All posts
Using OpenClaw with Ollama: Free, Private AI Agents with Local LLMs

Using OpenClaw with Ollama: Free, Private AI Agents with Local LLMs

|7 min read

Using OpenClaw with Ollama: Free, Private AI Agents with Local LLMs

Every request to a cloud LLM provider costs money. A typical OpenClaw conversation involving a few tool calls might cost $0.02--$0.10. If you run your agent frequently, monthly API bills can reach $20--$50 or more.

Ollama changes that equation entirely. By running an open-source language model on your own hardware, you can operate OpenClaw with zero API costs, complete data privacy, and no internet dependency. The tradeoff is capability -- local models are not as strong as frontier cloud models -- but for many everyday agent tasks, they are sufficient.

Why Run a Local LLM?

Cost: Local inference after the initial hardware investment costs only electricity -- a few cents per hour even with a dedicated GPU.

Privacy: A local LLM processes everything on your machine. Nothing leaves your network. This matters for sensitive data like medical records, financial documents, or proprietary code.

Availability: A local LLM works in airplane mode, during ISP outages, and without any API key. For always-on agents, local inference removes an entire category of failure modes.

Installing Ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# macOS
brew install ollama

# Verify and start
ollama --version
ollama serve

By default, Ollama listens on http://localhost:11434. Verify with:

curl http://localhost:11434/api/tags

Choosing the Right Model

Agent workloads require strong instruction following, reliable tool-use formatting, and sufficient context length. Here are the best models for OpenClaw, ranked by capability.

Tier 1: Best Quality (Workstation GPU)

Model Parameters VRAM Context Best For
Qwen 2.5 72B (Q4) 72B 48 GB 128K Closest to cloud quality
Llama 3.3 70B (Q4) 70B 44 GB 128K Strong general-purpose
DeepSeek-R1 70B (Q4) 70B 44 GB 64K Reasoning-heavy tasks

Tier 2: Sweet Spot (Consumer GPU)

Model Parameters VRAM Context Best For
Qwen 2.5 32B (Q4) 32B 20 GB 128K Best quality/speed balance
DeepSeek-R1 32B (Q4) 32B 20 GB 64K Complex reasoning
Mistral Small 24B (Q4) 24B 16 GB 128K Fast, good tool use
# Recommended starting point for RTX 3090/4080/4090
ollama pull qwen2.5:32b-instruct-q4_K_M

A 32B model in Q4 quantization hits the sweet spot for most users with a modern gaming GPU[1].

Tier 3: Accessible (Any GPU or CPU)

Model Parameters VRAM Context Best For
Llama 3.1 8B (Q4) 8B 6 GB 128K Light tasks, fast responses
Mistral 7B (Q4) 7B 5 GB 32K Simple automations
Qwen 2.5 7B (Q4) 7B 5 GB 128K Multilingual tasks

Context Length

OpenClaw includes conversation history, skill definitions, tool results, and system instructions in every prompt. Multi-step tasks can reach 30,000--50,000 tokens. Minimum context length: 64K tokens. Models with 128K context are preferred.

Configuring OpenClaw for Ollama

Edit ~/.openclaw/openclaw.json:

{
  "providers": {
    "ollama": {
      "type": "openai-compatible",
      "baseUrl": "http://localhost:11434/v1",
      "apiKey": "ollama",
      "models": ["qwen2.5:32b-instruct-q4_K_M"]
    }
  },
  "defaultProvider": "ollama",
  "defaultModel": "qwen2.5:32b-instruct-q4_K_M",
  "gateway": {
    "port": 3700,
    "maxConcurrentTasks": 1,
    "taskTimeout": 300
  }
}

Key settings: type: "openai-compatible" uses Ollama's OpenAI-compatible endpoint. apiKey: "ollama" is a placeholder -- Ollama needs no auth but the field cannot be empty. maxConcurrentTasks: 1 avoids memory pressure from parallel inference. taskTimeout: 300 gives local models adequate time.

Verify with:

openclaw status
openclaw chat "What time is it?"

Hardware and Performance

GPU Inference

GPU Model Size Tokens/sec 500-token response
RTX 4090 (24 GB) 32B Q4 35--45 t/s 11--14s
RTX 3090 (24 GB) 32B Q4 25--35 t/s 14--20s
RTX 3060 (12 GB) 8B Q4 50--70 t/s 7--10s
M2 Max (32 GB) 32B Q4 20--30 t/s 17--25s

For comfortable agent use, aim for 20+ tokens per second[2].

CPU Inference (Last Resort)

CPU Model Size Tokens/sec 500-token response
Ryzen 9 7950X 8B Q4 12--18 t/s 28--42s
Core i7-13700 8B Q4 10--15 t/s 33--50s

Stick with 8B models or smaller on CPU.

Tuning for Agent Workloads

Keep the model loaded to avoid cold-start delays:

# In /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_KEEP_ALIVE=24h"

Set context size and temperature explicitly with a custom Modelfile:

cat > ~/.ollama/Modelfile-openclaw <<'EOF'
FROM qwen2.5:32b-instruct-q4_K_M
PARAMETER num_ctx 65536
PARAMETER temperature 0.3
PARAMETER repeat_penalty 1.1
EOF

ollama create openclaw-qwen -f ~/.ollama/Modelfile-openclaw

Lower temperature (0.3) produces more consistent agent behavior. The repeat penalty prevents loops that local models sometimes fall into.

Limitations of Local Models

Local models are worse than frontier cloud models in several ways. Instruction following is less precise -- local models sometimes skip steps or hallucinate nonexistent tool calls. Complex reasoning is noticeably weaker; a 32B local model performs roughly at the level of a cloud model from 2--3 generations ago[3]. Tool use reliability is lower -- occasional malformed JSON or wrong parameter names. And long context quality degrades more than with cloud models.

The Hybrid Strategy

Use local for routine tasks, cloud for complex work:

{
  "providers": {
    "ollama": {
      "type": "openai-compatible",
      "baseUrl": "http://localhost:11434/v1",
      "apiKey": "ollama",
      "models": ["qwen2.5:32b-instruct-q4_K_M"]
    },
    "anthropic": { "apiKey": "sk-ant-..." }
  },
  "defaultProvider": "ollama",
  "defaultModel": "qwen2.5:32b-instruct-q4_K_M",
  "routing": {
    "complex": {
      "provider": "anthropic",
      "model": "claude-sonnet-4-20250514"
    }
  }
}

Override per-message:

openclaw chat "Check if my backup ran last night"                      # local
openclaw chat --provider anthropic "Review this PR and suggest fixes"  # cloud

This typically reduces cloud API costs by 60--80%.

Remote Ollama and Offline Operation

Run Ollama on a GPU machine and connect from a Raspberry Pi or laptop:

# On GPU machine
OLLAMA_HOST=0.0.0.0:11434 ollama serve

# In openclaw.json on the client
"baseUrl": "http://192.168.1.100:11434/v1"

For fully air-gapped environments, pre-download models on a connected machine and copy ~/.ollama/ to the offline machine:

# On connected machine
ollama pull qwen2.5:32b-instruct-q4_K_M

# Transfer to offline machine
rsync -av ~/.ollama/ user@offline-machine:~/.ollama/

Then start Ollama and OpenClaw normally on the offline machine. This makes OpenClaw viable in classified environments, remote locations, and any scenario where data must not leave the local machine.

If managing this infrastructure is not for you, ClawTank offers hosted instances with provider management built in.

Security note: Ollama has no authentication by default. Only expose it on trusted networks, or use an SSH tunnel for remote access:

ssh -L 11434:localhost:11434 user@gpu-machine -N &

Monitoring Performance

Keep an eye on local inference:

# Check loaded models and memory usage
ollama ps

# Monitor inference in real-time
journalctl -u ollama -f

If performance degrades, check GPU temperature (thermal throttling), verify VRAM is not oversubscribed with nvidia-smi, ensure the model is fully loaded in VRAM without RAM spill, and restart Ollama to clear memory fragmentation.

Summary

Running OpenClaw with Ollama gives you a private, zero-cost AI agent. The practical path:

  1. Install Ollama and pull a model matched to your hardware
  2. Start with Qwen 2.5 32B (24 GB GPU) or Llama 3.1 8B (smaller GPUs)
  3. Configure OpenClaw with openai-compatible provider type
  4. Use the hybrid strategy for cost savings with quality where it matters
  5. Tune context length and keep-alive for agent workloads

The local model handles 70--80% of everyday tasks well, and you always have the cloud for the rest.

References

  1. Ollama model library and quantization formats
  2. LLM inference benchmarks across consumer GPUs - Simon Willison
  3. Open LLM Leaderboard - Hugging Face
  4. Ollama OpenAI compatibility documentation
  5. OpenClaw provider configuration guide
  6. Qwen 2.5 model family - technical report

Ready to deploy OpenClaw?

No Docker, no SSH, no DevOps. Deploy in under 1 minute.

Get started free