Using OpenClaw with Ollama: Free, Private AI Agents with Local LLMs
Every request to a cloud LLM provider costs money. A typical OpenClaw conversation involving a few tool calls might cost $0.02--$0.10. If you run your agent frequently, monthly API bills can reach $20--$50 or more.
Ollama changes that equation entirely. By running an open-source language model on your own hardware, you can operate OpenClaw with zero API costs, complete data privacy, and no internet dependency. The tradeoff is capability -- local models are not as strong as frontier cloud models -- but for many everyday agent tasks, they are sufficient.
Why Run a Local LLM?
Cost: Local inference after the initial hardware investment costs only electricity -- a few cents per hour even with a dedicated GPU.
Privacy: A local LLM processes everything on your machine. Nothing leaves your network. This matters for sensitive data like medical records, financial documents, or proprietary code.
Availability: A local LLM works in airplane mode, during ISP outages, and without any API key. For always-on agents, local inference removes an entire category of failure modes.
Installing Ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# macOS
brew install ollama
# Verify and start
ollama --version
ollama serve
By default, Ollama listens on http://localhost:11434. Verify with:
curl http://localhost:11434/api/tags
Choosing the Right Model
Agent workloads require strong instruction following, reliable tool-use formatting, and sufficient context length. Here are the best models for OpenClaw, ranked by capability.
Tier 1: Best Quality (Workstation GPU)
| Model | Parameters | VRAM | Context | Best For |
|---|---|---|---|---|
| Qwen 2.5 72B (Q4) | 72B | 48 GB | 128K | Closest to cloud quality |
| Llama 3.3 70B (Q4) | 70B | 44 GB | 128K | Strong general-purpose |
| DeepSeek-R1 70B (Q4) | 70B | 44 GB | 64K | Reasoning-heavy tasks |
Tier 2: Sweet Spot (Consumer GPU)
| Model | Parameters | VRAM | Context | Best For |
|---|---|---|---|---|
| Qwen 2.5 32B (Q4) | 32B | 20 GB | 128K | Best quality/speed balance |
| DeepSeek-R1 32B (Q4) | 32B | 20 GB | 64K | Complex reasoning |
| Mistral Small 24B (Q4) | 24B | 16 GB | 128K | Fast, good tool use |
# Recommended starting point for RTX 3090/4080/4090
ollama pull qwen2.5:32b-instruct-q4_K_M
A 32B model in Q4 quantization hits the sweet spot for most users with a modern gaming GPU[1].
Tier 3: Accessible (Any GPU or CPU)
| Model | Parameters | VRAM | Context | Best For |
|---|---|---|---|---|
| Llama 3.1 8B (Q4) | 8B | 6 GB | 128K | Light tasks, fast responses |
| Mistral 7B (Q4) | 7B | 5 GB | 32K | Simple automations |
| Qwen 2.5 7B (Q4) | 7B | 5 GB | 128K | Multilingual tasks |
Context Length
OpenClaw includes conversation history, skill definitions, tool results, and system instructions in every prompt. Multi-step tasks can reach 30,000--50,000 tokens. Minimum context length: 64K tokens. Models with 128K context are preferred.
Configuring OpenClaw for Ollama
Edit ~/.openclaw/openclaw.json:
{
"providers": {
"ollama": {
"type": "openai-compatible",
"baseUrl": "http://localhost:11434/v1",
"apiKey": "ollama",
"models": ["qwen2.5:32b-instruct-q4_K_M"]
}
},
"defaultProvider": "ollama",
"defaultModel": "qwen2.5:32b-instruct-q4_K_M",
"gateway": {
"port": 3700,
"maxConcurrentTasks": 1,
"taskTimeout": 300
}
}
Key settings: type: "openai-compatible" uses Ollama's OpenAI-compatible endpoint. apiKey: "ollama" is a placeholder -- Ollama needs no auth but the field cannot be empty. maxConcurrentTasks: 1 avoids memory pressure from parallel inference. taskTimeout: 300 gives local models adequate time.
Verify with:
openclaw status
openclaw chat "What time is it?"
Hardware and Performance
GPU Inference
| GPU | Model Size | Tokens/sec | 500-token response |
|---|---|---|---|
| RTX 4090 (24 GB) | 32B Q4 | 35--45 t/s | 11--14s |
| RTX 3090 (24 GB) | 32B Q4 | 25--35 t/s | 14--20s |
| RTX 3060 (12 GB) | 8B Q4 | 50--70 t/s | 7--10s |
| M2 Max (32 GB) | 32B Q4 | 20--30 t/s | 17--25s |
For comfortable agent use, aim for 20+ tokens per second[2].
CPU Inference (Last Resort)
| CPU | Model Size | Tokens/sec | 500-token response |
|---|---|---|---|
| Ryzen 9 7950X | 8B Q4 | 12--18 t/s | 28--42s |
| Core i7-13700 | 8B Q4 | 10--15 t/s | 33--50s |
Stick with 8B models or smaller on CPU.
Tuning for Agent Workloads
Keep the model loaded to avoid cold-start delays:
# In /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_KEEP_ALIVE=24h"
Set context size and temperature explicitly with a custom Modelfile:
cat > ~/.ollama/Modelfile-openclaw <<'EOF'
FROM qwen2.5:32b-instruct-q4_K_M
PARAMETER num_ctx 65536
PARAMETER temperature 0.3
PARAMETER repeat_penalty 1.1
EOF
ollama create openclaw-qwen -f ~/.ollama/Modelfile-openclaw
Lower temperature (0.3) produces more consistent agent behavior. The repeat penalty prevents loops that local models sometimes fall into.
Limitations of Local Models
Local models are worse than frontier cloud models in several ways. Instruction following is less precise -- local models sometimes skip steps or hallucinate nonexistent tool calls. Complex reasoning is noticeably weaker; a 32B local model performs roughly at the level of a cloud model from 2--3 generations ago[3]. Tool use reliability is lower -- occasional malformed JSON or wrong parameter names. And long context quality degrades more than with cloud models.
The Hybrid Strategy
Use local for routine tasks, cloud for complex work:
{
"providers": {
"ollama": {
"type": "openai-compatible",
"baseUrl": "http://localhost:11434/v1",
"apiKey": "ollama",
"models": ["qwen2.5:32b-instruct-q4_K_M"]
},
"anthropic": { "apiKey": "sk-ant-..." }
},
"defaultProvider": "ollama",
"defaultModel": "qwen2.5:32b-instruct-q4_K_M",
"routing": {
"complex": {
"provider": "anthropic",
"model": "claude-sonnet-4-20250514"
}
}
}
Override per-message:
openclaw chat "Check if my backup ran last night" # local
openclaw chat --provider anthropic "Review this PR and suggest fixes" # cloud
This typically reduces cloud API costs by 60--80%.
Remote Ollama and Offline Operation
Run Ollama on a GPU machine and connect from a Raspberry Pi or laptop:
# On GPU machine
OLLAMA_HOST=0.0.0.0:11434 ollama serve
# In openclaw.json on the client
"baseUrl": "http://192.168.1.100:11434/v1"
For fully air-gapped environments, pre-download models on a connected machine and copy ~/.ollama/ to the offline machine:
# On connected machine
ollama pull qwen2.5:32b-instruct-q4_K_M
# Transfer to offline machine
rsync -av ~/.ollama/ user@offline-machine:~/.ollama/
Then start Ollama and OpenClaw normally on the offline machine. This makes OpenClaw viable in classified environments, remote locations, and any scenario where data must not leave the local machine.
If managing this infrastructure is not for you, ClawTank offers hosted instances with provider management built in.
Security note: Ollama has no authentication by default. Only expose it on trusted networks, or use an SSH tunnel for remote access:
ssh -L 11434:localhost:11434 user@gpu-machine -N &
Monitoring Performance
Keep an eye on local inference:
# Check loaded models and memory usage
ollama ps
# Monitor inference in real-time
journalctl -u ollama -f
If performance degrades, check GPU temperature (thermal throttling), verify VRAM is not oversubscribed with nvidia-smi, ensure the model is fully loaded in VRAM without RAM spill, and restart Ollama to clear memory fragmentation.
Summary
Running OpenClaw with Ollama gives you a private, zero-cost AI agent. The practical path:
- Install Ollama and pull a model matched to your hardware
- Start with Qwen 2.5 32B (24 GB GPU) or Llama 3.1 8B (smaller GPUs)
- Configure OpenClaw with
openai-compatibleprovider type - Use the hybrid strategy for cost savings with quality where it matters
- Tune context length and keep-alive for agent workloads
The local model handles 70--80% of everyday tasks well, and you always have the cloud for the rest.
