Debugging OpenClaw Agents: What to Do When Your Agent Goes Silent
The Moment Everyone Knows
It's 9:00 AM. The morning report cron was supposed to run at 8:45. No Telegram message. You send the agent a manual message. Nothing.
This isn't an edge case — it happens to everyone running agents in production. The difference between an experienced and inexperienced operator is how quickly they find the cause.
After several months running our 6-agent team, we've developed a diagnostics checklist we run through every time something breaks. This post is that checklist — with the exact commands.
---
Step 1: Is the Gateway Even Running?
This is the most common cause. The gateway is the heartbeat of the system — without it: no channel communication, no cron jobs, no heartbeats.
```bash
openclaw gateway status
```
Expected output when the gateway is running:
```
Gateway: running (PID 12345)
Uptime: 2h 14m
Channels: telegram (connected), discord (connected)
```
If "stopped" or no process:
```bash
openclaw gateway start
# Or with a systemd setup:
systemctl start openclaw
systemctl status openclaw
```
If the gateway starts but immediately stops:
```bash
# Check the logs
journalctl -u openclaw -n 50 --no-pager
# Or directly:
openclaw gateway start --foreground 2>&1 | head -30
```
Common errors in the log:
---
Step 2: Check Channel Connection
The gateway is running but the agent isn't responding to messages? The channel might be the issue.
```bash
openclaw channels list
```
Expected output:
```
telegram connected last message: 5m ago
discord connected last message: 12m ago
```
If "disconnected" or "error":
```bash
# Channel test sends an internal test message
openclaw channels test telegram
# Reconnect the channel
openclaw gateway restart
```
Common channel problems:
*Telegram:* Bot token expired or revoked. Fix: get a new token via @BotFather, replace it in .env, restart the gateway.
*Discord:* Bot was removed from the server or no longer has the required permissions. Fix: regenerate the bot invite link and re-invite the bot.
*All channels:* After a server reboot without systemd autostart, the gateway doesn't start automatically. Fix: set up `systemctl enable openclaw` (see our VPS setup post).
---
Step 3: Is the API Key Still Valid?
API keys expire, get rotated, or hit their spending limit. This is a common silent failure — the gateway runs, channels are connected, but the LLM call fails.
```bash
# Direct test with curl (replace sk-ant-... with your actual key)
curl https://api.anthropic.com/v1/messages -H "x-api-key: $ANTHROPIC_API_KEY" -H "anthropic-version: 2023-06-01" -H "content-type: application/json" -d '{"model":"claude-haiku-20240307","max_tokens":10,"messages":[{"role":"user","content":"ping"}]}'
```
If you get a 401: Key is invalid or expired. Create a new key, replace it in .env, restart the gateway.
If you get a 429: Rate limit or spending limit reached. Check the Anthropic dashboard under Usage Limits.
If the command shows ANTHROPIC_API_KEY is empty:
```bash
echo $ANTHROPIC_API_KEY
# If empty: .env is not being loaded
# Manually source .env and check
source ~/.openclaw/workspace/.env && echo $ANTHROPIC_API_KEY
```
The gateway needs to start with the correct `EnvironmentFile` configuration (see our systemd setup guide).
---
Step 4: Check Session Status
Sometimes a session hangs — the agent received a request but never finishes processing it. This blocks new messages.
```bash
# Show active sessions
openclaw sessions list
```
If a session shows as "stuck" or has been "running" for hours:
```bash
# Take the session ID from the list output
openclaw sessions kill <session-id>
# Then: send a new message — should work again
```
In Docker setups:
```bash
# Check container status
docker ps
# If a container shows "Exited":
docker compose up -d agent-sam
# View logs from the last crash
docker logs agent-sam --tail=50
```
If the container keeps restarting (restart loop):
```bash
docker logs agent-sam --tail=100 2>&1 | grep -i error
```
---
Step 5: Check Workspace Files
The agent starts, but behaves strangely — gives wrong answers, ignores context, seems to not know its own name?
This points to problems with the workspace files.
```bash
# Check the most important files
ls -la ~/.openclaw/workspace/
cat ~/.openclaw/workspace/SOUL.md # Personality/behavior
cat ~/.openclaw/workspace/MEMORY.md # Long-term memory
```
Common problems:
*SOUL.md empty or missing:* The agent has no personality and behaves like a raw chatbot. Fix: recreate SOUL.md.
*MEMORY.md too large:* If MEMORY.md exceeds 3000 words, it fills the context window and crowds out more important information. Fix: clean up MEMORY.md — remove old, irrelevant entries.
*Corrupted daily notes:* If a memory/YYYY-MM-DD.md file has invalid content, it can confuse the agent when reading.
```bash
# Check daily notes from the last 3 days
ls -la ~/.openclaw/workspace/memory/ | tail -5
cat ~/.openclaw/workspace/memory/$(date +%Y-%m-%d).md
```
---
Step 6: Check Cron Job Status
The agent responds to manual messages, but automated tasks no longer run?
```bash
# Show all cron jobs and their status
openclaw cron list
# Logs for a specific job
openclaw cron logs <job-id>
```
If a job shows as "disabled": Either accidentally disabled, or automatically disabled after repeated failures.
```bash
openclaw cron enable <job-id>
```
If the job shows "failed":
```bash
# View the last error
openclaw cron logs <job-id> --limit 1
# Manually trigger the job and watch the output
openclaw cron trigger <job-id>
```
Common cron errors:
*Timing issue:* The job doesn't run at the expected time. Often a timezone mix-up. OpenClaw runs in UTC. Berlin time is UTC+1 (winter) or UTC+2 (summer). Check your cron syntax.
*Prompt references a non-existent file:* If the prompt loads HEARTBEAT.md but the file doesn't exist, the job may fail. Fix: create the file (can be empty).
*Forgot to restart gateway after enabling:* Cron jobs only become active after `openclaw gateway restart`.
---
Step 7: Check Disk Space
This is the most overlooked cause. When the disk is full, the agent can't write new log entries or memory files — and behaves unpredictably.
```bash
df -h
```
If `/` is over 85% full:
```bash
# What's taking up space?
du -sh ~/.openclaw/workspace/* | sort -h | tail -10
# Docker cleanup (removes unused images and volumes)
docker system prune -a
# Clean up old daily notes (older than 30 days)
find ~/.openclaw/workspace/memory/ -name "*.md" -mtime +30 -delete
# Rotate OpenClaw logs
journalctl --vacuum-size=500M
```
In our setup, a cron job runs daily to check disk usage and sends a Telegram warning if it exceeds 80%. This has saved us from an outage more than once.
---
Step 8: Network and DNS
Less common, but it happens: the agent can't reach external APIs. This shows up as "connection refused" or "timeout" errors in the logs.
```bash
# Basic network test
curl -s https://api.anthropic.com/health || echo "Anthropic unreachable"
curl -s https://api.telegram.org/bot<TOKEN>/getMe | head -c 100
# Check DNS resolution
nslookup api.anthropic.com
# In Docker containers: check from inside the container
docker exec agent-sam curl -s https://api.anthropic.com/health
```
If the container can't reach the API but the host server can:
```bash
# Set DNS explicitly in docker-compose.yml
# Under the affected service:
dns:
- 8.8.8.8
- 1.1.1.1
```
---
The Quick Diagnostics Checklist
If you don't know where to start, run through this list in order:
```bash
# 1. Gateway status
openclaw gateway status
# 2. Channels
openclaw channels list
# 3. API key
curl https://api.anthropic.com/v1/messages -H "x-api-key: $ANTHROPIC_API_KEY" -H "anthropic-version: 2023-06-01" -H "content-type: application/json" -d '{"model":"claude-haiku-20240307","max_tokens":10,"messages":[{"role":"user","content":"ping"}]}'
# 4. Sessions
openclaw sessions list
# 5. Docker (if running in containers)
docker compose ps
# 6. Disk space
df -h
# 7. Logs
journalctl -u openclaw -n 30 --no-pager
```
90% of all outages get diagnosed by one of these seven commands.
---
Set Up Monitoring So You're Never the Last to Know
Better than debugging: not getting into that situation in the first place.
Three simple monitoring measures:
1. Uptime monitor (UptimeRobot, free): Set up an HTTP ping to an endpoint on your server. On failure: email or Telegram notification.
2. Disk warning cron:
```
Schedule: 0 */4 * * * (every 4 hours)
Prompt: Check disk space with 'df -h /'. If over 80% used:
send a warning via Telegram. Otherwise: HEARTBEAT_OK.
```
3. Agent self-check:
```
Schedule: */30 * * * * (every 30 minutes, the regular heartbeat)
Note in HEARTBEAT.md: "If no morning report was sent today,
mention it on the next check."
```
This sounds simple — and it is. But these three measures have prevented a silent outage from going unnoticed for hours more times than we can count.
---
The Bottom Line
Agent outages are inevitable. The difference is how quickly and systematically you respond. The checklist above gets you to the root cause in under 10 minutes.
The complete setup — including monitoring configuration, systemd service, Docker Compose files, and workspace files for all 6 agents — is documented in the OpenClaw Setup Playbook.
Fully available in German too. 🇩🇪
Want to learn more?
Our playbook contains 18 detailed chapters — available in English and German.
Get the Playbook