2026-03-097 min

Debugging OpenClaw Agents: What to Do When Your Agent Goes Silent

DebuggingTroubleshootingOpenClawBest Practices

The Moment Everyone Knows

It's 9:00 AM. The morning report cron was supposed to run at 8:45. No Telegram message. You send the agent a manual message. Nothing.

This isn't an edge case — it happens to everyone running agents in production. The difference between an experienced and inexperienced operator is how quickly they find the cause.

After several months running our 6-agent team, we've developed a diagnostics checklist we run through every time something breaks. This post is that checklist — with the exact commands.

---

Step 1: Is the Gateway Even Running?

This is the most common cause. The gateway is the heartbeat of the system — without it: no channel communication, no cron jobs, no heartbeats.

```bash

openclaw gateway status

```

Expected output when the gateway is running:

```

Gateway: running (PID 12345)

Uptime: 2h 14m

Channels: telegram (connected), discord (connected)

```

If "stopped" or no process:

```bash

openclaw gateway start

# Or with a systemd setup:

systemctl start openclaw

systemctl status openclaw

```

If the gateway starts but immediately stops:

```bash

# Check the logs

journalctl -u openclaw -n 50 --no-pager

# Or directly:

openclaw gateway start --foreground 2>&1 | head -30

```

Common errors in the log:

`ANTHROPIC_API_KEY not set` — environment variable missing

`Port already in use` — another process is blocking the port

`Cannot find module` — OpenClaw installation is corrupted

---

Step 2: Check Channel Connection

The gateway is running but the agent isn't responding to messages? The channel might be the issue.

```bash

openclaw channels list

```

Expected output:

```

telegram connected last message: 5m ago

discord connected last message: 12m ago

```

If "disconnected" or "error":

```bash

# Channel test sends an internal test message

openclaw channels test telegram

# Reconnect the channel

openclaw gateway restart

```

Common channel problems:

*Telegram:* Bot token expired or revoked. Fix: get a new token via @BotFather, replace it in .env, restart the gateway.

*Discord:* Bot was removed from the server or no longer has the required permissions. Fix: regenerate the bot invite link and re-invite the bot.

*All channels:* After a server reboot without systemd autostart, the gateway doesn't start automatically. Fix: set up `systemctl enable openclaw` (see our VPS setup post).

---

Step 3: Is the API Key Still Valid?

API keys expire, get rotated, or hit their spending limit. This is a common silent failure — the gateway runs, channels are connected, but the LLM call fails.

```bash

# Direct test with curl (replace sk-ant-... with your actual key)

curl https://api.anthropic.com/v1/messages -H "x-api-key: $ANTHROPIC_API_KEY" -H "anthropic-version: 2023-06-01" -H "content-type: application/json" -d '{"model":"claude-haiku-20240307","max_tokens":10,"messages":[{"role":"user","content":"ping"}]}'

```

If you get a 401: Key is invalid or expired. Create a new key, replace it in .env, restart the gateway.

If you get a 429: Rate limit or spending limit reached. Check the Anthropic dashboard under Usage Limits.

If the command shows ANTHROPIC_API_KEY is empty:

```bash

echo $ANTHROPIC_API_KEY

# If empty: .env is not being loaded

# Manually source .env and check

source ~/.openclaw/workspace/.env && echo $ANTHROPIC_API_KEY

```

The gateway needs to start with the correct `EnvironmentFile` configuration (see our systemd setup guide).

---

Step 4: Check Session Status

Sometimes a session hangs — the agent received a request but never finishes processing it. This blocks new messages.

```bash

# Show active sessions

openclaw sessions list

```

If a session shows as "stuck" or has been "running" for hours:

```bash

# Take the session ID from the list output

openclaw sessions kill <session-id>

# Then: send a new message — should work again

```

In Docker setups:

```bash

# Check container status

docker ps

# If a container shows "Exited":

docker compose up -d agent-sam

# View logs from the last crash

docker logs agent-sam --tail=50

```

If the container keeps restarting (restart loop):

```bash

docker logs agent-sam --tail=100 2>&1 | grep -i error

```

---

Step 5: Check Workspace Files

The agent starts, but behaves strangely — gives wrong answers, ignores context, seems to not know its own name?

This points to problems with the workspace files.

```bash

# Check the most important files

ls -la ~/.openclaw/workspace/

cat ~/.openclaw/workspace/SOUL.md # Personality/behavior

cat ~/.openclaw/workspace/MEMORY.md # Long-term memory

```

Common problems:

*SOUL.md empty or missing:* The agent has no personality and behaves like a raw chatbot. Fix: recreate SOUL.md.

*MEMORY.md too large:* If MEMORY.md exceeds 3000 words, it fills the context window and crowds out more important information. Fix: clean up MEMORY.md — remove old, irrelevant entries.

*Corrupted daily notes:* If a memory/YYYY-MM-DD.md file has invalid content, it can confuse the agent when reading.

```bash

# Check daily notes from the last 3 days

ls -la ~/.openclaw/workspace/memory/ | tail -5

cat ~/.openclaw/workspace/memory/$(date +%Y-%m-%d).md

```

---

Step 6: Check Cron Job Status

The agent responds to manual messages, but automated tasks no longer run?

```bash

# Show all cron jobs and their status

openclaw cron list

# Logs for a specific job

openclaw cron logs <job-id>

```

If a job shows as "disabled": Either accidentally disabled, or automatically disabled after repeated failures.

```bash

openclaw cron enable <job-id>

```

If the job shows "failed":

```bash

# View the last error

openclaw cron logs <job-id> --limit 1

# Manually trigger the job and watch the output

openclaw cron trigger <job-id>

```

Common cron errors:

*Timing issue:* The job doesn't run at the expected time. Often a timezone mix-up. OpenClaw runs in UTC. Berlin time is UTC+1 (winter) or UTC+2 (summer). Check your cron syntax.

*Prompt references a non-existent file:* If the prompt loads HEARTBEAT.md but the file doesn't exist, the job may fail. Fix: create the file (can be empty).

*Forgot to restart gateway after enabling:* Cron jobs only become active after `openclaw gateway restart`.

---

Step 7: Check Disk Space

This is the most overlooked cause. When the disk is full, the agent can't write new log entries or memory files — and behaves unpredictably.

```bash

df -h

```

If `/` is over 85% full:

```bash

# What's taking up space?

du -sh ~/.openclaw/workspace/* | sort -h | tail -10

# Docker cleanup (removes unused images and volumes)

docker system prune -a

# Clean up old daily notes (older than 30 days)

find ~/.openclaw/workspace/memory/ -name "*.md" -mtime +30 -delete

# Rotate OpenClaw logs

journalctl --vacuum-size=500M

```

In our setup, a cron job runs daily to check disk usage and sends a Telegram warning if it exceeds 80%. This has saved us from an outage more than once.

---

Step 8: Network and DNS

Less common, but it happens: the agent can't reach external APIs. This shows up as "connection refused" or "timeout" errors in the logs.

```bash

# Basic network test

curl -s https://api.anthropic.com/health || echo "Anthropic unreachable"

curl -s https://api.telegram.org/bot<TOKEN>/getMe | head -c 100

# Check DNS resolution

nslookup api.anthropic.com

# In Docker containers: check from inside the container

docker exec agent-sam curl -s https://api.anthropic.com/health

```

If the container can't reach the API but the host server can:

```bash

# Set DNS explicitly in docker-compose.yml

# Under the affected service:

dns:

- 8.8.8.8

- 1.1.1.1

```

---

The Quick Diagnostics Checklist

If you don't know where to start, run through this list in order:

```bash

# 1. Gateway status

openclaw gateway status

# 2. Channels

openclaw channels list

# 3. API key

# 4. Sessions

openclaw sessions list

# 5. Docker (if running in containers)

docker compose ps

# 6. Disk space

df -h

# 7. Logs

journalctl -u openclaw -n 30 --no-pager

```

90% of all outages get diagnosed by one of these seven commands.

---

Set Up Monitoring So You're Never the Last to Know

Better than debugging: not getting into that situation in the first place.

Three simple monitoring measures:

1. Uptime monitor (UptimeRobot, free): Set up an HTTP ping to an endpoint on your server. On failure: email or Telegram notification.

2. Disk warning cron:

```

Schedule: 0 */4 * * * (every 4 hours)

Prompt: Check disk space with 'df -h /'. If over 80% used:

send a warning via Telegram. Otherwise: HEARTBEAT_OK.

```

3. Agent self-check:

```

Schedule: */30 * * * * (every 30 minutes, the regular heartbeat)

Note in HEARTBEAT.md: "If no morning report was sent today,

mention it on the next check."

```

This sounds simple — and it is. But these three measures have prevented a silent outage from going unnoticed for hours more times than we can count.

---

The Bottom Line

Agent outages are inevitable. The difference is how quickly and systematically you respond. The checklist above gets you to the root cause in under 10 minutes.

The complete setup — including monitoring configuration, systemd service, Docker Compose files, and workspace files for all 6 agents — is documented in the OpenClaw Setup Playbook.

Fully available in German too. 🇩🇪

Want to learn more?

Our playbook contains 18 detailed chapters — available in English and German.

Get the Playbook