All posts
2026-03-097 min

Debugging OpenClaw Agents: What to Do When Your Agent Goes Silent

DebuggingTroubleshootingOpenClawBest Practices

The Moment Everyone Knows

It's 9:00 AM. The morning report cron was supposed to run at 8:45. No Telegram message. You send the agent a manual message. Nothing.

This isn't an edge case — it happens to everyone running agents in production. The difference between an experienced and inexperienced operator is how quickly they find the cause.

After several months running our 6-agent team, we've developed a diagnostics checklist we run through every time something breaks. This post is that checklist — with the exact commands.

---

Step 1: Is the Gateway Even Running?

This is the most common cause. The gateway is the heartbeat of the system — without it: no channel communication, no cron jobs, no heartbeats.

```bash

openclaw gateway status

```

Expected output when the gateway is running:

```

Gateway: running (PID 12345)

Uptime: 2h 14m

Channels: telegram (connected), discord (connected)

```

If "stopped" or no process:

```bash

openclaw gateway start

# Or with a systemd setup:

systemctl start openclaw

systemctl status openclaw

```

If the gateway starts but immediately stops:

```bash

# Check the logs

journalctl -u openclaw -n 50 --no-pager

# Or directly:

openclaw gateway start --foreground 2>&1 | head -30

```

Common errors in the log:

  • `ANTHROPIC_API_KEY not set` — environment variable missing
  • `Port already in use` — another process is blocking the port
  • `Cannot find module` — OpenClaw installation is corrupted
  • ---

    Step 2: Check Channel Connection

    The gateway is running but the agent isn't responding to messages? The channel might be the issue.

    ```bash

    openclaw channels list

    ```

    Expected output:

    ```

    telegram connected last message: 5m ago

    discord connected last message: 12m ago

    ```

    If "disconnected" or "error":

    ```bash

    # Channel test sends an internal test message

    openclaw channels test telegram

    # Reconnect the channel

    openclaw gateway restart

    ```

    Common channel problems:

    *Telegram:* Bot token expired or revoked. Fix: get a new token via @BotFather, replace it in .env, restart the gateway.

    *Discord:* Bot was removed from the server or no longer has the required permissions. Fix: regenerate the bot invite link and re-invite the bot.

    *All channels:* After a server reboot without systemd autostart, the gateway doesn't start automatically. Fix: set up `systemctl enable openclaw` (see our VPS setup post).

    ---

    Step 3: Is the API Key Still Valid?

    API keys expire, get rotated, or hit their spending limit. This is a common silent failure — the gateway runs, channels are connected, but the LLM call fails.

    ```bash

    # Direct test with curl (replace sk-ant-... with your actual key)

    curl https://api.anthropic.com/v1/messages -H "x-api-key: $ANTHROPIC_API_KEY" -H "anthropic-version: 2023-06-01" -H "content-type: application/json" -d '{"model":"claude-haiku-20240307","max_tokens":10,"messages":[{"role":"user","content":"ping"}]}'

    ```

    If you get a 401: Key is invalid or expired. Create a new key, replace it in .env, restart the gateway.

    If you get a 429: Rate limit or spending limit reached. Check the Anthropic dashboard under Usage Limits.

    If the command shows ANTHROPIC_API_KEY is empty:

    ```bash

    echo $ANTHROPIC_API_KEY

    # If empty: .env is not being loaded

    # Manually source .env and check

    source ~/.openclaw/workspace/.env && echo $ANTHROPIC_API_KEY

    ```

    The gateway needs to start with the correct `EnvironmentFile` configuration (see our systemd setup guide).

    ---

    Step 4: Check Session Status

    Sometimes a session hangs — the agent received a request but never finishes processing it. This blocks new messages.

    ```bash

    # Show active sessions

    openclaw sessions list

    ```

    If a session shows as "stuck" or has been "running" for hours:

    ```bash

    # Take the session ID from the list output

    openclaw sessions kill <session-id>

    # Then: send a new message — should work again

    ```

    In Docker setups:

    ```bash

    # Check container status

    docker ps

    # If a container shows "Exited":

    docker compose up -d agent-sam

    # View logs from the last crash

    docker logs agent-sam --tail=50

    ```

    If the container keeps restarting (restart loop):

    ```bash

    docker logs agent-sam --tail=100 2>&1 | grep -i error

    ```

    ---

    Step 5: Check Workspace Files

    The agent starts, but behaves strangely — gives wrong answers, ignores context, seems to not know its own name?

    This points to problems with the workspace files.

    ```bash

    # Check the most important files

    ls -la ~/.openclaw/workspace/

    cat ~/.openclaw/workspace/SOUL.md # Personality/behavior

    cat ~/.openclaw/workspace/MEMORY.md # Long-term memory

    ```

    Common problems:

    *SOUL.md empty or missing:* The agent has no personality and behaves like a raw chatbot. Fix: recreate SOUL.md.

    *MEMORY.md too large:* If MEMORY.md exceeds 3000 words, it fills the context window and crowds out more important information. Fix: clean up MEMORY.md — remove old, irrelevant entries.

    *Corrupted daily notes:* If a memory/YYYY-MM-DD.md file has invalid content, it can confuse the agent when reading.

    ```bash

    # Check daily notes from the last 3 days

    ls -la ~/.openclaw/workspace/memory/ | tail -5

    cat ~/.openclaw/workspace/memory/$(date +%Y-%m-%d).md

    ```

    ---

    Step 6: Check Cron Job Status

    The agent responds to manual messages, but automated tasks no longer run?

    ```bash

    # Show all cron jobs and their status

    openclaw cron list

    # Logs for a specific job

    openclaw cron logs <job-id>

    ```

    If a job shows as "disabled": Either accidentally disabled, or automatically disabled after repeated failures.

    ```bash

    openclaw cron enable <job-id>

    ```

    If the job shows "failed":

    ```bash

    # View the last error

    openclaw cron logs <job-id> --limit 1

    # Manually trigger the job and watch the output

    openclaw cron trigger <job-id>

    ```

    Common cron errors:

    *Timing issue:* The job doesn't run at the expected time. Often a timezone mix-up. OpenClaw runs in UTC. Berlin time is UTC+1 (winter) or UTC+2 (summer). Check your cron syntax.

    *Prompt references a non-existent file:* If the prompt loads HEARTBEAT.md but the file doesn't exist, the job may fail. Fix: create the file (can be empty).

    *Forgot to restart gateway after enabling:* Cron jobs only become active after `openclaw gateway restart`.

    ---

    Step 7: Check Disk Space

    This is the most overlooked cause. When the disk is full, the agent can't write new log entries or memory files — and behaves unpredictably.

    ```bash

    df -h

    ```

    If `/` is over 85% full:

    ```bash

    # What's taking up space?

    du -sh ~/.openclaw/workspace/* | sort -h | tail -10

    # Docker cleanup (removes unused images and volumes)

    docker system prune -a

    # Clean up old daily notes (older than 30 days)

    find ~/.openclaw/workspace/memory/ -name "*.md" -mtime +30 -delete

    # Rotate OpenClaw logs

    journalctl --vacuum-size=500M

    ```

    In our setup, a cron job runs daily to check disk usage and sends a Telegram warning if it exceeds 80%. This has saved us from an outage more than once.

    ---

    Step 8: Network and DNS

    Less common, but it happens: the agent can't reach external APIs. This shows up as "connection refused" or "timeout" errors in the logs.

    ```bash

    # Basic network test

    curl -s https://api.anthropic.com/health || echo "Anthropic unreachable"

    curl -s https://api.telegram.org/bot<TOKEN>/getMe | head -c 100

    # Check DNS resolution

    nslookup api.anthropic.com

    # In Docker containers: check from inside the container

    docker exec agent-sam curl -s https://api.anthropic.com/health

    ```

    If the container can't reach the API but the host server can:

    ```bash

    # Set DNS explicitly in docker-compose.yml

    # Under the affected service:

    dns:

    - 8.8.8.8

    - 1.1.1.1

    ```

    ---

    The Quick Diagnostics Checklist

    If you don't know where to start, run through this list in order:

    ```bash

    # 1. Gateway status

    openclaw gateway status

    # 2. Channels

    openclaw channels list

    # 3. API key

    curl https://api.anthropic.com/v1/messages -H "x-api-key: $ANTHROPIC_API_KEY" -H "anthropic-version: 2023-06-01" -H "content-type: application/json" -d '{"model":"claude-haiku-20240307","max_tokens":10,"messages":[{"role":"user","content":"ping"}]}'

    # 4. Sessions

    openclaw sessions list

    # 5. Docker (if running in containers)

    docker compose ps

    # 6. Disk space

    df -h

    # 7. Logs

    journalctl -u openclaw -n 30 --no-pager

    ```

    90% of all outages get diagnosed by one of these seven commands.

    ---

    Set Up Monitoring So You're Never the Last to Know

    Better than debugging: not getting into that situation in the first place.

    Three simple monitoring measures:

    1. Uptime monitor (UptimeRobot, free): Set up an HTTP ping to an endpoint on your server. On failure: email or Telegram notification.

    2. Disk warning cron:

    ```

    Schedule: 0 */4 * * * (every 4 hours)

    Prompt: Check disk space with 'df -h /'. If over 80% used:

    send a warning via Telegram. Otherwise: HEARTBEAT_OK.

    ```

    3. Agent self-check:

    ```

    Schedule: */30 * * * * (every 30 minutes, the regular heartbeat)

    Note in HEARTBEAT.md: "If no morning report was sent today,

    mention it on the next check."

    ```

    This sounds simple — and it is. But these three measures have prevented a silent outage from going unnoticed for hours more times than we can count.

    ---

    The Bottom Line

    Agent outages are inevitable. The difference is how quickly and systematically you respond. The checklist above gets you to the root cause in under 10 minutes.

    The complete setup — including monitoring configuration, systemd service, Docker Compose files, and workspace files for all 6 agents — is documented in the OpenClaw Setup Playbook.

    Fully available in German too. 🇩🇪

    Want to learn more?

    Our playbook contains 18 detailed chapters — available in English and German.

    Get the Playbook