Skip to Content

25b - 监控与告警

本文是《AI Agent 实战手册》第 25 章第 2 节。 上一节:25a-高可用部署 | 下一节:25c-日志管理与审计

概述

OpenClaw 作为 24/7 自主运行的 AI Agent 平台,其监控需求远超传统 Web 应用——你不仅需要关注基础设施指标(CPU、内存、磁盘),还需要追踪 Agent 特有的业务指标(token 消耗、模型延迟、任务成功率、会话健康度)。2026 年,Prometheus + Grafana 已成为开源监控的事实标准,配合 Alertmanager 可以实现从指标采集到故障通知的完整闭环。

本节提供从零搭建 OpenClaw 监控体系的完整指南,包含 Prometheus 配置、自定义 Agent 指标、Grafana 仪表板模板和多通道告警规则。


1. 监控架构总览

监控体系分层

OpenClaw 的监控需要覆盖三个层次:

┌─────────────────────────────────────────────────────────┐ │ 告警通知层 │ │ Alertmanager → Slack / Telegram / Email / PagerDuty │ ├─────────────────────────────────────────────────────────┤ │ 可视化层 │ │ Grafana 仪表板(基础设施 + Agent 业务指标) │ ├─────────────────────────────────────────────────────────┤ │ 指标存储层 │ │ Prometheus(时序数据库,拉取模式) │ ├─────────────────────────────────────────────────────────┤ │ 指标采集层 │ │ Node Exporter cAdvisor 自定义 /metrics 端点 │ │ (主机指标) (容器指标) (Agent 业务指标) │ ├─────────────────────────────────────────────────────────┤ │ 被监控目标 │ │ VPS 主机 Docker 容器 OpenClaw Gateway + Agents │ └─────────────────────────────────────────────────────────┘

工具推荐

工具用途价格适用场景
Prometheus时序指标采集与存储免费(开源)所有场景的核心指标引擎
Grafana指标可视化与仪表板免费(开源)/ Cloud 起步 $0仪表板和告警可视化
Alertmanager告警路由与通知免费(开源)告警去重、分组、静默
Node Exporter主机级指标采集免费(开源)CPU/内存/磁盘/网络监控
cAdvisor容器级指标采集免费(开源)Docker 容器资源监控
Uptime Kuma外部可用性监控免费(开源)HTTP/TCP 端点存活检测
Grafana Cloud托管监控平台免费层 10K 指标 / Pro $8/月零运维需求
Datadog全栈可观测性$15/主机/月起企业级需求
New RelicAPM + 基础设施免费层 100 GB/月全栈监控
Betterstack日志 + 状态页 + 告警免费层可用 / $25/月起状态页 + 事件管理

💡 推荐组合:Prometheus + Grafana + Alertmanager(自托管免费方案)是 OpenClaw 监控的最佳起点。如果已有 K8s 集群,kube-prometheus-stack Helm Chart 可一键部署全套。


2. Prometheus + Grafana 部署

Docker Compose 集成

25a-高可用部署 的 Docker Compose 基础上,添加监控组件:

# docker-compose.monitoring.yml — 监控栈 # 与主 docker-compose.yml 配合使用: # docker compose -f docker-compose.yml -f docker-compose.monitoring.yml up -d version: '3.8' services: # ============================================ # Prometheus — 指标采集与存储 # ============================================ prometheus: image: prom/prometheus:v3.4.0 container_name: openclaw-prometheus restart: unless-stopped user: "65534:65534" # nobody 用户 ports: - "127.0.0.1:9090:9090" volumes: - ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro - ./monitoring/prometheus/rules/:/etc/prometheus/rules/:ro - prometheus_data:/prometheus command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' - '--storage.tsdb.retention.time=30d' - '--web.enable-lifecycle' # 支持热重载配置 - '--web.enable-admin-api' networks: - openclaw-net logging: driver: json-file options: max-size: "20m" max-file: "3" # ============================================ # Grafana — 可视化仪表板 # ============================================ grafana: image: grafana/grafana:11.6.0 container_name: openclaw-grafana restart: unless-stopped ports: - "127.0.0.1:3000:3000" environment: - GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER:-admin} - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD} - GF_USERS_ALLOW_SIGN_UP=false - GF_SERVER_ROOT_URL=https://grafana.yourdomain.com volumes: - grafana_data:/var/lib/grafana - ./monitoring/grafana/provisioning/:/etc/grafana/provisioning/:ro - ./monitoring/grafana/dashboards/:/var/lib/grafana/dashboards/:ro networks: - openclaw-net depends_on: - prometheus logging: driver: json-file options: max-size: "20m" max-file: "3" # ============================================ # Alertmanager — 告警路由与通知 # ============================================ alertmanager: image: prom/alertmanager:v0.28.1 container_name: openclaw-alertmanager restart: unless-stopped ports: - "127.0.0.1:9093:9093" volumes: - ./monitoring/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro - alertmanager_data:/alertmanager command: - '--config.file=/etc/alertmanager/alertmanager.yml' - '--storage.path=/alertmanager' networks: - openclaw-net logging: driver: json-file options: max-size: "10m" max-file: "3" # ============================================ # Node Exporter — 主机指标 # ============================================ node-exporter: image: prom/node-exporter:v1.9.1 container_name: openclaw-node-exporter restart: unless-stopped pid: host volumes: - /proc:/host/proc:ro - /sys:/host/sys:ro - /:/rootfs:ro command: - '--path.procfs=/host/proc' - '--path.sysfs=/host/sys' - '--path.rootfs=/rootfs' - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)' networks: - openclaw-net # ============================================ # cAdvisor — 容器指标 # ============================================ cadvisor: image: gcr.io/cadvisor/cadvisor:v0.52.1 container_name: openclaw-cadvisor restart: unless-stopped privileged: true volumes: - /:/rootfs:ro - /var/run:/var/run:ro - /sys:/sys:ro - /var/lib/docker/:/var/lib/docker:ro - /dev/disk/:/dev/disk:ro networks: - openclaw-net volumes: prometheus_data: grafana_data: alertmanager_data:

目录结构

openclaw-deploy/ ├── docker-compose.yml # 主服务 ├── docker-compose.monitoring.yml # 监控栈 ├── monitoring/ │ ├── prometheus/ │ │ ├── prometheus.yml # Prometheus 主配置 │ │ └── rules/ │ │ ├── infrastructure.yml # 基础设施告警规则 │ │ └── openclaw-agent.yml # Agent 业务告警规则 │ ├── alertmanager/ │ │ └── alertmanager.yml # 告警路由配置 │ └── grafana/ │ ├── provisioning/ │ │ ├── datasources/ │ │ │ └── prometheus.yml # 自动配置数据源 │ │ └── dashboards/ │ │ └── dashboards.yml # 仪表板自动发现 │ └── dashboards/ │ ├── openclaw-overview.json # OpenClaw 总览仪表板 │ └── infrastructure.json # 基础设施仪表板 └── .env

Prometheus 主配置

# monitoring/prometheus/prometheus.yml global: scrape_interval: 15s # 默认采集间隔 evaluation_interval: 15s # 规则评估间隔 scrape_timeout: 10s # 告警规则文件 rule_files: - /etc/prometheus/rules/*.yml # Alertmanager 配置 alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093 # 采集目标 scrape_configs: # Prometheus 自身指标 - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] # Node Exporter — 主机指标 - job_name: 'node-exporter' static_configs: - targets: ['node-exporter:9100'] relabel_configs: - source_labels: [__address__] target_label: instance replacement: 'openclaw-vps' # cAdvisor — 容器指标 - job_name: 'cadvisor' static_configs: - targets: ['cadvisor:8080'] # OpenClaw Gateway — 应用指标 # 如果 OpenClaw 暴露 /metrics 端点 - job_name: 'openclaw-gateway' static_configs: - targets: ['openclaw-gateway:18789'] metrics_path: /metrics scrape_interval: 30s # 如果需要认证 # authorization: # credentials: '${OPENCLAW_GATEWAY_TOKEN}' # OpenClaw 健康检查探针 - job_name: 'openclaw-health' metrics_path: /health static_configs: - targets: ['openclaw-gateway:18789'] scrape_interval: 30s

Grafana 数据源自动配置

# monitoring/grafana/provisioning/datasources/prometheus.yml apiVersion: 1 datasources: - name: Prometheus type: prometheus access: proxy url: http://prometheus:9090 isDefault: true editable: false jsonData: timeInterval: '15s' httpMethod: POST

Grafana 仪表板自动发现

# monitoring/grafana/provisioning/dashboards/dashboards.yml apiVersion: 1 providers: - name: 'default' orgId: 1 folder: 'OpenClaw' type: file disableDeletion: false editable: true options: path: /var/lib/grafana/dashboards foldersFromFilesStructure: false

启动监控栈

# 创建目录结构 mkdir -p monitoring/{prometheus/rules,alertmanager,grafana/provisioning/{datasources,dashboards},grafana/dashboards} # 启动(与主服务一起) docker compose -f docker-compose.yml -f docker-compose.monitoring.yml up -d # 验证各组件 curl -s http://127.0.0.1:9090/-/healthy # Prometheus curl -s http://127.0.0.1:3000/api/health # Grafana curl -s http://127.0.0.1:9093/-/healthy # Alertmanager # 检查 Prometheus 目标状态 curl -s http://127.0.0.1:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'

3. 关键监控指标

基础设施指标

OpenClaw 运行在 VPS 或 K8s 上,基础设施监控是第一道防线:

指标类别PromQL 示例告警阈值建议说明
CPU 使用率100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)> 80% 持续 5 分钟Agent 推理密集时 CPU 飙升
内存使用率(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100> 85% 持续 5 分钟多 Agent 会话消耗大量内存
磁盘使用率(1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100> 80%日志和会话数据持续增长
磁盘 I/Orate(node_disk_io_time_seconds_total[5m])> 0.9(90% 繁忙)SQLite 写入密集时关注
网络流量rate(node_network_receive_bytes_total[5m])异常突增API 调用和 WebSocket 通信

容器指标

通过 cAdvisor 采集 Docker 容器级别的资源使用:

指标PromQL说明
容器 CPUrate(container_cpu_usage_seconds_total{name="openclaw-gateway"}[5m])Gateway 进程 CPU 消耗
容器内存container_memory_usage_bytes{name="openclaw-gateway"}实际内存占用
容器重启次数container_restart_count{name="openclaw-gateway"}频繁重启说明有问题
容器网络rate(container_network_receive_bytes_total{name="openclaw-gateway"}[5m])网络吞吐量

Agent 业务指标

这是 OpenClaw 监控的核心——传统基础设施监控无法覆盖的 AI Agent 特有指标:

指标类别指标名称类型说明
会话健康openclaw_active_sessionsGauge当前活跃 Agent 会话数
任务执行openclaw_tasks_totalCounter任务执行总数(按状态标签:success/failure/timeout)
任务延迟openclaw_task_duration_secondsHistogram任务执行耗时分布
模型调用openclaw_llm_requests_totalCounterLLM API 调用次数(按 provider/model 标签)
模型延迟openclaw_llm_request_duration_secondsHistogramLLM API 响应时间
Token 消耗openclaw_llm_tokens_totalCounterToken 使用量(按 type: input/output 标签)
Token 成本openclaw_llm_cost_dollarsCounter估算 API 成本(美元)
工具调用openclaw_tool_calls_totalCounter工具调用次数(按 tool_name 标签)
工具失败openclaw_tool_errors_totalCounter工具调用失败次数
TTFTopenclaw_time_to_first_token_secondsHistogram首 Token 响应时间
错误率openclaw_errors_totalCounter错误总数(按 error_type 标签)
健康检查openclaw_health_statusGaugeGateway 健康状态(1=健康,0=异常)

💡 注意:截至 2026 年初,OpenClaw 原生暴露的 Prometheus 指标有限。下一节将介绍如何通过自定义 Exporter 脚本采集这些业务指标。


4. 自定义 Agent 指标采集

OpenClaw Metrics Exporter

由于 OpenClaw 的 /metrics 端点可能不包含所有 Agent 业务指标,我们可以编写一个轻量级 Exporter 脚本,通过 OpenClaw API 采集数据并暴露为 Prometheus 格式:

#!/usr/bin/env python3 """ openclaw_exporter.py — OpenClaw Prometheus Exporter 采集 Agent 业务指标并暴露为 Prometheus 格式 """ import os import time import json import requests from prometheus_client import start_http_server, Gauge, Counter, Histogram # ============================================ # 指标定义 # ============================================ # 会话指标 ACTIVE_SESSIONS = Gauge( 'openclaw_active_sessions', 'Number of currently active agent sessions' ) # 任务指标 TASKS_TOTAL = Counter( 'openclaw_tasks_total', 'Total number of tasks executed', ['status'] # success, failure, timeout ) TASK_DURATION = Histogram( 'openclaw_task_duration_seconds', 'Task execution duration in seconds', buckets=[1, 5, 10, 30, 60, 120, 300, 600] ) # LLM 指标 LLM_REQUESTS = Counter( 'openclaw_llm_requests_total', 'Total LLM API requests', ['provider', 'model'] ) LLM_TOKENS = Counter( 'openclaw_llm_tokens_total', 'Total tokens consumed', ['provider', 'model', 'type'] # type: input/output ) LLM_COST = Counter( 'openclaw_llm_cost_dollars', 'Estimated LLM API cost in USD', ['provider', 'model'] ) LLM_LATENCY = Histogram( 'openclaw_llm_request_duration_seconds', 'LLM API request duration', ['provider', 'model'], buckets=[0.5, 1, 2, 5, 10, 30, 60] ) # 工具调用指标 TOOL_CALLS = Counter( 'openclaw_tool_calls_total', 'Total tool invocations', ['tool_name'] ) TOOL_ERRORS = Counter( 'openclaw_tool_errors_total', 'Total tool invocation errors', ['tool_name', 'error_type'] ) # 健康状态 HEALTH_STATUS = Gauge( 'openclaw_health_status', 'Gateway health status (1=healthy, 0=unhealthy)' ) GATEWAY_UPTIME = Gauge( 'openclaw_gateway_uptime_seconds', 'Gateway uptime in seconds' ) # ============================================ # 采集逻辑 # ============================================ GATEWAY_URL = os.getenv('OPENCLAW_GATEWAY_URL', 'http://127.0.0.1:18789') GATEWAY_TOKEN = os.getenv('OPENCLAW_GATEWAY_TOKEN', '') def fetch_api(endpoint): """从 OpenClaw API 获取数据""" headers = {} if GATEWAY_TOKEN: headers['Authorization'] = f'Bearer {GATEWAY_TOKEN}' try: resp = requests.get( f'{GATEWAY_URL}{endpoint}', headers=headers, timeout=10 ) resp.raise_for_status() return resp.json() except Exception as e: print(f"[WARN] Failed to fetch {endpoint}: {e}") return None def collect_health(): """采集健康状态""" data = fetch_api('/health') if data and data.get('status') == 'ok': HEALTH_STATUS.set(1) if 'uptime' in data: GATEWAY_UPTIME.set(data['uptime']) else: HEALTH_STATUS.set(0) def collect_sessions(): """采集会话指标""" data = fetch_api('/api/status') if data: sessions = data.get('activeSessions', 0) ACTIVE_SESSIONS.set(sessions) def collect_metrics(): """主采集循环""" collect_health() collect_sessions() # 根据 OpenClaw API 的实际响应结构 # 扩展更多指标采集逻辑 if __name__ == '__main__': port = int(os.getenv('EXPORTER_PORT', '9101')) interval = int(os.getenv('COLLECT_INTERVAL', '30')) print(f"[INFO] Starting OpenClaw Exporter on :{port}") start_http_server(port) while True: collect_metrics() time.sleep(interval)

Exporter Docker 配置

# monitoring/exporter/Dockerfile FROM python:3.12-slim WORKDIR /app RUN pip install --no-cache-dir prometheus-client requests COPY openclaw_exporter.py . EXPOSE 9101 CMD ["python", "openclaw_exporter.py"]

docker-compose.monitoring.yml 中添加:

openclaw-exporter: build: ./monitoring/exporter container_name: openclaw-exporter restart: unless-stopped environment: - OPENCLAW_GATEWAY_URL=http://openclaw-gateway:18789 - OPENCLAW_GATEWAY_TOKEN=${OPENCLAW_GATEWAY_TOKEN} - EXPORTER_PORT=9101 - COLLECT_INTERVAL=30 networks: - openclaw-net depends_on: - openclaw-gateway

prometheus.yml 中添加采集目标:

- job_name: 'openclaw-exporter' static_configs: - targets: ['openclaw-exporter:9101'] scrape_interval: 30s

5. 告警规则配置

基础设施告警规则

# monitoring/prometheus/rules/infrastructure.yml groups: - name: infrastructure rules: # ---- 主机级告警 ---- - alert: HighCpuUsage expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 5m labels: severity: warning component: infrastructure annotations: summary: "CPU 使用率过高 ({{ $labels.instance }})" description: "CPU 使用率已超过 80%,当前值 {{ $value | printf \"%.1f\" }}%。持续 5 分钟。" runbook: "检查是否有失控的 Agent 会话,考虑增加资源限制或升级 VPS。" - alert: CriticalCpuUsage expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 95 for: 2m labels: severity: critical component: infrastructure annotations: summary: "🔴 CPU 使用率危急 ({{ $labels.instance }})" description: "CPU 使用率超过 95%,当前值 {{ $value | printf \"%.1f\" }}%。系统可能无响应。" - alert: HighMemoryUsage expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 85 for: 5m labels: severity: warning component: infrastructure annotations: summary: "内存使用率过高 ({{ $labels.instance }})" description: "内存使用率超过 85%,当前值 {{ $value | printf \"%.1f\" }}%。" runbook: "检查 Agent 会话数量,考虑限制并发会话或增加内存。" - alert: DiskSpaceRunningLow expr: (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 > 80 for: 10m labels: severity: warning component: infrastructure annotations: summary: "磁盘空间不足 ({{ $labels.instance }})" description: "根分区使用率超过 80%,当前值 {{ $value | printf \"%.1f\" }}%。" runbook: "清理旧日志和备份,检查 Docker 镜像缓存:docker system prune" - alert: DiskSpaceCritical expr: (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 > 95 for: 5m labels: severity: critical component: infrastructure annotations: summary: "🔴 磁盘空间即将耗尽 ({{ $labels.instance }})" description: "根分区使用率超过 95%。OpenClaw 可能无法写入数据。" # ---- 容器级告警 ---- - alert: ContainerHighMemory expr: container_memory_usage_bytes{name="openclaw-gateway"} / container_spec_memory_limit_bytes{name="openclaw-gateway"} * 100 > 85 for: 5m labels: severity: warning component: container annotations: summary: "OpenClaw 容器内存接近上限" description: "Gateway 容器内存使用率 {{ $value | printf \"%.1f\" }}%,接近 deploy.resources.limits。" - alert: ContainerRestarted expr: increase(container_restart_count{name="openclaw-gateway"}[1h]) > 2 labels: severity: critical component: container annotations: summary: "🔴 OpenClaw 容器频繁重启" description: "Gateway 容器在过去 1 小时内重启了 {{ $value }} 次。" runbook: "检查容器日志:docker logs openclaw-gateway --tail 100"

Agent 业务告警规则

# monitoring/prometheus/rules/openclaw-agent.yml groups: - name: openclaw-agent rules: # ---- Gateway 健康 ---- - alert: OpenClawGatewayDown expr: openclaw_health_status == 0 for: 1m labels: severity: critical component: openclaw annotations: summary: "🔴 OpenClaw Gateway 不可用" description: "Gateway 健康检查失败超过 1 分钟。所有 Agent 会话已中断。" runbook: | 1. 检查容器状态:docker compose ps 2. 查看日志:docker compose logs --tail 50 openclaw-gateway 3. 尝试重启:docker compose restart openclaw-gateway 4. 检查 API 密钥是否过期 - alert: OpenClawHealthCheckTimeout expr: up{job="openclaw-gateway"} == 0 for: 2m labels: severity: critical component: openclaw annotations: summary: "🔴 OpenClaw 健康检查端点无响应" description: "Prometheus 无法连接到 OpenClaw Gateway 的 /metrics 端点。" # ---- 任务执行 ---- - alert: HighTaskFailureRate expr: | rate(openclaw_tasks_total{status="failure"}[15m]) / rate(openclaw_tasks_total[15m]) > 0.3 for: 10m labels: severity: warning component: openclaw annotations: summary: "Agent 任务失败率过高" description: "过去 15 分钟任务失败率超过 30%,当前值 {{ $value | printf \"%.1f\" }}%。" runbook: "检查 Agent 日志中的错误模式,可能是 API 限流或工具故障。" - alert: TaskExecutionSlow expr: histogram_quantile(0.95, rate(openclaw_task_duration_seconds_bucket[15m])) > 300 for: 10m labels: severity: warning component: openclaw annotations: summary: "Agent 任务执行缓慢" description: "P95 任务执行时间超过 5 分钟({{ $value | printf \"%.0f\" }}s)。" # ---- LLM API ---- - alert: LLMApiHighLatency expr: histogram_quantile(0.95, rate(openclaw_llm_request_duration_seconds_bucket[10m])) > 30 for: 5m labels: severity: warning component: llm annotations: summary: "LLM API 响应缓慢" description: "P95 LLM 请求延迟超过 30 秒({{ $value | printf \"%.1f\" }}s)。可能是模型提供商限流。" - alert: LLMApiErrors expr: rate(openclaw_llm_requests_total{status="error"}[10m]) > 0.1 for: 5m labels: severity: warning component: llm annotations: summary: "LLM API 错误率上升" description: "LLM API 错误率超过 0.1 req/s。检查 API 密钥和配额。" - alert: HighTokenBurn expr: rate(openclaw_llm_tokens_total[1h]) * 3600 > 500000 for: 30m labels: severity: warning component: cost annotations: summary: "⚠️ Token 消耗速率异常" description: "过去 1 小时 Token 消耗速率超过 50 万/小时。可能有失控的 Agent 循环。" runbook: "检查活跃会话,终止异常 Agent:通过 Control UI 或 API 停止可疑会话。" - alert: DailyCostExceeded expr: increase(openclaw_llm_cost_dollars[24h]) > 50 labels: severity: critical component: cost annotations: summary: "🔴 日 API 成本超过预算" description: "过去 24 小时 LLM API 成本超过 $50(当前 ${{ $value | printf \"%.2f\" }})。" # ---- 工具调用 ---- - alert: ToolCallFailureSpike expr: | rate(openclaw_tool_errors_total[10m]) / rate(openclaw_tool_calls_total[10m]) > 0.5 for: 5m labels: severity: warning component: openclaw annotations: summary: "工具调用失败率飙升" description: "工具调用失败率超过 50%。可能是外部服务故障。" # ---- 会话健康 ---- - alert: NoActiveSessions expr: openclaw_active_sessions == 0 for: 15m labels: severity: info component: openclaw annotations: summary: "无活跃 Agent 会话" description: "过去 15 分钟没有活跃的 Agent 会话。如果预期 24/7 运行,请检查。"

6. Alertmanager 告警通知配置

多通道告警路由

# monitoring/alertmanager/alertmanager.yml global: resolve_timeout: 5m # SMTP 配置(邮件通知) smtp_smarthost: 'smtp.gmail.com:587' smtp_from: 'openclaw-alerts@yourdomain.com' smtp_auth_username: 'openclaw-alerts@yourdomain.com' smtp_auth_password: 'your-app-password' smtp_require_tls: true # 告警模板 templates: - '/etc/alertmanager/templates/*.tmpl' # 路由规则 route: # 默认接收者 receiver: 'slack-default' # 分组策略:按 alertname + component 分组 group_by: ['alertname', 'component'] group_wait: 30s # 等待 30 秒收集同组告警 group_interval: 5m # 同组告警间隔 5 分钟 repeat_interval: 4h # 未恢复告警每 4 小时重复 routes: # Critical 告警 → Telegram + Email(立即通知) - match: severity: critical receiver: 'critical-multi' group_wait: 10s repeat_interval: 1h # 成本告警 → 专用通道 - match: component: cost receiver: 'cost-alerts' group_wait: 1m repeat_interval: 6h # Info 级别 → 仅 Slack - match: severity: info receiver: 'slack-default' repeat_interval: 24h # 抑制规则:Critical 触发时抑制同组件的 Warning inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['component'] # 接收者配置 receivers: # Slack 默认通道 - name: 'slack-default' slack_configs: - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK' channel: '#openclaw-alerts' send_resolved: true title: '{{ if eq .Status "firing" }}🔥{{ else }}✅{{ end }} {{ .CommonLabels.alertname }}' text: >- {{ range .Alerts }} *{{ .Annotations.summary }}* {{ .Annotations.description }} {{ if .Annotations.runbook }}📋 *处理步骤:* {{ .Annotations.runbook }}{{ end }} {{ end }} # Critical 多通道(Telegram + Email) - name: 'critical-multi' telegram_configs: - bot_token: 'YOUR_TELEGRAM_BOT_TOKEN' chat_id: -1001234567890 parse_mode: 'HTML' message: | {{ if eq .Status "firing" }}🚨 <b>CRITICAL ALERT</b>{{ else }}✅ <b>RESOLVED</b>{{ end }} {{ range .Alerts }} <b>{{ .Annotations.summary }}</b> {{ .Annotations.description }} {{ end }} email_configs: - to: 'oncall@yourdomain.com' send_resolved: true headers: Subject: '{{ if eq .Status "firing" }}🔴 CRITICAL{{ else }}✅ Resolved{{ end }}: {{ .CommonLabels.alertname }}' # 成本告警 - name: 'cost-alerts' slack_configs: - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK' channel: '#openclaw-costs' send_resolved: true title: '💰 {{ .CommonLabels.alertname }}' text: >- {{ range .Alerts }} {{ .Annotations.description }} {{ end }}

提示词模板:生成 Alertmanager 配置

你是一个 DevOps 专家。请为我的 [平台名称] 生成 Alertmanager 配置。 要求: - 通知通道:[Slack/Telegram/Email/PagerDuty/Discord] - Critical 告警需要 [即时通知/电话升级] - Warning 告警发送到 [通道名称] - 告警分组策略:按 [alertname/component/severity] 分组 - 重复间隔:Critical [1小时],Warning [4小时] - 需要抑制规则:Critical 触发时抑制同组件的 Warning - 工作时间:[时区],非工作时间仅发送 Critical 请生成完整的 alertmanager.yml 配置文件,包含注释说明。

7. Grafana 仪表板

OpenClaw 总览仪表板

以下是一个预配置的 Grafana 仪表板 JSON 模板,涵盖 OpenClaw 的核心监控面板:

{ "dashboard": { "title": "OpenClaw Agent 监控总览", "tags": ["openclaw", "agent", "monitoring"], "timezone": "browser", "refresh": "30s", "panels": [ { "title": "Gateway 状态", "type": "stat", "gridPos": { "h": 4, "w": 4, "x": 0, "y": 0 }, "targets": [{ "expr": "openclaw_health_status", "legendFormat": "Health" }], "fieldConfig": { "defaults": { "mappings": [ { "type": "value", "options": { "1": { "text": "✅ 健康", "color": "green" } } }, { "type": "value", "options": { "0": { "text": "🔴 异常", "color": "red" } } } ] } } }, { "title": "Gateway 运行时间", "type": "stat", "gridPos": { "h": 4, "w": 4, "x": 4, "y": 0 }, "targets": [{ "expr": "openclaw_gateway_uptime_seconds / 3600", "legendFormat": "Uptime" }], "fieldConfig": { "defaults": { "unit": "h", "decimals": 1 } } }, { "title": "活跃会话数", "type": "stat", "gridPos": { "h": 4, "w": 4, "x": 8, "y": 0 }, "targets": [{ "expr": "openclaw_active_sessions", "legendFormat": "Sessions" }] }, { "title": "今日 Token 消耗", "type": "stat", "gridPos": { "h": 4, "w": 4, "x": 12, "y": 0 }, "targets": [{ "expr": "increase(openclaw_llm_tokens_total[24h])", "legendFormat": "Tokens" }], "fieldConfig": { "defaults": { "unit": "short", "decimals": 0 } } }, { "title": "今日估算成本", "type": "stat", "gridPos": { "h": 4, "w": 4, "x": 16, "y": 0 }, "targets": [{ "expr": "increase(openclaw_llm_cost_dollars[24h])", "legendFormat": "Cost" }], "fieldConfig": { "defaults": { "unit": "currencyUSD", "decimals": 2 } } }, { "title": "当前告警数", "type": "stat", "gridPos": { "h": 4, "w": 4, "x": 20, "y": 0 }, "targets": [{ "expr": "count(ALERTS{alertstate=\"firing\"})", "legendFormat": "Firing" }], "fieldConfig": { "defaults": { "thresholds": { "steps": [ { "value": 0, "color": "green" }, { "value": 1, "color": "yellow" }, { "value": 3, "color": "red" } ] } } } }, { "title": "LLM 请求延迟 (P50 / P95 / P99)", "type": "timeseries", "gridPos": { "h": 8, "w": 12, "x": 0, "y": 4 }, "targets": [ { "expr": "histogram_quantile(0.50, rate(openclaw_llm_request_duration_seconds_bucket[5m]))", "legendFormat": "P50" }, { "expr": "histogram_quantile(0.95, rate(openclaw_llm_request_duration_seconds_bucket[5m]))", "legendFormat": "P95" }, { "expr": "histogram_quantile(0.99, rate(openclaw_llm_request_duration_seconds_bucket[5m]))", "legendFormat": "P99" } ], "fieldConfig": { "defaults": { "unit": "s" } } }, { "title": "Token 消耗趋势(按模型)", "type": "timeseries", "gridPos": { "h": 8, "w": 12, "x": 12, "y": 4 }, "targets": [{ "expr": "rate(openclaw_llm_tokens_total[1h]) * 3600", "legendFormat": "{{ provider }}/{{ model }} ({{ type }})" }], "fieldConfig": { "defaults": { "unit": "short" } } }, { "title": "任务成功率", "type": "gauge", "gridPos": { "h": 6, "w": 6, "x": 0, "y": 12 }, "targets": [{ "expr": "rate(openclaw_tasks_total{status=\"success\"}[1h]) / rate(openclaw_tasks_total[1h]) * 100", "legendFormat": "Success Rate" }], "fieldConfig": { "defaults": { "unit": "percent", "min": 0, "max": 100, "thresholds": { "steps": [ { "value": 0, "color": "red" }, { "value": 80, "color": "yellow" }, { "value": 95, "color": "green" } ] } } } }, { "title": "CPU 使用率", "type": "timeseries", "gridPos": { "h": 6, "w": 9, "x": 6, "y": 12 }, "targets": [{ "expr": "100 - (avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)", "legendFormat": "CPU %" }], "fieldConfig": { "defaults": { "unit": "percent", "max": 100 } } }, { "title": "内存使用率", "type": "timeseries", "gridPos": { "h": 6, "w": 9, "x": 15, "y": 12 }, "targets": [{ "expr": "(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100", "legendFormat": "Memory %" }], "fieldConfig": { "defaults": { "unit": "percent", "max": 100 } } }, { "title": "工具调用统计(Top 10)", "type": "barchart", "gridPos": { "h": 8, "w": 12, "x": 0, "y": 18 }, "targets": [{ "expr": "topk(10, increase(openclaw_tool_calls_total[24h]))", "legendFormat": "{{ tool_name }}" }] }, { "title": "累计 API 成本趋势", "type": "timeseries", "gridPos": { "h": 8, "w": 12, "x": 12, "y": 18 }, "targets": [{ "expr": "increase(openclaw_llm_cost_dollars[24h])", "legendFormat": "{{ provider }}/{{ model }}" }], "fieldConfig": { "defaults": { "unit": "currencyUSD" } } } ] } }

将此 JSON 保存为 monitoring/grafana/dashboards/openclaw-overview.json,Grafana 会通过 provisioning 自动加载。

仪表板面板说明

面板类型用途
Gateway 状态Stat一眼看到 Gateway 是否健康
运行时间Stat上次重启以来的运行时长
活跃会话数Stat当前正在运行的 Agent 会话
今日 Token 消耗Stat24 小时内的 Token 使用总量
今日估算成本Stat24 小时内的 API 成本估算
当前告警数Stat正在触发的告警数量
LLM 请求延迟时序图P50/P95/P99 延迟趋势
Token 消耗趋势时序图按模型分类的 Token 消耗速率
任务成功率仪表盘任务执行的成功百分比
CPU/内存使用率时序图基础设施资源趋势
工具调用统计柱状图最常用的工具 Top 10
累计 API 成本时序图按模型的成本趋势

8. 外部可用性监控

Uptime Kuma 集成

除了内部 Prometheus 监控,建议配置外部可用性探针,从外部视角检测 OpenClaw 是否可达:

# 在 docker-compose.monitoring.yml 中添加 uptime-kuma: image: louislam/uptime-kuma:1 container_name: openclaw-uptime-kuma restart: unless-stopped ports: - "127.0.0.1:3001:3001" volumes: - uptime_kuma_data:/app/data networks: - openclaw-net

在 Uptime Kuma 中配置以下监控项:

监控项类型URL/目标间隔说明
Gateway 健康HTTP(s)https://openclaw.yourdomain.com/health60s检查 Gateway 存活
Gateway 认证HTTP(s) - Keyword/api/status + Bearer Token120s验证认证正常
TLS 证书HTTP(s)主域名24h证书过期提前 14 天告警
DNS 解析DNSopenclaw.yourdomain.com300sDNS 解析正常

SLA 监控

对于需要 SLA 承诺的场景,建议追踪以下可用性指标:

# 月度可用性计算 SLA = (1 - 总停机分钟数 / 总分钟数) × 100% # 目标 SLA 对照表 99.0% = 每月最多 7.3 小时停机 99.5% = 每月最多 3.65 小时停机 99.9% = 每月最多 43.8 分钟停机

在 Grafana 中创建 SLA 面板:

# PromQL:过去 30 天可用性百分比 (1 - ( count_over_time((openclaw_health_status == 0)[30d:1m]) / count_over_time(openclaw_health_status[30d:1m]) )) * 100

9. Kubernetes 环境监控

kube-prometheus-stack

如果 OpenClaw 部署在 Kubernetes 上,推荐使用 kube-prometheus-stack Helm Chart 一键部署完整监控栈:

# 添加 Helm 仓库 helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update # 安装 kube-prometheus-stack helm install monitoring prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --create-namespace \ --set grafana.adminPassword="${GRAFANA_ADMIN_PASSWORD}" \ --set prometheus.prometheusSpec.retention=30d \ --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi

ServiceMonitor 配置

为 OpenClaw 创建 ServiceMonitor,让 Prometheus Operator 自动发现采集目标:

# servicemonitor.yaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: openclaw-monitor namespace: openclaw labels: release: monitoring # 匹配 kube-prometheus-stack 的 label selector spec: selector: matchLabels: app: openclaw endpoints: - port: http path: /metrics interval: 30s scrapeTimeout: 10s namespaceSelector: matchNames: - openclaw

PrometheusRule 配置

将告警规则以 K8s 资源形式管理:

# prometheusrule.yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: openclaw-alerts namespace: openclaw labels: release: monitoring spec: groups: - name: openclaw-agent rules: - alert: OpenClawGatewayDown expr: openclaw_health_status == 0 for: 1m labels: severity: critical annotations: summary: "OpenClaw Gateway 不可用" # ... 其他规则同上

实战案例:从零搭建 OpenClaw 监控体系

场景

一个独立开发者在 Hetzner VPS(CX22,2 vCPU/4 GB)上运行 OpenClaw,需要在不增加太多资源开销的情况下实现基本监控和告警。

步骤 1:准备监控配置文件

# 在 openclaw-deploy 目录下 mkdir -p monitoring/{prometheus/rules,alertmanager,grafana/provisioning/{datasources,dashboards},grafana/dashboards} # 创建 Prometheus 配置(使用上方模板) vim monitoring/prometheus/prometheus.yml # 创建告警规则 vim monitoring/prometheus/rules/infrastructure.yml vim monitoring/prometheus/rules/openclaw-agent.yml # 创建 Alertmanager 配置 vim monitoring/alertmanager/alertmanager.yml # 创建 Grafana 数据源配置 vim monitoring/grafana/provisioning/datasources/prometheus.yml # 创建仪表板配置 vim monitoring/grafana/provisioning/dashboards/dashboards.yml # 复制仪表板 JSON vim monitoring/grafana/dashboards/openclaw-overview.json

步骤 2:配置 Telegram 告警机器人

# 1. 在 Telegram 中找到 @BotFather,创建新 Bot # 2. 获取 Bot Token # 3. 创建一个群组,将 Bot 加入 # 4. 获取 Chat ID: curl -s "https://api.telegram.org/bot<YOUR_BOT_TOKEN>/getUpdates" | jq '.result[0].message.chat.id' # 5. 将 Token 和 Chat ID 填入 alertmanager.yml

步骤 3:启动监控栈

# 在 .env 中添加 Grafana 密码 echo "GRAFANA_ADMIN_PASSWORD=$(openssl rand -base64 16)" >> .env # 启动所有服务 docker compose -f docker-compose.yml -f docker-compose.monitoring.yml up -d # 检查所有容器状态 docker compose -f docker-compose.yml -f docker-compose.monitoring.yml ps

步骤 4:配置反向代理访问 Grafana

在 Caddyfile 中添加 Grafana 的反向代理:

grafana.yourdomain.com { reverse_proxy localhost:3000 header { X-Content-Type-Options nosniff X-Frame-Options DENY -Server } }
# 重载 Caddy 配置 docker compose restart caddy

步骤 5:验证监控体系

# 检查 Prometheus 目标 curl -s http://127.0.0.1:9090/api/v1/targets | \ jq '.data.activeTargets[] | {job: .labels.job, health: .health}' # 预期输出: # {"job": "prometheus", "health": "up"} # {"job": "node-exporter", "health": "up"} # {"job": "cadvisor", "health": "up"} # {"job": "openclaw-gateway", "health": "up"} # 检查告警规则 curl -s http://127.0.0.1:9090/api/v1/rules | jq '.data.groups[].rules[] | {name: .name, state: .state}' # 测试 Alertmanager 通知 # 手动触发测试告警 curl -X POST http://127.0.0.1:9093/api/v2/alerts \ -H "Content-Type: application/json" \ -d '[{ "labels": { "alertname": "TestAlert", "severity": "warning", "component": "test" }, "annotations": { "summary": "这是一条测试告警", "description": "验证 Alertmanager 通知通道是否正常工作。" } }]'

步骤 6:访问 Grafana 仪表板

  1. 打开 https://grafana.yourdomain.com
  2. 使用 admin / ${GRAFANA_ADMIN_PASSWORD} 登录
  3. 导航到 Dashboards → OpenClaw → OpenClaw Agent 监控总览
  4. 确认所有面板正常显示数据

资源开销

组件CPU内存磁盘
Prometheus~0.1 核~200 MB~1 GB/月(30 天保留)
Grafana~0.05 核~100 MB~50 MB
Alertmanager~0.01 核~30 MB~10 MB
Node Exporter~0.01 核~20 MB
cAdvisor~0.05 核~80 MB
合计~0.22 核~430 MB~1 GB/月

💡 在 2 vCPU / 4 GB 的 VPS 上,监控栈大约占用 10% CPU 和 10% 内存,完全可以与 OpenClaw 共存。


避坑指南

❌ 常见错误

  1. 只监控基础设施,忽略 Agent 业务指标

    • 问题:CPU 和内存正常不代表 Agent 在正常工作。Agent 可能陷入无限循环、API 密钥过期、或工具调用持续失败,但基础设施指标完全正常
    • 正确做法:同时监控基础设施指标和 Agent 业务指标(任务成功率、Token 消耗、LLM 延迟),两者缺一不可
  2. 告警阈值设置过于敏感,导致告警疲劳

    • 问题:CPU 超过 50% 就告警、每次 API 超时都通知,导致团队忽略所有告警,真正的问题反而被淹没
    • 正确做法:遵循”每条告警都需要人工操作”原则。Warning 级别设置合理的 for 持续时间(5-10 分钟),避免瞬时波动触发告警
  3. Prometheus 数据保留时间过长,磁盘耗尽

    • 问题:默认 15 天保留看似不多,但高基数指标(如按 tool_name 标签的工具调用)会快速膨胀存储
    • 正确做法:设置 --storage.tsdb.retention.time=30d 并监控 Prometheus 自身的磁盘使用。使用 recording rules 预聚合高频查询
  4. Grafana 暴露到公网且使用默认密码

    • 问题:Grafana 默认用户名/密码是 admin/admin,暴露到公网后任何人都能访问你的监控数据和告警配置
    • 正确做法:通过反向代理访问,设置强密码,禁用注册(GF_USERS_ALLOW_SIGN_UP=false
  5. 没有测试告警通知通道

    • 问题:配置了 Slack/Telegram 告警但从未测试,真正出问题时才发现 Webhook URL 错误或 Bot Token 过期
    • 正确做法:部署后立即发送测试告警验证所有通知通道。定期(每月)发送测试告警确认通道仍然有效
  6. 监控栈本身没有监控

    • 问题:Prometheus 挂了没人知道,等到 OpenClaw 出问题才发现监控已经停了
    • 正确做法:使用外部服务(Uptime Kuma、UptimeRobot、Betterstack)监控 Prometheus 和 Grafana 的可用性

✅ 最佳实践

  1. 分层告警:Info → Warning → Critical 三级,Critical 必须立即处理,Warning 在工作时间处理,Info 仅记录
  2. 告警必须可操作:每条告警的 annotations 中包含 runbook(处理步骤),让收到告警的人知道该做什么
  3. 成本监控前置:Token 消耗和 API 成本是 AI Agent 平台最容易失控的指标,务必设置日/周预算告警
  4. 定期审查告警规则:每月审查一次告警触发记录,删除从未触发的规则,调整频繁误报的阈值
  5. 仪表板即文档:Grafana 仪表板应该能让新团队成员在 5 分钟内理解系统状态,面板命名清晰、布局合理
  6. 备份监控配置:将 prometheus.yml、告警规则、alertmanager.yml、Grafana 仪表板 JSON 全部纳入 Git 版本控制

相关资源与延伸阅读

资源类型说明链接
Prometheus 官方文档官方文档配置、PromQL、告警规则权威参考prometheus.io/docs 
Grafana 官方文档官方文档仪表板、数据源、告警配置grafana.com/docs 
Alertmanager 配置指南官方文档路由、接收者、抑制规则prometheus.io/docs/alerting 
Awesome Prometheus Alerts开源项目社区维护的告警规则集合github.com/samber/awesome-prometheus-alerts 
kube-prometheus-stackHelm ChartK8s 一键部署 Prometheus + Grafanagithub.com/prometheus-community/helm-charts 
Uptime Kuma开源项目自托管的可用性监控工具github.com/louislam/uptime-kuma 
Grafana Cloud 免费层托管服务10K 指标免费,零运维grafana.com/products/cloud 
AI Agent 可观测性指南社区文章AI Agent 监控的特殊挑战blaxel.ai/blog/ai-observability 
Grafana + AI Agent 监控官方博客Grafana Cloud 监控 AI Agent 应用grafana.com/blog 
多 Agent 系统可观测性社区指南生产级多 Agent 系统的监控与排障xugj520.cn 

参考来源

📝 内容基于 2025-2026 年公开资料整理,Prometheus 配置和告警规则经过改写以适应 OpenClaw 场景。具体指标名称和 API 端点请以 OpenClaw 官方文档为准。


📖 返回 总览与导航 | 上一节:25a-高可用部署 | 下一节:25c-日志管理与审计

Last updated on