25b - 监控与告警
本文是《AI Agent 实战手册》第 25 章第 2 节。 上一节:25a-高可用部署 | 下一节:25c-日志管理与审计
概述
OpenClaw 作为 24/7 自主运行的 AI Agent 平台,其监控需求远超传统 Web 应用——你不仅需要关注基础设施指标(CPU、内存、磁盘),还需要追踪 Agent 特有的业务指标(token 消耗、模型延迟、任务成功率、会话健康度)。2026 年,Prometheus + Grafana 已成为开源监控的事实标准,配合 Alertmanager 可以实现从指标采集到故障通知的完整闭环。
本节提供从零搭建 OpenClaw 监控体系的完整指南,包含 Prometheus 配置、自定义 Agent 指标、Grafana 仪表板模板和多通道告警规则。
1. 监控架构总览
监控体系分层
OpenClaw 的监控需要覆盖三个层次:
┌─────────────────────────────────────────────────────────┐
│ 告警通知层 │
│ Alertmanager → Slack / Telegram / Email / PagerDuty │
├─────────────────────────────────────────────────────────┤
│ 可视化层 │
│ Grafana 仪表板(基础设施 + Agent 业务指标) │
├─────────────────────────────────────────────────────────┤
│ 指标存储层 │
│ Prometheus(时序数据库,拉取模式) │
├─────────────────────────────────────────────────────────┤
│ 指标采集层 │
│ Node Exporter cAdvisor 自定义 /metrics 端点 │
│ (主机指标) (容器指标) (Agent 业务指标) │
├─────────────────────────────────────────────────────────┤
│ 被监控目标 │
│ VPS 主机 Docker 容器 OpenClaw Gateway + Agents │
└─────────────────────────────────────────────────────────┘工具推荐
| 工具 | 用途 | 价格 | 适用场景 |
|---|---|---|---|
| Prometheus | 时序指标采集与存储 | 免费(开源) | 所有场景的核心指标引擎 |
| Grafana | 指标可视化与仪表板 | 免费(开源)/ Cloud 起步 $0 | 仪表板和告警可视化 |
| Alertmanager | 告警路由与通知 | 免费(开源) | 告警去重、分组、静默 |
| Node Exporter | 主机级指标采集 | 免费(开源) | CPU/内存/磁盘/网络监控 |
| cAdvisor | 容器级指标采集 | 免费(开源) | Docker 容器资源监控 |
| Uptime Kuma | 外部可用性监控 | 免费(开源) | HTTP/TCP 端点存活检测 |
| Grafana Cloud | 托管监控平台 | 免费层 10K 指标 / Pro $8/月 | 零运维需求 |
| Datadog | 全栈可观测性 | $15/主机/月起 | 企业级需求 |
| New Relic | APM + 基础设施 | 免费层 100 GB/月 | 全栈监控 |
| Betterstack | 日志 + 状态页 + 告警 | 免费层可用 / $25/月起 | 状态页 + 事件管理 |
💡 推荐组合:Prometheus + Grafana + Alertmanager(自托管免费方案)是 OpenClaw 监控的最佳起点。如果已有 K8s 集群,kube-prometheus-stack Helm Chart 可一键部署全套。
2. Prometheus + Grafana 部署
Docker Compose 集成
在 25a-高可用部署 的 Docker Compose 基础上,添加监控组件:
# docker-compose.monitoring.yml — 监控栈
# 与主 docker-compose.yml 配合使用:
# docker compose -f docker-compose.yml -f docker-compose.monitoring.yml up -d
version: '3.8'
services:
# ============================================
# Prometheus — 指标采集与存储
# ============================================
prometheus:
image: prom/prometheus:v3.4.0
container_name: openclaw-prometheus
restart: unless-stopped
user: "65534:65534" # nobody 用户
ports:
- "127.0.0.1:9090:9090"
volumes:
- ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./monitoring/prometheus/rules/:/etc/prometheus/rules/:ro
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle' # 支持热重载配置
- '--web.enable-admin-api'
networks:
- openclaw-net
logging:
driver: json-file
options:
max-size: "20m"
max-file: "3"
# ============================================
# Grafana — 可视化仪表板
# ============================================
grafana:
image: grafana/grafana:11.6.0
container_name: openclaw-grafana
restart: unless-stopped
ports:
- "127.0.0.1:3000:3000"
environment:
- GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER:-admin}
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD}
- GF_USERS_ALLOW_SIGN_UP=false
- GF_SERVER_ROOT_URL=https://grafana.yourdomain.com
volumes:
- grafana_data:/var/lib/grafana
- ./monitoring/grafana/provisioning/:/etc/grafana/provisioning/:ro
- ./monitoring/grafana/dashboards/:/var/lib/grafana/dashboards/:ro
networks:
- openclaw-net
depends_on:
- prometheus
logging:
driver: json-file
options:
max-size: "20m"
max-file: "3"
# ============================================
# Alertmanager — 告警路由与通知
# ============================================
alertmanager:
image: prom/alertmanager:v0.28.1
container_name: openclaw-alertmanager
restart: unless-stopped
ports:
- "127.0.0.1:9093:9093"
volumes:
- ./monitoring/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
- alertmanager_data:/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
networks:
- openclaw-net
logging:
driver: json-file
options:
max-size: "10m"
max-file: "3"
# ============================================
# Node Exporter — 主机指标
# ============================================
node-exporter:
image: prom/node-exporter:v1.9.1
container_name: openclaw-node-exporter
restart: unless-stopped
pid: host
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
networks:
- openclaw-net
# ============================================
# cAdvisor — 容器指标
# ============================================
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.52.1
container_name: openclaw-cadvisor
restart: unless-stopped
privileged: true
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
networks:
- openclaw-net
volumes:
prometheus_data:
grafana_data:
alertmanager_data:目录结构
openclaw-deploy/
├── docker-compose.yml # 主服务
├── docker-compose.monitoring.yml # 监控栈
├── monitoring/
│ ├── prometheus/
│ │ ├── prometheus.yml # Prometheus 主配置
│ │ └── rules/
│ │ ├── infrastructure.yml # 基础设施告警规则
│ │ └── openclaw-agent.yml # Agent 业务告警规则
│ ├── alertmanager/
│ │ └── alertmanager.yml # 告警路由配置
│ └── grafana/
│ ├── provisioning/
│ │ ├── datasources/
│ │ │ └── prometheus.yml # 自动配置数据源
│ │ └── dashboards/
│ │ └── dashboards.yml # 仪表板自动发现
│ └── dashboards/
│ ├── openclaw-overview.json # OpenClaw 总览仪表板
│ └── infrastructure.json # 基础设施仪表板
└── .envPrometheus 主配置
# monitoring/prometheus/prometheus.yml
global:
scrape_interval: 15s # 默认采集间隔
evaluation_interval: 15s # 规则评估间隔
scrape_timeout: 10s
# 告警规则文件
rule_files:
- /etc/prometheus/rules/*.yml
# Alertmanager 配置
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# 采集目标
scrape_configs:
# Prometheus 自身指标
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node Exporter — 主机指标
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
relabel_configs:
- source_labels: [__address__]
target_label: instance
replacement: 'openclaw-vps'
# cAdvisor — 容器指标
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
# OpenClaw Gateway — 应用指标
# 如果 OpenClaw 暴露 /metrics 端点
- job_name: 'openclaw-gateway'
static_configs:
- targets: ['openclaw-gateway:18789']
metrics_path: /metrics
scrape_interval: 30s
# 如果需要认证
# authorization:
# credentials: '${OPENCLAW_GATEWAY_TOKEN}'
# OpenClaw 健康检查探针
- job_name: 'openclaw-health'
metrics_path: /health
static_configs:
- targets: ['openclaw-gateway:18789']
scrape_interval: 30sGrafana 数据源自动配置
# monitoring/grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
jsonData:
timeInterval: '15s'
httpMethod: POSTGrafana 仪表板自动发现
# monitoring/grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: 'OpenClaw'
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards
foldersFromFilesStructure: false启动监控栈
# 创建目录结构
mkdir -p monitoring/{prometheus/rules,alertmanager,grafana/provisioning/{datasources,dashboards},grafana/dashboards}
# 启动(与主服务一起)
docker compose -f docker-compose.yml -f docker-compose.monitoring.yml up -d
# 验证各组件
curl -s http://127.0.0.1:9090/-/healthy # Prometheus
curl -s http://127.0.0.1:3000/api/health # Grafana
curl -s http://127.0.0.1:9093/-/healthy # Alertmanager
# 检查 Prometheus 目标状态
curl -s http://127.0.0.1:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'3. 关键监控指标
基础设施指标
OpenClaw 运行在 VPS 或 K8s 上,基础设施监控是第一道防线:
| 指标类别 | PromQL 示例 | 告警阈值建议 | 说明 |
|---|---|---|---|
| CPU 使用率 | 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) | > 80% 持续 5 分钟 | Agent 推理密集时 CPU 飙升 |
| 内存使用率 | (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 | > 85% 持续 5 分钟 | 多 Agent 会话消耗大量内存 |
| 磁盘使用率 | (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 | > 80% | 日志和会话数据持续增长 |
| 磁盘 I/O | rate(node_disk_io_time_seconds_total[5m]) | > 0.9(90% 繁忙) | SQLite 写入密集时关注 |
| 网络流量 | rate(node_network_receive_bytes_total[5m]) | 异常突增 | API 调用和 WebSocket 通信 |
容器指标
通过 cAdvisor 采集 Docker 容器级别的资源使用:
| 指标 | PromQL | 说明 |
|---|---|---|
| 容器 CPU | rate(container_cpu_usage_seconds_total{name="openclaw-gateway"}[5m]) | Gateway 进程 CPU 消耗 |
| 容器内存 | container_memory_usage_bytes{name="openclaw-gateway"} | 实际内存占用 |
| 容器重启次数 | container_restart_count{name="openclaw-gateway"} | 频繁重启说明有问题 |
| 容器网络 | rate(container_network_receive_bytes_total{name="openclaw-gateway"}[5m]) | 网络吞吐量 |
Agent 业务指标
这是 OpenClaw 监控的核心——传统基础设施监控无法覆盖的 AI Agent 特有指标:
| 指标类别 | 指标名称 | 类型 | 说明 |
|---|---|---|---|
| 会话健康 | openclaw_active_sessions | Gauge | 当前活跃 Agent 会话数 |
| 任务执行 | openclaw_tasks_total | Counter | 任务执行总数(按状态标签:success/failure/timeout) |
| 任务延迟 | openclaw_task_duration_seconds | Histogram | 任务执行耗时分布 |
| 模型调用 | openclaw_llm_requests_total | Counter | LLM API 调用次数(按 provider/model 标签) |
| 模型延迟 | openclaw_llm_request_duration_seconds | Histogram | LLM API 响应时间 |
| Token 消耗 | openclaw_llm_tokens_total | Counter | Token 使用量(按 type: input/output 标签) |
| Token 成本 | openclaw_llm_cost_dollars | Counter | 估算 API 成本(美元) |
| 工具调用 | openclaw_tool_calls_total | Counter | 工具调用次数(按 tool_name 标签) |
| 工具失败 | openclaw_tool_errors_total | Counter | 工具调用失败次数 |
| TTFT | openclaw_time_to_first_token_seconds | Histogram | 首 Token 响应时间 |
| 错误率 | openclaw_errors_total | Counter | 错误总数(按 error_type 标签) |
| 健康检查 | openclaw_health_status | Gauge | Gateway 健康状态(1=健康,0=异常) |
💡 注意:截至 2026 年初,OpenClaw 原生暴露的 Prometheus 指标有限。下一节将介绍如何通过自定义 Exporter 脚本采集这些业务指标。
4. 自定义 Agent 指标采集
OpenClaw Metrics Exporter
由于 OpenClaw 的 /metrics 端点可能不包含所有 Agent 业务指标,我们可以编写一个轻量级 Exporter 脚本,通过 OpenClaw API 采集数据并暴露为 Prometheus 格式:
#!/usr/bin/env python3
"""
openclaw_exporter.py — OpenClaw Prometheus Exporter
采集 Agent 业务指标并暴露为 Prometheus 格式
"""
import os
import time
import json
import requests
from prometheus_client import start_http_server, Gauge, Counter, Histogram
# ============================================
# 指标定义
# ============================================
# 会话指标
ACTIVE_SESSIONS = Gauge(
'openclaw_active_sessions',
'Number of currently active agent sessions'
)
# 任务指标
TASKS_TOTAL = Counter(
'openclaw_tasks_total',
'Total number of tasks executed',
['status'] # success, failure, timeout
)
TASK_DURATION = Histogram(
'openclaw_task_duration_seconds',
'Task execution duration in seconds',
buckets=[1, 5, 10, 30, 60, 120, 300, 600]
)
# LLM 指标
LLM_REQUESTS = Counter(
'openclaw_llm_requests_total',
'Total LLM API requests',
['provider', 'model']
)
LLM_TOKENS = Counter(
'openclaw_llm_tokens_total',
'Total tokens consumed',
['provider', 'model', 'type'] # type: input/output
)
LLM_COST = Counter(
'openclaw_llm_cost_dollars',
'Estimated LLM API cost in USD',
['provider', 'model']
)
LLM_LATENCY = Histogram(
'openclaw_llm_request_duration_seconds',
'LLM API request duration',
['provider', 'model'],
buckets=[0.5, 1, 2, 5, 10, 30, 60]
)
# 工具调用指标
TOOL_CALLS = Counter(
'openclaw_tool_calls_total',
'Total tool invocations',
['tool_name']
)
TOOL_ERRORS = Counter(
'openclaw_tool_errors_total',
'Total tool invocation errors',
['tool_name', 'error_type']
)
# 健康状态
HEALTH_STATUS = Gauge(
'openclaw_health_status',
'Gateway health status (1=healthy, 0=unhealthy)'
)
GATEWAY_UPTIME = Gauge(
'openclaw_gateway_uptime_seconds',
'Gateway uptime in seconds'
)
# ============================================
# 采集逻辑
# ============================================
GATEWAY_URL = os.getenv('OPENCLAW_GATEWAY_URL', 'http://127.0.0.1:18789')
GATEWAY_TOKEN = os.getenv('OPENCLAW_GATEWAY_TOKEN', '')
def fetch_api(endpoint):
"""从 OpenClaw API 获取数据"""
headers = {}
if GATEWAY_TOKEN:
headers['Authorization'] = f'Bearer {GATEWAY_TOKEN}'
try:
resp = requests.get(
f'{GATEWAY_URL}{endpoint}',
headers=headers,
timeout=10
)
resp.raise_for_status()
return resp.json()
except Exception as e:
print(f"[WARN] Failed to fetch {endpoint}: {e}")
return None
def collect_health():
"""采集健康状态"""
data = fetch_api('/health')
if data and data.get('status') == 'ok':
HEALTH_STATUS.set(1)
if 'uptime' in data:
GATEWAY_UPTIME.set(data['uptime'])
else:
HEALTH_STATUS.set(0)
def collect_sessions():
"""采集会话指标"""
data = fetch_api('/api/status')
if data:
sessions = data.get('activeSessions', 0)
ACTIVE_SESSIONS.set(sessions)
def collect_metrics():
"""主采集循环"""
collect_health()
collect_sessions()
# 根据 OpenClaw API 的实际响应结构
# 扩展更多指标采集逻辑
if __name__ == '__main__':
port = int(os.getenv('EXPORTER_PORT', '9101'))
interval = int(os.getenv('COLLECT_INTERVAL', '30'))
print(f"[INFO] Starting OpenClaw Exporter on :{port}")
start_http_server(port)
while True:
collect_metrics()
time.sleep(interval)Exporter Docker 配置
# monitoring/exporter/Dockerfile
FROM python:3.12-slim
WORKDIR /app
RUN pip install --no-cache-dir prometheus-client requests
COPY openclaw_exporter.py .
EXPOSE 9101
CMD ["python", "openclaw_exporter.py"]在 docker-compose.monitoring.yml 中添加:
openclaw-exporter:
build: ./monitoring/exporter
container_name: openclaw-exporter
restart: unless-stopped
environment:
- OPENCLAW_GATEWAY_URL=http://openclaw-gateway:18789
- OPENCLAW_GATEWAY_TOKEN=${OPENCLAW_GATEWAY_TOKEN}
- EXPORTER_PORT=9101
- COLLECT_INTERVAL=30
networks:
- openclaw-net
depends_on:
- openclaw-gateway在 prometheus.yml 中添加采集目标:
- job_name: 'openclaw-exporter'
static_configs:
- targets: ['openclaw-exporter:9101']
scrape_interval: 30s5. 告警规则配置
基础设施告警规则
# monitoring/prometheus/rules/infrastructure.yml
groups:
- name: infrastructure
rules:
# ---- 主机级告警 ----
- alert: HighCpuUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
component: infrastructure
annotations:
summary: "CPU 使用率过高 ({{ $labels.instance }})"
description: "CPU 使用率已超过 80%,当前值 {{ $value | printf \"%.1f\" }}%。持续 5 分钟。"
runbook: "检查是否有失控的 Agent 会话,考虑增加资源限制或升级 VPS。"
- alert: CriticalCpuUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 95
for: 2m
labels:
severity: critical
component: infrastructure
annotations:
summary: "🔴 CPU 使用率危急 ({{ $labels.instance }})"
description: "CPU 使用率超过 95%,当前值 {{ $value | printf \"%.1f\" }}%。系统可能无响应。"
- alert: HighMemoryUsage
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 85
for: 5m
labels:
severity: warning
component: infrastructure
annotations:
summary: "内存使用率过高 ({{ $labels.instance }})"
description: "内存使用率超过 85%,当前值 {{ $value | printf \"%.1f\" }}%。"
runbook: "检查 Agent 会话数量,考虑限制并发会话或增加内存。"
- alert: DiskSpaceRunningLow
expr: (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 > 80
for: 10m
labels:
severity: warning
component: infrastructure
annotations:
summary: "磁盘空间不足 ({{ $labels.instance }})"
description: "根分区使用率超过 80%,当前值 {{ $value | printf \"%.1f\" }}%。"
runbook: "清理旧日志和备份,检查 Docker 镜像缓存:docker system prune"
- alert: DiskSpaceCritical
expr: (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 > 95
for: 5m
labels:
severity: critical
component: infrastructure
annotations:
summary: "🔴 磁盘空间即将耗尽 ({{ $labels.instance }})"
description: "根分区使用率超过 95%。OpenClaw 可能无法写入数据。"
# ---- 容器级告警 ----
- alert: ContainerHighMemory
expr: container_memory_usage_bytes{name="openclaw-gateway"} / container_spec_memory_limit_bytes{name="openclaw-gateway"} * 100 > 85
for: 5m
labels:
severity: warning
component: container
annotations:
summary: "OpenClaw 容器内存接近上限"
description: "Gateway 容器内存使用率 {{ $value | printf \"%.1f\" }}%,接近 deploy.resources.limits。"
- alert: ContainerRestarted
expr: increase(container_restart_count{name="openclaw-gateway"}[1h]) > 2
labels:
severity: critical
component: container
annotations:
summary: "🔴 OpenClaw 容器频繁重启"
description: "Gateway 容器在过去 1 小时内重启了 {{ $value }} 次。"
runbook: "检查容器日志:docker logs openclaw-gateway --tail 100"Agent 业务告警规则
# monitoring/prometheus/rules/openclaw-agent.yml
groups:
- name: openclaw-agent
rules:
# ---- Gateway 健康 ----
- alert: OpenClawGatewayDown
expr: openclaw_health_status == 0
for: 1m
labels:
severity: critical
component: openclaw
annotations:
summary: "🔴 OpenClaw Gateway 不可用"
description: "Gateway 健康检查失败超过 1 分钟。所有 Agent 会话已中断。"
runbook: |
1. 检查容器状态:docker compose ps
2. 查看日志:docker compose logs --tail 50 openclaw-gateway
3. 尝试重启:docker compose restart openclaw-gateway
4. 检查 API 密钥是否过期
- alert: OpenClawHealthCheckTimeout
expr: up{job="openclaw-gateway"} == 0
for: 2m
labels:
severity: critical
component: openclaw
annotations:
summary: "🔴 OpenClaw 健康检查端点无响应"
description: "Prometheus 无法连接到 OpenClaw Gateway 的 /metrics 端点。"
# ---- 任务执行 ----
- alert: HighTaskFailureRate
expr: |
rate(openclaw_tasks_total{status="failure"}[15m])
/ rate(openclaw_tasks_total[15m]) > 0.3
for: 10m
labels:
severity: warning
component: openclaw
annotations:
summary: "Agent 任务失败率过高"
description: "过去 15 分钟任务失败率超过 30%,当前值 {{ $value | printf \"%.1f\" }}%。"
runbook: "检查 Agent 日志中的错误模式,可能是 API 限流或工具故障。"
- alert: TaskExecutionSlow
expr: histogram_quantile(0.95, rate(openclaw_task_duration_seconds_bucket[15m])) > 300
for: 10m
labels:
severity: warning
component: openclaw
annotations:
summary: "Agent 任务执行缓慢"
description: "P95 任务执行时间超过 5 分钟({{ $value | printf \"%.0f\" }}s)。"
# ---- LLM API ----
- alert: LLMApiHighLatency
expr: histogram_quantile(0.95, rate(openclaw_llm_request_duration_seconds_bucket[10m])) > 30
for: 5m
labels:
severity: warning
component: llm
annotations:
summary: "LLM API 响应缓慢"
description: "P95 LLM 请求延迟超过 30 秒({{ $value | printf \"%.1f\" }}s)。可能是模型提供商限流。"
- alert: LLMApiErrors
expr: rate(openclaw_llm_requests_total{status="error"}[10m]) > 0.1
for: 5m
labels:
severity: warning
component: llm
annotations:
summary: "LLM API 错误率上升"
description: "LLM API 错误率超过 0.1 req/s。检查 API 密钥和配额。"
- alert: HighTokenBurn
expr: rate(openclaw_llm_tokens_total[1h]) * 3600 > 500000
for: 30m
labels:
severity: warning
component: cost
annotations:
summary: "⚠️ Token 消耗速率异常"
description: "过去 1 小时 Token 消耗速率超过 50 万/小时。可能有失控的 Agent 循环。"
runbook: "检查活跃会话,终止异常 Agent:通过 Control UI 或 API 停止可疑会话。"
- alert: DailyCostExceeded
expr: increase(openclaw_llm_cost_dollars[24h]) > 50
labels:
severity: critical
component: cost
annotations:
summary: "🔴 日 API 成本超过预算"
description: "过去 24 小时 LLM API 成本超过 $50(当前 ${{ $value | printf \"%.2f\" }})。"
# ---- 工具调用 ----
- alert: ToolCallFailureSpike
expr: |
rate(openclaw_tool_errors_total[10m])
/ rate(openclaw_tool_calls_total[10m]) > 0.5
for: 5m
labels:
severity: warning
component: openclaw
annotations:
summary: "工具调用失败率飙升"
description: "工具调用失败率超过 50%。可能是外部服务故障。"
# ---- 会话健康 ----
- alert: NoActiveSessions
expr: openclaw_active_sessions == 0
for: 15m
labels:
severity: info
component: openclaw
annotations:
summary: "无活跃 Agent 会话"
description: "过去 15 分钟没有活跃的 Agent 会话。如果预期 24/7 运行,请检查。"6. Alertmanager 告警通知配置
多通道告警路由
# monitoring/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
# SMTP 配置(邮件通知)
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'openclaw-alerts@yourdomain.com'
smtp_auth_username: 'openclaw-alerts@yourdomain.com'
smtp_auth_password: 'your-app-password'
smtp_require_tls: true
# 告警模板
templates:
- '/etc/alertmanager/templates/*.tmpl'
# 路由规则
route:
# 默认接收者
receiver: 'slack-default'
# 分组策略:按 alertname + component 分组
group_by: ['alertname', 'component']
group_wait: 30s # 等待 30 秒收集同组告警
group_interval: 5m # 同组告警间隔 5 分钟
repeat_interval: 4h # 未恢复告警每 4 小时重复
routes:
# Critical 告警 → Telegram + Email(立即通知)
- match:
severity: critical
receiver: 'critical-multi'
group_wait: 10s
repeat_interval: 1h
# 成本告警 → 专用通道
- match:
component: cost
receiver: 'cost-alerts'
group_wait: 1m
repeat_interval: 6h
# Info 级别 → 仅 Slack
- match:
severity: info
receiver: 'slack-default'
repeat_interval: 24h
# 抑制规则:Critical 触发时抑制同组件的 Warning
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['component']
# 接收者配置
receivers:
# Slack 默认通道
- name: 'slack-default'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#openclaw-alerts'
send_resolved: true
title: '{{ if eq .Status "firing" }}🔥{{ else }}✅{{ end }} {{ .CommonLabels.alertname }}'
text: >-
{{ range .Alerts }}
*{{ .Annotations.summary }}*
{{ .Annotations.description }}
{{ if .Annotations.runbook }}📋 *处理步骤:* {{ .Annotations.runbook }}{{ end }}
{{ end }}
# Critical 多通道(Telegram + Email)
- name: 'critical-multi'
telegram_configs:
- bot_token: 'YOUR_TELEGRAM_BOT_TOKEN'
chat_id: -1001234567890
parse_mode: 'HTML'
message: |
{{ if eq .Status "firing" }}🚨 <b>CRITICAL ALERT</b>{{ else }}✅ <b>RESOLVED</b>{{ end }}
{{ range .Alerts }}
<b>{{ .Annotations.summary }}</b>
{{ .Annotations.description }}
{{ end }}
email_configs:
- to: 'oncall@yourdomain.com'
send_resolved: true
headers:
Subject: '{{ if eq .Status "firing" }}🔴 CRITICAL{{ else }}✅ Resolved{{ end }}: {{ .CommonLabels.alertname }}'
# 成本告警
- name: 'cost-alerts'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#openclaw-costs'
send_resolved: true
title: '💰 {{ .CommonLabels.alertname }}'
text: >-
{{ range .Alerts }}
{{ .Annotations.description }}
{{ end }}提示词模板:生成 Alertmanager 配置
你是一个 DevOps 专家。请为我的 [平台名称] 生成 Alertmanager 配置。
要求:
- 通知通道:[Slack/Telegram/Email/PagerDuty/Discord]
- Critical 告警需要 [即时通知/电话升级]
- Warning 告警发送到 [通道名称]
- 告警分组策略:按 [alertname/component/severity] 分组
- 重复间隔:Critical [1小时],Warning [4小时]
- 需要抑制规则:Critical 触发时抑制同组件的 Warning
- 工作时间:[时区],非工作时间仅发送 Critical
请生成完整的 alertmanager.yml 配置文件,包含注释说明。7. Grafana 仪表板
OpenClaw 总览仪表板
以下是一个预配置的 Grafana 仪表板 JSON 模板,涵盖 OpenClaw 的核心监控面板:
{
"dashboard": {
"title": "OpenClaw Agent 监控总览",
"tags": ["openclaw", "agent", "monitoring"],
"timezone": "browser",
"refresh": "30s",
"panels": [
{
"title": "Gateway 状态",
"type": "stat",
"gridPos": { "h": 4, "w": 4, "x": 0, "y": 0 },
"targets": [{
"expr": "openclaw_health_status",
"legendFormat": "Health"
}],
"fieldConfig": {
"defaults": {
"mappings": [
{ "type": "value", "options": { "1": { "text": "✅ 健康", "color": "green" } } },
{ "type": "value", "options": { "0": { "text": "🔴 异常", "color": "red" } } }
]
}
}
},
{
"title": "Gateway 运行时间",
"type": "stat",
"gridPos": { "h": 4, "w": 4, "x": 4, "y": 0 },
"targets": [{
"expr": "openclaw_gateway_uptime_seconds / 3600",
"legendFormat": "Uptime"
}],
"fieldConfig": {
"defaults": { "unit": "h", "decimals": 1 }
}
},
{
"title": "活跃会话数",
"type": "stat",
"gridPos": { "h": 4, "w": 4, "x": 8, "y": 0 },
"targets": [{
"expr": "openclaw_active_sessions",
"legendFormat": "Sessions"
}]
},
{
"title": "今日 Token 消耗",
"type": "stat",
"gridPos": { "h": 4, "w": 4, "x": 12, "y": 0 },
"targets": [{
"expr": "increase(openclaw_llm_tokens_total[24h])",
"legendFormat": "Tokens"
}],
"fieldConfig": {
"defaults": { "unit": "short", "decimals": 0 }
}
},
{
"title": "今日估算成本",
"type": "stat",
"gridPos": { "h": 4, "w": 4, "x": 16, "y": 0 },
"targets": [{
"expr": "increase(openclaw_llm_cost_dollars[24h])",
"legendFormat": "Cost"
}],
"fieldConfig": {
"defaults": { "unit": "currencyUSD", "decimals": 2 }
}
},
{
"title": "当前告警数",
"type": "stat",
"gridPos": { "h": 4, "w": 4, "x": 20, "y": 0 },
"targets": [{
"expr": "count(ALERTS{alertstate=\"firing\"})",
"legendFormat": "Firing"
}],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{ "value": 0, "color": "green" },
{ "value": 1, "color": "yellow" },
{ "value": 3, "color": "red" }
]
}
}
}
},
{
"title": "LLM 请求延迟 (P50 / P95 / P99)",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 4 },
"targets": [
{
"expr": "histogram_quantile(0.50, rate(openclaw_llm_request_duration_seconds_bucket[5m]))",
"legendFormat": "P50"
},
{
"expr": "histogram_quantile(0.95, rate(openclaw_llm_request_duration_seconds_bucket[5m]))",
"legendFormat": "P95"
},
{
"expr": "histogram_quantile(0.99, rate(openclaw_llm_request_duration_seconds_bucket[5m]))",
"legendFormat": "P99"
}
],
"fieldConfig": {
"defaults": { "unit": "s" }
}
},
{
"title": "Token 消耗趋势(按模型)",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 4 },
"targets": [{
"expr": "rate(openclaw_llm_tokens_total[1h]) * 3600",
"legendFormat": "{{ provider }}/{{ model }} ({{ type }})"
}],
"fieldConfig": {
"defaults": { "unit": "short" }
}
},
{
"title": "任务成功率",
"type": "gauge",
"gridPos": { "h": 6, "w": 6, "x": 0, "y": 12 },
"targets": [{
"expr": "rate(openclaw_tasks_total{status=\"success\"}[1h]) / rate(openclaw_tasks_total[1h]) * 100",
"legendFormat": "Success Rate"
}],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0, "max": 100,
"thresholds": {
"steps": [
{ "value": 0, "color": "red" },
{ "value": 80, "color": "yellow" },
{ "value": 95, "color": "green" }
]
}
}
}
},
{
"title": "CPU 使用率",
"type": "timeseries",
"gridPos": { "h": 6, "w": 9, "x": 6, "y": 12 },
"targets": [{
"expr": "100 - (avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "CPU %"
}],
"fieldConfig": {
"defaults": { "unit": "percent", "max": 100 }
}
},
{
"title": "内存使用率",
"type": "timeseries",
"gridPos": { "h": 6, "w": 9, "x": 15, "y": 12 },
"targets": [{
"expr": "(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100",
"legendFormat": "Memory %"
}],
"fieldConfig": {
"defaults": { "unit": "percent", "max": 100 }
}
},
{
"title": "工具调用统计(Top 10)",
"type": "barchart",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 18 },
"targets": [{
"expr": "topk(10, increase(openclaw_tool_calls_total[24h]))",
"legendFormat": "{{ tool_name }}"
}]
},
{
"title": "累计 API 成本趋势",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 18 },
"targets": [{
"expr": "increase(openclaw_llm_cost_dollars[24h])",
"legendFormat": "{{ provider }}/{{ model }}"
}],
"fieldConfig": {
"defaults": { "unit": "currencyUSD" }
}
}
]
}
}将此 JSON 保存为 monitoring/grafana/dashboards/openclaw-overview.json,Grafana 会通过 provisioning 自动加载。
仪表板面板说明
| 面板 | 类型 | 用途 |
|---|---|---|
| Gateway 状态 | Stat | 一眼看到 Gateway 是否健康 |
| 运行时间 | Stat | 上次重启以来的运行时长 |
| 活跃会话数 | Stat | 当前正在运行的 Agent 会话 |
| 今日 Token 消耗 | Stat | 24 小时内的 Token 使用总量 |
| 今日估算成本 | Stat | 24 小时内的 API 成本估算 |
| 当前告警数 | Stat | 正在触发的告警数量 |
| LLM 请求延迟 | 时序图 | P50/P95/P99 延迟趋势 |
| Token 消耗趋势 | 时序图 | 按模型分类的 Token 消耗速率 |
| 任务成功率 | 仪表盘 | 任务执行的成功百分比 |
| CPU/内存使用率 | 时序图 | 基础设施资源趋势 |
| 工具调用统计 | 柱状图 | 最常用的工具 Top 10 |
| 累计 API 成本 | 时序图 | 按模型的成本趋势 |
8. 外部可用性监控
Uptime Kuma 集成
除了内部 Prometheus 监控,建议配置外部可用性探针,从外部视角检测 OpenClaw 是否可达:
# 在 docker-compose.monitoring.yml 中添加
uptime-kuma:
image: louislam/uptime-kuma:1
container_name: openclaw-uptime-kuma
restart: unless-stopped
ports:
- "127.0.0.1:3001:3001"
volumes:
- uptime_kuma_data:/app/data
networks:
- openclaw-net在 Uptime Kuma 中配置以下监控项:
| 监控项 | 类型 | URL/目标 | 间隔 | 说明 |
|---|---|---|---|---|
| Gateway 健康 | HTTP(s) | https://openclaw.yourdomain.com/health | 60s | 检查 Gateway 存活 |
| Gateway 认证 | HTTP(s) - Keyword | /api/status + Bearer Token | 120s | 验证认证正常 |
| TLS 证书 | HTTP(s) | 主域名 | 24h | 证书过期提前 14 天告警 |
| DNS 解析 | DNS | openclaw.yourdomain.com | 300s | DNS 解析正常 |
SLA 监控
对于需要 SLA 承诺的场景,建议追踪以下可用性指标:
# 月度可用性计算
SLA = (1 - 总停机分钟数 / 总分钟数) × 100%
# 目标 SLA 对照表
99.0% = 每月最多 7.3 小时停机
99.5% = 每月最多 3.65 小时停机
99.9% = 每月最多 43.8 分钟停机在 Grafana 中创建 SLA 面板:
# PromQL:过去 30 天可用性百分比
(1 - (
count_over_time((openclaw_health_status == 0)[30d:1m])
/ count_over_time(openclaw_health_status[30d:1m])
)) * 1009. Kubernetes 环境监控
kube-prometheus-stack
如果 OpenClaw 部署在 Kubernetes 上,推荐使用 kube-prometheus-stack Helm Chart 一键部署完整监控栈:
# 添加 Helm 仓库
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# 安装 kube-prometheus-stack
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set grafana.adminPassword="${GRAFANA_ADMIN_PASSWORD}" \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50GiServiceMonitor 配置
为 OpenClaw 创建 ServiceMonitor,让 Prometheus Operator 自动发现采集目标:
# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: openclaw-monitor
namespace: openclaw
labels:
release: monitoring # 匹配 kube-prometheus-stack 的 label selector
spec:
selector:
matchLabels:
app: openclaw
endpoints:
- port: http
path: /metrics
interval: 30s
scrapeTimeout: 10s
namespaceSelector:
matchNames:
- openclawPrometheusRule 配置
将告警规则以 K8s 资源形式管理:
# prometheusrule.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: openclaw-alerts
namespace: openclaw
labels:
release: monitoring
spec:
groups:
- name: openclaw-agent
rules:
- alert: OpenClawGatewayDown
expr: openclaw_health_status == 0
for: 1m
labels:
severity: critical
annotations:
summary: "OpenClaw Gateway 不可用"
# ... 其他规则同上实战案例:从零搭建 OpenClaw 监控体系
场景
一个独立开发者在 Hetzner VPS(CX22,2 vCPU/4 GB)上运行 OpenClaw,需要在不增加太多资源开销的情况下实现基本监控和告警。
步骤 1:准备监控配置文件
# 在 openclaw-deploy 目录下
mkdir -p monitoring/{prometheus/rules,alertmanager,grafana/provisioning/{datasources,dashboards},grafana/dashboards}
# 创建 Prometheus 配置(使用上方模板)
vim monitoring/prometheus/prometheus.yml
# 创建告警规则
vim monitoring/prometheus/rules/infrastructure.yml
vim monitoring/prometheus/rules/openclaw-agent.yml
# 创建 Alertmanager 配置
vim monitoring/alertmanager/alertmanager.yml
# 创建 Grafana 数据源配置
vim monitoring/grafana/provisioning/datasources/prometheus.yml
# 创建仪表板配置
vim monitoring/grafana/provisioning/dashboards/dashboards.yml
# 复制仪表板 JSON
vim monitoring/grafana/dashboards/openclaw-overview.json步骤 2:配置 Telegram 告警机器人
# 1. 在 Telegram 中找到 @BotFather,创建新 Bot
# 2. 获取 Bot Token
# 3. 创建一个群组,将 Bot 加入
# 4. 获取 Chat ID:
curl -s "https://api.telegram.org/bot<YOUR_BOT_TOKEN>/getUpdates" | jq '.result[0].message.chat.id'
# 5. 将 Token 和 Chat ID 填入 alertmanager.yml步骤 3:启动监控栈
# 在 .env 中添加 Grafana 密码
echo "GRAFANA_ADMIN_PASSWORD=$(openssl rand -base64 16)" >> .env
# 启动所有服务
docker compose -f docker-compose.yml -f docker-compose.monitoring.yml up -d
# 检查所有容器状态
docker compose -f docker-compose.yml -f docker-compose.monitoring.yml ps步骤 4:配置反向代理访问 Grafana
在 Caddyfile 中添加 Grafana 的反向代理:
grafana.yourdomain.com {
reverse_proxy localhost:3000
header {
X-Content-Type-Options nosniff
X-Frame-Options DENY
-Server
}
}# 重载 Caddy 配置
docker compose restart caddy步骤 5:验证监控体系
# 检查 Prometheus 目标
curl -s http://127.0.0.1:9090/api/v1/targets | \
jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
# 预期输出:
# {"job": "prometheus", "health": "up"}
# {"job": "node-exporter", "health": "up"}
# {"job": "cadvisor", "health": "up"}
# {"job": "openclaw-gateway", "health": "up"}
# 检查告警规则
curl -s http://127.0.0.1:9090/api/v1/rules | jq '.data.groups[].rules[] | {name: .name, state: .state}'
# 测试 Alertmanager 通知
# 手动触发测试告警
curl -X POST http://127.0.0.1:9093/api/v2/alerts \
-H "Content-Type: application/json" \
-d '[{
"labels": {
"alertname": "TestAlert",
"severity": "warning",
"component": "test"
},
"annotations": {
"summary": "这是一条测试告警",
"description": "验证 Alertmanager 通知通道是否正常工作。"
}
}]'步骤 6:访问 Grafana 仪表板
- 打开
https://grafana.yourdomain.com - 使用 admin /
${GRAFANA_ADMIN_PASSWORD}登录 - 导航到 Dashboards → OpenClaw → OpenClaw Agent 监控总览
- 确认所有面板正常显示数据
资源开销
| 组件 | CPU | 内存 | 磁盘 |
|---|---|---|---|
| Prometheus | ~0.1 核 | ~200 MB | ~1 GB/月(30 天保留) |
| Grafana | ~0.05 核 | ~100 MB | ~50 MB |
| Alertmanager | ~0.01 核 | ~30 MB | ~10 MB |
| Node Exporter | ~0.01 核 | ~20 MB | 无 |
| cAdvisor | ~0.05 核 | ~80 MB | 无 |
| 合计 | ~0.22 核 | ~430 MB | ~1 GB/月 |
💡 在 2 vCPU / 4 GB 的 VPS 上,监控栈大约占用 10% CPU 和 10% 内存,完全可以与 OpenClaw 共存。
避坑指南
❌ 常见错误
-
只监控基础设施,忽略 Agent 业务指标
- 问题:CPU 和内存正常不代表 Agent 在正常工作。Agent 可能陷入无限循环、API 密钥过期、或工具调用持续失败,但基础设施指标完全正常
- 正确做法:同时监控基础设施指标和 Agent 业务指标(任务成功率、Token 消耗、LLM 延迟),两者缺一不可
-
告警阈值设置过于敏感,导致告警疲劳
- 问题:CPU 超过 50% 就告警、每次 API 超时都通知,导致团队忽略所有告警,真正的问题反而被淹没
- 正确做法:遵循”每条告警都需要人工操作”原则。Warning 级别设置合理的
for持续时间(5-10 分钟),避免瞬时波动触发告警
-
Prometheus 数据保留时间过长,磁盘耗尽
- 问题:默认 15 天保留看似不多,但高基数指标(如按 tool_name 标签的工具调用)会快速膨胀存储
- 正确做法:设置
--storage.tsdb.retention.time=30d并监控 Prometheus 自身的磁盘使用。使用recording rules预聚合高频查询
-
Grafana 暴露到公网且使用默认密码
- 问题:Grafana 默认用户名/密码是 admin/admin,暴露到公网后任何人都能访问你的监控数据和告警配置
- 正确做法:通过反向代理访问,设置强密码,禁用注册(
GF_USERS_ALLOW_SIGN_UP=false)
-
没有测试告警通知通道
- 问题:配置了 Slack/Telegram 告警但从未测试,真正出问题时才发现 Webhook URL 错误或 Bot Token 过期
- 正确做法:部署后立即发送测试告警验证所有通知通道。定期(每月)发送测试告警确认通道仍然有效
-
监控栈本身没有监控
- 问题:Prometheus 挂了没人知道,等到 OpenClaw 出问题才发现监控已经停了
- 正确做法:使用外部服务(Uptime Kuma、UptimeRobot、Betterstack)监控 Prometheus 和 Grafana 的可用性
✅ 最佳实践
- 分层告警:Info → Warning → Critical 三级,Critical 必须立即处理,Warning 在工作时间处理,Info 仅记录
- 告警必须可操作:每条告警的
annotations中包含runbook(处理步骤),让收到告警的人知道该做什么 - 成本监控前置:Token 消耗和 API 成本是 AI Agent 平台最容易失控的指标,务必设置日/周预算告警
- 定期审查告警规则:每月审查一次告警触发记录,删除从未触发的规则,调整频繁误报的阈值
- 仪表板即文档:Grafana 仪表板应该能让新团队成员在 5 分钟内理解系统状态,面板命名清晰、布局合理
- 备份监控配置:将
prometheus.yml、告警规则、alertmanager.yml、Grafana 仪表板 JSON 全部纳入 Git 版本控制
相关资源与延伸阅读
| 资源 | 类型 | 说明 | 链接 |
|---|---|---|---|
| Prometheus 官方文档 | 官方文档 | 配置、PromQL、告警规则权威参考 | prometheus.io/docs |
| Grafana 官方文档 | 官方文档 | 仪表板、数据源、告警配置 | grafana.com/docs |
| Alertmanager 配置指南 | 官方文档 | 路由、接收者、抑制规则 | prometheus.io/docs/alerting |
| Awesome Prometheus Alerts | 开源项目 | 社区维护的告警规则集合 | github.com/samber/awesome-prometheus-alerts |
| kube-prometheus-stack | Helm Chart | K8s 一键部署 Prometheus + Grafana | github.com/prometheus-community/helm-charts |
| Uptime Kuma | 开源项目 | 自托管的可用性监控工具 | github.com/louislam/uptime-kuma |
| Grafana Cloud 免费层 | 托管服务 | 10K 指标免费,零运维 | grafana.com/products/cloud |
| AI Agent 可观测性指南 | 社区文章 | AI Agent 监控的特殊挑战 | blaxel.ai/blog/ai-observability |
| Grafana + AI Agent 监控 | 官方博客 | Grafana Cloud 监控 AI Agent 应用 | grafana.com/blog |
| 多 Agent 系统可观测性 | 社区指南 | 生产级多 Agent 系统的监控与排障 | xugj520.cn |
参考来源
- Integrating Grafana and Prometheus with AI for Advanced Monitoring (2026 年 2 月)
- Monitoring Agents and Flows with Grafana and Sentry: A Practical Playbook for 2026 (2025 年 12 月)
- AI Observability for Coding Agents: Complete Guide (2026 年 1 月)
- How to Monitor AI Agent Applications with Grafana Cloud (2025 年 11 月)
- Enterprise Multi-Agent AI Deployment: Observability & Troubleshooting Guide (2026 年 2 月)
- LiveKit Agent Monitoring in Production: Prometheus, Grafana & Alerts (2026 年 2 月)
- Prometheus & Grafana: Monitoring Your Infrastructure (2025 年 10 月)
- Real-Time Token Metrics: TTFT, TTR, Cached Tokens, and Cost (2025 年 10 月)
- Grafana & Prometheus Complete Guide 2026 (2026 年 1 月)
- Prometheus Alertmanager vs Grafana Alerts: Which One Should You Use (2025 年)
📝 内容基于 2025-2026 年公开资料整理,Prometheus 配置和告警规则经过改写以适应 OpenClaw 场景。具体指标名称和 API 端点请以 OpenClaw 官方文档为准。
📖 返回 总览与导航 | 上一节:25a-高可用部署 | 下一节:25c-日志管理与审计