Prometheus-01-框架架构与核心概念详解

news/2025/9/26 14:06:24/文章来源:https://www.cnblogs.com/wzzkaifa/p/19113374

写在开头的序：

文档01到10：

每篇文档开头都包含整理的官方文档、GitHub项目、技术资源和社区链接

提供大量实用的配置示例、脚本和最佳实践

涵盖物理机监控、容器监控、应用监控、网络监控等各类场景

每篇文档都超过6000字，内容深入浅出

从基础概念到高级架构，从部署配置到故障排查，覆盖完整技术栈

正文如下

监控模板类型：

系统监控：CPU、内存、磁盘、网络等基础指标
容器监控：Docker、Kubernetes集群和Pod监控
应用监控：Web服务、数据库、消息队列等应用层监控
网络监控：黑盒监控、SSL证书、DNS等网络服务监控
业务监控：自定义业务指标和SLA监控

Prometheus+Grafana 系统架构与核心概念详解

Prometheus系统架构概述

整体架构图

┌─────────────────────────────────────────────────────────────────┐
│                    Prometheus 监控生态系统                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐         │
│  │   Target    │    │   Target    │    │   Target    │         │
│  │  (Node)     │    │  (MySQL)    │    │ (Kubernetes)│         │
│  │             │    │             │    │             │         │
│  └─────┬───────┘    └─────┬───────┘    └─────┬───────┘         │
│        │                  │                  │                 │
│        │ /metrics         │ /metrics         │ /metrics        │
│        │                  │                  │                 │
│  ┌─────▼─────────────────────▼─────────────────▼─────────────┐  │
│  │                Prometheus Server                         │  │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐      │  │
│  │  │ Retrieval   │  │   TSDB      │  │    HTTP     │      │  │
│  │  │             │  │             │  │   Server    │      │  │
│  │  └─────────────┘  └─────────────┘  └─────────────┘      │  │
│  └─────────────────────┬─────────────────┬─────────────────┘  │
│                        │                 │                    │
│  ┌─────────────────────▼─┐           ┌───▼─────────────────┐  │
│  │     Alertmanager      │           │     Grafana         │  │
│  │                       │           │                     │  │
│  │  ┌─────────────────┐  │           │  ┌───────────────┐  │  │
│  │  │   Routing       │  │           │  │  Dashboard    │  │  │
│  │  │   Grouping      │  │           │  │  Panels       │  │  │
│  │  │   Throttling    │  │           │  │  Queries      │  │  │
│  │  └─────────────────┘  │           │  └───────────────┘  │  │
│  └─────────────────────┬─┘           └─────────────────────┘  │
│                        │                                      │
│  ┌─────────────────────▼─┐                                    │
│  │   Notification        │                                    │
│  │   (Email/Slack/...)   │                                    │
│  └─────────────────────────┘                                  │
└─────────────────────────────────────────────────────────────────┘

核心设计理念

Prometheus基于以下核心设计理念构建：

Pull模式: 主动拉取监控数据，而非被动接收
时间序列数据库: 专门针对时间序列数据优化
多维数据模型: 通过标签(labels)实现灵活的数据建模
强大的查询语言: PromQL提供丰富的数据分析能力
无依赖: 单二进制文件，易于部署和维护

系统特点

优势特点:

高性能时间序列数据库
内置Web UI和图形界面
灵活的查询语言PromQL
支持多种服务发现机制
活跃的开源社区

设计约束:

主要关注可靠性而非100%准确性
不适合计费类精确数据
单机存储限制(可通过联邦解决)

核心组件详解

Prometheus Server

Prometheus Server是整个监控系统的核心，主要包含以下组件：

1. Retrieval组件

# prometheus.yml 配置示例
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
scrape_interval: 5s
metrics_path: /metrics
scheme: http
- job_name: 'node-exporter'
static_configs:
- targets:
- 'node1:9100'
- 'node2:9100'
- 'node3:9100'
scrape_interval: 10s
- job_name: 'mysql-exporter'
static_configs:
- targets: ['mysql-server:9104']
scrape_interval: 30s

主要功能:

根据配置定时抓取目标的指标数据
支持多种认证方式(HTTP Basic、TLS、Bearer Token)
自动服务发现和目标管理
健康检查和状态监控

2. 时间序列数据库(TSDB)

// TSDB存储结构示例
type Sample struct {
Timestamp int64 // 时间戳(毫秒)
Value float64 // 指标值
}
type Series struct {
Labels []Label // 标签集合
Samples []Sample // 样本数据
}

存储特性:

基于时间分片的存储结构
高压缩比的数据编码
快速的范围查询能力
自动数据过期和清理

存储目录结构:

data/
├── 01BKGV7JBM69T2G1BGBGM6KB12/     # Block目录
│   ├── chunks/                      # 原始数据块
│   │   └── 000001
│   ├── tombstones                   # 删除标记
│   ├── index                        # 索引文件
│   └── meta.json                    # 元数据
├── 01BKGTZQ1SYQJTR4PB43C8PD98/
├── lock                             # 锁文件
└── wal/                             # 预写日志
├── 00000000
├── 00000001
└── checkpoint.000002

3. HTTP API服务

# 常用API端点
GET /api/v1/query # 即时查询
GET /api/v1/query_range # 范围查询
GET /api/v1/targets # 查看采集目标
GET /api/v1/rules # 查看告警规则
GET /api/v1/alerts # 查看活跃告警
GET /api/v1/label/<label>/values # 查看标签值

API使用示例:

# 查询CPU使用率
curl 'http://localhost:9090/api/v1/query?query=cpu_usage_percent'
# 查询指定时间范围的内存使用
curl 'http://localhost:9090/api/v1/query_range?query=memory_usage_bytes&start=2024-01-01T00:00:00Z&end=2024-01-02T00:00:00Z&step=300s'

Exporters

Exporters是Prometheus生态系统中的数据采集器，负责将各种系统、服务的指标暴露为Prometheus格式。

常用Exporters

Exporter	用途	默认端口	GitHub地址
node_exporter	系统硬件和OS指标	9100	https://github.com/prometheus/node_exporter
mysqld_exporter	MySQL数据库指标	9104	https://github.com/prometheus/mysqld_exporter
redis_exporter	Redis缓存指标	9121	https://github.com/oliver006/redis_exporter
nginx-exporter	Nginx Web服务指标	9113	https://github.com/nginxinc/nginx-prometheus-exporter
blackbox_exporter	黑盒监控(HTTP/HTTPS/DNS/TCP)	9115	https://github.com/prometheus/blackbox_exporter

Node Exporter配置

# 安装和启动node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.0/node_exporter-1.6.0.linux-amd64.tar.gz
tar xvf node_exporter-1.6.0.linux-amd64.tar.gz
cd node_exporter-1.6.0.linux-amd64
./node_exporter --web.listen-address=":9100"

主要指标类别:

# CPU相关指标
node_cpu_seconds_total{cpu="0",mode="idle"}
node_cpu_seconds_total{cpu="0",mode="user"}
node_cpu_seconds_total{cpu="0",mode="system"}
# 内存相关指标
node_memory_MemTotal_bytes
node_memory_MemFree_bytes
node_memory_MemAvailable_bytes
# 磁盘相关指标
node_disk_read_bytes_total{device="sda"}
node_disk_written_bytes_total{device="sda"}

Alertmanager

Alertmanager负责处理由Prometheus server发送的告警，进行去重、分组、路由和通知。

配置文件结构

# alertmanager.yml
global:
smtp_smarthost: 'localhost:587'
smtp_from: 'alerts@example.com'
smtp_auth_username: 'alerts@example.com'
smtp_auth_password: 'password'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'default-receiver'
routes:
- match:
severity: critical
receiver: critical-receiver
group_wait: 0s
- match:
severity: warning
receiver: warning-receiver
receivers:
- name: 'default-receiver'
email_configs:
- to: 'admin@example.com'
subject: 'Prometheus Alert'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
{{ end }}
- name: 'critical-receiver'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx'
channel: '#alerts'
title: 'Critical Alert'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
- name: 'warning-receiver'
webhook_configs:
- url: 'http://webhook-server:8080/alert'
send_resolved: true

告警规则配置

# alert_rules.yml
groups:
- name: system.rules
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"
}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for instance {{ $labels.instance }}"
- alert: OutOfMemory
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10
for: 5m
labels:
severity: critical
annotations:
summary: "Out of memory"
description: "Memory usage is above 90% for instance {{ $labels.instance }}"
- alert: DiskSpaceWarning
expr: (node_filesystem_avail_bytes{fstype!="tmpfs"
} / node_filesystem_size_bytes{fstype!="tmpfs"
}) * 100 < 20
for: 10m
labels:
severity: warning
annotations:
summary: "Low disk space"
description: "Disk space is below 20% for {{ $labels.mountpoint }} on {{ $labels.instance }}"

数据模型与指标类型

时间序列数据模型

Prometheus中的每个时间序列都由指标名称和一组标签唯一标识：

{=, ...}

示例:

http_requests_total{method="POST", handler="/api/tracks", status="200", instance="localhost:8080"}

四种指标类型

1. Counter(计数器)

Counter是一个累积指标，表示单调递增的计数器，只能增加或重置为零。

使用场景:

HTTP请求总数
错误发生次数
任务完成数量

示例指标:

# 总请求数
http_requests_total{method="GET", handler="/api/users", status="200"} 1027
# 错误总数
http_request_errors_total{method="POST", handler="/api/login"} 23

常用PromQL查询:

# 计算每秒请求率
rate(http_requests_total[5m])
# 计算5分钟内的请求增长量
increase(http_requests_total[5m])

2. Gauge(仪表盘)

Gauge表示可以任意上下变动的数值。

使用场景:

内存使用量
CPU使用率
队列大小
温度

示例指标:

# 内存使用量
memory_usage_bytes{instance="server1"} 8589934592
# CPU使用率
cpu_usage_percent{instance="server1", cpu="0"} 75.5
# 队列长度
queue_size{queue="processing"} 42

常用PromQL查询:

# 当前内存使用率
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
# 平均CPU使用率
avg(cpu_usage_percent) by (instance)

3. Histogram(直方图)

Histogram对观察结果进行采样(通常是请求持续时间或响应大小)，并在可配置的桶中计数。

生成的指标:

<basename>_bucket{le="<upper inclusive bound>"}: 累积计数器
<basename>_count: 观察总数
<basename>_sum: 观察值总和

示例:

# HTTP请求延迟直方图
http_request_duration_seconds_bucket{le="0.1"} 24054
http_request_duration_seconds_bucket{le="0.2"} 33444
http_request_duration_seconds_bucket{le="0.5"} 100392
http_request_duration_seconds_bucket{le="+Inf"} 144320
http_request_duration_seconds_count 144320
http_request_duration_seconds_sum 53423

常用PromQL查询:

# 计算95分位数
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# 计算平均延迟
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

4. Summary(摘要)

Summary类似于Histogram，但它在客户端计算可配置的分位数。

生成的指标:

<basename>{quantile="<φ>"}: 流式φ-分位数
<basename>_count: 观察总数
<basename>_sum: 观察值总和

示例:

# RPC延迟摘要
rpc_duration_seconds{quantile="0.5"} 0.05
rpc_duration_seconds{quantile="0.9"} 0.1
rpc_duration_seconds{quantile="0.99"} 0.3
rpc_duration_seconds_count 2693
rpc_duration_seconds_sum 1756.021

标签最佳实践

标签命名规范

# 好的标签命名
http_requests_total{method="GET", status="200", handler="/api/users"
}
# 避免的标签命名
http_requests_total{METHOD="get", Status_Code="two_hundred"
}

标签基数控制

# 合理的标签基数
user_login_total{country="US"
} # 国家数量有限
# 避免高基数标签
user_login_total{user_id="12345"
} # 用户ID数量庞大

查询语言PromQL

PromQL语法基础

基本选择器

# 指标名称选择
http_requests_total
# 标签匹配选择
http_requests_total{job="apiserver", handler="/api/comments"}
# 标签正则匹配
http_requests_total{handler=~"/api/.*"}
# 标签负向匹配
http_requests_total{handler!="/api/health"}

时间范围选择器

# 5分钟范围
http_requests_total[5m]
# 1小时范围
cpu_usage_percent[1h]
# 1天范围
memory_usage_bytes[1d]

操作符和函数

算术操作符

# 加法
node_memory_MemTotal_bytes + node_memory_Buffers_bytes
# 减法
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
# 乘法
rate(cpu_seconds_total[5m]) * 100
# 除法
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes

比较操作符

# 大于
cpu_usage_percent > 80
# 小于等于
memory_usage_bytes <= 8589934592
# 等于
http_response_status == 200
# 不等于
http_response_status != 404

逻辑操作符

# AND操作
(cpu_usage_percent > 80) and (memory_usage_percent > 90)
# OR操作
(http_response_status == 404) or (http_response_status == 500)
# UNLESS操作(排除)
up unless on(instance) (maintenance_mode == 1)

聚合函数

基本聚合

# 求和
sum(http_requests_total)
# 平均值
avg(cpu_usage_percent)
# 最大值
max(memory_usage_bytes)
# 最小值
min(disk_free_bytes)
# 计数
count(up == 1)

分组聚合

# 按job分组求和
sum(http_requests_total) by (job)
# 按instance和method分组求平均
avg(request_duration_seconds) by (instance, method)
# 排除特定标签进行聚合
sum(http_requests_total) without (instance)

时间序列函数

Rate函数

# 计算每秒平均增长率
rate(http_requests_total[5m])
# 计算每秒平均增长率(允许计数器重置)
irate(http_requests_total[5m])
# 计算指定时间范围内的增长量
increase(http_requests_total[1h])

预测函数

# 线性预测
predict_linear(disk_free_bytes[1h], 4*3600)
# 时间偏移
cpu_usage_percent offset 1h
# 导数
deriv(cpu_temperature[5m])

数学函数

# 绝对值
abs(temperature_celsius)
# 向上取整
ceil(response_time_seconds)
# 向下取整
floor(response_time_seconds)
# 四舍五入
round(cpu_usage_percent)
# 平方根
sqrt(memory_usage_bytes)

复杂查询示例

服务可用性计算

# SLA计算(99.9%可用性)
(
sum(rate(http_requests_total{status!~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
) * 100

错误率监控

# 5xx错误率
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) /
sum(rate(http_requests_total[5m])) by (service) * 100

资源使用率TOP-K

# CPU使用率最高的5台服务器
topk(5, avg(cpu_usage_percent) by (instance))
# 内存使用量最高的服务
topk(3, sum(memory_usage_bytes) by (service))

服务发现机制

静态配置

最简单的服务发现方式，适用于目标地址相对固定的场景。

scrape_configs:
- job_name: 'static-nodes'
static_configs:
- targets:
- '192.168.1.10:9100'
- '192.168.1.11:9100'
- '192.168.1.12:9100'
labels:
env: 'production'
datacenter: 'dc1'

文件服务发现

通过监控文件变化来动态更新目标列表。

scrape_configs:
- job_name: 'file-sd'
file_sd_configs:
- files:
- '/etc/prometheus/targets/*.json'
- '/etc/prometheus/targets/*.yml'
refresh_interval: 30s

目标文件示例:

[
{
"targets": ["node1:9100", "node2:9100"],
"labels": {
"job": "node-exporter",
"env": "prod"
}
},
{
"targets": ["mysql1:9104", "mysql2:9104"],
"labels": {
"job": "mysql-exporter",
"env": "prod"
}
}
]

Docker服务发现

scrape_configs:
- job_name: 'docker'
docker_sd_configs:
- host: unix:///var/run/docker.sock
refresh_interval: 30s
filters:
- name: label
values: ["prometheus.scrape=true"]
relabel_configs:
- source_labels: [__meta_docker_container_name]
target_label: container
- source_labels: [__meta_docker_container_label_prometheus_port]
target_label: __address__
regex: (.+)
replacement: ${
1
}

Kubernetes服务发现

scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- default
- monitoring
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name

Consul服务发现

scrape_configs:
- job_name: 'consul'
consul_sd_configs:
- server: 'consul.example.com:8500'
datacenter: 'dc1'
services: ['web', 'database', 'cache']
tags: ['prometheus']
refresh_interval: 30s
relabel_configs:
- source_labels: [__meta_consul_service]
target_label: service
- source_labels: [__meta_consul_datacenter]
target_label: datacenter
- source_labels: [__meta_consul_tags]
target_label: consul_tags

存储架构

本地存储

存储结构

Prometheus使用自定义的时间序列数据库(TSDB)进行本地存储，具有以下特点：

Block结构:

01BKGV7JBM69T2G1BGBGM6KB12/
├── chunks/
│   └── 000001              # 压缩的时间序列数据
├── index                   # 倒排索引
├── meta.json              # Block元数据
└── tombstones             # 删除标记

WAL(预写日志):

wal/
├── 00000000               # WAL段文件
├── 00000001
└── checkpoint.000002      # 检查点

数据压缩算法

// 时间戳压缩(Delta-of-delta)
type timestampEncoder struct {
t int64 // 基准时间戳
delta int64 // 上次delta
}
// 值压缩(XOR)
type valueEncoder struct {
val uint64 // 上次值的位表示
leading uint8 // 前导零位数
trailing uint8 // 尾随零位数
}

数据保留策略:

# prometheus.yml
global:
external_labels:
cluster: 'production'
# 存储配置
storage:
tsdb:
retention.time: 15d # 保留15天数据
retention.size: 512GB # 最大存储512GB
wal-compression: true # 启用WAL压缩
# 压缩级别配置
compaction:
max_block_duration: 36h # 最大块持续时间
min_block_duration: 2h # 最小块持续时间

远程存储

远程写入

# 远程写入配置
remote_write:
- url: "https://prometheus-remote-write.example.com/api/v1/write"
basic_auth:
username: 'user'
password: 'password'
write_relabel_configs:
- source_labels: [__name__]
regex: 'high_cardinality_.*'
action: drop
queue_config:
max_samples_per_send: 1000
max_shards: 200
capacity: 2500

远程读取

# 远程读取配置
remote_read:
- url: "https://prometheus-remote-read.example.com/api/v1/read"
basic_auth:
username: 'user'
password: 'password'
required_matchers:
job: 'special'

高可用部署

Prometheus集群方案

1. 联邦(Federation)

# 全局Prometheus配置
scrape_configs:
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job=~"prometheus"}'
- '{__name__=~"job:.*"}'
- '{__name__=~"node_cpu.*"}'
static_configs:
- targets:
- 'prometheus-dc1:9090'
- 'prometheus-dc2:9090'
- 'prometheus-dc3:9090'

联邦架构图:

┌─────────────────────────────────────────────────────────┐
│                 Global Prometheus                        │
├─────────────────────────────────────────────────────────┤
│          /federate API 聚合数据                          │
└─────────────┬───────────┬───────────┬─────────────────┘
│           │           │
┌─────────────▼─┐  ┌─────▼─────┐  ┌──▼─────────────┐
│ Prometheus DC1│  │Prometheus │  │ Prometheus DC3 │
│               │  │    DC2    │  │               │
└─────────────┬─┘  └─────┬─────┘  └──┬─────────────┘
│          │           │
┌─────▼────┐ ┌──▼────┐ ┌────▼─────┐
│ Targets  │ │Targets│ │ Targets  │
│   DC1    │ │  DC2  │ │   DC3    │
└──────────┘ └───────┘ └──────────┘

2. 横向扩展(Horizontal Scaling)

# 服务器1配置
external_labels:
replica: A
datacenter: dc1
# 服务器2配置 
external_labels:
replica: B
datacenter: dc1

Alertmanager集群

# alertmanager.yml
cluster:
listen-address: "0.0.0.0:9094"
advertise-address: "192.168.1.10:9094"
peers:
- "192.168.1.10:9094"
- "192.168.1.11:9094"
- "192.168.1.12:9094"
route:
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'default'

与Grafana集成

数据源配置

Prometheus数据源

{
"name": "Prometheus",
"type": "prometheus",
"url": "http://prometheus:9090",
"access": "proxy",
"basicAuth": false,
"jsonData": {
"timeInterval": "5s",
"queryTimeout": "60s",
"httpMethod": "POST"
}
}

高可用数据源配置

{
"name": "Prometheus-HA",
"type": "prometheus",
"url": "http://prometheus-ha-proxy:9090",
"access": "proxy",
"jsonData": {
"timeInterval": "15s",
"queryTimeout": "300s",
"httpMethod": "POST",
"exemplarTraceIdDestinations": [
{
"name": "trace_id",
"datasourceUid": "jaeger-uid"
}
]
}
}

Dashboard创建

基础面板配置

{
"panels": [
{
"title": "CPU使用率",
"type": "stat",
"targets": [
{
"expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "{{instance}}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"steps": [
{
"color": "green", "value": null
},
{
"color": "yellow", "value": 70
},
{
"color": "red", "value": 90
}
]
}
}
}
}
]
}

监控模版配置

物理机监控模版

Node Exporter完整配置

# docker-compose.yml
version: '3.8'
services:
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
restart: unless-stopped
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
- '--collector.systemd'
- '--collector.processes'
ports:
- "9100:9100"
network_mode: host

物理机监控指标

# CPU监控指标
cpu_usage = 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# 内存监控指标
memory_usage = (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
# 磁盘使用率
disk_usage = (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100
# 网络流量
network_receive = rate(node_network_receive_bytes_total[5m])
network_transmit = rate(node_network_transmit_bytes_total[5m])
# Load Average
load_average = node_load1
# 磁盘IO
disk_read_iops = rate(node_disk_reads_completed_total[5m])
disk_write_iops = rate(node_disk_writes_completed_total[5m])

容器监控模版

cAdvisor配置

version: '3.8'
services:
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
restart: unless-stopped
privileged: true
devices:
- /dev/kmsg:/dev/kmsg
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker:/var/lib/docker:ro
- /cgroup:/cgroup:ro
ports:
- "8080:8080"
command:
- '--housekeeping_interval=30s'
- '--docker_only=false'
- '--disable_metrics=percpu,sched,tcp,udp,disk,diskIO,accelerator,hugetlb,referenced_memory,cpu_topology,resctrl'

容器监控指标

# 容器CPU使用率
container_cpu_usage = rate(container_cpu_usage_seconds_total{name!=""}[5m]) * 100
# 容器内存使用率
container_memory_usage = container_memory_usage_bytes{name!=""} / container_spec_memory_limit_bytes{name!=""} * 100
# 容器网络流量
container_network_rx = rate(container_network_receive_bytes_total{name!=""}[5m])
container_network_tx = rate(container_network_transmit_bytes_total{name!=""}[5m])
# 容器文件系统使用率
container_fs_usage = container_fs_usage_bytes{name!=""} / container_fs_limit_bytes{name!=""} * 100

应用监控模版

HTTP服务监控

# QPS(每秒请求数)
http_qps = sum(rate(http_requests_total[5m])) by (service)
# 响应时间分位数
http_latency_p50 = histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
http_latency_p95 = histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
http_latency_p99 = histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
# 错误率
http_error_rate = sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) * 100
# 可用性
http_availability = sum(rate(http_requests_total{status!~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) * 100

数据库监控模版

# MySQL监控指标
mysql_connections = mysql_global_status_threads_connected
mysql_qps = rate(mysql_global_status_queries[5m])
mysql_slow_queries = rate(mysql_global_status_slow_queries[5m])
mysql_innodb_buffer_pool_hit_rate = mysql_global_status_innodb_buffer_pool_read_requests / (mysql_global_status_innodb_buffer_pool_read_requests + mysql_global_status_innodb_buffer_pool_reads) * 100
# Redis监控指标
redis_connected_clients = redis_connected_clients
redis_memory_usage = redis_memory_used_bytes / redis_memory_max_bytes * 100
redis_hit_rate = redis_keyspace_hits / (redis_keyspace_hits + redis_keyspace_misses) * 100
redis_ops = rate(redis_commands_processed_total[5m])

最佳实践建议

性能优化

1. 查询优化

# 避免高基数聚合
# 错误示例
sum(http_requests_total) by (user_id)  # user_id基数可能很高
# 正确示例
sum(http_requests_total) by (service, method)  # 基数可控

2. 存储优化

# 合理设置保留期
global:
external_labels:
cluster: 'prod'
# 存储配置
storage:
tsdb:
retention.time: 7d # 根据需求设置
retention.size: 100GB # 限制存储大小
wal-compression: true # 启用压缩

3. 采集优化

scrape_configs:
- job_name: 'app'
scrape_interval: 30s # 根据需要调整采集频率
scrape_timeout: 10s # 设置合理超时
metric_relabel_configs: # 删除不需要的指标
- source_labels: [__name__]
regex: 'debug_.*'
action: drop

安全配置

1. 基础认证

# basic_auth配置
basic_auth_users:
admin: $2b$12$hNf2lSsxfm0.i4a.1kVpSOVyBCfIB51VRjgBUyv6kdnyTlgWj81Ay
# TLS配置
tls_server_config:
cert_file: server.crt
key_file: server.key

2. 网络安全

# 限制访问范围
web:
listen-address: 127.0.0.1:9090
# 防火墙规则
iptables -A INPUT -p tcp --dport 9090 -s 192.168.1.0/24 -j ACCEPT
iptables -A INPUT -p tcp --dport 9090 -j DROP

监控策略

1. 告警分级

groups:
- name: critical.rules
rules:
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
- name: warning.rules
rules:
- alert: HighCPU
expr: cpu_usage > 80
for: 5m
labels:
severity: warning

2. 指标命名规范

# 好的命名
http_requests_total
http_request_duration_seconds
mysql_queries_total
# 避免的命名
HttpRequestsTotal
http-requests-total
requests

容量规划

存储需求评估

# 估算公式
# 每个样本约1-2字节(压缩后)
# 存储需求 = 指标数量 × 采集频率 × 保留时间 × 2字节
# 示例计算
# 1000个指标，15秒采集一次，保留7天
# 存储需求 = 1000 × (86400/15) × 7 × 2 = 80,640,000字节 ≈ 77MB

硬件推荐

指标数量	CPU	内存	存储	网络
< 1万	2核	4GB	100GB	1Gbps
1万-10万	4核	8GB	500GB	1Gbps
10万-100万	8核	16GB	2TB	10Gbps
> 100万	集群部署	按需扩展	分布式存储	高速网络

总结

Prometheus+Grafana监控体系具有以下核心优势：

架构简单: Pull模式，无依赖，易部署
功能强大: 多维数据模型，强大的PromQL查询语言
生态丰富: 大量Exporters，活跃社区
可视化优秀: Grafana提供专业的监控仪表板
扩展性好: 支持联邦、远程存储等扩展方案

通过合理的架构设计、配置优化和最佳实践应用，可以构建出稳定、高效的企业级监控系统。在实际部署时，需要根据业务规模、技术栈特点和团队能力，选择合适的部署方案和监控策略。

本文详细介绍了Prometheus+Grafana的系统架构、核心概念、部署配置和最佳实践，为构建完整的监控体系提供了理论基础和实践指导。后续文章将深入探讨具体的安装部署、配置管理、告警设置等实战内容。

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.mzph.cn/news/918362.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！

麒麟 Linux｜深入解析 Linux 文件系统架构：理念、结构与工作机制 - 教程

pre { white-space: pre !important; word-wrap: normal !important; overflow-x: auto !important; display: block !important; font-family: "Consolas", "Monaco", "Courier New", …

别等碳超支才慌！EMS 像 “碳导航”，提前预警能耗 “堵点”，双碳路上不绕路

在 “双碳” 目标推进的当下，不少企业都面临着一个棘手问题：往往要等到月度、季度碳排放核算结果出来，才发现碳排放量超标，此时再紧急整改，不仅要承担罚款风险，还可能打乱生产节奏，陷入 “亡羊补牢” 的被动局面…

哈尔滨网站seo公司农业信息网站建设

正文ASP.NET Core MVC 2.1 特意为构建 HTTP API 提供了一些小特性，今天主角就是 ApiControllerAttribute。0. ApiControllerAttribute 继承自 ControllerAttributeASP.NET Core MVC 已经有了ControllerAttribute，这个用来标注一个类型是否是Controller。…

绩效考核表网站建设建设部网站城市规划资质标准

为什么有越来越多的用户选择使用小程序？跟“高大上”的APP相比，小程序不仅可以减少下载安装的复杂流程，还具备操作便捷、沉淀私域数据的优势。蚓链分销小程序具备裂变二维码、实时分佣、分销身份升级、层级分佣、商品个性化佣金设定等功能&am…

网站文件夹名中文网址和中文域名区别

参考：基于知识库和 LLM 的问答系统经验分享 - 知乎 (zhihu.com) 一、基于LLM的问答系统架构比较常见的开源 LLM 的问答系统都会遵循下图这种结构去进行设计： 加载文件 -> 读取文本 -> 文本分割 -> 文本向量化 -> 问句向量化 -> 在文本…

有没有便宜做网站的我要做个深圳网站如何制作

当我们面对三维建模软件的选择时，许多初学者可能会感到迷茫。今天，我们将从不同角度深入探讨Maya和Blender这两款软件的差异，特别是对于游戏建模领域的用户来说，这将有助于您更好地理解两者之间的区别。软件授权与开发背景&#…

OTA测试实战指南：测试流程、用例设计与自动化实现

作为国内早期开展OEM整车EE测试业务的团队，经纬恒润整车电气团队在OTA测试及网联测试业务持续发展，积累了丰富的测试经验及项目管理经验。9月18日，经纬恒润举办线上直播研讨会，结合业务团队的实践经验，分享OTA云端…

Halcon图像——相机图像采集模式

Halcon中图像采集模式在Halcon中，图像的采集分为两种：同步模式、异步模式。异步读取图像并非直接从相机中读取，还是在采集卡中读取。这样的好处就是可以保证取图和读图两不误，但必须要严格规定图像的处理时间。…

How to use SQL Server Management Studio track one store procedure performance - 详解

How to use SQL Server Management Studio track one store procedure performance - 详解pre { white-space: pre !important; word-wrap: normal !important; overflow-x: auto !important; display: block !importa…

【2025-09-25】连岳摘抄

23:59最好的东西不是独来的。它伴了所有的东西同来。——泰戈尔抑郁症，或者抑郁状态，现在是个普遍现象。可以说是现代病。现代化让社会高度发达，科学、技术、工程越来越宏大，而个人却在日益细化的分工中看不到全局…

网站管理工作流程集团高端网站建设

三种情况：第一种情况：隐式动画，全自动动画（属于内部封装好了，只需要几行代码，就可以实现非常强大的效果）第二种情况：显示动画，手动控制的动画（提供自定义选择…

完整教程：探索 Event 框架实战指南：微服务系统中的事件驱动通信：

完整教程：探索 Event 框架实战指南：微服务系统中的事件驱动通信：pre { white-space: pre !important; word-wrap: normal !important; overflow-x: auto !important; display: block !important; font-family: &quo…

网站停留时间从哪里获取如何做网站编辑

HCSC单片机使用小结 CAN 1、波特率主频/（分频1）/（SEG1SEG2)。存在BT寄存器中。其中 2、STB在 FIFO 模式下，最先写入的数据先发送，在优先级模式下，ID 小的数据先发送。 3、通过 TCMD 寄存器的 TBSEL 位选…

Gitee：本土化DevOps平台如何助力企业实现研发效能跃迁

Gitee：本土化DevOps平台如何助力企业实现研发效能跃迁在数字化转型浪潮席卷全球的当下，软件开发正成为企业核心竞争力的关键所在。Gitee作为一款植根中国市场的DevOps平台，凭借其独特的本土化优势和创新设计理念，…

全新升级~山海鲸4.5.12版本更新内容速递

产品更新概览功能修复：修复鲸地图底图报错问题；修复部分图表加载时的偶发报错问题；修复iframe中特殊嵌套的情况。功能优化：优化鲸地图底图加载优化、标记图层顺序调整；优化3D饼图文本显示效果；优化表格组…

告别等待：5 个 systemd 优化技巧，显著加速你的 Linux 启动

告别等待：5 个 systemd 优化技巧，显著加速你的 Linux 启动即使是固态硬盘，不必要的后台服务也会拖慢你的系统启动速度。通过几个简单的 systemd 调优技巧，我成功减少了明显的启动时间，而无需牺牲系统稳定性。为什…

pod启动后一直containerCreating状态解决

更新容器镜像的时候，经常遇到遇到pod一直卡住在containerCreating状态，检查该pod的事件信息，显示pod一直在pulling镜像，即pod的创建阻塞在拉取镜像到节点的过程中。出现上述问题的pod，在uat测试环境和prod生产…

activiti部署流程后act_re_procdef表中无流程定义信息

ctiviti部署流程后act_re_procdef表中无流程定义信息原因是因为之前按照教程学习，把qingjia.bpmn20.xml的bpmn20给删掉了，只在两个表中做了更新，act_re_procdef表中的数据没有更新，在启动流程的时候就发现没有定义…

手写代码使用Fls模块的方法

手写代码使用Fls模块的方法擦除目的地址的数据 Std_ReturnType Fls_17_Dmu_Erase( const Fls_17_Dmu_AddressType TargetAddress, const Fls_17_Dmu_LengthType Length);写数据Std_ReturnType Fls_17_Dmu_Write( const…

[PaperReading] REACT: SYNERGIZING REASONING AND ACTING IN LANGUAGE MODELS

目录REACT: SYNERGIZING REASONING AND ACTING IN LANGUAGE MODELSTL;DRMethod实验设计不同方法的对比BadCase分析Q&AExperimentWebShop总结与思考相关链接 REACT: SYNERGIZING REASONING AND ACTING IN LANGU…