搭建容器监控平台
1. 实战
1.1. 新建目录
创建如下目录结构。
1.2 编写 prometheus 的配置文件 prometheus.yml
# 全局配置
global:
# 每5s收集一次数据
scrape_interval: 5s
# 每5s执行一次告警规则检测
evaluation_interval: 5s
# 标记标签
external_labels:
monitor: 'monitor'
# 告警配置
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
# 指定规则配置文件
rule_files:
- rules/*.yml
# 数据抓取配置
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['prometheus:9090']
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
1.3 编写 Prometheus 的告警规则 rule_1.yml
groups:
- name: rule-
rules:
- alert: "服务运行告警"
expr: up{job="cadvisor"} < 1
# 告警等待时间
for: 1m
labels:
severity: warning
annotations:
summary: "服务名: {{$labels.alertname}}"
description: "容器cadvisor已停止"
# 全局配置项
global:
resolve_timeout: 5m #处理超时时间,默认为5min
smtp_smarthost: 'smtp.qq.com:587' # 邮箱smtp服务器代理,这里以QQ邮箱为例
smtp_from: '[email protected]' # 发送邮箱名称
smtp_auth_username: '[email protected]' # 邮箱账号
smtp_auth_password: 'xxxxxxxxxxxx' # 邮箱授权码
# 定义模板信息
templates:
- '/etc/alertmanager/templates/*.html'
# 定义路由树信息
route:
group_by: ['alertname'] # 报警分组依据
group_wait: 10s # 最初即第一次等待多久时间发送一组警报的通知
group_interval: 10s # 在发送新警报前的等待时间
repeat_interval: 1m # 发送重复警报的周期 对于email配置中,此项不可以设置过低,否则将会由于邮件发送太多频繁,被smtp服务器拒绝
receiver: 'email' # 发送警报的接收者的名称,以下receivers name的名称
# 定义警报接收者信息
receivers:
- name: 'email' # 警报
email_configs: # 邮箱配置
- to: '[email protected]' # 接收警报的email配置
html: '{{ template "alert.html" . }}' # 设定邮箱的内容模板
headers: { Subject: "[WARN] 报警邮件"} # 接收邮件的标题
1.5 编辑告警邮件模板 alert.html
{{ define "alert.html" }}
<table border="1">
<tr>
<td>报警项</td>
<td>实例</td>
<td>报警内容</td>
<td>开始时间</td>
</tr>
{{ range $i, $alert := .Alerts }}
<tr>
<td>{{ index $alert.Labels "alertname" }}</td>
<td>{{ index $alert.Labels "instance" }}</td>
<td>{{ index $alert.Annotations "description" }}</td>
<td>{{ $alert.StartsAt }}</td>
</tr>
{{ end }}
</table>
{{ end }}
1.6 编写 DockerCompose.yml
version: "3.8"
services:
cAdvisor:
image: google/cadvisor:v0.33.0
container_name: cadvisor
restart: always
deploy:
resources:
limits:
cpus: '0.20'
memory: 500M
networks:
- monitor
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
Prometheus:
image: prom/prometheus:v2.19.2
container_name: prometheus
restart: always
deploy:
resources:
limits:
cpus: '0.20'
memory: 500M
volumes:
- ./prometheus/conf:/etc/prometheus:ro
networks:
- monitor
depends_on:
- cAdvisor
ports:
- "9090:9090"
alertmanager:
image: prom/alertmanager:v0.21.0
container_name: alertmanager
restart: always
deploy:
resources:
limits:
cpus: '0.20'
memory: 500M
networks:
- monitor
ports:
- "9093:9093"
depends_on:
- Prometheus
volumes:
- ./alertmanager/conf/config.yml:/etc/alertmanager/config.yml
- ./alertmanager/templates:/etc/alertmanager/templates
command:
- '--config.file=/etc/alertmanager/config.yml'
- '--storage.path=/alertmanager'
- '--log.level=info'
Grafana:
image: grafana/grafana:7.0.5
container_name: grafana
restart: always
deploy:
resources:
limits:
cpus: '0.20'
memory: 500M
networks:
- monitor
environment:
- GF_Security_ADMIN_PASSWORD=123456
depends_on:
- Prometheus
ports:
- "3000:3000"
networks:
monitor:
name: monitornet
driver: bridge
需要使用--compatibility
兼容模式使它生效。
# 启动
docker-compose --compatibility up -d
# 移除
docker-compose --compatibility down
# 重启
docker-compose --compatibility restart
1.7 配置 Grafana Dashboard 页
docker-compose --compatibility up -d
启动之后,使用浏览器打开 http://127.0.0.1:3000, 即可访问 grafana Web 页面。
选择 Prometheus 数据源:
填写 Prometheus 地址:
保存配置:
选择 Prometheus 数据源:
2. 告警测试
docker stop cadvisor
几秒后刷新页面,发现告警进入 Pending 暂挂状态:
一分钟后(rule_1.yml 中 for = 1m 配置)告警进入 firing 状态。
查收告警邮件:
修复故障后恢复正常:
docker start cadvisor