linux安装prometheus+grafana+alermanager

Posted 我是一只小小茑

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了linux安装prometheus+grafana+alermanager相关的知识,希望对你有一定的参考价值。

Liunx装prometheus+grafana+alermanager


一 软件与系统版本信息


软件版本



├── grafana-enterprise-8.0.5-1.x86_64.rpm

└── prometheus-2.32.1.linux-amd64.tar.gz

└── node_exporter-1.3.1.linux-amd64.tar.gz

└── alertmanager-0.23.0.linux-amd64.tar.gz



系统内核版本



系统版本



2.6.32-642.el6.x86_64



CentOS release 6.8 (Final)



3.10.0-1160.el7.x86_64



CentOS Linux release 7.9.2009 (Core)


将软件上传至服务器

二 安装prometheus

[root@master ~]# tar
zxvf prometheus-2.32.1.linux-amd64.tar.gz -C /home

[root@master ~]# mv
/home/prometheus-2.32.1.linux-amd64/ /home/prometheus-2.32.1

[root@master ~]# cd
/home/prometheus-2.32.1/ && nohup ./prometheus &

[root@master ~]#
netstat -ntpl


Active Internet connections (only servers)

Proto Recv-Q Send-Q Local Address               Foreign Address             State       PID/Program name  

tcp        0      0 0.0.0.0:22                  0.0.0.0:*                   LISTEN      1360/sshd          

tcp        0      0 :::9090                            :::*                        LISTEN      1528/./prometheus     #服务启动成功

tcp        0      0 :::22                                 :::*                        LISTEN      1360/sshd          


添加为服务并设置开机自启动


[root@master ~]# cat  > /usr/lib/systemd/system/prometheus.service << EOF

[Unit]

  Description=https://prometheus.io


  [Service]

  Restart=on-failure

  ExecStart=/home/prometheus-2.32.1/prometheus --config.file=/home/prometheus-2.32.1/prometheus.yml


  [Install]                     

  WantedBy=multi-user.target

EOF






三 node_exporter 安装(所有被控端的linux主机都需要安装)


将   node_exporter-1.3.1.linux-amd64.tar.gz上传至客户端

[root@master ~]# tar
zxvf node_exporter-1.3.1.linux-amd64.tar.gz -C /home

[root@master ~]# mv
/home/node_exporter-1.3.1.linux-amd64/ /home/node_exporter-1.3.1

[root@master ~]# cd
/home/node_exporter-1.3.1/ && nohup ./node_exporter &

[root@master ~]#
netstat -ntpl


Active Internet connections (only servers)

Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name   

tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      961/sshd           

tcp        0      0 127.0.0.1:25            0.0.0.0:*               LISTEN      1150/master        

tcp6       0      0 :::9100                 :::*                    LISTEN      2153/./node_exporte

tcp6       0      0 :::22                   :::*                    LISTEN      961/sshd           

tcp6       0      0 ::1:25                  :::*                    LISTEN      1150/master        

tcp6       0      0 :::9090                 :::*                    LISTEN      2130/./prometheus  


添加为服务并设置开机自启动


[root@master ~]# cat > /etc/systemd/system/node_exporter.service << EOF

[Unit]

  Description=node_exporter

  Documentation=https://prometheus.io/docs/introduction/overview

  After=network-online.target remote-fs.target nss-lookup.target

  Wants=network-online.target

  [Service]

  Type=simple

  PIDFile==/var/run/node_exporter.pid

  ExecStart=/home/node_exporter-1.3.1/node_exporter

  ExecReload=/bin/kill -s HUP $MAINPID

  ExecStop=/bin/kill -s TERM $MAINPID

  [Install]

  WantedBy=multi-user.target

EOF


四 服务端添加 node_exporter 信息

[root@master ~]# vim /home/prometheus-2.32.1/prometheus.yml



添加以下信息,要注意格式


  - job_name: "linux"

    static_configs:

      - targets: ["192.168.1.60:9100","192.168.1.61:9100","192.168.1.62:9100"]





设置开机启动

[root@master ~]#
systemctl daemon-reload

[root@master ~]#
systemctl enable node_exporter

[root@master ~]#
systemctl start node_exporter

[root@master ~]#
systemctl enable prometheus

[root@master ~]#
systemctl start prometheus


查看状态

linux安装prometheus+grafana+alermanager_监控告警配置

五 安装grafana


[root@master ~]# yum -y install urw-fonts

[root@master ~]# rpm -ivh grafana-enterprise-8.0.5-1.x86_64.rpm


warning: grafana-enterprise-8.0.5-1.x86_64.rpm: Header V4 RSA/SHA256 Signature, key ID 24098cb6: NOKEY

Preparing...                          ################################# [100%]

Updating / installing...

   1:grafana-enterprise-8.0.5-1       ################################# [100%]

### NOT starting on installation, please execute the following statements to configure grafana to start automatically using systemd

 sudo /bin/systemctl daemon-reload

 sudo /bin/systemctl enable grafana-server.service

### You can start grafana-server by executing

 sudo /bin/systemctl start grafana-server.service

POSTTRANS: Running script



[root@master ~]# systemctl enable grafana-server

[root@master~]# systemctl   start   grafana-server

[root@master ~]# netstat -ntpl

linux安装prometheus+grafana+alermanager_linux_02


5.1.安装pie插件(饼图显示报错再安装

[root@master ~]#
grafana-cli plugins install grafana-piechart-panel

2.重启grafana-server

[root@master ~]#
systemctl restart grafana-server


打开grafans网页

linux安装prometheus+grafana+alermanager_alertmanager_03


linux安装prometheus+grafana+alermanager_alertmanager_04


linux安装prometheus+grafana+alermanager_linux_05


linux安装prometheus+grafana+alermanager_linux_06


linux安装prometheus+grafana+alermanager_监控告警配置_07


linux安装prometheus+grafana+alermanager_grafana_08


linux安装prometheus+grafana+alermanager_prometheus_09


linux安装prometheus+grafana+alermanager_监控告警配置_10


prometheus+grafana安装完成


六 alermanager安装

[root@master~ ]# tar zxvf alertmanager-0.23.0.linux-amd64.tar.gz  -C /home/

[root@master~ ]# mv /home/alertmanager-0.23.0.linux-amd64  /home/alertmanager-0.23.0

七 配置报警监控模板


[root@master~ ]# cd  /home/alertmanager-0.23.0

[root@master alertmanager-0.23.0]# cp alertmanager.yml alertmanager.yml-back

[root@master alertmanager-0.23.0]# vim alertmanager.yml



global:

  resolve_timeout: 5m

  smtp_smarthost: smtp.163.com:25

  smtp_from: 18600000000@163.com

  smtp_auth_username: 18600000000@163.com

  smtp_auth_password: XXXXXXXXXX

  smtp_require_tls: false


route:

  group_by: [alertname]

  group_wait: 10s

  group_interval: 1m

  repeat_interval: 3m

  receiver: mail


receivers:

- name: mail

  email_configs:

   - to: 58888888@qq.com

     headers: Subject: "[WARN] 报警邮件"

     send_resolved: true


inhibit_rules:

  - source_match:

      severity: critical

    target_match:

      severity: warning

    equal: [alertname, dev, instance]



如图:

linux安装prometheus+grafana+alermanager_监控告警配置_11



[root@master  ~ ]# mkdir /home/alertmanager-0.23.0/rules

[root@master  ~ ]# vim /home/alertmanager-0.23.0/rules/node.yml



groups:

    - name: 主机状态-监控告警

      rules:

      - alert: 主机状态

        expr: up == 0

        for: 5m

        labels:

          status: 非常严重

        annotations:

          summary: "$labels.instance:服务器宕机"

          description: "$labels.instance:服务器延时超过5分钟"


      - alert: CPU使用情况

        expr: 100-(avg(irate(node_cpu_seconds_totalmode="idle"[5m])) by(instance)* 100) > 80

        for: 1m

        labels:

          status: 一般告警

        annotations:

          summary: "$labels.instance CPU使用率过高!"

          description: "$labels.instance CPU使用大于80%(目前使用:$value%)"

      - alert: 内存使用

        expr: (1 - (node_memory_MemAvailable_bytes / (node_memory_MemTotal_bytes))) * 100 > 80

        for: 1m

        labels:

          status: 严重告警

        annotations:

          summary: "$labels.instance 内存使用率过高!"

          description: "$labels.instance 内存使用大于80%(目前使用:$value%)"

      - alert: IO性能

        expr: (avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) > 80

        for: 1m

        labels:

          status: 严重告警

        annotations:

          summary: "$labels.instance 流入磁盘IO使用率过高!"

          description: "$labels.instance 流入磁盘IO大于80%(目前使用:$value)"


      - alert: 网络

        expr: ((sum(rate (node_network_receive_bytes_totaldevice!~tap.*|veth.*|br.*|docker.*|virbr*|lo*[5m])) by (instance)) / 100) > 102400

        for: 1m

        labels:

          status: 严重告警

        annotations:

          summary: "$labels.instance 流入网络带宽过高!"

          description: "$labels.instance 流入网络带宽持续2分钟高于100M. RX带宽使用率$value"


      - alert: 网络

        expr: ((sum(rate (node_network_transmit_bytes_totaldevice!~tap.*|veth.*|br.*|docker.*|virbr*|lo*[5m])) by (instance)) / 100) > 102400

        for: 1m

        labels:

          status: 严重告警

        annotations:

          summary: "$labels.instance 流出网络带宽过高!"

          description: "$labels.instance 流出网络带宽持续2分钟高于100M. RX带宽使用率$value"


      - alert: TCP会话

        expr: node_netstat_Tcp_CurrEstab > 1000

        for: 1m

        labels:

          status: 严重告警

        annotations:

          summary: "$labels.instance TCP_ESTABLISHED过高!"

          description: "$labels.instance TCP_ESTABLISHED大于1000(目前使用:$value%)"


      - alert: 磁盘容量

        expr: 100-(node_filesystem_free_bytesfstype=~"ext4|xfs"/node_filesystem_size_bytes fstype=~"ext4|xfs"*100) > 80

        for: 1m

        labels:

          status: 严重告警

        annotations:

          summary: "$labels.instance 磁盘分区使用率过高!"

          description: "$labels.instance 磁盘分区使用大于80%(目前使用:$value%)"




修改prometheus.yml配置文件,配置报警规则,打开alerting
和 rule_files 文件指定(增加红色字体部分)

[root@master ~]# vim /home/prometheus-2.32.1/prometheus.yml



# my global config

global:

  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.

  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

  # scrape_timeout is set to the global default (10s).


# Alertmanager configuration

alerting:

  alertmanagers:

    - static_configs:

        - targets: ["192.168.1.60:9093"]

          # - alertmanager:9093


# Load rules once and periodically evaluate them according to the global evaluation_interval.

rule_files:

     - "/home/alertmanager-0.23.0/rules/*.yml"

  # - "first_rules.yml"

  # - "second_rules.yml"


# A scrape configuration containing exactly one endpoint to scrape:

# Here its Prometheus itself.

scrape_configs:

  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.

  - job_name: "prometheus"

   #抓取间隔时间

      scrape_interval: 5s

    # metrics_path defaults to /metrics

    # scheme defaults to http.


    static_configs:

      - targets: ["localhost:9090","192.168.2.231:9100"]


  - job_name: "linux-host"

     scrape_interval: 5s

    # metrics_path defaults to /metrics

    # scheme defaults to http.


    static_configs:

      - targets: ["192.168.2.230:9100"]




重启服务



[root@master ~]# cat > /home/alertmanager-0.23.0/start_alertmanager.sh << EOF

#!/bin/bash


nohup /home/alertmanager-0.23.0/alertmanager --config.file="/home/alertmanager-0.23.0/alertmanager.yml" > /home/alertmanager-0.23.0/alertmanager.log 2>&1 &

EOF



[root@master ~]# sh
/home/alertmanager-0.23.0/start_alertmanager.sh

[root@master ~]#
netstat -ntpl

linux安装prometheus+grafana+alermanager_grafana_12


重启 prometheus 服务


[root@master ~ ]# kill -9 `ps -ef | grep prometheus | grep -v grep |awk print $2`

[root@master ~ ]# systemctl start prometheus


查看监控项配置信息

linux安装prometheus+grafana+alermanager_监控告警配置_13

八 测试邮件报警

[root@node1 ~ ] # kill -9 `ps -ef | grep node_exporter | grep -v grep |awk print $2`

linux安装prometheus+grafana+alermanager_监控告警配置_14


linux安装prometheus+grafana+alermanager_linux_15

进入邮箱查看告警邮件:

linux安装prometheus+grafana+alermanager_linux_16


开启node1主机的 node_exporter 服务

linux安装prometheus+grafana+alermanager_监控告警配置_17

收到恢复邮件

linux安装prometheus+grafana+alermanager_grafana_18


九 配置邮件告警模板

[root@master ~]# cat /home/alertmanager-0.23.0/template/wechat.tmpl  
(邮件告警模板)



define "wechat.default.message"

- if gt (len .Alerts.Firing) 0 -

- range $index, $alert := .Alerts -

- if eq $index 0 -

**********告警通知**********<br>

告警类型: $alert.Labels.alertname <br>

告警级别: $alert.Labels.severity <br>

- end

=====================<br>

告警主题: $alert.Annotations.summary <br>

告警详情: $alert.Annotations.description <br>

故障时间: $alert.StartsAt.Local.Format "2006-01-02 15:04:05" <br>

if gt (len $alert.Labels.instance) 0 -故障实例: $alert.Labels.instance - end -

- end

- end



- if gt (len .Alerts.Resolved) 0 -

- range $index, $alert := .Alerts -

- if eq $index 0 -

**********恢复通知**********<br>

告警类型: $alert.Labels.alertname <br>

告警级别: $alert.Labels.severity <br>

- end

=====================<br>

告警主题: $alert.Annotations.summary <br>

告警详情: $alert.Annotations.description <br>

故障时间: $alert.StartsAt.Local.Format "2006-01-02 15:04:05" <br>

恢复时间: $alert.EndsAt.Local.Format "2006-01-02 15:04:05" <br>

if gt (len $alert.Labels.instance) 0 -故障实例: $alert.Labels.instance - end -<br>

- end

- end

- end




[root@master ~]# cat /home/alertmanager-0.23.0/alertmanager.yml


global:

  resolve_timeout: 5m

  smtp_smarthost: smtp.163.com:25

  smtp_from: 1866@163.com

  smtp_auth_username: 1866@163.com

  smtp_auth_password: RKONLSDFSDMFZM

  smtp_require_tls: false


templates:

- /home/alertmanager-0.23.0/template/wechat.tmpl

route:

  group_by: [alertname]

  group_wait: 10s

  group_interval: 20s

  repeat_interval: 3m

  receiver: mail


receivers:

- name: mail

  email_configs:

   - to: 534234548@qq.com

     headers: Subject: "[WARN] 报警邮件"

     send_resolved: true

     html: template "wechat.default.message" .

inhibit_rules:

  - source_match:

      severity: critical

    target_match:

      severity: warning

    equal: [alertname, dev, instance]



查看邮件

linux安装prometheus+grafana+alermanager_监控告警配置_19

linux安装prometheus+grafana+alermanager_prometheus_20


十 配置(微信告警模板)

[root@master ~]# cat /home/alertmanager-0.23.0/template/wechat.tmpl   (微信告警模板)



define "wechat.default.message"

- if gt (len .Alerts.Firing) 0 -

- range $index, $alert := .Alerts -

- if eq $index 0 -

**********告警通知**********

告警类型: $alert.Labels.alertname

告警级别: $alert.Labels.severity

- end

=====================

告警主题: $alert.Annotations.summary

告警详情: $alert.Annotations.description

故障时间: $alert.StartsAt.Local.Format "2006-01-02 15:04:05"

if gt (len $alert.Labels.instance) 0 -故障实例: $alert.Labels.instance - end -

- end

- end



- if gt (len .Alerts.Resolved) 0 -

- range $index, $alert := .Alerts -

- if eq $index 0 -

**********恢复通知**********

告警类型: $alert.Labels.alertname

告警级别: $alert.Labels.severity

- end

=====================

告警主题: $alert.Annotations.summary

告警详情: $alert.Annotations.description

故障时间: $alert.StartsAt.Local.Format "2006-01-02 15:04:05"

恢复时间: $alert.EndsAt.Local.Format "2006-01-02 15:04:05"

if gt (len $alert.Labels.instance) 0 -故障实例: $alert.Labels.instance - end -

- end

- end

- end





[root@master ~]# cat /home/alertmanager-0.23.0/alertmanager.yml



global: 

  resolve_timeout: 5m

  wechat_api_url: ​​https://qyapi.weixin.qq.com/cgi-bin/​

templates:

- /home/alertmanager-0.23.0/template/wechat.tmpl

route:

  group_by: [alertname]

  group_wait: 10s

  group_interval: 10s

  repeat_interval: 3m

  receiver: wechat

receivers:

- name: wechat

  wechat_configs:

  - corp_id: weterterterterhc6

    to_party: 1

    agent_id: 1000002

    api_secret: Ha_wefsdfertgretgerguRCVPnzvK1fY

    send_resolved: true

inhibit_rules:

- equal: [alertname, cluster, service]

  source_match:

    severity: high

  target_match:

    severity: warning




linux安装prometheus+grafana+alermanager_监控告警配置_21


linux安装prometheus+grafana+alermanager_监控告警配置_22


linux安装prometheus+grafana+alermanager_linux_23


linux安装prometheus+grafana+alermanager_grafana_24




以上是关于linux安装prometheus+grafana+alermanager的主要内容,如果未能解决你的问题,请参考以下文章

linux(centos8):为prometheus安装grafana(grafana-7.0.3)

linux(centos8):为prometheus安装grafana(grafana-7.0.3)

linux Prometheus+Grafana监控服务器性能

prometheus+grafana简介与linux下的安装

grafana+prometheus安装

Grafana+Prometheus监控Linux服务器性能