监控系列讲座常见系统监控指标之存储

Posted 云原生技术课堂

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了监控系列讲座常见系统监控指标之存储相关的知识,希望对你有一定的参考价值。

4. 磁盘/存储监控指标

一般来说,我们监控存储设备的时候大多数都是在监控文件系统,也就是可以被操作系统直接使用的部分。但是实际的生产中,我们会有其他的监控需求

  • 监控那些没有被格式化成对应的文件系统(XFS,FAT32,EXT4)的磁盘,这些磁盘使用的时候使用一般的df命令是不可见的,比如:oracle的ASM盘,块存储,存储映射过来的lun(FC,iSCSI)。我们需要使用一些特殊手段才能让他们为我们所用。

  • 提供存储的设备,比如:惠普的3PAR,IBM的DS存储,EMC存储这些设备,或者是NFS服务器,Ceph/swift集群这一类提供存储功能的服务器。对于厂商的产品我们最好是去咨询原厂的工程师,关于监控指标的问题,比如是否可以有插件支持某类(zabbix,grafana)监控软件直接采集,还是有API接口,可以供外部程序采集指标。即使都没有的话,还会有snmp这种简单的方式可以让我们监控。但是snmp方式提供的指标数量有限,算是个保底的solution。而对于使用开源软件这类的解决方案,可以我们客户或者领导最想听到的是一些硬性指标,比如:随机读写的速度,顺时读写的速度等等。因为这类指标是衡量我们系统的重要依据之一。

这块我们后面会在讲分布式存储和Ceph的时候再详细说,我们这里只比较一下一些工具内置模板可以监控到的指标。


4.1. 系统上查看硬盘指标

同样是两类

  • 通过命令:top、iostat、vmstat、sar这类属于查看瞬时速度的和查看使用率的df类命令。或者使用dd+time命令,可以通过查看读写的结果来测试速度。当然,还有一些三方工具,比如:FIO,hdparm,smartctl

  • 通过文件:一般来说,这些磁盘也是一个文件,他们都有对应的指标,我们可以在/sys/block/sda下面找到对应的信息,sda是设备名字。当然,不是所有的linux/unix系统都是这样的,比如MacOS就找不到/sys/block目录

    # ls /sys/block/sda
    alignment_offset discard_alignment inflight queue slaves
    bdi ext_range mmcblk0p1 range stat
    capability force_ro mmcblk0p2 removable subsystem
    dev hidden mq ro trace
    device holders power size uevent

其实系统上能看到的指标是最全面的,而我们常用的vmstat命令提供的指标也非常少

   Swap
si: Amount of memory swapped in from disk (/s).
so: Amount of memory swapped to disk (/s).

IO
bi: Blocks received from a block device (blocks/s).
bo: Blocks sent to a block device (blocks/s).

只有swap分区的读和写,块存储的读和写。我们经常会使用iostat -d来查看硬盘的IO

$ iostat
Linux 2.6.32-431.11.15.el6.ucloud.x86_64 (ssdk1) 10/14/2016 _x86_64_ (4 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
0.44 0.00 0.26 0.01 0.01 99.29

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
vda 0.66 0.09 6.75 1404732 105885456
vdb 1.42 12.47 55.86 195619082 876552296

这个会显示所有的每块盘的速度

tps:该设备每秒的传输次数
Blk_read/s:每秒从设备(drive expressed)读取的数据量;
Blk_wrtn/s:每秒向设备(drive expressed)写入的数据量;
Blk_read:读取的总数据量;
Blk_wrtn:写入的总数量数据量;

然后就是df命令了,他会显示磁盘的使用率,这个是是很重要的指标,因为如果磁盘满了,和CPU一样,某些运行的程序可能会由于无法写数据而意外终止。

df -H
Filesystem Size Used Avail Use% Mounted on
/dev/root 126G 2.0G 119G 2% /
devtmpfs 1.9G 0 1.9G 0% /dev
tmpfs 2.0G 0 2.0G 0% /dev/shm
tmpfs 2.0G 8.8M 2.0G 1% /run
tmpfs 5.3M 4.1k 5.3M 1% /run/lock
tmpfs 2.0G 0 2.0G 0% /sys/fs/cgroup
/dev/mmcblk0p1 265M 55M 210M 21% /boot
tmpfs 400M 0 400M 0% /run/user/1000

4.2. zabbix上的存储监控指标

和我们在系统上看到的指标大同小异

image-20200724235135473.png

4.3. grafana上的存储监控指标

多了一个inode的监控,其他的基本一样

image-20200725000108563.png

4.4. node_exporter上的存储监控指标

这边的监控貌似多了很多

# HELP node_disk_discard_time_seconds_total This is the total number of seconds spent by all discards.
# TYPE node_disk_discard_time_seconds_total counter
node_disk_discard_time_seconds_total{device="mmcblk0"} 0
node_disk_discard_time_seconds_total{device="mmcblk0p1"} 0
node_disk_discard_time_seconds_total{device="mmcblk0p2"} 0
# HELP node_disk_discarded_sectors_total The total number of sectors discarded successfully.
# TYPE node_disk_discarded_sectors_total counter
node_disk_discarded_sectors_total{device="mmcblk0"} 0
node_disk_discarded_sectors_total{device="mmcblk0p1"} 0
node_disk_discarded_sectors_total{device="mmcblk0p2"} 0
# HELP node_disk_discards_completed_total The total number of discards completed successfully.
# TYPE node_disk_discards_completed_total counter
node_disk_discards_completed_total{device="mmcblk0"} 0
node_disk_discards_completed_total{device="mmcblk0p1"} 0
node_disk_discards_completed_total{device="mmcblk0p2"} 0
# HELP node_disk_discards_merged_total The total number of discards merged.
# TYPE node_disk_discards_merged_total counter
node_disk_discards_merged_total{device="mmcblk0"} 0
node_disk_discards_merged_total{device="mmcblk0p1"} 0
node_disk_discards_merged_total{device="mmcblk0p2"} 0
# HELP node_disk_io_now The number of I/Os currently in progress.
# TYPE node_disk_io_now gauge
node_disk_io_now{device="mmcblk0"} 0
node_disk_io_now{device="mmcblk0p1"} 0
node_disk_io_now{device="mmcblk0p2"} 0
# HELP node_disk_io_time_seconds_total Total seconds spent doing I/Os.
# TYPE node_disk_io_time_seconds_total counter
node_disk_io_time_seconds_total{device="mmcblk0"} 11.476
node_disk_io_time_seconds_total{device="mmcblk0p1"} 0.44
node_disk_io_time_seconds_total{device="mmcblk0p2"} 11.064
# HELP node_disk_io_time_weighted_seconds_total The weighted # of seconds spent doing I/Os.
# TYPE node_disk_io_time_weighted_seconds_total counter
node_disk_io_time_weighted_seconds_total{device="mmcblk0"} 16.476
node_disk_io_time_weighted_seconds_total{device="mmcblk0p1"} 0.668
node_disk_io_time_weighted_seconds_total{device="mmcblk0p2"} 15.792
# HELP node_disk_read_bytes_total The total number of bytes read successfully.
# TYPE node_disk_read_bytes_total counter
node_disk_read_bytes_total{device="mmcblk0"} 2.32966144e+08
node_disk_read_bytes_total{device="mmcblk0p1"} 1.153536e+07
node_disk_read_bytes_total{device="mmcblk0p2"} 2.20890112e+08
# HELP node_disk_read_time_seconds_total The total number of seconds spent by all reads.
# TYPE node_disk_read_time_seconds_total counter
node_disk_read_time_seconds_total{device="mmcblk0"} 11.972
node_disk_read_time_seconds_total{device="mmcblk0p1"} 0.704
node_disk_read_time_seconds_total{device="mmcblk0p2"} 11.232000000000001
# HELP node_disk_reads_completed_total The total number of reads completed successfully.
# TYPE node_disk_reads_completed_total counter
node_disk_reads_completed_total{device="mmcblk0"} 4883
node_disk_reads_completed_total{device="mmcblk0p1"} 416
node_disk_reads_completed_total{device="mmcblk0p2"} 4447
# HELP node_disk_reads_merged_total The total number of reads merged.
# TYPE node_disk_reads_merged_total counter
node_disk_reads_merged_total{device="mmcblk0"} 6505
node_disk_reads_merged_total{device="mmcblk0p1"} 3795
node_disk_reads_merged_total{device="mmcblk0p2"} 2710
# HELP node_disk_write_time_seconds_total This is the total number of seconds spent by all writes.
# TYPE node_disk_write_time_seconds_total counter
node_disk_write_time_seconds_total{device="mmcblk0"} 26.967000000000002
node_disk_write_time_seconds_total{device="mmcblk0p1"} 0.008
node_disk_write_time_seconds_total{device="mmcblk0p2"} 26.958000000000002
# HELP node_disk_writes_completed_total The total number of writes completed successfully.
# TYPE node_disk_writes_completed_total counter
node_disk_writes_completed_total{device="mmcblk0"} 1456
node_disk_writes_completed_total{device="mmcblk0p1"} 3
node_disk_writes_completed_total{device="mmcblk0p2"} 1453
# HELP node_disk_writes_merged_total The number of writes merged.
# TYPE node_disk_writes_merged_total counter
node_disk_writes_merged_total{device="mmcblk0"} 2529
node_disk_writes_merged_total{device="mmcblk0p1"} 0
node_disk_writes_merged_total{device="mmcblk0p2"} 2529
# HELP node_disk_written_bytes_total The total number of bytes written successfully.
# TYPE node_disk_written_bytes_total counter
node_disk_written_bytes_total{device="mmcblk0"} 6.9829632e+07
node_disk_written_bytes_total{device="mmcblk0p1"} 5120
node_disk_written_bytes_total{device="mmcblk0p2"} 6.9824512e+07
node_scrape_collector_duration_seconds{collector="diskstats"} 0.001754445
node_scrape_collector_success{collector="diskstats"} 1
  • merged的,是说合并所有硬盘后的指标

  • discard是说硬盘的丢包率,也就是说如果丢包率过高,有可能是硬盘本身的介质出现问题


以上是关于监控系列讲座常见系统监控指标之存储的主要内容,如果未能解决你的问题,请参考以下文章

监控系列讲座常用的监控数据库

监控平台设计之Graphite&Prometheus存储

性能测试系列五 压测常见的关注指标以及监控分析工具

Day649.生产指标监控问题 -Java业务开发常见错误

SRE指导思想之系统监控

liunx服务器常见监控指标