Linux系统故障定位与优化

Posted 2021-08-24 John08

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Linux系统故障定位与优化相关的知识，希望对你有一定的参考价值。

进程查看

Ex1、Cpu Load突然升高？

排查思路：

1) 用户访问量增加，导致机器CPU负责升高；

2) 程序异常，导致CPU使用升高；

3) 磁盘IO故障，导致CPU负载升高；

排除命令参考：

>> 找出CPU使用率高的进程；

# ps aux |sort -k3 |tail
root      1654  0.0  0.1 12308  2636 ?        S<  18:28   0:00 /sbin/udevd -d
root      1655  0.0  0.1 12308  2636 ?        S<  18:28   0:00 /sbin/udevd -d
root      1612  0.0  0.1 81012  3480 ?        Ss  18:28   0:00 /usr/libexec/postfix/master
postfix   1618  0.0  0.1 81092  3436 ?        S   18:28   0:00 pickup -l -t fifo -u
postfix   1619  0.0  0.1 81160  3484 ?        S   18:28   0:00 qmgr -l -t fifo -u
root      1689  0.3  0.2 98008  4436 ?        Ss  18:30   0:00 sshd: root@pts/0
root      1626  0.5  0.0 116884 1392 ?        Ss   18:28  0:01 crond
root         1  0.7  0.0 19232  1500 ?        Ss  18:27   0:02 /sbin/init
root       1708 99.5  0.0 100940   680 pts/0   R    18:31   1:13 sha256sum /dev/zero
USER       PID %CPU %MEM    VSZ   RSS TTY     STAT START   TIME COMMAND

>>找出执行该进程的用户、终端；

# ps axjf |sort -k3|tail -n 20
  1612   1618  1612   1612 ?            -1 S       89  0:00  \\_ pickup -l -t fifo -u
  1612   1619  1612   1612 ?            -1 S       89  0:00  \\_ qmgr -l -t fifo -u
     1   1612  1612   1612 ?            -1 Ss       0  0:00 /usr/libexec/postfix/master
     1   1626  1626   1626 ?            -1 Ss       0  0:01 crond
     1   1639  1639   1639 tty1       1639 Ss+      0  0:00 /sbin/mingetty /dev/tty1
     1   1641  1641   1641 tty2       1641 Ss+      0  0:00 /sbin/mingetty /dev/tty2
     1   1643  1643   1643 tty3       1643 Ss+      0  0:00 /sbin/mingetty /dev/tty3
     1   1645  1645   1645 tty4       1645 Ss+      0  0:00 /sbin/mingetty /dev/tty4
     1   1647  1647   1647 tty5       1647 Ss+      0  0:00 /sbin/mingetty /dev/tty5
     1   1649  1649   1649 tty6       1649 Ss+      0  0:00 /sbin/mingetty /dev/tty6
  1533   1689   1689  1689 ?            -1 Ss       0  0:00  \\_ sshd: root@pts/0
  1689   1693   1693  1693 pts/0      1747 Ss       0  0:00      \\_ -bash
  1693   1708   1708  1693 pts/0      1747 R        0 10:17          \\_ sha256sum /dev/zero
  1693   1747  1747   1693 pts/0      1747 R+       0  0:00          \\_ ps axjf
  1693   1748  1747   1693 pts/0      1747 S+       0  0:00          \\_ sort -k3
  1693   1749   1747  1693 pts/0      1747 S+       0  0:00          \\_ tail -n 20
   614   1654   614    614 ?            -1 S<       0  0:00  \\_ /sbin/udevd -d
   614   1655   614    614 ?            -1 S<       0  0:00  \\_ /sbin/udevd -d
     1    614   614    614 ?            -1 S<s      0  0:00 /sbin/udevd -d
  PPID    PID  PGID    SID TTY       TPGID STAT   UID  TIME COMMAND

>>找出该用户登录的终端来源；

# w
 18:46:23 up 18 min,  1 user,  load average: 1.00, 0.94, 0.60
USER    TTY      FROM              LOGIN@   IDLE   JCPU  PCPU WHAT
root     pts/0    192.168.23.1     18:30   0.00s 15:18   0.00s w

>>如果是网络服务进程，检查网络连接数

# ss -4tu
Netid State      Recv-Q Send-Q                                Local Address:Port                                                Peer Address:Port
tcp   ESTAB      0      0                                      1*.2**.2**.7:59197                                                10.202.13.3:10050
tcp   ESTAB      0      0                                      1*.2**.2**.7:41727                                                10.202.13.10:10050
tcp   LAST-ACK   1      1                                      1*.2**.2**.7:18876                                                10.202.13.5:10050
tcp   ESTAB      0      0                                      1*.2**.2**.7:41771                                               10.202.13.10:10050
tcp   ESTAB      0      0                                      1*.2**.2**.7:40051                                               10.202.13.15:10050
tcp   ESTAB      0     0                                     1*.2**.2**.7:22087                                              10.202.167.12:10050
tcp   ESTAB      0      0                                      1*.2**.2**.7:17288                                                 10.202.13.4:10050
统计网络服务连接数ss -4tup |wc –l;
统计某一个服务进程的连接数 ss -4tup |grep ‘进程名’|wc -l

>>检查进程是否被IO Block；

# ps aux |sort -k8
USER        PID %CPU %MEM    VSZ   RSS TTY     STAT START   TIME COMMAND
root     4169515 30.6  0.7 2975800 2872816 ?     D   09:36  26:17 /usr/bin/python3.6 -s /bin/s3cmd ***
root       82363 28.9  0.0 295004 193892 ?       D   10:58   0:54 /usr/bin/python3.6 -s /bin/s3cmd ***
root          7  0.0  0.0     0     0 ?        D   Jul21   4:34 [kworker/u96:0+ixgbe]
root     3436655 0.0  0.0      0    0 ?        I    00:00  0:01 [kworker/16:1-mm_percpu_wq]
root    3445624  0.0  0.0     0     0 ?        I   00:07   0:01 [kworker/26:1-events]
root    3481033  0.0  0.0     0     0 ?        I   00:34   0:01 [kworker/11:3-xfs-buf/dm-2

被block的进程，如果磁盘IO能够响应过来，进程状态会自动恢复正常，所以当排除到这样的进程状态时，应先查看本地磁盘，后端的NAS，块存储等是否正常。

>>对于间歇性Cpu冲高的情况，使用top等交互性工具查找使用高的进程

# top -M -d 1
Tasks: 145 total,  2 running, 143 sleeping,   0 stopped,   0 zombie
Cpu(s): 49.2%us, 0.8%sy,  0.0%ni, 49.9%id,  0.0%wa, 0.0%hi,  0.0%si,  0.0%st
Mem: 1860.512M total,  364.098M used, 1496.414M free, 9284.000k buffers
Swap: 2047.996M total,    0.000k used, 2047.996M free,  207.379M cached
 
   PID USER      PR  NI VIRT  RES  SHR S %CPU %MEM    TIME+ COMMAND
  1708 root      20   0 98.6m 680  560 R 100.0  0.0 56:50.53 sha256sum
  1900 root      20   0 15020 1356 1012 R  0.3 0.1   0:00.49 top
     1 root      20   0 19232 1508 1232 S  0.0 0.1   0:02.11 init
     2 root      20   0    0    0    0 S 0.0  0.0   0:00.02 kthreadd