Linux系统内对高CPU的监控及日志分析

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Linux系统内对高CPU的监控及日志分析相关的知识,希望对你有一定的参考价值。

使用linux系统时,占用cpu资源过高和,用脚本排查:

1,实时监控,一旦有cpu占用高的进程,程序启动;

2,再对进程分析,得出对应线程;

3,对对应线程所在的程序日志文档进行分析,比如Websphere中间件就有很详备的文件系统;

4,对于日志文件中errorworning等详细查看,但由于有时候日志文件过于庞大,并且容易忽略某些细节,如果用sedawk,结合四则表达式,可以有效的定位其中的错误并不放过任何细节。

 

此脚本同,通过一个local脚本和一个remote脚本,能准确监控,并定位日志文件,并分析文件

 

 

 

 

 

                      

 

 

 

 

 

 

本地脚本:highCpuAnalysis_l.sh

准备工作:定义变量

###############################################################################

#The source code is created in 10.19.90.165 and 192.168.86.198

# This script is used to Analysis data for Performance, High CPU Issues on Linux‘

# Usage:    ./highCpuAnalysis.sh $IP $USER

# Author: HuangTao

# Email:[email protected]

#

###############################################################################

##########################

#  Define Variables      #

##########################

export USER=$1;

export IP=$2;

 

##Usage:

if [ $# -eq 0 ] || [ $# -eq 1 ]

then

echo " Unable to find  USER and IP."

echo " Please rerun the script as follows:./highCpuAnalysis.sh USER IP"

echo "eg: ./highCpuAnalysis_l.sh root 192.168.86.198 "

exit 1

fi

 

##get the remote server‘s WAS application server name

export wasappname=$(ssh [email protected]$IP ps -eo pcpu,pmem,pid,user,args | sort -k 1 -r | sed -n ‘2p‘ |awk ‘{print $NF}‘)

 

##get the remote server‘s hostname

export remotehostname=$(ssh [email protected]$IP hostname)

##get the current directory

export dir=$(pwd)

 

 

 

 

1步:复制本地的分析脚本到远程主机:

###############################################################################

##Copy the script:highCpuAnalysis_r.sh to target host

echo "*********************************************************************" 

echo "Step 1: "

echo "copy the highCpuAnalysis_r.sh to the remote host, and "

scp  highCpuAnalysis_r.sh  [email protected]$IP:/tmp/

ssh [email protected]$IP cd /tmp

ssh [email protected]$IP chmod 755 /tmp/highCpuAnalysis_r.sh

echo "is RUNING on $remotehostname($IP). "

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6步:将远程主机生成的分析文档拷贝到本地主机,并删除远程主机的分析文件

###############################################################################

##run the script, make the script run on target remote host:

ssh [email protected]$IP  /tmp/highCpuAnalysis_r.sh

echo "*************************************************************************"

echo "Step 6:"

echo "Copy the report and javacore to the local fenxi host:"

###############################################################################

##Copy the report and javacore to the local host then delete them:

export dir=$(pwd)

scp [email protected]$IP:/tmp/HighCpuReport* .

scp [email protected]$IP:/tmp/javacore*.gz  .

tar -zxvf javacore*.gz

 

##Remove all related files in remate server

ssh [email protected]$IP rm -f /tmp/HighCpu*Report*

ssh [email protected]$IP rm -f /tmp/javacore*

ssh [email protected]$IP rm -f /tmp/highCpuAnaly sis_r.sh

ssh [email protected]$IP rm -f /tmp/topdashH.*

echo "   "

 

 

7步:显示分析结果

echo "*********************************************************************" 

echo "step 7:"

echo "Show All information:"

echo "Remote hostname: $remotehostname($IP)."

echo "Remote Appserver name:$wasappname."

echo "Report and javacore:"

 

rm -f javacore*.gz

ls -rlt HighCpu*Report* |tail -1

ls -rtl javacore* |tail -3

 

echo  "*******************************END**********************************"

 

 

 

远程脚本: highCpuAnalysis_r.sh

远程脚本的准备工作:定义远程主机的变量(主要有监控的时间和次数)

###############################################################################

#The source code is created in 10.19.90.165 and 192.168.86.198.

# This script is used to Analysis data for Performance, High CPU Issues on Linux‘

# Usage:    ./HighCpuAnalysis.sh 

# Author: HuangTao

# Email:[email protected]

###############################################################################

##########################

#  Define Variables      #

##########################

# How long the top dash H data should be taken in once(second).

TOP_DASH_H_VAL=30  

# How many times dash H data should be taken.

TOP_DASH_H_VAL_T=3

 

# How long one javacores should be taken(second) .    

JAVACORE_VAL=60 

# How many times javacores should be taken.   

JAVACORE_VAL_T=3 

##get High CPU pid

此步骤得到占用cpu资源最高的进程ID

export pid=$(ps -eo pcpu,pmem,pid,user,args | sort -k 1 -r | head -10 | sed -n ‘2p‘ |awk ‘{print $3}‘)

 

##get turn pid number to hexadecimal (from 10 to 16)

export pid16=$(echo "obase=10; $pid" | bc)

 

##check the pid if WAS process

export was=$(ps -eo pcpu,pmem,pid,user,args | sort -k 1 -r | sed -n ‘2p‘ |awk ‘{print $4}‘)

 

##get the WAS application name

export wasappname=$(ps -eo pcpu,pmem,pid,user,args | sort -k 1 -r | sed -n ‘2p‘ |awk ‘{print $NF}‘)

 

##get hostname

export hostname=$(hostname)

 

 

2步:获取占用高CPU的进程PID,并判断是否是特定程序的进程

##########################

# Get High CPU PID       #

##########################

## put the report in /tmp/HighCpuReport.$pid.$hostname.out

echo "Script execude time:" $(date)  > /tmp/HighCpuReport.$pid.$hostname.out

echo "   " 

if [ $was = wasuser ]  || [ $was = wasadmin ]

then 

echo "*********************************************************************"                                                                       

echo "Step 2:"

echo "The Highest CPU pid is :  $pid, the process is WAS porcess. " 判断是否是WAS进程              

else

echo "The Highest CPU pid :  $pid is NOT WAS process."  | tee -a /tmp/HighCpuReport.$pid.$hostname.out

echo "   " 

exit 1

fi

sleep 1;

echo "*********************************************************************" 

 

在某一个特定用户名称下(比如wasadmin,httpd),找出占用CPU资源最高的前10项进程:

 

 

 

 

 

 

 

 

 

3步:分析占用CPU资源最高的某线程中,在特定时间段内线程(或者是子进程)占用情况,并写入到临时文件中,待分析;

#########################

#                       #

# Start collection of:  #

#  * top dash H         #

#                       #

#########################

# Start the collection of top dash H data.

echo  "Step 3:" 

echo  "Starting collection of top dash H data ..." 

echo  "Need $[$TOP_DASH_H_VAL*TOP_DASH_H_VAL_T] seconds to complete this step:"

 

 top -bH -d $TOP_DASH_H_VAL -n $TOP_DASH_H_VAL_T -p $pid  > /tmp/topdashH.$pid.$hostname.out

 

    #eg:   top -bH -d 30 -n 3 -p 7031

    #eg:  grep -v Swap toplog.out |grep -v Task |grep -v "Cpu(s)"|grep -v "Mem:" |grep -v top| sort -k 1 -r | head -10 | sed -n ‘2p‘ |awk ‘{print $3}‘

#echo "Analysis the snapshot of /tmp/topdashH.$pid.$hostname.out can find out the hight CPU thread" ;

echo  "Collected The top dash H data ." 

sleep 2;

 

 

 

 

 

 

 

4步:分析上步骤文件,找出某一段时间内消耗CPU资源最多的线程的前10位,(如果CPU占用资源过高是WAS造成的)则同时在指定时间段以内生成JAVACORE供分析。

 

###########################

#  Find out the Thread of  most CPU  

#  and TIME consumner  Top 10 .                         

###########################

##delete the /tmp/topdashH.$pid.$hostname.out   when completed the data Collection

############################

# Start collection of:  #

#  * javacores          #

#########################

# Javacores are output to the working directory of the JVM; in most cases this is the <profile_root>

echo "*********************************************************************"    

echo  "Step 4:" 

echo  "Starting collection of Javacores ..." 

echo  "Need $[$JAVACORE_VAL*$JAVACORE_VAL_T] seconds to complete This step:"

##clear the javacore about this PID first:

rm -f /opt/IBM/WebSphere/AppServer/profiles/$wasappname/javacore*$pid*

##then generate the javacore

        kill -3 $pid ;

        echo "Collected the first javacore for PID $pid ."   

        sleep $221

        

        kill -3 $pid ;

        echo "Collected the second javacore for PID $pid ." 

        sleep $JAVACORE_VAL

 

        kill -3 $pid ;

        echo "Collected the third javacore for PID $pid ."     

        sleep $JAVACORE_VAL    

        

##mv the javacore to the /tmp DIR and then zip:

rm -f /tmp/javacore*

mv -f /opt/IBM/WebSphere/AppServer/profiles/$wasappname/javacore*$pid*     /tmp/

cd /tmp

tar -zcvf javacore.$(date +%Y%m%d"."%H%M%S).$pid.gz javacore*$pid*

 

5步:显示分析结果并将分析结果保存到临时文件中

 

echo "*********************************************************************"  

echo  "Step 5:" 

echo  "Print out the Analysis infomantion:" 

echo "   "                                                                      | tee -a /tmp/HighCpuReport.$pid.$hostname.out

echo "*********The most CPU consumner top 10 PROCESS :*********************"   | tee -a /tmp/HighCpuReport.$pid.$hostname.out

ps -eo pcpu,pmem,pid,user,args | sort -k 1 -r |head -10                          | tee -a /tmp/HighCpuReport.$pid.$hostname.out

echo "*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~*"    | tee -a /tmp/HighCpuReport.$pid.$hostname.out

echo "   "

 

echo "****The most CPU consumner top 10 *Threads* from process $pid:********"    | tee -a /tmp/HighCpuReport.$pid.$hostname.out

echo "  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND "    | tee -a /tmp/HighCpuReport.$pid.$hostname.out

cat /tmp/topdashH.$pid.$hostname.out|grep -v Cpu|sort -k9  -n -r  -k1 -u |head -10             | tee -a /tmp/HighCpuReport.$pid.$hostname.out

echo "*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~*"    | tee -a /tmp/HighCpuReport.$pid.$hostname.out

echo "   "                                                                      | tee -a /tmp/HighCpuReport.$pid.$hostname.out

 

echo "****The most TIME c onsumner top 10 *Threads* from process $pid:*******"    | tee -a /tmp/HighCpuReport.$pid.$hostname.out

echo "  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND "    | tee -a /tmp/HighCpuReport.$pid.$hostname.out

cat /tmp/topdashH.$pid.$hostname.out | grep -v Cpu|sort -k11  -n  -r -k1 -u  |head -10         | tee -a /tmp/HighCpuReport.$pid.$hostname.out

echo "*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~*"    | tee -a /tmp/HighCpuReport.$pid.$hostname.out

echo "   "

 

echo "Pleae check the javacore and HighCpuReport.$pid.$hostname.out under current directory."

 

以上是关于Linux系统内对高CPU的监控及日志分析的主要内容,如果未能解决你的问题,请参考以下文章

shell系统监控及日志分析

在linux中快速使用docker搭建ELK日志监控分析系统

运维 | 智能运维(AiOps)之日志监控及日志分析系统分析

Linux日志系统日志及分析

Linux系统日志及日志分析

Linux中的日志分析及管理