Python监控服务端口并报警

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python监控服务端口并报警相关的知识,希望对你有一定的参考价值。

最近发现公司的测试环境中有个Socket服务的端口总是莫名其妙Down掉,但是服务却正常运行着,看样子是僵死了。。。

虽然是测试环境,但是也不能这样放着不管,于是连夜写了一个简单的监控脚本。因为服务器是Windows的,所以要用到wmi模块。逻辑如下:

1、用wmi模块获取系统中处于停止状态的服务,生成一个字典。

2、判断监控的服务是否存在于字典中,如果存在说明服务已经停止,那么将尝试启动服务,并发送报警邮件。

3、向本地的Socket服务端口发送一个connect,如果捕获到异常将尝试重启服务,并发送报警邮件。

4、每次执行时脚本将会循环执行以上步骤三次,间隔10秒,以确保服务状态正常。

在运行的时候发现了一个问题,Python使用wmi模块来对Windows系统进行操作的时候速度格外的慢,不知道有没有其他的代替方法,哪位如果有更好的方法可以指点一下。

 

源码如下:

#!/usr/bin/env python

import os
import wmi
import time
import socket
import base64
import smtplib
import logging
from email.mime.text import MIMEText


def GetSrv(designation):
    """Get stopped service name and caption,
    Filtration ‘designation‘ service whether there is ‘Stopped‘.

    :return: service state
    """
    c = wmi.WMI()
    ret = dict()
    for service in c.Win32_Service():
        state, caption = service.State, service.Caption
        if state == Stopped:
            t = ret.get(state, [])
            t.append(caption)
            ret[state] = t
    # If ‘designation‘ service in the ‘Stopped‘, return status is ‘down‘
    if designation in ret.get(Stopped):
        logging.error(Service [%s] is down, try to restart the service. \r\n % designation)
        return down
    return True


def Monitor(sname):
    """Send the machine IP port 20000 socket request,
    If capture the abnormal returns the string ‘ex‘.

    :return: string ‘ex‘
    """
    s = socket.socket()
    s.settimeout(3)  # timeout
    host = (127.0.0.1, 20000)
    try:  # Try connection to the host
        s.connect(host)
    except socket.error as e:
        logging.warning([%s] service connection failed: %s \r\n % (sname, e))
        return ex
    return True


def RestartSocket(rstname, conn, run):
    """First check whether the service is stopped,
    if stop, start the service directly.
    The check whether the zombies,
    if a zombie, then restart the service.

    :return: flag or True
    """
    flag = False
    try:
        # From GetSrv() to obtain the return value, the return value
        if run == down:
            ret = os.system(sc start "%s" % rstname)
            if ret != 0:
                raise Exception([Errno %s] % ret)
            flag = True
        elif conn == ex:
            retStop = os.system(sc stop "%s" % rstname)
            retSart = os.system(sc start "%s" % rstname)
            if retSart != 0:
                raise Exception(retStop [Status code %s] 
                                retSart [Status code %s]  % (retStop, retSart))
            flag = True
        else:
            logging.info([%s] service running status to normal % rstname)
            return True
    except Exception as e:
        logging.warning([%s] service restart failed: %s \r\n % (rstname, e))
        return flag


def SendMail(to_list, sub, contents):
    """Send alarm mail.

    :return: flag
    """
    mail_server = mail.stmp.com  # STMP Server
    mail_user = YouAccount  # Mail account
    mail_pass = base64.b64decode(Password)  # The encrypted password
    mail_postfix = smtp.com  # Domain name

    me = Monitor alarm<%[email protected]%s> % (mail_user, mail_postfix)
    message = MIMEText(contents, _subtype=html, _charset=utf-8)

    message[Subject] = sub
    message[From] = me
    message[To] = ;.join(to_list)

    flag = False  # To determine whether a mail sent successfully
    try:
        s = smtplib.SMTP()
        s.connect(mail_server)
        s.login(mail_user, mail_pass)
        s.sendmail(me, to_list, message.as_string())
        s.close()
        flag = True
    except Exception, e:
        logging.warning(Send mail failed, exception: [%s]. \r\n % e)

    return flag


def main(sname):
    """Parameter type in the name of the service need to monitor,
    perform functions defined in turn, and the return value is correct.
    After the program is running, will test three times,
    each time interval to 10 seconds.

    :return: retValue
    """
    retry = 3
    count = 0
    retValue = False  # Used return to the state of the socket
    while count < retry:
        ret = Monitor(sname)
        if ret != ex:  # If socket connection is normaol, return retValue
            retValue = ret
            return retValue
        isDown = GetSrv(sname)
        RestartSocket(rstname=sname, conn=ret, run=isDown)

        host = socket.gethostname()
        address = socket.gethostbyname(host)
        mailto_list = [[email protected], ]  # Alarm contacts
        SendMail(mailto_list, Alarm,
                  <h4>Level: <u>ERROR</u></br> Host name: %s</br>
                  IP Address: %s</br>
                  Service name:</h4> <h5>%s</h5>
                 % (host, address, sname))
        count += 1
        time.sleep(10)
    else:
        logging.error([%s] service try to restart more than three times \r\n % sname)

    return retValue


if __name__ == __main__:

    logging.basicConfig(level=logging.INFO,
                        format=%(asctime)s %(levelname)s %(message)s,
                        datefmt=%Y/%m/%d %H:%M:%S,
                        filename=D:\\em_logs\\SocketMonitor.log,
                        filemode=ab)

    name = IM_IMConnectorServerWinService
    response = main(name)
    if response:
        logging.info(The [%s] service connection is normal \r\n % name)

 

以上是关于Python监控服务端口并报警的主要内容,如果未能解决你的问题,请参考以下文章

zabbix监控nginx状态端口不监听触发报警

python3.8 微信发送服务器监控报警代码

Zabbix监控httpd服务

使用nagios监控交换机端口流量,对低于阈值的流量进行报警

案例六:shell脚本监控httpd服务80端口状态

利用python3监控服务器状态进行邮件报警