性能监控之Telegraf+InfluxDB+Grafana+Python实现Oracle实时监控

Posted zuozewei

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了性能监控之Telegraf+InfluxDB+Grafana+Python实现Oracle实时监控相关的知识,希望对你有一定的参考价值。

文章目录

一、前言

监控集成选型的 Telegraf 探针,最近需要实现对 Oracle 数据库的做实时监控,查了下 Telegraf 竟然还不支持 Oracle 监控,WTF?于是自己研究了下,通过 Python + SQL 脚本折中解决了,此文去且当作小结。

二、实现的效果

三、预备知识

1、Oracle动态性能视图

动态性能视图属于数据字典,它们的所有者为SYS,并且多数动态性能视图只能由特权用户和DBA用户查询。
当数据库处于不同状态时,可以访问的动态性能视图有所不同。
启动例程时,ORACLE会自动建立动态性能视图;停止例程时,ORACLE会自动删除动态性能视图。
数据字典信息是从数据文件中获得,而动态性能视图信息是从SGA和控制文件取得。
所以,两者所反映的信息还是有很大差异的。数据库管理员利用这些动态性能视图,可以了解数据库运行的一些基本信息,为我们进行数据库维护以及数据库性能优化提供一些数据上的支持。
所有动态性能视图都是以V_$ 开始的,Oracle 为每个动态性能视图提供了相应的同义词(V$开头)

通过查询 V$FIXED_TABLE ,可以列出所有可用的动态性能视图和动态性能表。

SQL> select  * from V$FIXED_TABLE  where name like 'V$%';
 
NAME                            OBJECT_ID TYPE   TABLE_NUM
------------------------------ ---------- ----- ----------
V$WAITSTAT                     4294950915 VIEW       65537
V$BH                           4294951406 VIEW       65537
V$GC_ELEMENT                   4294951794 VIEW       65537
V$CR_BLOCK_SERVER              4294951796 VIEW       65537
V$CURRENT_BLOCK_SERVER         4294952095 VIEW       65537
V$POLICY_HISTORY               4294953128 VIEW       65537
V$ENCRYPTED_TABLESPACES        4294952996 VIEW       65537
V$GC_ELEMENTS_WITH_COLLISIONS  4294951798 VIEW       65537
V$FILE_CACHE_TRANSFER          4294951800 VIEW       65537
V$TEMP_CACHE_TRANSFER          4294951802 VIEW       65537
V$CLASS_CACHE_TRANSFER         4294951804 VIEW       65537
V$INSTANCE_CACHE_TRANSFER      4294952151 VIEW       65537
V$LOCK_ELEMENT                 4294951408 VIEW       65537
V$BSP                          4294951594 VIEW       65537
V$LOCKS_WITH_COLLISIONS        4294951410 VIEW       65537
V$FILE_PING                    4294951412 VIEW       65537
V$TEMP_PING                    4294951532 VIEW       65537
V$CLASS_PING                   4294951414 VIEW       65537
V$LOCK_ACTIVITY                4294951437 VIEW       65537
V$ROWCACHE                     4294950916 VIEW       65537

以下是不同类型的指标视图的快速表格比较:

该表的第一行是经典的等待事件和统计视图。以下几行是度量标准视图。度量标准视图是在 Oracle 10g 中引入的。

度量视图计算增量和速率,这极大地简化了解决简单问题的能力,比如 “现在我的数据库的I/O速率是多少?” 这个问题,在10g之前,处理起来出奇的乏味。要回答这个问题,你必须查询 v$sysstat,例如:

Select value from v$sysstat where name='physical reads';

但是仅查询一次 v$sysstat 不能解决问题,而是“自数据库启动以来已完成了多少I / O?”的问题。要回答原始问题,必须两次查询 v$sysstat 并接受两个值之间的增量:

  • 在时间A取值
  • 在时间B取值
  • Delta = (B-A)
  • and/or get Rate = (B-A)/elapsed time

获得这些差值和速率可能是一项艰巨的工作。然后 10g Oracle 引入了度量标准表,这些度量表可以在一个查询中解决问题。

等待事件视图为(系统级别)

  • V$SYSTEM_EVENT – 自启动以来累积的等待事件
  • V$EVENTMETRIC - 等待事件增量持续60秒
  • DBA_HIST_SYSTEM_EVENT – 自启动以来累计的上周按快照(小时)的等待事件

等待事件汇总到称为等待类的组中。对于等待类,有以下视图:

  • V$SYSTEM_WAIT_CLASS – 自启动以来累积
  • V$WAITCLASSMETRIC – 持续60秒增量
  • V$WAITCLASSMETRIC_HISTORY – 最后一小时的60秒增量

注意:DBA_HIST_WAITCLASSMETRIC_HISTORY 用于警报或基准,而不是日常值。

其他的就不一一展开了,具体可以参考下文:
http://datavirtualizer.com/wait-event-and-wait-class-metrics-vs-vsystem_event/

2、cx_Oracle

cx_Oracle 是一个 Python 扩展模块,可以访问 Oracle 数据库。它符合 Python 数据库API 2.0 规范。

2.1、基本要求

要在 Python 和 Oracle 数据库中使用 cx_Oracle 7,需要满足以下条件:

  • Python 2.7或 3.5 及更高版本。
  • Oracle 客户端库。
  • Oracle 数据库。Oracle的标准客户端 - 服务器版本互操作性允许cx_Oracle连接到较旧和较新的数据库。(推荐)

2.2、快速安装

在 Linux 上安装 cx_Oracle 的一般方法是使用 Python 的 Pip 包从 PyPI 安装 cx_Oracle

从 PyPI 安装 cx_Oracle:

python -m pip install cx_Oracle --upgrade

将 Oracle 客户端库添加到操作系统库搜索路径,例如 Linux 的 LD_LIBRARY_PATH

如果你的数据库位于远程计算机上,请下 适用于你的操作系统体系结构的免费Oracle Instant Client “Basic” 或 “Basic Light” 包

至于具体的 Oracle Client 安装,可以参考下文:
https://cx-oracle.readthedocs.io/en/latest/user_guide/installation.html#installing-cx-oracle-on-linux

四、解决方案

  • Python:收集 Oracle 指标数据
  • Telegraf:收集 Python 打印的性能指标数据
  • InfluxDB:存储时间序列 Oracle 性能指标数据
  • Grafana:可视化 Dashboard

1、安装

具体的安装可以参考官方文档:

  • Telegraf:https://docs.influxdata.com/telegraf/v1.12/introduction/installation/
  • InfluxDB:https://docs.influxdata.com/influxdb/v1.7/introduction/installation/
  • Grafanak:https://grafana.com/docs/installation/rpm/

2、具体设置

在 InfluxDB 中创建一个 Telegraf 数据库:

[root@zuozewei ~]# influx
Connected to http://localhost:8086 version 1.6.2
InfluxDB shell version: 1.6.2
> create user "telegraf" with password 'telegraf'
> create database telegraf
> show databases
name: databases
name
----
_internal
telegraf

编写 python+sql 脚本以收集 oracle 指标。脚本的输出内容很重要,必须是 InfluxDB line-protocol。该脚本查询 v$ SYSMETRICv$eventmetric ,获得最后一分钟时,等待类和等待事件指标。

python代码是:

import socket,argparse,subprocess,re,cx_Oracle

fqdn = socket.getfqdn()

class OraStats():

    def __init__(self, user, passwd, sid):
        self.user = user
        self.passwd = passwd
        self.sid = sid
        self.delengine = "none"
        connstr=self.user+'/'+self.passwd+'@'+self.sid
        self.connection = cx_Oracle.connect(connstr)
        cursor = self.connection.cursor()
        cursor.execute("select distinct(SVRNAME)  from v$dnfs_servers")
        rows = cursor.fetchall()

        for i in range(0, cursor.rowcount):
            self.dengine_ip = rows[i][0]
            proc = subprocess.Popen(["nslookup", self.dengine_ip], stdout=subprocess.PIPE)
            lookupresult = proc.communicate()[0].split('\\n')

            for line in lookupresult:
                if 'name=' in re.sub(r'\\s', '', line):
                    self.delengine = re.sub('\\..*$', '', re.sub(r'^.*name=', '', re.sub(r'\\s', '', re.sub(r'.$', '', line))))

    # 等待类别
    def waitclassstats(self, user, passwd, sid, format):
        cursor = self.connection.cursor()
        cursor.execute("""
        select n.wait_class, round(m.time_waited/m.INTSIZE_CSEC,3) AAS
        from   v$waitclassmetric  m, v$system_wait_class n
        where m.wait_class_id=n.wait_class_id and n.wait_class != 'Idle'
        union
        select  'CPU', round(value/100,3) AAS
        from v$sysmetric where metric_name='CPU Usage Per Sec' and group_id=2
        union select 'CPU_OS', round((prcnt.busy*parameter.cpu_count)/100,3) - aas.cpu
        from
        ( select value busy
        from v$sysmetric
        where metric_name='Host CPU Utilization (%)'
         and group_id=2 ) prcnt,
        ( select value cpu_count from v$parameter where name='cpu_count' )  parameter,
        ( select  'CPU', round(value/100,3) cpu from v$sysmetric where metric_name='CPU Usage Per Sec' and group_id=2) aas
        """)
        for wait in cursor:
            wait_name = wait[0]
            wait_value = wait[1]
            print ("oracle_wait_class,fqdn=0,delphix=1,db=2,wait_class=3 wait_value=4".format(fqdn, self.delengine, sid, re.sub(' ', '_', wait_name), wait_value))

    # 系统指标
    def sysmetrics(self, user, passwd, sid, format):
        cursor = self.connection.cursor()
        cursor.execute("""
        select METRIC_NAME,VALUE,METRIC_UNIT from v$sysmetric where group_id=2
        """)
        for metric in cursor:
            metric_name = metric[0]
            metric_value = metric[1]
            print ("oracle_sysmetric,fqdn=0,delphix=1,db=2,metric_name=3 metric_value=4".format(fqdn,self.delengine,sid,re.sub(' ', '_', metric_name),metric_value))

    # 在闪回恢复区中有关磁盘配额和当前磁盘使用情况
    def fraused(self, user, passwd, sid, format):
        cursor = self.connection.cursor()
        cursor.execute("""
        select round((SPACE_USED-SPACE_RECLAIMABLE)*100/SPACE_LIMIT,1) from  V$RECOVERY_FILE_DEST
        """)
        for frau in cursor:
            fra_used = frau[0]
            print ("oracle_fra_pctused,fqdn=0,delphix=1,db=2 fra_pctused=3".format(fqdn,self.delengine,sid,fra_used))

    # 磁盘使用状态
    def fsused(self):
     fss = ['/oracle', '/data']
     for fs in fss:
            df = subprocess.Popen(["df","-P",fs], stdout=subprocess.PIPE)
            output = df.communicate()[0]
            total = re.sub('%','',output.split("\\n")[1].split()[1])
            used = re.sub('%','',output.split("\\n")[1].split()[2])
            pctused = re.sub('%','',output.split("\\n")[1].split()[4])
            print("oracle_fs_pctused,fqdn=0,fs_name=1 oraclefs_pctused=2,oraclefs_alloc=3,oraclefs_used=4".format(fqdn,fs,pctused,total,used))

    # 等待状态
    def waitstats(self, user, passwd, sid, format):
        cursor = self.connection.cursor()
        cursor.execute("""
        select /*+ ordered use_hash(n) */
        n.wait_class wait_class,
        n.name wait_name,
        m.wait_count  cnt,
        nvl(round(10*m.time_waited/nullif(m.wait_count,0),3) ,0) avg_ms
        from v$eventmetric m,
        v$event_name n
        where m.event_id=n.event_id
        and n.wait_class <> 'Idle' and m.wait_count > 0 order by 1""")
        for wait in cursor:
            wait_class = wait[0]
            wait_name = wait[1]
            wait_cnt = wait[2]
            wait_avgms = wait[3]
            print ("oracle_wait_event,fqdn=0,delphix=1,db=2,wait_class=3,wait_event=4 count=5,latency=6".format(fqdn, self.delengine,sid,re.sub(' ', '_', wait_class), re.sub(' ','_',wait_name),wait_cnt,wait_avgms))

    # 表空间使用状态
    def tbsstats(self, user, passwd, sid, format):
        cursor = self.connection.cursor()
        cursor.execute("""
        select /*+ ordered */ tablespace_name,
            round(used_space),
            round(max_size-used_space) free_space,
            round(max_size),
            round(used_space*100/max_size,2) percent_used
            from (
                select m.tablespace_name,
                m.used_space*t.block_size/1024/1024 used_space,
                (case when t.bigfile='YES' then power(2,32)*t.block_size/1024/1024
                        else tablespace_size*t.block_size/1024/1024 end) max_size
            from dba_tablespace_usage_metrics m, dba_tablespaces t
        where m.tablespace_name=t.tablespace_name)
        """)
        for tbs in cursor:
            tbs_name = tbs[0]
            used_space_mb = tbs[1]
            free_space_mb = tbs[2]
            max_size_mb = tbs[3]
            percent_used = tbs[4]
            print ("oracle_tablespaces,fqdn=0,delphix=1,db=2,tbs_name=3 used_space_mb=4,free_space_mb=5,percent_used=6,max_size_mb=7".format(fqdn, self.delengine, sid, re.sub(' ', '_', tbs_name), used_space_mb,free_space_mb,percent_used,max_size_mb))


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('-f', '--format', help="Output format, default influx", choices=['kafka', 'influx'], default='influx')
    subparsers = parser.add_subparsers(dest='stat')
    parser_all = subparsers.add_parser('ALL', help="Get all database stats")
    parser_all.add_argument('-u', '--user', help="Username with sys views grant", required=True)
    parser_all.add_argument('-p', '--passwd', required=True)
    parser_all.add_argument('-s', '--sid', help="tnsnames SID to connect", required=True)

    args = parser.parse_args()

    if args.stat == "ALL":
        stats = OraStats(args.user, args.passwd, args.sid)
        stats.waitclassstats(args.user, args.passwd, args.sid, args.format)
        stats.waitstats(args.user, args.passwd, args.sid, args.format)
        stats.sysmetrics(args.user, args.passwd, args.sid, args.format)
        stats.tbsstats(args.user, args.passwd, args.sid, args.format)
        stats.fraused(args.user, args.passwd, args.sid, args.format)
        stats.fsused()

输出格式化为 InfluxDB line-protocol

[root@localhost tools]# ./oracle.sh 
oracle_wait_class,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=Application wait_value=0
oracle_wait_class,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=CPU wait_value=0.003
oracle_wait_class,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=CPU_OS wait_value=0.778
oracle_wait_class,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=Commit wait_value=0
oracle_wait_class,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=Concurrency wait_value=0.001
oracle_wait_class,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=Configuration wait_value=0
oracle_wait_class,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=Network wait_value=0
oracle_wait_class,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=Other wait_value=0
oracle_wait_class,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=System_I/O wait_value=0.001
oracle_wait_class,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=User_I/O wait_value=0
oracle_wait_event,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=Commit,wait_event=log_file_sync count=2,latency=0.122
oracle_wait_event,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=Concurrency,wait_event=os_thread_startup count=2,latency=21.595
oracle_wait_event,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=Network,wait_event=SQL*Net_message_to_client count=17,latency=0.001
oracle_wait_event,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=Other,wait_event=asynch_descriptor_resize count=4,latency=0.001
oracle_wait_event,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=System_I/O,wait_event=db_file_parallel_write count=2,latency=0.081
oracle_wait_event,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=System_I/O,wait_event=control_file_parallel_write count=24,latency=0.268
oracle_wait_event,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=System_I/O,wait_event=control_file_sequential_read count=71,latency=0.716
oracle_wait_event,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=System_I/O,wait_event=log_file_parallel_write count=7,latency=0.076
oracle_wait_event,fqdn=localhost.localdomain,delphix=none,db=172.16.106.251:1521/orcl,wait_class=User_I/O,wait_event=Disk_file_operations_I/O count=16,laten

定义一个 shell 脚本执行 Python 脚本

#!/usr/bin/env bash
python /home/oracle/scripts/oracle_metrics.sh -f "influx" "ALL" "-u" "system" "-p" "xxxx" "-s" "172.16.106.251:1521/orcl"

在 oracle主机上,配置 telegraf 以60秒的间隔执行 python sh,然后将输出发送到 InfluxDB。
编辑 /etc/telegraf/telegraf.conf 配置文件:

# Telegraf configuration

# Telegraf is entirely plugin driven. All metrics are gathered from the
# declared inputs, and sent to the declared outputs.

# Plugins must be declared in here to be active.
# To deactivate a plugin, comment out the name and any variables.

# Use 'telegraf -config telegraf.conf -test' to see what metrics a config
# file would generate.

# Global tags can be specified here in key="value" format.
[global_tags]
  # dc = "us-east-1" # will tag all metrics with dc=us-east-1
  # rack = "1a"
  host="Dprima"
  collector="telegraf"

# Configuration for telegraf agent
[agent]
  ## Default data collection interval for all inputs
  interval = "10s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will cache metric_buffer_limit metrics for each output, and will
  ## flush this buffer on a successful write.
  metric_buffer_limit = 10000
  ## Flush the buffer whenever full, regardless of flush_interval.
  flush_buffer_when_full = true

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "0s"

  ## Default flushing interval for all outputs. You shouldn't set this below
  ## interval. Maximum flush_interval will be flush_interval + flush_jitter
  flush_interval = "60s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "0s"

  ## Run telegraf in debug mode
  debug = false
  ## Run telegraf in quiet mode
  quiet = false
  ## Override default hostname, if empty use os.Hostname()
  hostname = "Dprima"


###############################################################################
#                                  OUTPUTS                                    #
###############################################################################

# Configuration for influxdb server to send metrics to
[[outputs.influxdb]]
  urls = ["http://influxgraf:8086"] # required
  database = "telegraf" # required
  precision = "s"
  timeout = "5s"

[[outputs.influxdb]]
  urls = ["http://localhost:9092"] # required
  database = "kapacitor" # required
  precision = "s"
  retention_policy = "default"
  timeout = "5s"

#[[outputs.file]]
#  files=["/home/oracle/scripts/telegraf_debug.txt"]
###############################################################################
#                                  INPUTS                                     #
###############################################################################

# Oracle metrics
[[inputs.exec]]
  # Shell/commands array
  commands = ["/home/oracle/scripts/oracle_metrics.sh"]
  # Data format to consume. This can be "json", "influx" or "graphite" (line-protocol)
  # NOTE json only reads numerical measurements, strings and booleans are ignored.
  data_format = "influx"
  interval = "60s"

###############################################################################
#                              SERVICE INPUTS                                 #
###############################################################################

启动 telegraf:

telegraf -config /etc/telegraf/telegraf.conf

3、数据可视化

查询 InfluxDB 数据

以上是关于性能监控之Telegraf+InfluxDB+Grafana+Python实现Oracle实时监控的主要内容,如果未能解决你的问题,请参考以下文章

Telegraf+Influxdb+Chronograf+Kapacitor主机性能监控告警

Linux下部署grafana + influxdb + telegraf , 构建性能监控平台

grafana + influxdb + telegraf , 构建性能监控平台

Jmeter高级性能测试-Telegraf+Influxdb+Granfana实时监控系统

Spring Boot Actutaur + Telegraf + InFluxDB + Grafana 构建监控平台之应用数据分析

telegraf+influxdb+grafana开源监控架构