logstash之grok

Posted 终点即起点

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了logstash之grok相关的知识,希望对你有一定的参考价值。

nginx匹配示例

nginx日志格式
\'$remote_user [$time_local]  $http_x_Forwarded_for $remote_addr  $request $status $upstream_status\'
                       \'$http_x_forwarded_for\'
                       \'$upstream_addr \'
                       \'ups_resp_time: $upstream_response_time \'
                       \'request_time: $request_time\';
nginx日志示例
- [09/May/2023:15:01:31 +0800]  11.20.1.30 38.34.246.127  GET / HTTP/1.1 200 -11.20.1.30- ups_resp_time: - request_time: 0.000
grok匹配
filter 
   grok 
       match => 
         "message" => "%DATA:remote_user \\[%HTTPDATE:log_times\\]  %IPV4:http_x_Forwarded_for %IPV4:remote_addr  %WORD:request_method %DATA:uri HTTP/%NUMBER:http_version %NUMBER:response_code %DATA:upstream_status%IPV4:http_x_forwarded_for%DATA:upstream_addr ups_resp_time: %DATA:ups_resp_time request_time: %NUMBER:request_time"
        
   
匹配后数据

    "http_x_Forwarded_for" => "11.20.1.30",
                    "host" => "elk3",
                 "message" => "- [09/May/2023:15:01:31 +0800]  11.20.1.30 38.34.246.127  GET / HTTP/1.1 200 -11.20.1.30- ups_resp_time: - request_time: 0.000",
          "request_method" => "GET",
         "upstream_status" => "-",
           "ups_resp_time" => "-",
            "request_time" => "0.000",
             "remote_user" => "-",
               "log_times" => "09/May/2023:15:01:31 +0800",
           "upstream_addr" => "-",
                "@version" => "1",
              "@timestamp" => 2023-05-09T08:12:35.912Z,
            "http_version" => "1.1",
             "remote_addr" => "38.34.246.127",
    "http_x_forwarded_for" => "11.20.1.30",
                     "uri" => "/",
           "response_code" => "200"

 

grok使用格式

%SYNTAX:SEMANTIC
%预定义好的表达式的名字:自定义命名

内置正则

 USERNAME [a-zA-Z0-9._-]+
 USER %USERNAME
 EMAILLOCALPART [a-zA-Z][a-zA-Z0-9_.+-=:]+
 EMAILADDRESS %EMAILLOCALPART@%HOSTNAME
 INT (?:[+-]?(?:[0-9]+))
 BASE10NUM (?<![0-9.+-])(?>[+-]?(?:(?:[0-9]+(?:\\.[0-9]+)?)|(?:\\.[0-9]+)))
 NUMBER (?:%BASE10NUM)
 BASE16NUM (?<![0-9A-Fa-f])(?:[+-]?(?:0x)?(?:[0-9A-Fa-f]+))
 BASE16FLOAT \\b(?<![0-9A-Fa-f.])(?:[+-]?(?:0x)?(?:(?:[0-9A-Fa-f]+(?:\\.[0-9A-Fa-f]*)?)|(?:\\.[0-9A-Fa-f]+)))\\b
 
 POSINT \\b(?:[1-9][0-9]*)\\b
 NONNEGINT \\b(?:[0-9]+)\\b
 WORD \\b\\w+\\b
 NOTSPACE \\S+
 SPACE \\s*
 DATA .*?
 GREEDYDATA .*
 QUOTEDSTRING (?>(?<!\\\\)(?>"(?>\\\\.|[^\\\\"]+)+"|""|(?>\'(?>\\\\.|[^\\\\\']+)+\')|\'\'|(?>`(?>\\\\.|[^\\\\`]+)+`)|``))
 UUID [A-Fa-f0-9]8-(?:[A-Fa-f0-9]4-)3[A-Fa-f0-9]12
 # URN, allowing use of RFC 2141 section 2.3 reserved characters
 URN urn:[0-9A-Za-z][0-9A-Za-z-]0,31:(?:%[0-9a-fA-F]2|[0-9A-Za-z()+,.:=@;$_!*\'/?#-])+
 
 # Networking
 MAC (?:%CISCOMAC|%WINDOWSMAC|%COMMONMAC)
 CISCOMAC (?:(?:[A-Fa-f0-9]4\\.)2[A-Fa-f0-9]4)
 WINDOWSMAC (?:(?:[A-Fa-f0-9]2-)5[A-Fa-f0-9]2)
 COMMONMAC (?:(?:[A-Fa-f0-9]2:)5[A-Fa-f0-9]2)
 IPV6 ((([0-9A-Fa-f]1,4:)7([0-9A-Fa-f]1,4|:))|(([0-9A-Fa-f]1,4:)6(:[0-9A-Fa-f]1,4|((25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d)(\\.(25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d))3)|:))|(([0-9A-Fa-f]1,4:)5(((:[0-9A-Fa-f]1,4)1,2)|:((25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d)(\\.(25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d))3)|:))|(([0-9A-Fa-f]1,4:)4(((:[0-9A-Fa-f]1,4)1,3)|((:[0-9A-Fa-f]1,4)?:((25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d)(\\.(25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d))3))|:))|(([0-9A-Fa-f]1,4:)3(((:[0-9A-Fa-f]1,4)1,4)|((:[0-9A-Fa-f]1,4)0,2:((25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d)(\\.(25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d))3))|:))|(([0-9A-Fa-f]1,4:)2(((:[0-9A-Fa-f]1,4)1,5)|((:[0-9A-Fa-f]1,4)0,3:((25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d)(\\.(25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d))3))|:))|(([0-9A-Fa-f]1,4:)1(((:[0-9A-Fa-f]1,4)1,6)|((:[0-9A-Fa-f]1,4)0,4:((25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d)(\\.(25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d))3))|:))|(:(((:[0-9A-Fa-f]1,4)1,7)|((:[0-9A-Fa-f]1,4)0,5:((25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d)(\\.(25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d))3))|:)))(%.+)?
 IPV4 (?<![0-9])(?:(?:[0-1]?[0-9]1,2|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]1,2|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]1,2|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]1,2|2[0-4][0-9]|25[0-5]))(?![0-9])
 IP (?:%IPV6|%IPV4)
 HOSTNAME \\b(?:[0-9A-Za-z][0-9A-Za-z-]0,62)(?:\\.(?:[0-9A-Za-z][0-9A-Za-z-]0,62))*(\\.?|\\b)
 IPORHOST (?:%IP|%HOSTNAME)
 HOSTPORT %IPORHOST:%POSINT
 
 # paths
 PATH (?:%UNIXPATH|%WINPATH)
 UNIXPATH (/([\\w_%!$@:.,+~-]+|\\\\.)*)+
 TTY (?:/dev/(pts|tty([pq])?)(\\w+)?/?(?:[0-9]+))
 WINPATH (?>[A-Za-z]+:|\\\\)(?:\\\\[^\\\\?*]*)+
 URIPROTO [A-Za-z]([A-Za-z0-9+\\-.]+)+
 URIHOST %IPORHOST(?::%POSINT:port)?
 # uripath comes loosely from RFC1738, but mostly from what Firefox
 # doesn\'t turn into %XX
 URIPATH (?:/[A-Za-z0-9$.+!*\'(),~:;=@#%&_\\-]*)+
 #URIPARAM \\?(?:[A-Za-z0-9]+(?:=(?:[^&]*))?(?:&(?:[A-Za-z0-9]+(?:=(?:[^&]*))?)?)*)?
 URIPARAM \\?[A-Za-z0-9$.+!*\'|(),~@#%&/=:;_?\\-\\[\\]<>]*
 URIPATHPARAM %URIPATH(?:%URIPARAM)?
 URI %URIPROTO://(?:%USER(?::[^@]*)?@)?(?:%URIHOST)?(?:%URIPATHPARAM)?
 
 # Months: January, Feb, 3, 03, 12, December
 MONTH \\b(?:[Jj]an(?:uary|uar)?|[Ff]eb(?:ruary|ruar)?|[Mm](?:a|ä)?r(?:ch|z)?|[Aa]pr(?:il)?|[Mm]a(?:y|i)?|[Jj]un(?:e|i)?|[Jj]ul(?:y)?|[Aa]ug(?:ust)?|[Ss]ep(?:tember)?|[Oo](?:c|k)?t(?:ober)?|[Nn]ov(?:ember)?|[Dd]e(?:c|z)(?:ember)?)\\b
 MONTHNUM (?:0?[1-9]|1[0-2])
 MONTHNUM2 (?:0[1-9]|1[0-2])
 MONTHDAY (?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9])
 
 # Days: Monday, Tue, Thu, etc...
 DAY (?:Mon(?:day)?|Tue(?:sday)?|Wed(?:nesday)?|Thu(?:rsday)?|Fri(?:day)?|Sat(?:urday)?|Sun(?:day)?)
 
 # Years?
 YEAR (?>\\d\\d)1,2
 HOUR (?:2[0123]|[01]?[0-9])
 MINUTE (?:[0-5][0-9])
 # \'60\' is a leap second in most time standards and thus is valid.
 SECOND (?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)
 TIME (?!<[0-9])%HOUR:%MINUTE(?::%SECOND)(?![0-9])
 # datestamp is YYYY/MM/DD-HH:MM:SS.UUUU (or something like it)
 DATE_US %MONTHNUM[/-]%MONTHDAY[/-]%YEAR
 DATE_EU %MONTHDAY[./-]%MONTHNUM[./-]%YEAR
 ISO8601_TIMEZONE (?:Z|[+-]%HOUR(?::?%MINUTE))
 ISO8601_SECOND (?:%SECOND|60)
 TIMESTAMP_ISO8601 %YEAR-%MONTHNUM-%MONTHDAY[T ]%HOUR:?%MINUTE(?::?%SECOND)?%ISO8601_TIMEZONE?
 DATE %DATE_US|%DATE_EU
 DATESTAMP %DATE[- ]%TIME
 TZ (?:[APMCE][SD]T|UTC)
 DATESTAMP_RFC822 %DAY %MONTH %MONTHDAY %YEAR %TIME %TZ
 DATESTAMP_RFC2822 %DAY, %MONTHDAY %MONTH %YEAR %TIME %ISO8601_TIMEZONE
 DATESTAMP_OTHER %DAY %MONTH %MONTHDAY %TIME %TZ %YEAR
 DATESTAMP_EVENTLOG %YEAR%MONTHNUM2%MONTHDAY%HOUR%MINUTE%SECOND
 
 # Syslog Dates: Month Day HH:MM:SS
 SYSLOGTIMESTAMP %MONTH +%MONTHDAY %TIME
 PROG [\\x21-\\x5a\\x5c\\x5e-\\x7e]+
 SYSLOGPROG %PROG:program(?:\\[%POSINT:pid\\])?
 SYSLOGHOST %IPORHOST
 SYSLOGFACILITY <%NONNEGINT:facility.%NONNEGINT:priority>
 HTTPDATE %MONTHDAY/%MONTH/%YEAR:%TIME %INT
 
 # Shortcuts
 QS %QUOTEDSTRING
 
 # Log formats
 SYSLOGBASE %SYSLOGTIMESTAMP:timestamp (?:%SYSLOGFACILITY )?%SYSLOGHOST:logsource %SYSLOGPROG:
 
 # Log Levels
 LOGLEVEL ([Aa]lert|ALERT|[Tt]race|TRACE|[Dd]ebug|DEBUG|[Nn]otice|NOTICE|[Ii]nfo|INFO|[Ww]arn?(?:ing)?|WARN?(?:ING)?|[Ee]rr?(?:or)?|ERR?(?:OR)?|[Cc]rit?(?:ical)?|CRIT?(?:ICAL)?|[Ff]atal|FATAL|[Ss]evere|SEVERE|EMERG(?:ENCY)?|[Ee]merg(?:ency)?)

 

logstash grok使用案例

Grok 是 Logstash 最重要的插件。你可以在 grok 里预定义好命名正则表达式,在稍后(grok参数或者其他正则表达式里)引用它。它非常适用于syslog logs,apache和一些其他的webserver logs,以及mysql logs。grok有很多定义好pattern,当然也可以自己定义。

grok的语法:

%{SYNTAX:SEMANTIC}

SYNTAX表示grok定义好的pattern,SEMANTIC表示自定义的字段。

例如192.168.0.100

%{IP:client}可以将IP定义为client


假如现在某webserver log中的内容为以下格式,

55.3.244.1 GET /index.html 15824 0.043

我们完全可以利用grok将这些信息定义成以下字段

%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}

写到配置文件中通常这样子:

input {  file {    path => "/var/log/http.log"  }}filter {  grok {    match => { "message" => "%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}" }  }}


grok过滤后得到的信息成了以下这样子

  • client: 55.3.244.1

  • method: GET

  • request: /index.html

  • bytes: 15824

  • duration: 0.043

如何自定义Pattern?

语法:(?<field_name>the pattern here)

假如有以下内容“

begin 123.456 end

我们希望将123.456定义成request_time字段,可以向下面这样写这个正则表达式

\s+(?<request_time>\d+(?:\.\d+)?)\s+

解释:

\s:匹配任何不可见字符,包括空格、制表符、换页符等等。等价于[ \f\n\r\t\v]。+表示匹配次数为1次或者多次

(?<request_time>  ):这个是grok语法,request_time表示要将捕获的字符定义成的字段名

\d+:匹配一个或者多个数字

(?:\.\d+):为正则表达式,

(?: pattern):非获取匹配,匹配pattern但不获取匹配结果,不进行存储供以后使用。这在使用或字符“(|)”来组合一个模式的各个部分是很有用。例如“industr(?:y|ies)”就是一个比“industry|industries”更简略的表达式

\.\d+:表示点后面跟一个或者多个 数字,(?:\.\d+)?表示点后面跟一个或多个数字这种情况出现0次或者多次,如果为0次,则request_time为一个整数。所以匹配到的结果可能为123.456或者123或者123.4.5.6,这些都满足条件


测试下:

创建一个配置文件,内容如下:

input {stdin{}}
filter {
    grok {
        match => {
            "message" => "\s+(?<request_time>\d+(?:\.\d+)?)\s+"
        }
    }
}
output {stdout{}}

运行 logstash 进程然后输入 "begin 123.456 end",你会看到类似下面这样的输出:

{
         "message" => "begin 123.456 end",
        "@version" => "1",
      "@timestamp" => "2014-08-09T11:55:38.186Z",
            "host" => "raochenlindeMacBook-Air.local",
    "request_time" => "123.456"
}

练习:

/var/log/userlog.info 日志文件中获取到的信息为以下格式,需要自定义

2016-05-20T20:00:15.703407+08:00 localhost [audit root/13283 as root/13283 on pts/0/172.16.100.99:64790->10.10.10.6:22]: #=== session closed ===
2016-05-21T09:52:54.424055+08:00 localhost [audit root/13558 as root/13558 on pts/0/172.16.100.99:50897->10.10.10.6:22]: #=== session opened ===
2016-05-21T09:53:25.687134+08:00 localhost [audit root/13558 as root/13558 on pts/0/172.16.100.99:50897->10.10.10.6:22] /root: cd /etc/logstash/conf.d/
2016-05-21T09:53:26.284741+08:00 localhost [audit root/13558 as root/13558 on pts/0/172.16.100.99:50897->10.10.10.6:22] /etc/logstash/conf.d: ll

注意上面的日志文件中不是每一行的内容格式都是一样的,grok表达式如下

%{TIMESTAMP_ISO8601:timestamp} %{IPORHOST:login_host} \[\S+ %{USER:login_user}/%{NUMBER:pid} as %{USER:sudouser}/%{NUMBER:sudouser_pid} on %{WORD:tty}/%{NUMBER:tty_id}/%{IPORHOST:host_ip}:%{NUMBER:source_port}-\>%{IPORHOST:local_ip}:%{NUMBER:dest_port}\](?:\:|) (%{UNIXPATH:current_path} %{GREEDYDATA:command}|%{GREEDYDATA:detail})
注意:(?:\:|) (%{UNIXPATH:current_path} %{GREEDYDATA:command}|%{GREEDYDATA:detail})

上面的日志内容在后面出现了不一样的地方

技术分享

该如何处理? 就是在这个不一样的地方加上

(?:\:|)

表示匹配后面内容是冒号或者其他内容,然后文件中出现的是什么怎么写grok表达式。例如当出现

/root: cd /etc/logstash/conf.d/

这样的内容时,grok这样写,

%{UNIXPATH:current_path} %{GREEDYDATA:command}

当出现

#=== session closed ===

这样的内容时,grok这样写,

%{GREEDYDATA:detail

但是这两中情况需要加“|”来判断,所以正确的写法为

(%{UNIXPATH:current_path} %{GREEDYDATA:command}|%{GREEDYDATA:detail})

整体需要括起来。这里相当于对

(?:\:|)

是一种呼应。


本文出自 “zengestudy” 博客,请务必保留此出处http://zengestudy.blog.51cto.com/1702365/1782637

以上是关于logstash之grok的主要内容,如果未能解决你的问题,请参考以下文章

Logstash收集nginx日志之使用grok过滤插件解析日志

Logstash收集nginx日志之使用grok过滤插件解析日志

ELK日志处理之使用Grok解析日志

logstash grok使用案例

关于Logstash中grok插件干货

使用Logstash filter grok过滤日志文件