logstash之grok
Posted 终点即起点
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了logstash之grok相关的知识,希望对你有一定的参考价值。
nginx匹配示例
nginx日志格式 \'$remote_user [$time_local] $http_x_Forwarded_for $remote_addr $request $status $upstream_status\' \'$http_x_forwarded_for\' \'$upstream_addr \' \'ups_resp_time: $upstream_response_time \' \'request_time: $request_time\';
nginx日志示例 - [09/May/2023:15:01:31 +0800] 11.20.1.30 38.34.246.127 GET / HTTP/1.1 200 -11.20.1.30- ups_resp_time: - request_time: 0.000
grok匹配 filter grok match => "message" => "%DATA:remote_user \\[%HTTPDATE:log_times\\] %IPV4:http_x_Forwarded_for %IPV4:remote_addr %WORD:request_method %DATA:uri HTTP/%NUMBER:http_version %NUMBER:response_code %DATA:upstream_status%IPV4:http_x_forwarded_for%DATA:upstream_addr ups_resp_time: %DATA:ups_resp_time request_time: %NUMBER:request_time"
匹配后数据 "http_x_Forwarded_for" => "11.20.1.30", "host" => "elk3", "message" => "- [09/May/2023:15:01:31 +0800] 11.20.1.30 38.34.246.127 GET / HTTP/1.1 200 -11.20.1.30- ups_resp_time: - request_time: 0.000", "request_method" => "GET", "upstream_status" => "-", "ups_resp_time" => "-", "request_time" => "0.000", "remote_user" => "-", "log_times" => "09/May/2023:15:01:31 +0800", "upstream_addr" => "-", "@version" => "1", "@timestamp" => 2023-05-09T08:12:35.912Z, "http_version" => "1.1", "remote_addr" => "38.34.246.127", "http_x_forwarded_for" => "11.20.1.30", "uri" => "/", "response_code" => "200"
grok使用格式
%SYNTAX:SEMANTIC
%预定义好的表达式的名字:自定义命名
内置正则
USERNAME [a-zA-Z0-9._-]+ USER %USERNAME EMAILLOCALPART [a-zA-Z][a-zA-Z0-9_.+-=:]+ EMAILADDRESS %EMAILLOCALPART@%HOSTNAME INT (?:[+-]?(?:[0-9]+)) BASE10NUM (?<![0-9.+-])(?>[+-]?(?:(?:[0-9]+(?:\\.[0-9]+)?)|(?:\\.[0-9]+))) NUMBER (?:%BASE10NUM) BASE16NUM (?<![0-9A-Fa-f])(?:[+-]?(?:0x)?(?:[0-9A-Fa-f]+)) BASE16FLOAT \\b(?<![0-9A-Fa-f.])(?:[+-]?(?:0x)?(?:(?:[0-9A-Fa-f]+(?:\\.[0-9A-Fa-f]*)?)|(?:\\.[0-9A-Fa-f]+)))\\b POSINT \\b(?:[1-9][0-9]*)\\b NONNEGINT \\b(?:[0-9]+)\\b WORD \\b\\w+\\b NOTSPACE \\S+ SPACE \\s* DATA .*? GREEDYDATA .* QUOTEDSTRING (?>(?<!\\\\)(?>"(?>\\\\.|[^\\\\"]+)+"|""|(?>\'(?>\\\\.|[^\\\\\']+)+\')|\'\'|(?>`(?>\\\\.|[^\\\\`]+)+`)|``)) UUID [A-Fa-f0-9]8-(?:[A-Fa-f0-9]4-)3[A-Fa-f0-9]12 # URN, allowing use of RFC 2141 section 2.3 reserved characters URN urn:[0-9A-Za-z][0-9A-Za-z-]0,31:(?:%[0-9a-fA-F]2|[0-9A-Za-z()+,.:=@;$_!*\'/?#-])+ # Networking MAC (?:%CISCOMAC|%WINDOWSMAC|%COMMONMAC) CISCOMAC (?:(?:[A-Fa-f0-9]4\\.)2[A-Fa-f0-9]4) WINDOWSMAC (?:(?:[A-Fa-f0-9]2-)5[A-Fa-f0-9]2) COMMONMAC (?:(?:[A-Fa-f0-9]2:)5[A-Fa-f0-9]2) IPV6 ((([0-9A-Fa-f]1,4:)7([0-9A-Fa-f]1,4|:))|(([0-9A-Fa-f]1,4:)6(:[0-9A-Fa-f]1,4|((25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d)(\\.(25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d))3)|:))|(([0-9A-Fa-f]1,4:)5(((:[0-9A-Fa-f]1,4)1,2)|:((25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d)(\\.(25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d))3)|:))|(([0-9A-Fa-f]1,4:)4(((:[0-9A-Fa-f]1,4)1,3)|((:[0-9A-Fa-f]1,4)?:((25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d)(\\.(25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d))3))|:))|(([0-9A-Fa-f]1,4:)3(((:[0-9A-Fa-f]1,4)1,4)|((:[0-9A-Fa-f]1,4)0,2:((25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d)(\\.(25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d))3))|:))|(([0-9A-Fa-f]1,4:)2(((:[0-9A-Fa-f]1,4)1,5)|((:[0-9A-Fa-f]1,4)0,3:((25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d)(\\.(25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d))3))|:))|(([0-9A-Fa-f]1,4:)1(((:[0-9A-Fa-f]1,4)1,6)|((:[0-9A-Fa-f]1,4)0,4:((25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d)(\\.(25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d))3))|:))|(:(((:[0-9A-Fa-f]1,4)1,7)|((:[0-9A-Fa-f]1,4)0,5:((25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d)(\\.(25[0-5]|2[0-4]\\d|1\\d\\d|[1-9]?\\d))3))|:)))(%.+)? IPV4 (?<![0-9])(?:(?:[0-1]?[0-9]1,2|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]1,2|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]1,2|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]1,2|2[0-4][0-9]|25[0-5]))(?![0-9]) IP (?:%IPV6|%IPV4) HOSTNAME \\b(?:[0-9A-Za-z][0-9A-Za-z-]0,62)(?:\\.(?:[0-9A-Za-z][0-9A-Za-z-]0,62))*(\\.?|\\b) IPORHOST (?:%IP|%HOSTNAME) HOSTPORT %IPORHOST:%POSINT # paths PATH (?:%UNIXPATH|%WINPATH) UNIXPATH (/([\\w_%!$@:.,+~-]+|\\\\.)*)+ TTY (?:/dev/(pts|tty([pq])?)(\\w+)?/?(?:[0-9]+)) WINPATH (?>[A-Za-z]+:|\\\\)(?:\\\\[^\\\\?*]*)+ URIPROTO [A-Za-z]([A-Za-z0-9+\\-.]+)+ URIHOST %IPORHOST(?::%POSINT:port)? # uripath comes loosely from RFC1738, but mostly from what Firefox # doesn\'t turn into %XX URIPATH (?:/[A-Za-z0-9$.+!*\'(),~:;=@#%&_\\-]*)+ #URIPARAM \\?(?:[A-Za-z0-9]+(?:=(?:[^&]*))?(?:&(?:[A-Za-z0-9]+(?:=(?:[^&]*))?)?)*)? URIPARAM \\?[A-Za-z0-9$.+!*\'|(),~@#%&/=:;_?\\-\\[\\]<>]* URIPATHPARAM %URIPATH(?:%URIPARAM)? URI %URIPROTO://(?:%USER(?::[^@]*)?@)?(?:%URIHOST)?(?:%URIPATHPARAM)? # Months: January, Feb, 3, 03, 12, December MONTH \\b(?:[Jj]an(?:uary|uar)?|[Ff]eb(?:ruary|ruar)?|[Mm](?:a|ä)?r(?:ch|z)?|[Aa]pr(?:il)?|[Mm]a(?:y|i)?|[Jj]un(?:e|i)?|[Jj]ul(?:y)?|[Aa]ug(?:ust)?|[Ss]ep(?:tember)?|[Oo](?:c|k)?t(?:ober)?|[Nn]ov(?:ember)?|[Dd]e(?:c|z)(?:ember)?)\\b MONTHNUM (?:0?[1-9]|1[0-2]) MONTHNUM2 (?:0[1-9]|1[0-2]) MONTHDAY (?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9]) # Days: Monday, Tue, Thu, etc... DAY (?:Mon(?:day)?|Tue(?:sday)?|Wed(?:nesday)?|Thu(?:rsday)?|Fri(?:day)?|Sat(?:urday)?|Sun(?:day)?) # Years? YEAR (?>\\d\\d)1,2 HOUR (?:2[0123]|[01]?[0-9]) MINUTE (?:[0-5][0-9]) # \'60\' is a leap second in most time standards and thus is valid. SECOND (?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?) TIME (?!<[0-9])%HOUR:%MINUTE(?::%SECOND)(?![0-9]) # datestamp is YYYY/MM/DD-HH:MM:SS.UUUU (or something like it) DATE_US %MONTHNUM[/-]%MONTHDAY[/-]%YEAR DATE_EU %MONTHDAY[./-]%MONTHNUM[./-]%YEAR ISO8601_TIMEZONE (?:Z|[+-]%HOUR(?::?%MINUTE)) ISO8601_SECOND (?:%SECOND|60) TIMESTAMP_ISO8601 %YEAR-%MONTHNUM-%MONTHDAY[T ]%HOUR:?%MINUTE(?::?%SECOND)?%ISO8601_TIMEZONE? DATE %DATE_US|%DATE_EU DATESTAMP %DATE[- ]%TIME TZ (?:[APMCE][SD]T|UTC) DATESTAMP_RFC822 %DAY %MONTH %MONTHDAY %YEAR %TIME %TZ DATESTAMP_RFC2822 %DAY, %MONTHDAY %MONTH %YEAR %TIME %ISO8601_TIMEZONE DATESTAMP_OTHER %DAY %MONTH %MONTHDAY %TIME %TZ %YEAR DATESTAMP_EVENTLOG %YEAR%MONTHNUM2%MONTHDAY%HOUR%MINUTE%SECOND # Syslog Dates: Month Day HH:MM:SS SYSLOGTIMESTAMP %MONTH +%MONTHDAY %TIME PROG [\\x21-\\x5a\\x5c\\x5e-\\x7e]+ SYSLOGPROG %PROG:program(?:\\[%POSINT:pid\\])? SYSLOGHOST %IPORHOST SYSLOGFACILITY <%NONNEGINT:facility.%NONNEGINT:priority> HTTPDATE %MONTHDAY/%MONTH/%YEAR:%TIME %INT # Shortcuts QS %QUOTEDSTRING # Log formats SYSLOGBASE %SYSLOGTIMESTAMP:timestamp (?:%SYSLOGFACILITY )?%SYSLOGHOST:logsource %SYSLOGPROG: # Log Levels LOGLEVEL ([Aa]lert|ALERT|[Tt]race|TRACE|[Dd]ebug|DEBUG|[Nn]otice|NOTICE|[Ii]nfo|INFO|[Ww]arn?(?:ing)?|WARN?(?:ING)?|[Ee]rr?(?:or)?|ERR?(?:OR)?|[Cc]rit?(?:ical)?|CRIT?(?:ICAL)?|[Ff]atal|FATAL|[Ss]evere|SEVERE|EMERG(?:ENCY)?|[Ee]merg(?:ency)?)
logstash grok使用案例
Grok 是 Logstash 最重要的插件。你可以在 grok 里预定义好命名正则表达式,在稍后(grok参数或者其他正则表达式里)引用它。它非常适用于syslog logs,apache和一些其他的webserver logs,以及mysql logs。grok有很多定义好pattern,当然也可以自己定义。
grok的语法:
%{SYNTAX:SEMANTIC}
SYNTAX表示grok定义好的pattern,SEMANTIC表示自定义的字段。
例如192.168.0.100
用%{IP:client}可以将IP定义为client
假如现在某webserver log中的内容为以下格式,
55.3.244.1 GET /index.html 15824 0.043
我们完全可以利用grok将这些信息定义成以下字段
%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}
写到配置文件中通常这样子:
input { file { path => "/var/log/http.log" }}filter { grok { match => { "message" => "%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}" } }}
grok过滤后得到的信息成了以下这样子
client: 55.3.244.1
method: GET
request: /index.html
bytes: 15824
duration: 0.043
如何自定义Pattern?
语法:(?<field_name>the pattern here)
假如有以下内容“
begin 123.456 end
我们希望将123.456定义成request_time字段,可以向下面这样写这个正则表达式
\s+(?<request_time>\d+(?:\.\d+)?)\s+
解释:
\s:匹配任何不可见字符,包括空格、制表符、换页符等等。等价于[ \f\n\r\t\v]。+表示匹配次数为1次或者多次
(?<request_time> ):这个是grok语法,request_time表示要将捕获的字符定义成的字段名
\d+:匹配一个或者多个数字
(?:\.\d+):为正则表达式,
(?: pattern):非获取匹配,匹配pattern但不获取匹配结果,不进行存储供以后使用。这在使用或字符“(|)”来组合一个模式的各个部分是很有用。例如“industr(?:y|ies)”就是一个比“industry|industries”更简略的表达式。
\.\d+:表示点后面跟一个或者多个 数字,(?:\.\d+)?表示点后面跟一个或多个数字这种情况出现0次或者多次,如果为0次,则request_time为一个整数。所以匹配到的结果可能为123.456或者123或者123.4.5.6,这些都满足条件
测试下:
创建一个配置文件,内容如下:
input {stdin{}} filter { grok { match => { "message" => "\s+(?<request_time>\d+(?:\.\d+)?)\s+" } } } output {stdout{}}
运行 logstash 进程然后输入 "begin 123.456 end",你会看到类似下面这样的输出:
{ "message" => "begin 123.456 end", "@version" => "1", "@timestamp" => "2014-08-09T11:55:38.186Z", "host" => "raochenlindeMacBook-Air.local", "request_time" => "123.456" }
练习:
/var/log/userlog.info 日志文件中获取到的信息为以下格式,需要自定义 2016-05-20T20:00:15.703407+08:00 localhost [audit root/13283 as root/13283 on pts/0/172.16.100.99:64790->10.10.10.6:22]: #=== session closed === 2016-05-21T09:52:54.424055+08:00 localhost [audit root/13558 as root/13558 on pts/0/172.16.100.99:50897->10.10.10.6:22]: #=== session opened === 2016-05-21T09:53:25.687134+08:00 localhost [audit root/13558 as root/13558 on pts/0/172.16.100.99:50897->10.10.10.6:22] /root: cd /etc/logstash/conf.d/ 2016-05-21T09:53:26.284741+08:00 localhost [audit root/13558 as root/13558 on pts/0/172.16.100.99:50897->10.10.10.6:22] /etc/logstash/conf.d: ll
注意上面的日志文件中不是每一行的内容格式都是一样的,grok表达式如下
%{TIMESTAMP_ISO8601:timestamp} %{IPORHOST:login_host} \[\S+ %{USER:login_user}/%{NUMBER:pid} as %{USER:sudouser}/%{NUMBER:sudouser_pid} on %{WORD:tty}/%{NUMBER:tty_id}/%{IPORHOST:host_ip}:%{NUMBER:source_port}-\>%{IPORHOST:local_ip}:%{NUMBER:dest_port}\](?:\:|) (%{UNIXPATH:current_path} %{GREEDYDATA:command}|%{GREEDYDATA:detail})
注意:(?:\:|) (%{UNIXPATH:current_path} %{GREEDYDATA:command}|%{GREEDYDATA:detail})
上面的日志内容在后面出现了不一样的地方
该如何处理? 就是在这个不一样的地方加上
(?:\:|)
表示匹配后面内容是冒号或者其他内容,然后文件中出现的是什么怎么写grok表达式。例如当出现
/root: cd /etc/logstash/conf.d/
这样的内容时,grok这样写,
%{UNIXPATH:current_path} %{GREEDYDATA:command}
当出现
#=== session closed ===
这样的内容时,grok这样写,
%{GREEDYDATA:detail
但是这两中情况需要加“|”来判断,所以正确的写法为
(%{UNIXPATH:current_path} %{GREEDYDATA:command}|%{GREEDYDATA:detail})
整体需要括起来。这里相当于对
(?:\:|)
是一种呼应。
本文出自 “zengestudy” 博客,请务必保留此出处http://zengestudy.blog.51cto.com/1702365/1782637
以上是关于logstash之grok的主要内容,如果未能解决你的问题,请参考以下文章
Logstash收集nginx日志之使用grok过滤插件解析日志