python解析Web访问日志

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python解析Web访问日志相关的知识,希望对你有一定的参考价值。

我正在尝试使用以下正则表达式解析Web访问日志

pattern = re.compile(r"""(?x)^
    (?P<remote_host>S+)            s+         # host %h
    S+                             s+         # indent %l (unused)
    (?P<remote_user>S+)            s+         # user %u
    [(?P<time_received>.*?)]      s+         # time %t
    "(?P<request>.*?)"              s+         # request "%r"
    (?P<status>[0-9]+)              s+         # status %>s
    (?P<response_bytes_clf>S+)     (?:s+      # size %b (careful, can be '-')
    "(?P<referrer>[^"?s]*[^"]*)"   s+         # referrer "%{Referer}i"
    "(?P<user_agent>[^"]*)"         (?:s+      # user agent "%{User-agent}i"
    "[^"]*"                         )?)?        # optional argument (unused)
$""")

def get_structured_access_log(access_log):
    return pattern.match(access_log).groupdict()

但是有些日志行包含恶意请求,如下所示:

190.2.7.178 - - [21/Dec/2011:05:47:03 +0000] "GET /gnu3/index.php?doc=../../../../../../../proc/self/environ%00 HTTP/1.1" 404 273 "-" "<?php system("id"); ?>"
190.2.7.178 - - [21/Dec/2011:05:47:04 +0000] "GET /gnu/index.php?doc=../../../../../../../proc/self/environ%00 HTTP/1.1" 404 271 "-" "<?php system("id"); ?>"
190.2.7.178 - - [21/Dec/2011:05:47:04 +0000] "GET /phpgwapi/setup/tables_update.inc.php?appdir=../../../../../../../proc/self/environ%00 HTTP/1.1" 404 286 "-" "<?php system("id"); ?>"
190.2.7.178 - - [21/Dec/2011:05:47:05 +0000] "GET /forum/install.php?phpbb_root_dir=../../../../../../../proc/self/environ%00 HTTP/1.1" 404 274 "-" "<?php system("id"); ?>"
190.2.7.178 - - [21/Dec/2011:05:47:06 +0000] "GET /includes/calendar.php?phpc_root_path=../../../../../../../proc/self/environ%00 HTTP/1.1" 404 275 "-" "<?php system("id"); ?>"
190.2.7.178 - - [21/Dec/2011:05:47:06 +0000] "GET /includes/setup.php?phpc_root_path=../../../../../../../proc/self/environ%00 HTTP/1.1" 404 273 "-" "<?php system("id"); ?>"
190.2.7.178 - - [21/Dec/2011:05:47:07 +0000] "GET /inc/authform.inc.php?path_pre=../../../../../../../proc/self/environ%00 HTTP/1.1" 404 275 "-" "<?php system("id"); ?>"
190.2.7.178 - - [21/Dec/2011:05:47:07 +0000] "GET /include/authform.inc.php?path_pre=../../../../../../../proc/self/environ%00 HTTP/1.1" 404 278 "-" "<?php system("id"); ?>"
190.2.7.178 - - [21/Dec/2011:05:47:08 +0000] "GET /index.php?nic=../../../../../../../proc/self/environ%00 HTTP/1.1" 200 4399 "-" "<?php system("id"); ?>"
190.2.7.178 - - [21/Dec/2011:05:47:11 +0000] "GET /index.php?sec=../../../../../../../proc/self/environ%00 HTTP/1.1" 200 4399 "-" "<?php system("id"); ?>"

这些请求无法使用上述正则表达式进行解析,其他正常的Web请求被成功解析。

这里有一些成功解析的访问日志:

123.125.71.79 - - [28/Apr/2012:08:12:57 +0100] "GET /robots.txt HTTP/1.1" 404 268 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
157.56.95.126 - - [28/Apr/2012:10:23:02 +0100] "GET /robots.txt HTTP/1.1" 404 268 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
157.56.95.126 - - [28/Apr/2012:10:23:02 +0100] "GET / HTTP/1.1" 200 4399 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
110.75.173.193 - - [28/Apr/2012:11:57:26 +0100] "GET / HTTP/1.1" 200 4399 "-" "Yahoo! Slurp China"

异常错误消息:

'NoneType'对象没有属性'groupdict'

请修复正则表达式,以便它也可以解析这些复杂的请求。

我感谢您的帮助。

答案

使用re.match将返回相应的匹配对象,或者如果字符串与模式不匹配将返回None

在第一个示例数据中,这是包含转义双引号"<?php system("id"); ?>"的最后一部分

如果你使用一个否定的字符类匹配而不是双引号并且想要断言字符串的结尾,那么[^"]*将不会超过("id中的第一个双引号

您可以通过替换否定的字符类来修复您的模式,以匹配此部分"(?P<user_agent>[^"]*)"中的双引号以匹配除新行.*?之外的任何字符

您的模式可能如下所示:

(?x)^
    (?P<remote_host>S+)            s+         # host %h
    S+                             s+         # indent %l (unused)
    (?P<remote_user>S+)            s+         # user %u
    [(?P<time_received>.*?)]      s+         # time %t
    "(?P<request>.*?)"              s+         # request "%r"
    (?P<status>[0-9]+)              s+         # status %>s
    (?P<response_bytes_clf>S+)     (?:s+      # size %b (careful, can be '-')
    "(?P<referrer>[^"?s]*[^"]*)"   s+         # referrer "%{Referer}i"
    "(?P<user_agent>.*?  )"         (?:s+      # user agent "%{User-agent}i"
    "[^"]*"                         )?)?        # optional argument (unused)
$

Regex demo

以上是关于python解析Web访问日志的主要内容,如果未能解决你的问题,请参考以下文章

python [解析python中的nginx访问日志]解析python #log中的nginx访问日志

干货|可视化分析 web 访问日志

Python分析web访问日志

今晚九点|如何使用 Python 分析 web 访问日志?

Android 逆向使用 Python 解析 ELF 文件 ( Capstone 反汇编 ELF 文件中的机器码数据 | 创建反汇编解析器实例对象 | 设置汇编解析器显示细节 )(代码片段

python常用代码片段总结