python bleach --- 让html干净些

Posted 2020-10-12

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了python bleach --- 让html干净些相关的知识，希望对你有一定的参考价值。

一、bleach功能简介

用python做web开发时，必须要考虑到防止用户的XSS注入。当然我们可以自己写白名单，然后通过BeatifulSoup等处理html的库来进行标签和属性的过滤。Bleach是一个实现上述功能的python库，官网是这样描述的：

Bleach is a allowed-list-based HTML sanitizing library that escapes or strips markup and attributes.

Bleach can also linkify text safely, applying filters that Django’s urlize filter cannot, and optionally setting rel attributes, even on links already in the text.

Bleach is intended for sanitizing text from untrusted sources. If you find yourself jumping through hoops to allow your site administrators to do lots of things, you’re probably outside the use cases. Either trust those users, or don’t.

Because it relies on html5lib, Bleach is as good as modern browsers at dealing with weird, quirky HTML fragments. And any of Bleach’s methods will fix unbalanced or mis-nested tags.

拙劣地翻译一下：

bleach是一个基于白名单、通过转义或去除标签和属性的方式，来对HTML文本净化的python库。

除此之外，bleach还能安全地链接文本，它所提供的过滤器，可以过滤掉django的urlize过滤器所不能过滤的内容，并且可以随意的设置rel属性，即使是已经存在于文本中的链接。

bleach 设计的目的在于对不可信来源数据的净化。如果你发现自己总是依赖你的网站管理员做很多事情，那你可能不适合使用它的场景里。要么相信要用户，要么就不去相信。

由于bleach依赖html5lib，因此它在现代浏览器上处理一些奇怪的HTML标签依然好用。它的任何方法都能帮助修复错乱的或嵌套的标签。

二、bleach的安装

知道bleach的作用后，现在来安装它。Bleach在PyPI上可用，因此可用通过pip或easy_install来安装：

$ pip install bleach

$ easy_install bleach

三、bleach的简单使用

>>> import bleach

>>> bleach.clean(‘an <script>evil()</script> example‘)
u‘an &lt;script&gt;evil()&lt;/script&gt; example‘

>>> bleach.linkify(‘an http://example.com url‘)
u‘an <a href="http://example.com" rel="nofollow">http://example.com</a> url

这样就能做一些简单的html过滤了。

四、常用方法

文本内容过滤：

bleach.clean() 　　

　　用于对HTML片段进行过滤的方法 需要注意的是，该方法过滤的是片段而非整个HTML文档，当不传任何参数时，它只用来过滤HTML标签，不包括属性、CSS, JSON, xhtml, SVG等其他内容。 　　
　　因此对一些存在风险的属性的渲染过程中，需要用模板转义一下。 如果你正在清理大量的文本并传递相同的参数值或者你想要更多的可配置性，可以考虑使用bleach.sanitizer.Cleaner 实例。    

　　参数解读：

- text (str) – 要过滤的文本，通常为HTML片段文本
- tags (list) – 标签白名单; 默认使用 bleach.sanitizer.ALLOWED_TAGS （参数值为以标签字符串为元素的可迭代对象，不在tags中的标签都会被清除或转义）
- attributes (dict or list) – 属性白名单; 可以是一个可调用对象、列表或字典; 默认使用 bleach.sanitizer.ALLOWED_ATTRIBUTES （同tags，dict是以标签为键，标签对应属性组成的列表为值，键为*时表示所有标签；而list时，则其中的属性过滤应用于所有标签）　　　　
- styles (list) – CSS白名单; 默认使用bleach.sanitizer.ALLOWED_STYLES ，但这个列表是空的，因此如果不加此参数，会把写进来的style值过滤掉protocols (list) – 链接协议白名单; 默认使用 bleach.sanitizer.ALLOWED_PROTOCOLS=[u‘http‘,u‘https‘,u‘mailto‘]。当有带链接或者锚的标签，比如有href属性的标签，需要加上允许的协议。否则会把href属性过滤掉。可以通过对bleach.sanitizer.ALLOWED_PROTOCOLS添加值来扩展支持的协议
- strip (bool) – 是否清除白名单之外的元素（默认False时不清除，只进行转义），当为True时，会把白名单以外的标签清除掉。　　　　
- strip_comments (bool) – 是否清除HTML注释内容，默认清除（True）

   返回值: 
       Unicode格式的文本

　　　　
各参数的简单示例：

# tag参数示例

>>> import bleach

>>> bleach.clean(
...     u‘<b><i>an example</i></b>‘,
...     tags=[‘b‘],
... )
u‘<b>&lt;i&gt;an example&lt;/i&gt;</b>‘

# attributes为list示例

>>> import bleach

>>> bleach.clean(
...     u‘<p class="foo" style="color: red; font-weight: bold;">blah blah blah</p>‘,
...     tags=[‘p‘],
...     attributes=[‘style‘],
...     styles=[‘color‘],
... )
u‘<p style="color: red;">blah blah blah</p>‘

# attributes为dict示例

>>> import bleach

>>> attrs = {
...     ‘*‘: [‘class‘],
...     ‘a‘: [‘href‘, ‘rel‘],
...     ‘img‘: [‘alt‘],
... }

>>> bleach.clean(
...    u‘<img alt="an example" width=500>‘,
...    tags=[‘img‘],
...    attributes=attrs
... )
u‘<img alt="an example">‘

# attributes为function示例

>>> import bleach

>>> def allow_h(tag, name, value):
...     return name[0] == ‘h‘

>>> bleach.clean(
...    u‘<a href="http://example.com" title="link">link</a>‘,
...    tags=[‘a‘],
...    attributes=allow_h,
... )
u‘<a href="http://example.com">link</a>‘

>>> from urlparse import urlparse
>>> import bleach

>>> def allow_src(tag, name, value):
...     if name in (‘alt‘, ‘height‘, ‘width‘):
...         return True
...     if name == ‘src‘:
...         p = urlparse(value)
...         return (not p.netloc) or p.netloc == ‘mydomain.com‘
...     return False

>>> bleach.clean(
...    u‘<img src="http://example.com" alt="an example">‘,
...    tags=[‘img‘],
...    attributes={
...        ‘img‘: allow_src
...    }
... )
u‘<img alt="an example">‘

# style参数示例

>>> import bleach

>>> tags = [‘p‘, ‘em‘, ‘strong‘]
>>> attrs = {
...     ‘*‘: [‘style‘]
... }
>>> styles = [‘color‘, ‘font-weight‘]

>>> bleach.clean(
...     u‘<p style="font-weight: heavy;">my html</p>‘,
...     tags=tags,
...     attributes=attrs,
...     styles=styles
... )
u‘<p style="font-weight: heavy;">my html</p>‘

# protocol参数示例

>>> import bleach

>>> bleach.clean(
...     ‘<a href="smb://more_text">allowed protocol</a>‘,
...     protocols=[‘http‘, ‘https‘, ‘smb‘]
... )
u‘<a href="smb://more_text">allowed protocol</a>‘

>>> import bleach

>>> bleach.clean(
...     ‘<a href="smb://more_text">allowed protocol</a>‘,
...     protocols=bleach.ALLOWED_PROTOCOLS + [‘smb‘]
... )
u‘<a href="smb://more_text">allowed protocol</a>‘

#strip参数示例

>>> import bleach

>>> bleach.clean(‘<span>is not allowed</span>‘)
u‘&lt;span&gt;is not allowed&lt;/span&gt;‘

>>> bleach.clean(‘<b><span>is not allowed</span></b>‘, tags=[‘b‘])
u‘<b>&lt;span&gt;is not allowed&lt;/span&gt;</b>‘

>>> import bleach

>>> bleach.clean(‘<span>is not allowed</span>‘, strip=True)
u‘is not allowed‘

>>> bleach.clean(‘<b><span>is not allowed</span></b>‘, tags=[‘b‘], strip=True)
u‘<b>is not allowed</b>‘

# strip_comments参数示例

>>> import bleach

>>> html = ‘my<!-- commented --> html‘

>>> bleach.clean(html)
u‘my html‘

>>> bleach.clean(html, strip_comments=False)
u‘my<!-- commented --> html‘

bleach.sanitizer.Cleaner

　　class bleach.sanitizer.Cleaner(tags=[u‘a‘, u‘abbr‘, u‘acronym‘, u‘b‘, u‘blockquote‘, u‘code‘, u‘em‘, u‘i‘, u‘li‘, u‘ol‘, u‘strong‘, u‘ul‘], attributes={u‘a‘: [u‘href‘, u‘title‘], u‘acronym‘: [u‘title‘], u‘abbr‘: [u‘title‘]}, styles=[], protocols=[u‘http‘, u‘https‘, u‘mailto‘], strip=False, strip_comments=True, filters=None)

　　　　参数基本与clean方法一样，filter参数传入的是一个由html5lib Filter类组成用来传递流内容的列表。
　　　　clean方法: 返回值和之前的clean方法一样，但是当传入值不为文本格式时，会引发TypeError异常。　
　　
　　简单示例：

from bleach.sanitizer import Cleaner

cleaner = Cleaner()

for text in all_the_yucky_things:
    sanitized = cleaner.clean(text)

　　　关于向filter参数传入html5lib Filter对象，该对象可以自定义过滤操作，可以对传入的文本数据进行增删改，bleach默认操作为删除恶意内容，可以通过该实例在不丢弃内容的前提下，将恶意内容修改为合法内容：

from bleach.sanitizer import Cleaner

cleaner = Cleaner()

for text in all_the_yucky_things:
    sanitized = cleaner.clean(text)

bleach.clean()方法实例化了bleach.sanitizer.Cleaner对象，而bleach.sanitizer.Cleaner对象则实例化了bleach.sanitizer.BleachSanitizerFilter对象，而真正实现过滤作用的就是bleach.sanitizer.BleachSanitizerFilter对象。bleach.sanitizer.BleachSanitizerFilter是一个html5lib filter，可以在任何使用html5lib filter的地方使用。

bleach.sanitizer.BleachSanitizerFilter

　　class bleach.sanitizer.BleachSanitizerFilter(source, attributes={u‘a‘: [u‘href‘, u‘title‘], u‘acronym‘: [u‘title‘], u‘abbr‘: [u‘title‘]}, strip_disallowed_elements=False, strip_html_comments=True, **kwargs)

　　　　参数与bleach.clean方法一样，strip_disallowed_elements相当于bleach.clean方法的strip参数，strip_html_comments相当于bleach.clean方法的strip_comments参数。

文本内容链接化：

　bleach.linkify()

　此方法会将html文本中的url形式字符转换为链接。url形式字符包括：url、域名、email等，但在以下情况不会进行转换：

1. 已经是以链接格式呈现在文本中；
2. 该标签属性值里包含url格式；
3. email地址。

　　总体来讲，本方法会尽可能多得将文本里的链接形式内容转换为a标签。

如果只想将文本中的url格式字符以同一标准转换为链接时，推荐使用 bleach.linkifier.Linker 实例
如果既想过滤html文本，又想将其中的url格式字符转换为链接，推荐使用 bleach.linkifier.LinkifyFilter，这样可以避免对html文本进行两次解析

　　 bleach.linkify(text, callbacks=[<function nofollow>], skip_tags=None, parse_email=False)

　　参数分析：

- text (str) – 要转换的html文本
- callbacks (list) – 由回调函数组成的列表，用来调整标签属性。默认使用lbleach.linkifier.DEFAULT_CALLBACKS
- skip_tags (list) – 由标签名组成的列表，表示这些标签不进行链接化处理l；例如可以设置[‘pre‘]，这样就会在链接化时跳过pre标签
- parse_email (bool) – 是否链接化email地址

　　　　其中callbacks参数里的callback函数必须遵循如下格式：

　　　　　　def my_callback(attrs, new=False):

　　　　attrs参数和clean方法的参数类似，是由标签和其属性组成键值对的字典。callbacks可以用来为链接化后的标签中加入、删除或修改属性。new参数表明callback执行操作的对象是新的链接化字符（即类url字符，还未转换为链接时），或者是已存在的链接（即已经是链接的字符）。

　　　　1.添加属性：

>>> from bleach.linkifier import Linker

>>> def set_title(attrs, new=False):
...     attrs[(None, u‘title‘)] = u‘link in user text‘
...     return attrs
...
>>> linker = Linker(callbacks=[set_title])
>>> linker.linkify(‘abc http://example.com def‘)
u‘abc <a href="http://example.com" title="link in user text">http://example.com</a> def‘

　　　　将生成的链接设置为内部链接在当前页打开、外部链接在新建页打开的例子：

>>> from urlparse import urlparse
>>> from bleach.linkifier import Linker

>>> def set_target(attrs, new=False):
...     p = urlparse(attrs[(None, u‘href‘)])
...     if p.netloc not in [‘my-domain.com‘, ‘other-domain.com‘]:
...         attrs[(None, u‘target‘)] = u‘_blank‘
...         attrs[(None, u‘class‘)] = u‘external‘
...     else:
...         attrs.pop((None, u‘target‘), None)
...     return attrs
...
>>> linker = Linker(callbacks=[set_target])
>>> linker.linkify(‘abc http://example.com def‘)
u‘abc <a class="external" href="http://example.com" target="_blank">http://example.com</a> def‘

　　　　2.删除属性。通过callback可以进行类似属性白名单过滤操作，删除标签中已有属性，甚至可以删除那些没有经过链接化的文本内容的标签属性，功能与clean方法类似：

>>> from bleach.linkifier import Linker

>>> def allowed_attrs(attrs, new=False):
...     """Only allow href, target, rel and title."""
...     allowed = [
...         (None, u‘href‘),
...         (None, u‘target‘),
...         (None, u‘rel‘),
...         (None, u‘title‘),
...         u‘_text‘,
...     ]
...     return dict((k, v) for k, v in attrs.items() if k in allowed)
...
>>> linker = Linker(callbacks=[allowed_attrs])
>>> linker.linkify(‘<a style="font-weight: super bold;" href="http://example.com">link</a>‘)
u‘<a href="http://example.com">link</a>‘

　　　　除了删除白名单之外的属性，还可以删除指定属性：

>>> from bleach.linkifier import Linker

>>> def remove_title(attrs, new=False):
...     attrs.pop((None, u‘title‘), None)
...     return attrs
...
>>> linker = Linker(callbacks=[remove_title])
>>> linker.linkify(‘<a href="http://example.com">link</a>‘)
u‘<a href="http://example.com">link</a>‘

>>> linker.linkify(‘<a title="bad title" href="http://example.com">link</a>‘)
u‘<a href="http://example.com">link</a>‘

　　　　3.添加属性。可以用来缩短长url在页面上显示的长度：

>>> from bleach.linkifier import Linker

>>> def shorten_url(attrs, new=False):
...     """Shorten overly-long URLs in the text."""
...     # Only adjust newly-created links
...     if not new:
...         return attrs
...     # _text will be the same as the URL for new links
...     text = attrs[u‘_text‘]
...     if len(text) > 25:
...         attrs[u‘_text‘] = text[0:22] + u‘...‘
...     return attrs
...
>>> linker = Linker(callbacks=[shorten_url])
>>> linker.linkify(‘http://example.com/longlonglonglonglongurl‘)
u‘<a href="http://example.com/longlonglonglonglongurl">http://example.com/lon...</a>‘

　　　　可以让所有链接都通过一个bouncer访问：

>>> from six.moves.urllib.parse import quote, urlparse
>>> from bleach.linkifier import Linker

>>> def outgoing_bouncer(attrs, new=False):
...     """Send outgoing links through a bouncer."""
...     href_key = (None, u‘href‘)
...     p = urlparse(attrs.get(href_key, None))
...     if p.netloc not in [‘example.com‘, ‘www.example.com‘, ‘‘]:
...         bouncer = ‘http://bn.ce/?destination=%s‘
...         attrs[href_key] = bouncer % quote(attrs[href_key])
...     return attrs
...
>>> linker = Linker(callbacks=[outgoing_bouncer])
>>> linker.linkify(‘http://example.com‘)
u‘<a href="http://example.com">http://example.com</a>‘

>>> linker.linkify(‘http://foo.com‘)
u‘<a href="http://bn.ce/?destination=http%3A//foo.com">http://foo.com</a>‘

　　　　防止某些类url格式的字符被转为链接：

>>> from bleach.linkifier import Linker

>>> def dont_linkify_python(attrs, new=False):
...     # This is an existing link, so leave it be
...     if not new:
...         return attrs
...     # If the TLD is ‘.py‘, make sure it starts with http: or https:.
...     # Use _text because that‘s the original text
...     link_text = attrs[u‘_text‘]
...     if link_text.endswith(‘.py‘) and not link_text.startswith((‘http:‘, ‘https:‘)):
...         # This looks like a Python file, not a URL. Don‘t make a link.
...         return None
...     # Everything checks out, keep going to the next callback.
...     return attrs
...
>>> linker = Linker(callbacks=[dont_linkify_python])
>>> linker.linkify(‘abc http://example.com def‘)
u‘abc <a href="http://example.com">http://example.com</a> def‘

>>> linker.linkify(‘abc models.py def‘)
u‘abc models.py def‘

　　　　可以进行反链接化操作，即使html文本中已存在的a标签，也可以通过自定制的方式将其逆转为普通字符：

>>> from bleach.linkifier import Linker

>>> def remove_mailto(attrs, new=False):
...     if attrs[(None, u‘href‘)].startswith(u‘mailto:‘):
...         return None
...     return attrs
...
>>> linker = Linker(callbacks=[remove_mailto])
>>> linker.linkify(‘<a href="mailto:[email protected]">mail janet!</a>‘)
u‘mail janet!‘

bleach.linkifier.Linker实例　　　　

　当使用一套统一的规则进行文本链接化处理时，推荐使用bleach.linkifier.Linker实例，因为linkify方法的本质就是调用此实例。简单使用：

>>> from bleach.linkifier import Linker

>>> linker = Linker(skip_tags=[‘pre‘])
>>> linker.linkify(‘a b c http://example.com d e f‘)
u‘a b c <a href="http://example.com" rel="nofollow">http://example.com</a> d e f‘

class bleach.linkifier.Linker(callbacks=[<function nofollow>], skip_tags=None, parse_email=False, url_re=<_sre.SRE_Pattern object at 0x25b8e90>, email_re=<_sre.SRE_Pattern object at 0x258b5f0>)

　　参数解读：

callbacks (list) – 同linkify函数
skip_tags (list) – 同linkify函数
parse_email (bool) – 同linkify函数
url_re (re) – 匹配url的正则对象
email_re (re) – 匹配email地址的正则对象

　　返回值：链接化的Unicode字符

　　实例方法：

　　linkify(text)

　　　　参数：text (str) – 需要链接化的html文本

　　　　返回值：Unicode文本

　　　异常：当传入值不是文本格式时，引发TypeError异常　

bleach.linkifier.LinkifyFilter

　　bleach.linkify方法就是通过此实例进行链接化的。跟之前讲到过的bleach.linkifier.Cleaner一样，此实例也可以当做html5lib filter实例使用，例如，可以使用此实例，传入到bleach.linkifier.Cleaner中，使得文本过滤和文本链接化同时进行，一步完成。使用默认配置时：

>>> from functools import partial

>>> from bleach import Cleaner
>>> from bleach.linkifier import LinkifyFilter

>>> cleaner = Cleaner(tags=[‘pre‘])
>>> cleaner.clean(‘<pre>http://example.com</pre>‘)
u‘<pre>http://example.com</pre>‘

>>> cleaner = Cleaner(tags=[‘pre‘], filters=[LinkifyFilter])
>>> cleaner.clean(‘<pre>http://example.com</pre>‘)
u‘<pre><a href="http://example.com">http://example.com</a></pre>‘

　　传参后对比：

>>> from functools import partial

>>> from bleach.sanitizer import Cleaner
>>> from bleach.linkifier import LinkifyFilter

>>> cleaner = Cleaner(
...     tags=[‘pre‘],
...     filters=[partial(LinkifyFilter, skip_tags=[‘pre‘])]
... )
...
>>> cleaner.clean(‘<pre>http://example.com</pre>‘)
u‘<pre>http://example.com</pre>‘

class bleach.linkifier.LinkifyFilter(source, callbacks=None, skip_tags=None, parse_email=False, url_re=<_sre.SRE_Pattern object at 0x25b8e90>, email_re=<_sre.SRE_Pattern object at 0x258b5f0>)

　　参数解读：

source (TreeWalker) – 数据流
callbacks (list) –同同linkify函数
skip_tags (list) –同linkify函数
parse_email (bool) – 同linkify函数
url_re (re) – 同Linker实例
email_re (re) – 同Linker实例

以上是关于python bleach --- 让html干净些的主要内容，如果未能解决你的问题，请参考以下文章