Beautifulsoup 4 spans containsg '@' 返回奇怪的结果

Posted 2023-03-06

技术标签:

【中文标题】Beautifulsoup 4 spans containsg \'@\' 返回奇怪的结果【英文标题】：Beautifulsoup 4 spans containg '@' return strange resultsBeautifulsoup 4 spans containsg '@' 返回奇怪的结果 【发布时间】：2017-10-31 01:42:44 【问题描述】：

我能够使用以下方法获得所需的跨度列表：

attrs = soup.find_all("span")

这会返回一个跨度列表作为键和值：

[
    <span>back camera resolution</span>, 
    <span class="even">12 MP</span>
]

[
    <span>front camera resolution</span>, 
    <span class="even">16 MP</span>
]

[
    <span>video resolution</span>, 
    <span class="even"><a class="__cf_email__" data-cfemail="b98b888f89c9f98a89dfc9ca" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p)tryt=document.currentScript||function()for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]();if(t&&(c=t.previousSibling))p=t.parentNode;if(a=c.getAttribute('data-cfemail'))for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)p.removeChild(t)catch(u)()/* ]]> */</script> - <a class="__cf_email__" data-cfemail="4677767e7636067576203635" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p)tryt=document.currentScript||function()for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]();if(t&&(c=t.previousSibling))p=t.parentNode;if(a=c.getAttribute('data-cfemail'))for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)p.removeChild(t)catch(u)()/* ]]> */</script> - <a class="__cf_email__" data-cfemail="5067626010616260362023" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p)tryt=document.currentScript||function()for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]();if(t&&(c=t.previousSibling))p=t.parentNode;if(a=c.getAttribute('data-cfemail'))for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)p.removeChild(t)catch(u)()/* ]]> */</script>
    </span>
]

原来的 html 是：

为什么“视频分辨率”会这样转换？

【问题讨论】：

不要将 DOM 查看器与服务器提供给浏览器的源混淆。 BeautifulSoup 无法执行服务器发送的 Javascript 代码。看起来服务器使用Javascript库自动混淆电子邮件地址，浏览器执行Javascript代码重新插入文本。 @MartijnPieters 哇！，如果它那么复杂，我认为这不是那么重要，我会跳过它。谢谢。逆转并不难；我在答案中发布了一个方法；它采用 BeautifulSoup 树并用去混淆的结果替换所有出现。 【参考方案1】：

该站点正在使用CloudFlare email protection feature，它似乎已将所有字符串替换为@，并使用混淆（XOR 加密）值，以防止抓取工具获取电子邮件地址。每个替换都包含用于对其进行解码的 JavaScript 代码。

BeautifulSoup 不会执行 Javascript，但您的浏览器执行了它并用生成的解密数据替换了 <a class="__cf_email__"> 标记。

你可以用一个小的 Python 3 函数来做同样的事情；所有 JavaScript 代码所做的就是通过使用第一个字节作为简单 XOR 解密例程中的密钥来“解密”（十六进制编码的）值：

def decode(cfemail):
    enc = bytes.fromhex(cfemail)
    return bytes([c ^ enc[0] for c in enc[1:]]).decode('utf8')

def deobfuscate_cf_email(soup):
    for encrypted_email in soup.select('a.__cf_email__'):
        decrypted = decode(encrypted_email['data-cfemail'])
        # remove the <script> tag from the tree
        script_tag = encrypted_email.find_next_sibling('script')
        script_tag.decompose()
        # replace the <a class="__cf_email__"> tag with the decoded result
        encrypted_email.replace_with(decrypted)

要在 Python 2 中实现上述功能，请将 bytes 替换为 bytearray。

演示：

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''
...     <span>video resolution</span>,
...     <span class="even"><a class="__cf_email__" data-cfemail="b98b888f89c9f98a89dfc9ca" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p)tryt=document.currentScript||function()for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]();if(t&&(c=t.previousSibling))p=t.parentNode;if(a=c.getAttribute('data-cfemail'))for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)p.removeChild(t)catch(u)()/* ]]> */</script> - <a class="__cf_email__" data-cfemail="4677767e7636067576203635" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p)tryt=document.currentScript||function()for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]();if(t&&(c=t.previousSibling))p=t.parentNode;if(a=c.getAttribute('data-cfemail'))for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)p.removeChild(t)catch(u)()/* ]]> */</script> - <a class="__cf_email__" data-cfemail="5067626010616260362023" href="/cdn-cgi/l/email-protection">[email protected]</a><script data-cfhash="f9e31" type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p)tryt=document.currentScript||function()for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]();if(t&&(c=t.previousSibling))p=t.parentNode;if(a=c.getAttribute('data-cfemail'))for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)p.removeChild(t)catch(u)()/* ]]> */</script>
...     </span>
... ''')
>>> deobfuscate_cf_email(soup)
>>> soup
<html><body><span>video resolution</span>,
    <span class="even">2160p@30fps - 1080p@30fps - 720@120fps
</span>
</body></html>

【讨论】：

以上是关于Beautifulsoup 4 spans containsg '@' 返回奇怪的结果的主要内容，如果未能解决你的问题，请参考以下文章

使用 BeautifulSoup 获取 span 标签的值

从 <span 类中获取文本：使用 Beautifulsoup 和请求

如何使用beautifulsoup快速操作div内的span

如何识别beautifulsoup返回的'p'标签中是否存在'span'子标签？

从 BeautifulSoup 4.6 中的两个 HTML 标签之间提取 HTML

BeautifulSoup的Python内存问题