没有标记就不能清理网络爬虫吗？用正则表达式是否不能让它干净？

Posted 2023-02-24

技术标签:

【中文标题】没有标记就不能清理网络爬虫吗？用正则表达式是否不能让它干净？【英文标题】：Is it not possible to clean web crawl without tagging? Is it impossible to make it clean with regular expression?没有标记就不能清理网络爬虫吗？用正则表达式是不是不能让它干净？ 【发布时间】：2020-09-10 08:06:48 【问题描述】：

data = re.sub('<[^>]*>', '', string=html).lower()

我想抓取随机页面。但是，由于不可能只抓取所需的内容，所以我发布了一个问题。刮擦后用正则表达式删除html是否有效？

【问题讨论】：

这能回答你的问题吗？ Can you provide some examples of why it is hard to parse XML and HTML with a regex? 【参考方案1】：

html2text 库或 pextract 库对问题有效

【讨论】：

以上是关于没有标记就不能清理网络爬虫吗？用正则表达式是否不能让它干净？的主要内容，如果未能解决你的问题，请参考以下文章

java正则表达式能不能不按顺序匹配？

正则表达式sub,rearch结合使用处理小说文本

用正则表达式不就可以让用户名不能包含一些字符了吗，为啥还要转义

NLTK 正则表达式标记器在正则表达式中不能很好地处理小数点

爬虫6-正则表达式基础知识

js如何正则验证密码