如何从文本文件中删除重复并包含某些单词的行?
Posted
技术标签:
【中文标题】如何从文本文件中删除重复并包含某些单词的行?【英文标题】:How do I remove lines that are repeating and contains certain words from text file? 【发布时间】:2022-01-17 17:47:31 【问题描述】:我正在尝试从抓取的数据中删除重复的行和包含某些单词的行。我搜索了各种代码,但它们不起作用:(
这是代码。只有第一部分有效,删除了重复的行:
openFile = open("links.txt", "r")
writeFile = open("updatedfile.txt", "w")
#Store traversed lines
tmp = set()
for txtLine in openFile:
#Check new line
if txtLine not in tmp:
writeFile.write(txtLine)
#Add new traversed line to tmp
tmp.add(txtLine)
openFile.close()
writeFile.close()
sleep(5)
with open("updatedfile.txt", "r") as fp:
lines = fp.readlines()
with open("updatedfile.txt", "w") as fp:
for line in lines:
if line.strip("\n") != "search":
fp.write(line)
这是 links.txt 文件
https://twitter.com/search?q=%23BTC&src=hashtag_click
https://twitter.com/search?q=%23ADA&src=hashtag_click
https://twitter.com/search?q=%23LTC&src=hashtag_click
https://twitter.com/search?q=%23CAKE&src=hashtag_click
https://twitter.com/Marie62943337
https://twitter.com/Marie62943337
https://twitter.com/Fathur0501
https://twitter.com/Fathur0501
https://twitter.com/BogdanMar93
https://twitter.com/BogdanMar93
https://t.[spaced because body cannot contain short url]co/74ZzkVwa2W
https://t. co/Gv2tyiWfAk
我希望输出是:
https://twitter.com/Marie62943337
https://twitter.com/Fathur0501
https://twitter.com/BogdanMar93
感谢您的帮助。
【问题讨论】:
【参考方案1】:检查此代码。我认为它有效
with open("test.txt", "r") as fp:
lines = fp.readlines()
fp.close()
unique = set()
with open("test.txt", "w") as fp:
for line in lines:
if "search" not in line and line not in unique and "twitter.com" in line:
fp.write(line)
unique.add(line)
请在下面的评论中分享查询。
【讨论】:
成功了。我还需要从行中删除“https://t.co/Gv2tyiWfAk”。我该怎么做? 您可以添加一个条件,它必须包含单词 twitter.com。我已经更新了代码,你可以参考一下。并且不要忘记标记正确的答案部分,以供将来参考。 非常感谢!现在一切都很顺利。【参考方案2】:也许你想用这个,用'in':
lines = ['https://twitter.com/search?q=%23CAKE&src=hashtag_click', 'https://twitter.com/Marie62943337']
for line in lines:
if 'search' not in line:
print(line)
【讨论】:
我只想删除包含“search”和“t.co”的行。 为此,您可以像这样扩展条件:if 'search' not in line and 't.co' not in line:
谢谢。我也检查了这个作品。每天都要学习新东西!以上是关于如何从文本文件中删除重复并包含某些单词的行?的主要内容,如果未能解决你的问题,请参考以下文章