如何从python中的文本文件中获取子字符串?

Posted

技术标签:

【中文标题】如何从python中的文本文件中获取子字符串?【英文标题】:How to fetch a substring from text file in python? 【发布时间】:2016-08-19 10:04:10 【问题描述】:

我有一堆明文形式的推文,如下所示。我只想提取文本部分

文件中的样本数据 -

Fri Nov 13 20:27:16 +0000 2015 4181010297 rt     we're treating one of you lads to this d'struct denim shirt! simply follow & rt to enter
Fri Nov 13 20:27:16 +0000 2015 2891325562 this album is wonderful, i'm so proud of you, i loved this album, it really is the best.    -273
Fri Nov 13 20:27:19 +0000 2015 2347993701 international break is garbage smh. it's boring and your players get injured
Fri Nov 13 20:27:20 +0000 2015 3168571911 get weather updates from the weather channel. 15:27:19
Fri Nov 13 20:27:20 +0000 2015 2495101558 woah what happened to twitter this update is horrible
Fri Nov 13 20:27:19 +0000 2015 229544082 i've completed the daily quest in paradise island 2!
Fri Nov 13 20:27:17 +0000 2015 309233999 new post: henderson memorial public library
Fri Nov 13 20:27:21 +0000 2015 291806707 who's going to  next week?
Fri Nov 13 20:27:19 +0000 2015 3031745900 why so blue?    @ golden bee

这是我在预处理阶段的尝试 -

for filename in glob.glob('*.txt'):
    with open("plain text - preprocesshurricane.txt",'a') as outfile ,open(filename, 'r') as infile:
        for tweet in infile.readlines():
            temp=tweet.split(' ')
            text=""
            for i in temp:
                x=str(i)
                if x.isalpha() :
                    text += x + ' '
            print(text)

输出-

Fri Nov rt treating one of you lads to this denim simply follow rt to 
Fri Nov this album is so proud of i loved this it really is the 
Fri Nov international break is garbage boring and your players get 
Fri Nov get weather updates from the weather 
Fri Nov woah what happened to twitter this update is 
Fri Nov completed the daily quest in paradise island 
Fri Nov new henderson memorial public 
Fri Nov going to next 
Fri Nov why so golden 

此输出不是所需的输出,因为

1.它不会让我在推文的文本部分中获取数字/数字。 2. 每行以 FRI NOV 开头。

您能否建议一种更好的方法来实现相同的目标?我对正则表达式不太熟悉,但我认为我们可以使用re.search(r'2015(magic to remove tweetID)/w*',tweet)

【问题讨论】:

【参考方案1】:

我提出比@Rushy Panchal 更具体一点的模式,以避免推文包含数字时出现问题:.+ \+(\d+ )3

使用 re.sub 函数

>>> import re
>>> with open('your_file.txt','r') as file:
...     data = file.read()
...     print re.sub('.+ \+(\d+ )3','',data)

输出

rt     we're treating one of you lads to this d'struct denim shirt! simply follow & rt to enter
this album is wonderful, i'm so proud of you, i loved this album, it really is the best.    -273
international break is garbage smh. it's boring and your players get injured
get weather updates from the weather channel. 15:27:19
woah what happened to twitter this update is horrible
i've completed the daily quest in paradise island 2!
new post: henderson memorial public library
who's going to  next week?
why so blue?    @ golden bee

【讨论】:

【参考方案2】:

不用正则表达式也可以

import glob

for filename in glob.glob('file.txt'):
    with open("plain text - preprocesshurricane.txt",'a') as outfile ,open(filename, 'r') as infile:
        for tweet in infile.readlines():
            temp=tweet.split(' ')
            print(''.format(' '.join(temp[7:])))

【讨论】:

我相信这又是不受欢迎的输出。这包括周五 11 月?但我现在意识到,我只需要在第 7 个空格之后打破分割并加入即可。谢谢你的回答。【参考方案3】:

你要找的模式是.+ \d+:

import re
p = re.compile(".+ \d+")
tweets = p.sub('', data) # data is the original string

模式分解

. 匹配任何字符,+ 匹配 1 个或多个。所以,.+ 匹配一个或多个字符。但是,如果我们将其保留在此处,我们将删除所有文本。

因此,我们希望以 \d+ 结束模式 - \d 匹配任何数字,因此这将匹配任何连续的数字序列,其中最后一个是推文 ID。

【讨论】:

将检查并回复给您。 您的模式不适用于此行:Fri Nov 13 20:27:20 +0000 2015 3168571911 get weather updates from the weather channel. 15:27:19。你显示:27:19【参考方案4】:

在这种情况下,您可以避免使用正则表达式。您呈现的文本行在推文文本之前的空格数方面是一致的。就split():

>>> data = """
   lines with tweets here
"""
>>> for line in data.splitlines():
...     print(line.split(" ", 7)[-1])
... 
rt     we're treating one of you lads to this d'struct denim shirt! simply follow & rt to enter
this album is wonderful, i'm so proud of you, i loved this album, it really is the best.    -273
international break is garbage smh. it's boring and your players get injured
get weather updates from the weather channel. 15:27:19
woah what happened to twitter this update is horrible
i've completed the daily quest in paradise island 2!
new post: henderson memorial public library
who's going to  next week?
why so blue?    @ golden bee

【讨论】:

那个 [-1] 在做什么?将索引设置在空格之前? @MayurH line.split(" ", 7) 用前 7 个空格分割一行。它生成一个列表,其中推文文本是最后一项 - 我们通过最后一个索引获取它。 @MayurH <any-list>[-1] 中的索引 -1 指向 <any-list> 中的最后一个位置(在空列表中给出 IndexError)。你可以做一些花哨的事情,比如<some-list>[-3:] 来获取最后三个元素的列表等等。 我不知道 line.split() 可以接受多个参数。谢谢!我忘了它返回一个列表。现在说得通了。

以上是关于如何从python中的文本文件中获取子字符串?的主要内容,如果未能解决你的问题,请参考以下文章

如何从python中的文本文件中获取所有3克?

通过python将一个大字符串拆分为包含'n'个单词的多个子字符串

如何使用 Python 代码从 kv 文件中的 TextField 获取数据?

如何从python中的图像中删除某些文本?

算法 - KMP算法

如何从标签中获取文本,但忽略其他子标签