从文本文件中提取数据

Posted 2023-03-28

技术标签:

【中文标题】从文本文件中提取数据【英文标题】：Extacting data from text files 【发布时间】：2020-12-20 03:58:37 【问题描述】：

我有一个包含近 2000 条英文推文的文件。它看起来像这样：

"data":["no.":"1241583652212862978","created":"2020-03-22T04:33:04.000Z","tweet":"@OHAOregon My friend says we should not reuse masks to combat coronavirus, is that correct?","no.":"1241583655538941959","created":"2020-03-22T04:33:05.000Z","tweet":" I know it’s from a few days ago, but these books are in good shape, .......]

我只想从文本文件中提取推文。如何从文本文件中仅提取推文部分？任何建议都会有所帮助。提前致谢。

【问题讨论】：

这能回答你的问题吗？ Reading JSON from a file? 嗨@Rakesh，感谢您的回复。但这并不能解决我的问题。我正在尝试仅使用“re”包来解决此问题。所以这对我没有多大帮助。这里不需要正则表达式....它是一个 json 文件。您可以使用键值访问所需的信息。 @Rakesh，该文件是一个“.txt”文件。不是“.json”文件。我必须根据我正在解决的问题使用正则表达式。 【参考方案1】：

您的文件是 json 格式。检查 Python 的 json 库，以便您能够提取推文。 https://docs.python.org/3/library/json.html

【讨论】：

嗨@wildener，有没有可能使用正则表达式解决这个问题？嗯，JSON 是迄今为止最好的解决方案，但是是的，您可以使用这种模式：\"tweet\":\"(.*?)\" 在这里查看：regex101.com/r/qfbjgY/1 【参考方案2】：

假设您使用d 来表示对象，它很简单：

tweet = d["data"][0]["tweet"]

另外，如果它有助于我在您的示例中在 shell 中所做的工作示例：

>>> d = 'data': ['no.': '1241583652212862978', 'created': '2020-03-22T04:33:04.000Z', 'tweet': '@OHAOregon My friend says we should not reuse masks to combat coronavirus, is that correct?', 'no.': '1241583655538941959', 'created': '2020-03-22T04:33:05.000Z', 'tweet': ' I know it’s from a few days ago, but these books are in good shape']
>>> print(d["data"])
['no.': '1241583652212862978', 'created': '2020-03-22T04:33:04.000Z', 'tweet': '@OHAOregon My friend says we should not reuse masks to combat coronavirus, is that correct?', 'no.': '1241583655538941959', 'created': '2020-03-22T04:33:05.000Z', 'tweet': ' I know it’s from a few days ago, but these books are in good shape']
>>> print(d["data"][0]["tweet"])
@OHAOregon My friend says we should not reuse masks to combat coronavirus, is that correct?
>>>

【讨论】：

以上是关于从文本文件中提取数据的主要内容，如果未能解决你的问题，请参考以下文章