如何从 Pandas DataFrame 中提取 URL？

Posted 2023-03-12

技术标签:

【中文标题】如何从 Pandas DataFrame 中提取 URL？【英文标题】：How to extract URL from Pandas DataFrame? 【发布时间】：2020-09-25 12:22:51 【问题描述】：

我需要从使用以下值创建的 DataFrame 列中提取 URL

creation_date,tweet_id,tweet_text
2020-06-06 03:01:37,1269102116364324865,#Webinar: Sign up for @SumoLogic's June 16 webinar to learn how to navigate your #Kubernetes environment and unders… https://***.com/questions/42237666/extracting-information-from-pandas-dataframe
2020-06-06 01:29:38,1269078966985461767,"In this #webinar replay, @DisneyStreaming's @rothgar chats with @SumoLogic's @BenoitNewton about how #Kubernetes is… https://***.com/questions/46928636/pandas-split-list-into-columns-with-regex

列名 tweet_text 包含 URL。我正在尝试以下代码。

df["tweet_text"]=df["tweet_text"].astype(str)
pattern = r'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]1,256\.[a-zA-Z0-9()]1,6\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)'

df['links'] = ''
df['links']= df["tweet_text"].str.extract(pattern, expand=True)

print(df)

我正在使用来自this question 答案的正则表达式，它匹配两行中的 URL。但我得到NaN 作为新列df['links]' 的值。我还尝试了this question 的第一个答案中提供的解决方案，即

df['links']= df["tweet_text"].str.extract(pattern, expand=False).str.strip()

但我收到以下错误

AttributeError: 'DataFrame' object has no attribute 'str'

最后我使用df['links'] = '' 创建了一个空列，因为我收到了ValueError: Wrong number of items passed 2, placement implies 1 错误。如果那是相关的。有人可以帮我吗？

【问题讨论】：

您的 URL 模式不是很干净，但主要问题是它包含捕获组，而您需要 非捕获 组。你需要用一个捕获组包装它，pattern = r'(https?:\/\/(?:www\.)?[-a-zA-Z0-9@:%._+~#=]1,256\.[a-zA-Z0-9()]1,6[-a-zA-Z0-9()@:%_+.~#?&/=]*)' 它成功了，谢谢，你能把这个评论移到答案上，这样我就可以标记它了。 【参考方案1】：

主要问题是您的 URL 模式包含 捕获组，您需要 非捕获组。您需要将模式中的所有( 替换为(?:。

然而，这还不够，因为str.extract 需要模式中的捕获组，以便它可以返回任何值。因此，您需要使用捕获组来包装整个模式。

你可以使用

pattern = r'(https?:\/\/(?:www\.)?[-a-zA-Z0-9@:%._+~#=]1,256\.[a-zA-Z0-9()]1,6[-a-zA-Z0-9()@:%_+.~#?&/=]*)'

请注意，+ 不必在字符类中转义。另外，字符类中不需要使用//，一个/就足够了。

【讨论】：

以上是关于如何从 Pandas DataFrame 中提取 URL？的主要内容，如果未能解决你的问题，请参考以下文章