删除不以/不包含特定单词开头的行

Posted 2023-03-12

技术标签:

【中文标题】删除不以/不包含特定单词开头的行【英文标题】：Removing rows that does not start with/contain specific words 【发布时间】：2020-09-23 20:54:51 【问题描述】：

我有以下输出

Age
'1 year old',
'14 years old', 
'music store', 
'7 years old ',
'16 years old ',

使用这行代码后创建

df['Age']=df['Age'].str.split('.', expand=True,n=0)[0]
df['Age'].tolist()

我想从数据集中删除不以数字或数字+年+旧或数字+年+开头的行（最好使用它的副本或过滤后的新行）老的。

预期输出

Age (in a new dataset filtered)
'1 year old',
'14 years old', 
'7 years old ',
'16 years old ',

我该怎么办？

【问题讨论】：

使用正则过滤：***.com/questions/15325182/… df['Age'].str.startswith() 是一个很好的起点，或者df['Age'].str.contains() 使用 df['Age'] = [x for x in df['Age'] if not x.startswith('\d+')] 我得到了这个 AttributeError: 'bool' object has no attribute 'startswith' 你不能用正则表达式和startswith，只能处理实际数据，可以这么说 【参考方案1】：

使用 Series.str.contains 并创建一个布尔掩码来过滤数据框：

m = df['Age'].str.contains(r'(?i)^\d+\syears?\sold')
df1 = df[m]

结果：

# print(df1)
             Age
0     1 year old
1   14 years old 
3    7 years old
4   16 years old

您可以测试正则表达式模式here。

【讨论】：

谢谢@Shubham Sharma。请问如何在 m 中包含 OR 条件？可以这样做：df['Age'].str.contains(r'(?i)^\d+\syears | otherword')) 吗？谢谢你 @Math 是的，这很好，但在这种情况下，它匹配字符串，如 10 year, 20 YEARS, 30 Years, otherword,...【参考方案2】：

下面的代码查找以撇号开头、后跟数字的文本，并只保留这些行：

df = pd.read_clipboard(sep=';')


df.loc[df.Age.str.match("\'\d+")]

            Age
0   '1 year old',
1   '14 years old',
3   '7 years old ',
4   '16 years old ',

请注意，这仅限于撇号和数字，@Shubham 的解决方案涵盖了更多内容

【讨论】：

以上是关于删除不以/不包含特定单词开头的行的主要内容，如果未能解决你的问题，请参考以下文章

如何使用正则表达式匹配不以某些字符开头或结尾的单词？

删除CSV文件中不以python中的数字开头的所有行

不以元音开头或结尾的单词的正则表达式？

如何通过给定的两个文件检索特定单词之间的行？

shell脚本应用正则表达式grep,sed,awk,的应用