如何从熊猫中的字符串中提取前8个字符

Posted

技术标签:

【中文标题】如何从熊猫中的字符串中提取前8个字符【英文标题】:How to extract first 8 characters from a string in pandas 【发布时间】:2019-01-07 11:55:48 【问题描述】:

我在数据框中有一列,我正在尝试从字符串中提取 8 位数字。我该怎么做呢

    Input
 Shipment ID
20180504-S-20000
20180514-S-20537
20180514-S-20541
20180514-S-20644
20180514-S-20644
20180516-S-20009
20180516-S-20009
20180516-S-20009
20180516-S-20009

预期输出

Order_Date
20180504
20180514
20180514
20180514
20180514
20180516
20180516
20180516
20180516

我尝试了下面的代码,它没有工作。

data['Order_Date'] = data['Shipment ID'][:8]

【问题讨论】:

【参考方案1】:

你也可以决定从-S删除到最后

df["Order_Date"]=df['Shipment ID'].replace(regex=r"\-.*",value="")
df
        Shipment ID Order_Date
0  20180504-S-20000   20180504
1  20180514-S-20537   20180514
2  20180514-S-20541   20180514
3  20180514-S-20644   20180514
4  20180514-S-20644   20180514
5  20180516-S-20009   20180516
6  20180516-S-20009   20180516
7  20180516-S-20009   20180516
8  20180516-S-20009   20180516

您还可以捕获前 8 位数字,然后删除所有内容并用捕获组的反向引用替换:

df['Shipment ID'].replace(regex=r"(\d8).*",value="\\1")

【讨论】:

【参考方案2】:

你也可以使用str.extract

例如:

import pandas as pd

df = pd.DataFrame('Shipment ID': ['20180504-S-20000', '20180514-S-20537', '20180514-S-20541', '20180514-S-20644', '20180514-S-20644', '20180516-S-20009', '20180516-S-20009', '20180516-S-20009', '20180516-S-20009'])
df["Order_Date"] = df["Shipment ID"].str.extract(r"(\d8)")
print(df)

输出:

       Shipment ID Order_Date
0  20180504-S-20000   20180504
1  20180514-S-20537   20180514
2  20180514-S-20541   20180514
3  20180514-S-20644   20180514
4  20180514-S-20644   20180514
5  20180516-S-20009   20180516
6  20180516-S-20009   20180516
7  20180516-S-20009   20180516
8  20180516-S-20009   20180516

【讨论】:

【参考方案3】:

您很接近,需要使用str 进行索引,这适用于Series 的每个值:

data['Order_Date'] = data['Shipment ID'].str[:8]

如果没有 NaNs 值,为了获得更好的性能:

data['Order_Date'] = [x[:8] for x in data['Shipment ID']]

print (data)
        Shipment ID Order_Date
0  20180504-S-20000   20180504
1  20180514-S-20537   20180514
2  20180514-S-20541   20180514
3  20180514-S-20644   20180514
4  20180514-S-20644   20180514
5  20180516-S-20009   20180516
6  20180516-S-20009   20180516
7  20180516-S-20009   20180516
8  20180516-S-20009   20180516

如果省略str代码按位置过滤列,前N个值如:

print (data['Shipment ID'][:2])
0    20180504-S-20000
1    20180514-S-20537
Name: Shipment ID, dtype: object

【讨论】:

以上是关于如何从熊猫中的字符串中提取前8个字符的主要内容,如果未能解决你的问题,请参考以下文章

如何使用正则表达式从熊猫数据框中的一行中的字符串中提取所有特定值?

如何从字符串中提取熊猫索引的属性

从存储为熊猫数据框中的字符串的列表中提取项目

python中如何从字符串中提取数字?

熊猫从数据框中的一列中提取部分字符串并将其存储在一个新列中

从熊猫数据框中仅提取数字和字符串