从混合字母和数字列熊猫中提取日期时间
Posted
技术标签:
【中文标题】从混合字母和数字列熊猫中提取日期时间【英文标题】:Extracting date time from a mixed letter and numeric column pandas 【发布时间】:2022-01-02 03:49:13 【问题描述】:我在 pandas 数据框中有一列包含两种类型的信息 = 1. 日期和时间,2=公司名称。我必须将列分成两列(date_time、full_company_name)。首先,我尝试根据字符数拆分列(前 19 个一列,其余列到另一列),但后来我意识到有时日期会丢失,因此拆分可能不起作用。然后我尝试使用正则表达式,但我似乎无法正确提取它。
栏目:
想要的输出:
【问题讨论】:
【参考方案1】:如果日期格式都正确,也许你不必使用正则表达式
df = pd.DataFrame("A": ["2021-01-01 05:00:00Acme Industries",
"2021-01-01 06:00:00Acme LLC"])
df["date"] = pd.to_datetime(df.A.str[:19])
df["company"] = df.A.str[19:]
df
# A date company
# 0 2021-01-01 05:00:00Acme Industries 2021-01-01 05:00:00 Acme Industries
# 1 2021-01-01 06:00:00Acme LLC 2021-01-01 06:00:00 Acme LLC
或
df.A.str.extract("(\d4-\d2-\d2\s\d2:\d2:\d2)(.*)")
【讨论】:
谢谢!问题是它们可能在以后的一些条目中丢失(我只分享了该列的一个 sn-p)【参考方案2】:注意: 如果您可以选择避免连接这些字符串,请这样做。这不是一个健康的习惯。
解决方案(不是那么漂亮):
import pandas as pd
from datetime import datetime
import re
df = pd.DataFrame()
# creating a list of companies
companies = ['Google', 'Apple', 'Microsoft', 'Facebook', 'Amazon', 'IBM',
'Oracle', 'Intel', 'Yahoo', 'Alphabet']
# creating a list of random datetime objects
dates = [datetime(year=2000 + i, month=1, day=1) for i in range(10)]
# creating the column named 'date_time/full_company_name'
df['date_time/full_company_name'] = [f'str(dates[i])companies[i]' for i in range(len(companies))]
# Before:
# date_time/full_company_name
# 2000-01-01 00:00:00Google
# 2001-01-01 00:00:00Apple
# 2002-01-01 00:00:00Microsoft
# 2003-01-01 00:00:00Facebook
# 2004-01-01 00:00:00Amazon
# 2005-01-01 00:00:00IBM
# 2006-01-01 00:00:00Oracle
# 2007-01-01 00:00:00Intel
# 2008-01-01 00:00:00Yahoo
# 2009-01-01 00:00:00Alphabet
new_rows = []
for row in df['date_time/full_company_name']:
# extract the date_time from the row using regex
date_time = re.search(r'\d4-\d2-\d2 \d2:\d2:\d2', row)
# handle case of empty date_time
date_time = date_time.group() if date_time else ''
# extract the company name from the row from where the date_time ends
company_name = row[len(date_time):]
# create a new row with the extracted date_time and company_name
new_rows.append([date_time, company_name])
# drop the column 'date_time/full_company_name'
df = df.drop(columns=['date_time/full_company_name'])
# add the new columns to the dataframe: 'date_time' and 'company_name'
df['date_time'] = [row[0] for row in new_rows]
df['company_name'] = [row[1] for row in new_rows]
# After:
# date_time full_company_name
# 2000-01-01 00:00:00 Google
# 2001-01-01 00:00:00 Apple
# 2002-01-01 00:00:00 Microsoft
# 2003-01-01 00:00:00 Facebook
# 2004-01-01 00:00:00 Amazon
# 2005-01-01 00:00:00 IBM
# 2006-01-01 00:00:00 Oracle
# 2007-01-01 00:00:00 Intel
# 2008-01-01 00:00:00 Yahoo
# 2009-01-01 00:00:00 Alphabet
【讨论】:
【参考方案3】:使用非捕获组 ?.* 代替 (.*)
df = pd.DataFrame("A": ["2021-01-01 05:00:00Acme Industries",
"2021-01-01 06:00:00Acme LLC"])
df.A.str.extract("(\d4-\d2-\d2\s\d2:\d2:\d2)?.*")
【讨论】:
谢谢,但与 df.A.str.extract("(\d4-\d2-\d2\s\d2: \d2:\d2)(.*)") ? ?.* 是一个非贪婪非捕获组。它不会在 re.findall 模式匹配中返回组结果。以上是关于从混合字母和数字列熊猫中提取日期时间的主要内容,如果未能解决你的问题,请参考以下文章