通过 Python 计算两个字符串变量之间的常见条目
Posted
技术标签:
【中文标题】通过 Python 计算两个字符串变量之间的常见条目【英文标题】:count common entries between two string variables via Python 【发布时间】:2016-10-30 04:48:26 【问题描述】:非常感谢有人帮助我从我的 csv 文件的两列中计算匹配的州名称的数量。例如,考虑 State_born_in
和 state_lives_in
列的前 7 个观察结果:
State_born_in State_lives_in
New York Florida
Massachusetts Massachusetts
Florida Massachusetts
Illinois Illinois
Iowa Texas
New Hampshire Massachusetts
California California
基本上,我想计算生活在他们出生的同一州的人数。然后我想要生活在他们出生的同一州的所有人的百分比。所以在上面的例子中,我会计数 = 2,因为有两个人生活在他们出生的同一个州(加利福尼亚州和马萨诸塞州),他们生活在他们出生的同一个州。如果我想要百分比,我只需将 2 除以数字的观察。我对使用 pandas 还比较陌生,但这是我迄今为止尝试过的
df = pd.read_csv("uscitizens.csv","a")
import pandas as pd
counts = df[(df['State_born_in'] == df['state_lives_in'])] ; counts
percentage = counts/len(df['State_born_in'])
此外,我将如何在具有超过 200 万个观察值的数据集上执行此操作?我将非常感谢任何人的帮助
【问题讨论】:
【参考方案1】:您可以先使用boolean indexing
,然后简单地将过滤后的DataFrame
的length
与原始的length
相除(它与index
的长度相同,最快):
print (df)
State_born_in State_lives_in
0 New York Florida
1 Massachusetts Massachusetts
2 Massachusetts Massachusetts
3 Massachusetts Massachusetts
4 Florida Massachusetts
5 Illinois Illinois
6 Iowa Texas
7 New Hampshire Massachusetts
8 California California
same = df[(df['State_born_in'] == df['State_lives_in'])]
print (same)
State_born_in State_lives_in
1 Massachusetts Massachusetts
2 Massachusetts Massachusetts
3 Massachusetts Massachusetts
5 Illinois Illinois
8 California California
counts = len(same.index)
print (counts)
5
percentage = 100 * counts/len(df.index)
print (percentage)
55.55555555555556
时间安排:
In [21]: %timeit len(same.index)
The slowest run took 18.82 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 546 ns per loop
In [22]: %timeit same.shape[0]
The slowest run took 21.82 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.37 µs per loop
In [23]: %timeit len(same['State_born_in'])
The slowest run took 46.92 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 10.3 µs per loop
更快的解决方案:
same = (df['State_born_in'] == df['State_lives_in'])
print (same)
0 False
1 True
2 True
3 True
4 False
5 True
6 False
7 False
8 True
dtype: bool
counts = same.sum()
print (counts)
5
percentage = 100 * counts/len(df.index)
print (percentage)
55.5555555556
时序在 2M DataFrame 中:
#[2000000 rows x 2 columns]
df = pd.concat([df]*200000).reset_index(drop=True)
#print (df)
In [127]: %timeit (100 * (df['State_born_in'] == df['State_lives_in']).sum()/len(df.index))
1 loop, best of 3: 444 ms per loop
In [128]: %timeit (100 * len(df[(df['State_born_in'] == df['State_lives_in'])].index)/len(df.index))
1 loop, best of 3: 472 ms per loop
【讨论】:
谢谢!无论如何,对于包含大约 200 万个观测值的大型数据集,我怎么能做到这一点? 我认为您可以使用此解决方案 - 它可以很好地处理大型数据框。还是有什么问题? 我修复了它:),基本上我将df = pd.read_csv("uscitizens.csv","a")
重写为with open('uscitizens.csv', 'r') as f:
以便将其视为对象而不是存储内存。但我按照你的方法做了line[2] == line[3]
where line[2] is State_born_in
and line[3] is state_lives_in
Thank you for your help【参考方案2】:
你期待吗?
counts = df[ df['State_born_in'] == df['State_lives_in'] ].groupby('State_born_in').agg(['count']).sum()
counts / len(df['State_born_in'])
【讨论】:
以上是关于通过 Python 计算两个字符串变量之间的常见条目的主要内容,如果未能解决你的问题,请参考以下文章