通过 Python 计算两个字符串变量之间的常见条目

Posted

技术标签:

【中文标题】通过 Python 计算两个字符串变量之间的常见条目【英文标题】:count common entries between two string variables via Python 【发布时间】:2016-10-30 04:48:26 【问题描述】:

非常感谢有人帮助我从我的 csv 文件的两列中计算匹配的州名称的数量。例如,考虑 State_born_instate_lives_in 列的前 7 个观察结果:

State_born_in   State_lives_in
New York    Florida
Massachusetts   Massachusetts
Florida Massachusetts
Illinois    Illinois 
Iowa    Texas
New Hampshire   Massachusetts
California  California

基本上,我想计算生活在他们出生的同一州的人数。然后我想要生活在他们出生的同一州的所有人的百分比。所以在上面的例子中,我会计数 = 2,因为有两个人生活在他们出生的同一个州(加利福尼亚州和马萨诸塞州),他们生活在他们出生的同一个州。如果我想要百分比,我只需将 2 除以数字的观察。我对使用 pandas 还比较陌生,但这是我迄今为止尝试过的

df = pd.read_csv("uscitizens.csv","a")
import pandas as pd 
counts = df[(df['State_born_in'] == df['state_lives_in'])] ; counts
percentage = counts/len(df['State_born_in'])

此外,我将如何在具有超过 200 万个观察值的数据集上执行此操作?我将非常感谢任何人的帮助

【问题讨论】:

【参考方案1】:

您可以先使用boolean indexing,然后简单地将过滤后的DataFramelength 与原始的length 相除(它与index 的长度相同,最快):

print (df)
   State_born_in State_lives_in
0       New York        Florida
1  Massachusetts  Massachusetts
2  Massachusetts  Massachusetts
3  Massachusetts  Massachusetts
4        Florida  Massachusetts
5       Illinois       Illinois
6           Iowa          Texas
7  New Hampshire  Massachusetts
8     California     California

same = df[(df['State_born_in'] == df['State_lives_in'])] 
print (same)
   State_born_in State_lives_in
1  Massachusetts  Massachusetts
2  Massachusetts  Massachusetts
3  Massachusetts  Massachusetts
5       Illinois       Illinois
8     California     California

counts = len(same.index)
print (counts)
5

percentage = 100 * counts/len(df.index)
print (percentage)
55.55555555555556

时间安排

In [21]: %timeit len(same.index)
The slowest run took 18.82 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 546 ns per loop

In [22]: %timeit same.shape[0]
The slowest run took 21.82 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.37 µs per loop

In [23]: %timeit len(same['State_born_in'])
The slowest run took 46.92 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 10.3 µs per loop

更快的解决方案:

same = (df['State_born_in'] == df['State_lives_in'])
print (same)
0    False
1     True
2     True
3     True
4    False
5     True
6    False
7    False
8     True
dtype: bool

counts = same.sum()
print (counts)
5

percentage = 100 * counts/len(df.index)
print (percentage)
55.5555555556

时序在 2M DataFrame 中:

#[2000000 rows x 2 columns]
df = pd.concat([df]*200000).reset_index(drop=True)
#print (df)


In [127]: %timeit (100 * (df['State_born_in'] == df['State_lives_in']).sum()/len(df.index))
1 loop, best of 3: 444 ms per loop

In [128]: %timeit (100 * len(df[(df['State_born_in'] == df['State_lives_in'])].index)/len(df.index))
1 loop, best of 3: 472 ms per loop

【讨论】:

谢谢!无论如何,对于包含大约 200 万个观测值的大型数据集,我怎么能做到这一点? 我认为您可以使用此解决方案 - 它可以很好地处理大型数据框。还是有什么问题? 我修复了它:),基本上我将df = pd.read_csv("uscitizens.csv","a") 重写为with open('uscitizens.csv', 'r') as f: 以便将其视为对象而不是存储内存。但我按照你的方法做了line[2] == line[3] where line[2] is State_born_in and line[3] is state_lives_inThank you for your help【参考方案2】:

你期待吗?

counts = df[ df['State_born_in'] == df['State_lives_in'] ].groupby('State_born_in').agg(['count']).sum()
counts / len(df['State_born_in'])

【讨论】:

以上是关于通过 Python 计算两个字符串变量之间的常见条目的主要内容,如果未能解决你的问题,请参考以下文章

python--并行计算

python实例文本进度条

如何使用 MapReduce 在 python 中计算两个变量之间的相关性

对监督学习和非监督学习的理解

python常见的编程错误

python相关性分析如何生成两个相关性最强的两门?