如何用最少的代码创建过滤后的 DataFrame
Posted
技术标签:
【中文标题】如何用最少的代码创建过滤后的 DataFrame【英文标题】:How to create filtered DataFrame with minimum code 【发布时间】:2017-01-07 12:36:30 【问题描述】:有四辆车:bmw
、geo
、vw
和porsche
:
import pandas as pd
df = pd.DataFrame(
'car': ['bmw','geo','vw','porsche'],
'warranty': ['yes','yes','yes','no'],
'dvd': ['yes','yes','no','yes'],
'sunroof': ['yes','no','no','no'])
我想创建一个过滤后的 DataFrame,其中仅列出具有所有三个功能的汽车:DVD 播放器、天窗和保修(我们知道这里是 BMW,所有功能都设置为“是”)。
我可以一次做一列:
cars_with_warranty = df['car'][df['warranty']=='yes']
print(cars_with_warranty)
然后我需要对dvd和天窗柱进行类似的列计算:
cars_with_dvd = df['car'][df['dvd']=='yes']
cars_with_sunroof = df['car'][df['sunroof']=='yes']
我想知道是否有一种巧妙的方法可以创建过滤后的DataFrame
?
稍后编辑:
发布的解决方案效果很好。但生成的cars_with_all_three
是一个简单的列表变量。我们需要 DataFrame 对象,其中只有一辆“bmw”汽车作为其唯一的行和所有三列:dvd、天窗和保修(所有三个值都设置为“yes”)。
cars_with_all_three = []
for ind, car in enumerate(df['car']):
if df['dvd'][ind] == df['warranty'][ind] == df['sunroof'][ind] == 'yes':
cars_with_all_three.append(car)
【问题讨论】:
【参考方案1】:您可以使用简单的loop
和enumerate
:
cars_with_all_three = []
for ind, car in enumerate(df['car']):
if df['dvd'][ind] == df['warranty'][ind] == df['sunroof'][ind] == 'yes':
cars_with_all_three.append(car)
如果您执行print(cars_with_all_three)
,您将获得['bmw']
。
或者,如果你想变得非常聪明并使用单线,你可以这样做:
[car for ind, car in enumerate(df['car']) if df['dvd'][ind] == df['warranty'][ind] == df['sunroof'][ind] == 'yes']
希望对你有帮助
【讨论】:
【参考方案2】:你可以使用boolean indexing
:
print ((df.dvd == 'yes') & (df.sunroof == 'yes') & (df.warranty == 'yes'))
0 True
1 False
2 False
3 False
dtype: bool
print (df[(df.dvd == 'yes') & (df.sunroof == 'yes') & (df.warranty == 'yes')])
car dvd sunroof warranty
0 bmw yes yes yes
#if need filter only column 'car'
print (df.ix[(df.dvd == 'yes')&(df.sunroof == 'yes')&(df.warranty == 'yes'), 'car'])
0 bmw
Name: car, dtype: object
另一种解决方案是检查列中的所有值是否为yes
,然后通过all
检查所有值是否为True
:
print ((df[[ u'dvd', u'sunroof', u'warranty']] == "yes").all(axis=1))
0 True
1 False
2 False
3 False
dtype: bool
print (df[(df[[ u'dvd', u'sunroof', u'warranty']] == "yes").all(axis=1)])
car dvd sunroof warranty
0 bmw yes yes yes
print (df.ix[(df[[ u'dvd', u'sunroof', u'warranty']] == "yes").all(axis=1), 'car'])
0 bmw
Name: car, dtype: object
使用最少代码的解决方案,如果 DataFrame
只有 4
列,如示例:
print (df[(df.set_index('car') == 'yes').all(1).values])
car dvd sunroof warranty
0 bmw yes yes yes
时间安排:
In [44]: %timeit ([car for ind, car in enumerate(df['car']) if df['dvd'][ind] == df['warranty'][ind] == df['sunroof'][ind] == 'yes'])
10 loops, best of 3: 120 ms per loop
In [45]: %timeit (df[(df.dvd == 'yes')&(df.sunroof == 'yes')&(df.warranty == 'yes')])
The slowest run took 4.39 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 2.09 ms per loop
In [46]: %timeit (df[(df[[ u'dvd', u'sunroof', u'warranty']] == "yes").all(axis=1)])
1000 loops, best of 3: 1.53 ms per loop
In [47]: %timeit (df[(df.ix[:, [u'dvd', u'sunroof', u'warranty']] == "yes").all(axis=1)])
The slowest run took 4.46 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.51 ms per loop
In [48]: %timeit (df[(df.set_index('car') == 'yes').all(1).values])
1000 loops, best of 3: 1.64 ms per loop
In [49]: %timeit (mer(df))
The slowest run took 4.17 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 3.85 ms per loop
计时码:
df = pd.DataFrame(
'car': ['bmw','geo','vw','porsche'],
'warranty': ['yes','yes','yes','no'],
'dvd': ['yes','yes','no','yes'],
'sunroof': ['yes','no','no','no'])
print (df)
df = pd.concat([df]*1000).reset_index(drop=True)
def mer(df):
df = df.set_index('car')
return df[df[[ u'dvd', u'sunroof', u'warranty']] == "yes"].dropna().reset_index()
【讨论】:
【参考方案3】:试试这个:
df = df.set_index('car')
df[df[[ u'dvd', u'sunroof', u'warranty']] == "yes"].dropna().reset_index()
df
car dvd sunroof warranty
0 bmw yes yes yes
df = df.set_index('car')
df[df[[ u'dvd', u'sunroof', u'warranty']]== "yes"].dropna().index.values
['bmw']
【讨论】:
以上是关于如何用最少的代码创建过滤后的 DataFrame的主要内容,如果未能解决你的问题,请参考以下文章