基于另一个数据帧 Python 和 Pandas 从数据帧中采样
Posted
技术标签:
【中文标题】基于另一个数据帧 Python 和 Pandas 从数据帧中采样【英文标题】:Sampling from dataframe based on another dataframe Python and Pandas 【发布时间】:2021-05-27 09:33:52 【问题描述】:希望你们一切都好。
我有两个不同的数据框,如下所示。
主表:
Car | type | year | condition | price | No.users | No.seats |
---|---|---|---|---|---|---|
BZ | SV | 2000-2010 | old | 10000 | 2 | 7 |
BM | SD | 2000-2010 | new | 8000 | 3 | 2 |
BM | SV | 2000-2010 | old | 9000 | 1 | 4 |
BZ | SD | 2000-2010 | new | 7000 | 3 | 5 |
BM | SV | 2000-2010 | new | 5000 | 5 | 2 |
BM | SD | 2000-2010 | old | 3000 | 6 | 2 |
BZ | SV | 2010-2020 | old | 20000 | 2 | 4 |
BZ | SV | 2010-2020 | new | 1000 | 8 | 4 |
BZ | SV | 2000-2010 | new | 5000 | 0 | 5 |
BZ | SD | 2000-2010 | old | 4000 | 1 | 7 |
样本表 我想根据这张表进行抽样
city | type | year | No.sample |
---|---|---|---|
BZ | SV | 2000-2010 | 1 |
BZ | SV | 2010-2020 | 1 |
BZ | SD | 2000-2010 | 1 |
BM | SV | 2000-2010 | 1 |
BM | SD | 2000-2010 | 1 |
我尝试了不同的方法,但我想知道如何根据 SampleTable 随机抽样行。
【问题讨论】:
【参考方案1】:看看this answer:
import pandas as pd
data = pd.DataFrame('cols1':[4, 5, 5, 4, 321, 32, 5],
'clol2':[45, 66, 6, 6, 1, 432, 3],
'class':['A', 'B', 'C', 'C', 'A', 'B', 'B'])
freq = pd.DataFrame('class':['A', 'B', 'C'],
'nostoextract':[2, 2, 2], )
def bootstrap(data, freq):
freq = freq.set_index('class')
# This function will be applied on each group of instances of the same
# class in `data`.
def sampleClass(classgroup):
cls = classgroup['class'].iloc[0]
nDesired = freq.nostoextract[cls]
nRows = len(classgroup)
nSamples = min(nRows, nDesired)
return classgroup.sample(nSamples)
samples = data.groupby('class').apply(sampleClass)
# If you want a new index with ascending values
# samples.index = range(len(samples))
# If you want an index which is equal to the row in `data` where the sample
# came from
samples.index = samples.index.get_level_values(1)
# If you don't change it then you'll have a multiindex with level 0
# being the class and level 1 being the row in `data` where
# the sample came from.
return samples
print(bootstrap(data,freq))
您可以将“城市”、“类型”和“年份”列合并为一个新列:
准备MainTable
:
MainTable["combination"] = MainTable["city"] + MainTable["type"] + MainTable["year"]
准备SampleTable
:
SampleTable["combination"] = SampleTable["city"] + SampleTable["type"] + SampleTable["year"]
然后您根据链接答案中的SampleTable["combination"].value_counts()
而不是freq["class"]
进行抽样。
【讨论】:
以上是关于基于另一个数据帧 Python 和 Pandas 从数据帧中采样的主要内容,如果未能解决你的问题,请参考以下文章
根据另一个数据框 python pandas 替换列值 - 更好的方法?
Python 3.4 - Pandas - 帮助正确排列数据帧列和删除无效列