Python,lambda 函数作为 groupby 的参数

Posted

技术标签:

【中文标题】Python,lambda 函数作为 groupby 的参数【英文标题】:Python, lambda function as argument for groupby 【发布时间】:2021-03-18 05:31:09 【问题描述】:

我试图弄清楚一段代码在做什么,但我有点迷失了。

我有一个 pandas 数据框,它已由以下 .csv 文件加载:

origin_census_block_group,date_range_start,date_range_end,device_count,distance_traveled_from_home,bucketed_distance_traveled,median_dwell_at_bucketed_distance_traveled,completely_home_device_count,median_home_dwell_time,bucketed_home_dwell_time,at_home_by_each_hour,part_time_work_behavior_devices,full_time_work_behavior_devices,destination_cbgs,delivery_behavior_devices,median_non_home_dwell_time,candidate_device_count,bucketed_away_from_home_time,median_percentage_time_home,bucketed_percentage_time_home,mean_home_dwell_time,mean_non_home_dwell_time,mean_distance_traveled_from_home
010539707003,2020-06-25T00:00:00-05:00,2020-06-26T00:00:00-05:00,49,626,"""16001-50000"":5,""0"":11,"">50000"":4,""2001-8000"":3,""1-1000"":9,""1001-2000"":7,""8001-16000"":1","""16001-50000"":110,"">50000"":155,""<1000"":40,""2001-8000"":237,""1001-2000"":27,""8001-16000"":180",12,627,"""721-1080"":11,""361-720"":9,""61-360"":1,""<60"":11,"">1080"":12","[32,32,28,30,30,31,27,23,20,20,20,17,19,19,15,14,17,20,20,21,25,22,24,23]",7,3,"""120330012011"":1,""010030107031"":1,""010030114052"":2,""120330038001"":1,""010539701003"":1,""010030108001"":1,""010539707002"":14,""010539705003"":2,""120330015001"":1,""121130102003"":1,""010539701002"":1,""120330040001"":1,""370350101014"":2,""120330033081"":2,""010030106003"":1,""010539706001"":2,""010539707004"":3,""120330039001"":1,""010539699003"":1,""120330030003"":1,""010539707003"":41,""010970029003"":1,""010539705004"":1,""120330009002"":1,""010539705001"":3,""010539704003"":1,""120330028012"":1,""120330035081"":1,""120330036102"":1,""120330036142"":1,""010030114062"":1,""010539706004"":7,""010539706002"":1,""120330036082"":1,""010539707001"":7,""010030102001"":1,""120330028011"":1",2,241,71,"""21-45"":4,""481-540"":2,""541-600"":1,""721-840"":1,""1201-1320"":1,""301-360"":3,""<20"":13,""61-120"":3,""241-300"":3,""121-180"":1,""421-480"":3,""1321-1440"":4,""1081-1200"":1,""961-1080"":2,""601-660"":1,""181-240"":1,""661-720"":2,""361-420"":3",72,"""0-25"":13,""76-100"":21,""51-75"":6,""26-50"":3",657,413,1936
010730144081,2020-06-25T00:00:00-05:00,2020-06-26T00:00:00-05:00,139,2211,"""16001-50000"":17,""0"":41,"">50000"":15,""2001-8000"":22,""1-1000"":8,""1001-2000"":12,""8001-16000"":24","""16001-50000"":143,"">50000"":104,""<1000"":132,""2001-8000"":39,""1001-2000"":15,""8001-16000"":102",41,806,"""721-1080"":32,""361-720"":16,""61-360"":12,""<60"":30,"">1080"":46","[91,92,93,91,91,90,86,83,78,64,64,61,64,62,65,62,60,74,61,64,75,78,81,84]",8,6,"""131350501064"":1,""131350502151"":1,""010730102002"":1,""011170302131"":2,""010730038024"":1,""010730108041"":1,""010730144133"":1,""010730132003"":1,""011210118002"":1,""011170303053"":1,""010730111084"":2,""011170302142"":1,""010730119011"":1,""010730129063"":2,""010730107063"":1,""010730059083"":1,""010730058003"":1,""011270204003"":1,""010730049012"":2,""130879701001"":1,""010730120021"":1,""130890219133"":1,""010730144082"":4,""170310301031"":1,""010730129112"":1,""010730024002"":1,""011170303034"":2,""481390616004"":1,""121270826052"":1,""010730128021"":2,""121270825073"":1,""010730004004"":1,""211959313002"":1,""010730100012"":1,""011170302151"":1,""010730142041"":1,""010730129123"":1,""010730129084"":1,""010730042002"":1,""010730059033"":2,""170318306001"":1,""130519800001"":1,""010730027003"":1,""121270826042"":1,""481610001002"":1,""010730100011"":1,""010730023032"":1,""350250004002"":1,""010730056003"":1,""010730132001"":1,""011170302171"":2,""120910227003"":1,""011239620001"":1,""130351503002"":1,""010730129155"":1,""010730001001"":2,""010730110021"":1,""170310104003"":1,""010730059082"":2,""010730120022"":1,""011170303151"":1,""010730139022"":1,""011170303441"":4,""010730144092"":3,""010730129151"":1,""011210119001"":2,""010730144081"":117,""010730108052"":1,""010730129122"":9,""370710321003"":1,""010730142034"":2,""010730042001"":2,""010570201003"":1,""010730144132"":6,""010730059032"":1,""010730012001"":2,""010730102003"":1,""011170303332"":1,""010730128032"":2,""010730129081"":1,""010730103011"":1,""010730058001"":3,""011150401041"":1,""010730045001"":3,""010730110013"":1,""010730119041"":1,""010730042003"":1,""010730141041"":1,""010730144091"":1,""010730129154"":1,""484759501002"":1,""010730144063"":1,""010730144102"":12,""011170303141"":1,""011250106011"":1,""011170303152"":1,""010730059104"":1,""010730107021"":1,""010730100014"":1,""010730008004"":1,""011170303451"":1,""010730127041"":2,""370559704003"":1,""010730047011"":2,""010730129132"":2,""011010014002"":1,""010730144131"":1,""011170302133"":1,""010730030011"":1,""131350506063"":1,""010730118023"":1,""010890110141"":1,""010730128023"":1,""010730106022"":2,""130879703004"":1,""010730108015"":1,""131390010041"":1,""011170305013"":1,""010730134002"":1,""010730031004"":1,""010730138012"":1,""010730011004"":1,""011250102041"":1,""010730129131"":4,""010730144101"":4,""011170303331"":2,""010730003001"":1,""011010033012"":1,""483539504004"":1,""010550104021"":1,""011170303411"":1,""010730106031"":1,""011170303153"":5,""010730128034"":1,""010730129061"":1,""131390010023"":1,""010730051042"":1,""130510107002"":1,""010730027001"":2,""120090686011"":1,""010730107042"":1,""010730123052"":1,""010730129102"":1,""011210115003"":1,""010730129083"":4,""011170303142"":1,""011010014001"":1,""010730107064"":2",7,176,205,"""21-45"":7,""481-540"":10,""541-600"":4,""46-60"":2,""721-840"":3,""1201-1320"":3,""301-360"":7,""<20"":46,""61-120"":6,""241-300"":4,""121-180"":9,""421-480"":2,""1321-1440"":3,""1081-1200"":5,""961-1080"":1,""601-660"":1,""181-240"":5,""661-720"":1,""361-420"":7",78,"""0-25"":29,""76-100"":71,""51-75"":27,""26-50"":8",751,338,38937
010890017002,2020-06-25T00:00:00-05:00,2020-06-26T00:00:00-05:00,78,1934,"""16001-50000"":2,""0"":12,"">50000"":9,""2001-8000"":27,""1-1000"":12,""1001-2000"":8,""8001-16000"":8","""16001-50000"":49,"">50000"":99,""<1000"":111,""2001-8000"":37,""1001-2000"":24,""8001-16000"":28",11,787,"""721-1080"":17,""361-720"":11,""61-360"":11,""<60"":15,"">1080"":23","[49,42,48,48,47,48,44,44,39,32,34,32,36,31,32,36,40,37,36,38,49,45,46,46]",5,1,"""010890101002"":1,""010730108041"":1,""010890020003"":2,""010890010001"":2,""010890025011"":3,""010890026001"":4,""280819505003"":1,""281059504004"":1,""010890103022"":1,""120990056011"":1,""010890109012"":2,""010890019021"":6,""010890013021"":4,""010890015004"":3,""010890108003"":1,""010890014022"":6,""281059501003"":1,""281059503001"":1,""010890007022"":3,""010890017001"":3,""010890107023"":1,""010890021002"":1,""010890009011"":1,""010890109013"":1,""010730120022"":1,""010890031003"":15,""011170303151"":1,""010890019011"":9,""010890030002"":2,""010890110221"":1,""011170305021"":1,""010890026003"":2,""010890025012"":3,""010730117034"":1,""010830208022"":1,""010890031002"":2,""010890112002"":1,""010210602001"":1,""010890002022"":1,""010890017002"":65,""281059506021"":1,""010890010003"":2,""010890106222"":1,""120990059182"":1,""010890110222"":1,""010890020001"":1,""010890101003"":1,""010890018013"":1,""010890021001"":1,""010890109021"":1,""010890108001"":1,""010770106005"":1,""281059506011"":1,""010030114032"":2,""010830209001"":1,""010890027222"":1,""010730128023"":1,""010890009021"":1,""010030114051"":1,""010030109031"":1,""010030103003"":1,""010890031001"":1,""010890021003"":1,""010030114062"":4,""010890106241"":1,""281059504003"":1,""010890018011"":10,""010890019031"":5,""010890027012"":1,""010730108054"":1,""010890106223"":2,""010890111001"":1,""010210603002"":1,""010890109011"":1,""010890019012"":2,""010890113001"":1,""010890028013"":3",1,229,99,"""481-540"":3,""541-600"":2,""46-60"":1,""721-840"":1,""1201-1320"":7,""301-360"":6,""<20"":18,""61-120"":10,""241-300"":5,""121-180"":2,""1321-1440"":2,""841-960"":1,""1081-1200"":1,""961-1080"":3,""601-660"":3,""181-240"":2,""661-720"":3",78,"""0-25"":16,""76-100"":44,""51-75"":11,""26-50"":7",708,353,14328
010950308022,2020-06-25T00:00:00-05:00,2020-06-26T00:00:00-05:00,100,2481,"""16001-50000"":11,""0"":19,"">50000"":11,""2001-8000"":40,""1-1000"":6,""1001-2000"":3,""8001-16000"":4","""16001-50000"":150,"">50000"":23,""<1000"":739,""2001-8000"":23,""1001-2000"":12,""8001-16000"":208",17,703,"""721-1080"":21,""361-720"":19,""61-360"":10,""<60"":24,"">1080"":26","[62,64,64,63,65,67,54,48,37,37,34,33,30,34,32,33,35,43,50,56,58,56,56,57]",8,6,"""010950306004"":1,""010950302023"":1,""011030054051"":1,""010950311002"":1,""010950309023"":1,""010499606003"":1,""121319506023"":2,""010950308022"":86,""121319506016"":2,""010950304013"":1,""010950307024"":1,""010950309041"":1,""010890019021"":2,""010950312001"":5,""010499607002"":1,""011150402013"":1,""010550102003"":1,""120050027043"":3,""010719509003"":1,""010950302022"":1,""010950308023"":2,""120050027051"":2,""471079701022"":1,""010890106221"":1,""010950306001"":1,""010950302011"":2,""011150405013"":1,""011150402041"":2,""010950312002"":16,""011030054042"":1,""010950301002"":2,""130459105011"":1,""010730001001"":1,""130459102001"":1,""010890109013"":2,""010950308013"":14,""010719508004"":1,""120050027041"":3,""010550110021"":3,""010730049022"":1,""010950308024"":1,""010950312004"":6,""010950312003"":1,""010550104012"":2,""010550110013"":1,""120860004111"":1,""010890027222"":1,""010950306002"":2,""010950304015"":1,""011030054041"":1,""010950309031"":8,""010950308021"":1,""010950302024"":1,""010950307011"":5,""010550110012"":2,""011150404013"":1,""130459103003"":1,""120050027032"":3,""010950307012"":5,""010950309022"":2,""010950307023"":1,""010719508003"":1,""010499608001"":2,""010950310003"":1,""011150402043"":1,""120860099063"":1,""010950309021"":4,""010950309043"":2,""010950308011"":1,""010950306003"":3,""120050027042"":1,""010950308025"":5,""010950309032"":6,""010499607001"":1",1,199,132,"""21-45"":8,""481-540"":6,""541-600"":4,""46-60"":3,""721-840"":3,""1201-1320"":4,""301-360"":3,""<20"":20,""61-120"":10,""241-300"":2,""121-180"":4,""421-480"":3,""1321-1440"":1,""841-960"":3,""961-1080"":2,""601-660"":1,""181-240"":3,""661-720"":1,""361-420"":2",74,"""0-25"":20,""76-100"":48,""51-75"":23,""26-50"":4",661,350,5044

df = pd.read_csv(csv_file,
        usecols=[
                'origin_census_block_group',
                'date_range_start',
                'date_range_end',
                'device_count',
                'distance_traveled_from_home',
                'completely_home_device_count',
                'median_home_dwell_time',
                'part_time_work_behavior_devices',
                'full_time_work_behavior_devices'
                ],
                    dtype='origin_census_block_group': str,
                ).set_index('origin_census_block_group')

并且,稍后在代码中,数据框被修改为:

df = df.groupby(lambda cbg: cbg[:5]).sum()

我不太明白这条线到底在做什么。 Groupby 通常按列对数据框进行分组,所以...它是使用多列(0 到 5)对数据框进行分组吗? .sum() 最后的作用是什么?

【问题讨论】:

请提供一份 MRE。 "是否使用多列(0 到 5)对数据框进行分组?"是的。 “ .sum() 最后的效果是什么?”它通常会做同样的事情(请阅读文档)。 @KarlKnechtel,你有机会详细说明一下吗?我熟悉 .sum() 的一般工作原理,但我对在这种情况下如何将它应用于 groupby 的结果感到困惑 你试过自己做groupby吗?你明白sum 会对你看到的结果做什么吗? 【参考方案1】:

如果您完全按照编写的方式运行代码(创建 df 和 groupby),您可以看到结果。我打印groupby输出的前几列

         device_count    distance_traveled_from_home
-----  --------------  -----------------------------
01053              49                            626
01073             139                           2211
01089              78                           1934
01095             100                           2481

这里发生的是函数lambda cbg: cbg[:5] 应用于每个索引值(在列origin_census_block_group 中看起来像数字的字符串)。作为一个方面,请注意声明

...
dtype='origin_census_block_group': str,

在创建 df 时,有人遇到麻烦以确保他们实际上是 str

因此,该函数应用于'010539707003' 之类的字符串,并返回一个子字符串,该子字符串是该字符串的前 5 个字符:

'010539707003'[:5]

生产

'01053'

所以我假设有多个键共享前 5 个字符(在实际文件中——sn-p 使它们都是唯一的,所以不是很有趣)并且所有这些行都组合在一起

然后.sum() 应用于每个组的每个数字列,并返回每个 groupby 键的列总和。这就是您在'device_count' 等列的输出中看到的内容。

希望现在清楚了

【讨论】:

非常感谢!正如我对另一个答案所说,我很困惑,因为通常您必须指定用于分组的列,而在这种情况下,它似乎是自动使用索引之一。另外,我不知道如何单独打印 groupby 的输出(实际上仍然不知道),所以我不明白 .sum() 在操作什么。 @Carlo 了解 groupby 对我有帮助吗?这样查看结果会很有帮助 for key, group in df.groupby('First'): print (key, group) 这里“第一”是您分组的列的名称【参考方案2】:

Pandas 的 read_csv() 会将 csv 格式的文件呈现为 Pandas Dataframe。 我建议准备好 Pandas 的文档,因为它非常详尽 -> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

 usecols=[
                'origin_census_block_group',
                'date_range_start',
                'date_range_end',
                'device_count',
                'distance_traveled_from_home',
                'completely_home_device_count',
                'median_home_dwell_time',
                'part_time_work_behavior_devices',
                'full_time_work_behavior_devices'
                ],

usecols 参数会将所需列的数组作为输入,并且只会将指定的列加载到数据框中。

dtype='origin_census_block_group': str

dtype 参数将接受一个字典作为输入,用于指定值的数据类型,如 'column' : datatype

.set_index('origin_census_block_group')

.set_index() 将指定列设置为索引列(即:第一列)。 Pandas Dataframe 的常用索引是行的索引号,它显示为数据帧的第一列。通过设置索引,第一列现在成为指定列。见:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html

Panda 的 .groupby() 函数将根据指定列中出现的值对数据框进行重新组合。

也就是说,如果我们一个dataframe比如df =

Fruit       Name    Quality  Count
                           
Apple       Marco    High      4
Pear        Lucia  Medium     10
Apple   Francesco     Low      3
Banana      Carlo  Medium      6
Pear        Timmy     Low      7
Apple     Roberto    High      8
Banana        Joe    High     21
Banana       Jack     Low      3
Pear          Rob  Medium      5
Apple       Louis  Medium      6
Pear     Jennifer     Low      7
Pear        Laura    High      8

执行groupby操作,如:

df = df.groupby(lambda x: x[:2]).sum()

将获取索引中的所有元素,从索引0到索引2对它们进行切片并返回所有对应值的总和,即:

Ap     21
Ba     30
Pe     37

现在,您可能想知道最后的 .sum() 方法。如果您尝试打印数据框而不应用它,您可能会得到如下结果:

<bound method GroupBy.sum of <pandas.core.groupby.generic.DataFrameGroupBy object at 0x109d260a0>>

这是因为 Pandas 创建了一个 groubpy 对象,但现在还不知道如何向您显示它。你想让它按索引中出现的次数显示吗?你会这样做:

df = df.groupby(lambda x: x[:2]).size()

那会输出:

Ap    4
Ba    3
Pe    5

或者可能是它们各自的可累加值的总和? (这就是示例中所做的)

df = df.groupby(lambda x: x[:2]).sum()

同样会输出:

Ap     21
Ba     30
Pe     37

请注意,它已获取索引中字符串的前两个字母。如果是 x[:3],当然会取前三个字母。

总结一下:

-> .groupby() 获取索引中的元素,即数据框的第一列,并将数据框组织成与索引相关的组

-> 您给 groubpy 的输入是一个匿名函数,即 lambda 函数,从其映射输入的索引 0 到 5 切片

->您可以通过将方法 .sum().size() 附加到 groubpy 对象来选择如何获得 groubpy 的结果

我还推荐阅读 Python 的 lambda 函数: https://docs.python.org/3/reference/expressions.html

【讨论】:

非常感谢!我很困惑,因为通常情况下,您需要指定必须按哪一列对行进行分组,而在这种情况下,它是在索引列上自动分组(在 lambda 函数完成切片之后)。一开始我完全错过了那部分。 我很高兴能帮助 Carlo,buona giornata! :D

以上是关于Python,lambda 函数作为 groupby 的参数的主要内容,如果未能解决你的问题,请参考以下文章

Lambda 作为 Python 中的迭代器在第一次迭代时返回函数对象

Python lambda函数

Python进阶07 函数对象

python的匿名函数

对Python函数对象的理解

Python匿名函数详解