从 Dataframe 的列创建元组

Posted

技术标签:

【中文标题】从 Dataframe 的列创建元组【英文标题】:Creating tuples from columns of Dataframe 【发布时间】:2021-03-14 02:31:05 【问题描述】:

我有一个这样的数据集 - 我想创建一个 List of tuples 作为

(Name_of_State , Literacy_rate)
(JAMMU&KASHMIR, 89.78) #example

我不得不做一些清理工作,移除区域并保持状态。

data=data[data['Name']!='India']    #removing the India's row 
data=data[data['TRU']=='Total']    
 #Only keeping total and excluding the rural and urban rows
states_group=data[data['Level']=='State']
states_group

之后,这是我要关注的主要代码 -

literacy_rate=[]
total_state_pop=0
total_literate_pop=0
for key,group in states_group.iterrows():
    total_state_pop+=states_group['TOT_P']
    
    total_literate_pop+=states_group['P_LIT']
    total_literate_pop+=states_group['F_LIT']
    rate=(total_literate_pop/total_state_pop)*100
    literacy_rate.append((states_group['Name'],rate))
    
print(literacy_rate) 

但是我得到的输出是-

(3            JAMMU & KASHMIR
72          HIMACHAL PRADESH
111                   PUNJAB
174               CHANDIGARH
180              UTTARAKHAND
222                  HARYANA
288             NCT OF DELHI
318                RAJASTHAN
420            UTTAR PRADESH
636                    BIHAR
753                   SIKKIM
768                  MANIPUR
798                  MIZORAM
825                  TRIPURA
840                MEGHALAYA
864                    ASSAM
948              WEST BENGAL
1008               JHARKHAND
1083                  ODISHA
1176            CHHATTISGARH
1233          MADHYA PRADESH
1386                 GUJARAT
1467             DAMAN & DIU
1476    DADRA & NAGAR HAVELI
1482             MAHARASHTRA
1590          ANDHRA PRADESH
1662               KARNATAKA
1755                     GOA
1764                  KERALA
1809              TAMIL NADU
1908              PUDUCHERRY
Name: Name, dtype: object, 3        85.484832
72       99.946393
111      80.810862
174      93.793637
180      89.689123
222      79.608418
288      97.531743
318      67.745833
420      69.971651
636      52.937273
753      98.691424
768      96.236438
798     109.113300
825     116.065370
840      84.108326
864      96.451609
948      87.437511
1008     63.211190
1083     85.260257
1176     85.104889
1233     78.055310
1386     99.236215
1467    121.848465
1476    112.301972
1482    100.968386
1590     79.671587
1662     81.400129
1755    110.110417
1764    120.140132
1809     94.529868
1908    101.165414
dtype: float64), (3            JAMMU & KASHMIR
72          HIMACHAL PRADESH
111                   PUNJAB
174               CHANDIGARH
180              UTTARAKHAND
222                  HARYANA
288             NCT OF DELHI
318                RAJASTHAN
420            UTTAR PRADESH
636                    BIHAR
753                   SIKKIM
768                  MANIPUR
798                  MIZORAM
825                  TRIPURA
840                MEGHALAYA
864                    ASSAM
948              WEST BENGAL
1008               JHARKHAND
1083                  ODISHA
1176            CHHATTISGARH
1233          MADHYA PRADESH
1386                 GUJARAT
1467             DAMAN & DIU
1476    DADRA & NAGAR HAVELI
1482             MAHARASHTRA
1590          ANDHRA PRADESH
1662               KARNATAKA
1755                     GOA
1764                  KERALA
1809              TAMIL NADU
1908              PUDUCHERRY
Name: Name, dtype: object, 3        85.484832
72       99.946393
111      80.810862
174      93.793637
180      89.689123
222      79.608418
288      97.531743
318      67.745833
420      69.971651
636      52.937273
753      98.691424
768      96.236438
798     109.113300
825     116.065370
840      84.108326
864      96.451609
948      87.437511
1008     63.211190
1083     85.260257
1176     85.104889
1233     78.055310
1386     99.236215
1467    121.848465
1476    112.301972
1482    100.968386
1590     79.671587
1662     81.400129
1755    110.110417
1764    120.140132
1809     94.529868
1908    101.165414
dtype: float64), (3            JAMMU & KASHMIR
72          HIMACHAL PRADESH
111                   PUNJAB
174               CHANDIGARH
180              UTTARAKHAND
222                  HARYANA
288             NCT OF DELHI
318                RAJASTHAN
420            UTTAR PRADESH
636                    BIHAR
753                   SIKKIM
768                  MANIPUR
798                  MIZORAM
825                  TRIPURA
840                MEGHALAYA
864                    ASSAM
948              WEST BENGAL
1008               JHARKHAND
1083                  ODISHA
1176            CHHATTISGARH
1233          MADHYA PRADESH
1386                 GUJARAT
1467             DAMAN & DIU
1476    DADRA & NAGAR HAVELI
1482             MAHARASHTRA
1590          ANDHRA PRADESH
1662               KARNATAKA
1755                     GOA
1764                  KERALA
1809              TAMIL NADU
1908              PUDUCHERRY
Name: Name, dtype: object, 3        85.484832
72       99.946393
111      80.810862
174      93.793637
180      89.689123
222      79.608418
288      97.531743
318      67.745833
420      69.971651
636      52.937273
753      98.691424
768      96.236438
798     109.113300
825     116.065370
840      84.108326
864      96.451609
948      87.437511
1008     63.211190
1083     85.260257
1176     85.104889
1233     78.055310
1386     99.236215
1467    121.848465
1476    112.301972

而且还有很长的路要走 这是link 整个数据集 我哪里错了?提前致谢。

【问题讨论】:

一般建议:尝试改用pandas方法...更直接和有用... 其实我是新手..你能帮忙吗? 当然...我会尽快调查,确保有人会在我找到时间之前帮助您...如果没有,祝您好运! 好的,谢谢@adirabargil 【参考方案1】:

尽可能避免迭代,因为它是 pandas 的反模式。 good read

import pandas as pd
data = pd.read_excel('state_dist_sc.xls')
data=data[data['Name']!='India']
data=data[data['TRU']=='Total']
states_group=data[data['Level']=='State']

#create a copy of data on which we will be calculating literacy rate.
states_group = states_group.copy()

#Calculate litracy rate using vector formula which is faster and more.
states_group['literacy_rate'] = 100*(states_group['P_LIT'] + states_group['F_LIT'])/states_group['TOT_P']

# use to_records to get list of tuples
ans = states_group[['Name','literacy_rate']].to_records(index=False)
ans

输出:

rec.array([('JAMMU & KASHMIR',  85.48483174),
           ('HIMACHAL PRADESH',  99.94639301), ('PUNJAB',  80.81086172),
           ('CHANDIGARH',  93.79363692), ('UTTARAKHAND',  89.68912284),
           ('HARYANA',  79.60841792), ('NCT OF DELHI',  97.53174349),
           ('RAJASTHAN',  67.74583313), ('UTTAR PRADESH',  69.97165068),
           ('BIHAR',  52.93727261), ('SIKKIM',  98.69142352),
           ('MANIPUR',  96.23643761), ('MIZORAM', 109.11330049),
           ('TRIPURA', 116.06537002), ('MEGHALAYA',  84.10832613),
           ('ASSAM',  96.45160871), ('WEST BENGAL',  87.43751069),
           ('JHARKHAND',  63.21118996), ('ODISHA',  85.26025661),
           ('CHHATTISGARH',  85.10488906),
           ('MADHYA PRADESH',  78.05530967), ('GUJARAT',  99.23621537),
           ('DAMAN & DIU', 121.84846506),
           ('DADRA & NAGAR HAVELI', 112.3019722 ),
           ('MAHARASHTRA', 100.96838647),
           ('ANDHRA PRADESH',  79.67158709), ('KARNATAKA',  81.40012899),
           ('GOA', 110.11041691), ('KERALA', 120.14013153),
           ('TAMIL NADU',  94.529868  ), ('PUDUCHERRY', 101.16541449)],
          dtype=[('Name', 'O'), ('literacy_rate', '<f8')])

【讨论】:

【参考方案2】:

for 循环中,如何将每个states_group 换成group 否则用.iterrows()for循环就没有意义了

literacy_rate=[]
total_state_pop=0
total_literate_pop=0
for key,group in states_group.iterrows():
    total_state_pop+=group['TOT_P']
    
    total_literate_pop+=group['P_LIT']
    total_literate_pop+=group['F_LIT']
    rate=(total_literate_pop/total_state_pop)*100
    literacy_rate.append((group['Name'],rate))

【讨论】:

以上是关于从 Dataframe 的列创建元组的主要内容,如果未能解决你的问题,请参考以下文章

如何用pandas将某列one-hot编码后,修改原dataframe

从两个熊猫系列(csv的列作为DataFrame)创建元素字典

如何使用 Pandas 从 DataFrame 或 np.array 中的列条目创建字典

Spark Dataframe:从 Map 类型生成元组数组

当数组很大时,在Scala中的Spark Dataframe中从数组列创建单独的列[重复]

从大型元组/行列表中有效地构建 Pandas DataFrame