从 Dataframe 的列创建元组
Posted
技术标签:
【中文标题】从 Dataframe 的列创建元组【英文标题】:Creating tuples from columns of Dataframe 【发布时间】:2021-03-14 02:31:05 【问题描述】:我有一个这样的数据集 - 我想创建一个 List of tuples
作为
(Name_of_State , Literacy_rate)
(JAMMU&KASHMIR, 89.78) #example
我不得不做一些清理工作,移除区域并保持状态。
data=data[data['Name']!='India'] #removing the India's row
data=data[data['TRU']=='Total']
#Only keeping total and excluding the rural and urban rows
states_group=data[data['Level']=='State']
states_group
之后,这是我要关注的主要代码 -
literacy_rate=[]
total_state_pop=0
total_literate_pop=0
for key,group in states_group.iterrows():
total_state_pop+=states_group['TOT_P']
total_literate_pop+=states_group['P_LIT']
total_literate_pop+=states_group['F_LIT']
rate=(total_literate_pop/total_state_pop)*100
literacy_rate.append((states_group['Name'],rate))
print(literacy_rate)
但是我得到的输出是-
(3 JAMMU & KASHMIR
72 HIMACHAL PRADESH
111 PUNJAB
174 CHANDIGARH
180 UTTARAKHAND
222 HARYANA
288 NCT OF DELHI
318 RAJASTHAN
420 UTTAR PRADESH
636 BIHAR
753 SIKKIM
768 MANIPUR
798 MIZORAM
825 TRIPURA
840 MEGHALAYA
864 ASSAM
948 WEST BENGAL
1008 JHARKHAND
1083 ODISHA
1176 CHHATTISGARH
1233 MADHYA PRADESH
1386 GUJARAT
1467 DAMAN & DIU
1476 DADRA & NAGAR HAVELI
1482 MAHARASHTRA
1590 ANDHRA PRADESH
1662 KARNATAKA
1755 GOA
1764 KERALA
1809 TAMIL NADU
1908 PUDUCHERRY
Name: Name, dtype: object, 3 85.484832
72 99.946393
111 80.810862
174 93.793637
180 89.689123
222 79.608418
288 97.531743
318 67.745833
420 69.971651
636 52.937273
753 98.691424
768 96.236438
798 109.113300
825 116.065370
840 84.108326
864 96.451609
948 87.437511
1008 63.211190
1083 85.260257
1176 85.104889
1233 78.055310
1386 99.236215
1467 121.848465
1476 112.301972
1482 100.968386
1590 79.671587
1662 81.400129
1755 110.110417
1764 120.140132
1809 94.529868
1908 101.165414
dtype: float64), (3 JAMMU & KASHMIR
72 HIMACHAL PRADESH
111 PUNJAB
174 CHANDIGARH
180 UTTARAKHAND
222 HARYANA
288 NCT OF DELHI
318 RAJASTHAN
420 UTTAR PRADESH
636 BIHAR
753 SIKKIM
768 MANIPUR
798 MIZORAM
825 TRIPURA
840 MEGHALAYA
864 ASSAM
948 WEST BENGAL
1008 JHARKHAND
1083 ODISHA
1176 CHHATTISGARH
1233 MADHYA PRADESH
1386 GUJARAT
1467 DAMAN & DIU
1476 DADRA & NAGAR HAVELI
1482 MAHARASHTRA
1590 ANDHRA PRADESH
1662 KARNATAKA
1755 GOA
1764 KERALA
1809 TAMIL NADU
1908 PUDUCHERRY
Name: Name, dtype: object, 3 85.484832
72 99.946393
111 80.810862
174 93.793637
180 89.689123
222 79.608418
288 97.531743
318 67.745833
420 69.971651
636 52.937273
753 98.691424
768 96.236438
798 109.113300
825 116.065370
840 84.108326
864 96.451609
948 87.437511
1008 63.211190
1083 85.260257
1176 85.104889
1233 78.055310
1386 99.236215
1467 121.848465
1476 112.301972
1482 100.968386
1590 79.671587
1662 81.400129
1755 110.110417
1764 120.140132
1809 94.529868
1908 101.165414
dtype: float64), (3 JAMMU & KASHMIR
72 HIMACHAL PRADESH
111 PUNJAB
174 CHANDIGARH
180 UTTARAKHAND
222 HARYANA
288 NCT OF DELHI
318 RAJASTHAN
420 UTTAR PRADESH
636 BIHAR
753 SIKKIM
768 MANIPUR
798 MIZORAM
825 TRIPURA
840 MEGHALAYA
864 ASSAM
948 WEST BENGAL
1008 JHARKHAND
1083 ODISHA
1176 CHHATTISGARH
1233 MADHYA PRADESH
1386 GUJARAT
1467 DAMAN & DIU
1476 DADRA & NAGAR HAVELI
1482 MAHARASHTRA
1590 ANDHRA PRADESH
1662 KARNATAKA
1755 GOA
1764 KERALA
1809 TAMIL NADU
1908 PUDUCHERRY
Name: Name, dtype: object, 3 85.484832
72 99.946393
111 80.810862
174 93.793637
180 89.689123
222 79.608418
288 97.531743
318 67.745833
420 69.971651
636 52.937273
753 98.691424
768 96.236438
798 109.113300
825 116.065370
840 84.108326
864 96.451609
948 87.437511
1008 63.211190
1083 85.260257
1176 85.104889
1233 78.055310
1386 99.236215
1467 121.848465
1476 112.301972
而且还有很长的路要走 这是link 整个数据集 我哪里错了?提前致谢。
【问题讨论】:
一般建议:尝试改用pandas方法...更直接和有用... 其实我是新手..你能帮忙吗? 当然...我会尽快调查,确保有人会在我找到时间之前帮助您...如果没有,祝您好运! 好的,谢谢@adirabargil 【参考方案1】:尽可能避免迭代,因为它是 pandas 的反模式。 good read
import pandas as pd
data = pd.read_excel('state_dist_sc.xls')
data=data[data['Name']!='India']
data=data[data['TRU']=='Total']
states_group=data[data['Level']=='State']
#create a copy of data on which we will be calculating literacy rate.
states_group = states_group.copy()
#Calculate litracy rate using vector formula which is faster and more.
states_group['literacy_rate'] = 100*(states_group['P_LIT'] + states_group['F_LIT'])/states_group['TOT_P']
# use to_records to get list of tuples
ans = states_group[['Name','literacy_rate']].to_records(index=False)
ans
输出:
rec.array([('JAMMU & KASHMIR', 85.48483174),
('HIMACHAL PRADESH', 99.94639301), ('PUNJAB', 80.81086172),
('CHANDIGARH', 93.79363692), ('UTTARAKHAND', 89.68912284),
('HARYANA', 79.60841792), ('NCT OF DELHI', 97.53174349),
('RAJASTHAN', 67.74583313), ('UTTAR PRADESH', 69.97165068),
('BIHAR', 52.93727261), ('SIKKIM', 98.69142352),
('MANIPUR', 96.23643761), ('MIZORAM', 109.11330049),
('TRIPURA', 116.06537002), ('MEGHALAYA', 84.10832613),
('ASSAM', 96.45160871), ('WEST BENGAL', 87.43751069),
('JHARKHAND', 63.21118996), ('ODISHA', 85.26025661),
('CHHATTISGARH', 85.10488906),
('MADHYA PRADESH', 78.05530967), ('GUJARAT', 99.23621537),
('DAMAN & DIU', 121.84846506),
('DADRA & NAGAR HAVELI', 112.3019722 ),
('MAHARASHTRA', 100.96838647),
('ANDHRA PRADESH', 79.67158709), ('KARNATAKA', 81.40012899),
('GOA', 110.11041691), ('KERALA', 120.14013153),
('TAMIL NADU', 94.529868 ), ('PUDUCHERRY', 101.16541449)],
dtype=[('Name', 'O'), ('literacy_rate', '<f8')])
【讨论】:
【参考方案2】:在for
循环中,如何将每个states_group
换成group
否则用.iterrows()
做for
循环就没有意义了
literacy_rate=[]
total_state_pop=0
total_literate_pop=0
for key,group in states_group.iterrows():
total_state_pop+=group['TOT_P']
total_literate_pop+=group['P_LIT']
total_literate_pop+=group['F_LIT']
rate=(total_literate_pop/total_state_pop)*100
literacy_rate.append((group['Name'],rate))
【讨论】:
以上是关于从 Dataframe 的列创建元组的主要内容,如果未能解决你的问题,请参考以下文章
如何用pandas将某列one-hot编码后,修改原dataframe
从两个熊猫系列(csv的列作为DataFrame)创建元素字典
如何使用 Pandas 从 DataFrame 或 np.array 中的列条目创建字典
Spark Dataframe:从 Map 类型生成元组数组