Python数据分析pandas入门练习题

Posted Geek_bao

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python数据分析pandas入门练习题相关的知识,希望对你有一定的参考价值。

Python数据分析基础

Preparation

下面是练习题的数据集,尽量下载下来使用。下面习题的连接不一定能打开。
https://github.com/justmarkham/pandas-videos/tree/master/data

Exercise 1-GroupBy

Introduction:

GroupBy can be summarizes as Split-Apply-Combine.

Step 1. Import the necessary libraries

代码如下:

import pandas as pd

Step 2. Import the dataset from this address.

Step 3. Assign it to a variable called drinks.

代码如下:

drinks = pd.read_csv('drinks.csv', ',')
drinks

输出结果如下:

countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
0Afghanistan0000.0AS
1Albania89132544.9EU
2Algeria250140.7AF
3Andorra24513831212.4EU
4Angola21757455.9AF
5Antigua & Barbuda102128454.9NaN
6Argentina193252218.3SA
7Armenia21179113.8EU
8Australia2617221210.4OC
9Austria279751919.7EU
10Azerbaijan214651.3EU
11Bahamas122176516.3NaN
12Bahrain426372.0AS
13Bangladesh0000.0AS
14Barbados143173366.3NaN
15Belarus1423734214.4EU
16Belgium2958421210.5EU
17Belize26311486.8NaN
18Benin344131.1AF
19Bhutan23000.4AS
20Bolivia1674183.8SA
21Bosnia-Herzegovina7617384.6EU
22Botswana17335355.4AF
23Brazil245145167.2SA
24Brunei31210.6AS
25Bulgaria2312529410.3EU
26Burkina Faso25774.3AF
27Burundi88006.3AF
28Cote d'Ivoire37174.0AF
29Cabo Verde14456164.0AF
.....................
163Suriname12817875.6SA
164Swaziland90224.7AF
165Sweden152601867.2EU
166Switzerland18510028010.2EU
167Syria535161.0AS
168Tajikistan21500.3AS
169Thailand9925816.4AS
170Macedonia10627863.9EU
171Timor-Leste1140.1AS
172Togo362191.3AF
173Tonga362151.1OC
174Trinidad & Tobago19715676.4NaN
175Tunisia513201.3AF
176Turkey512271.4AS
177Turkmenistan1971322.2AS
178Tuvalu64191.0OC
179Uganda45908.3AF
180Ukraine206237458.9EU
181United Arab Emirates1613552.8AS
182United Kingdom21912619510.4EU
183Tanzania36615.7AF
184USA249158848.7NaN
185Uruguay115352206.6SA
186Uzbekistan2510182.4AS
187Vanuatu2118110.9OC
188Venezuela33310037.7SA
189Vietnam111212.0AS
190Yemen6000.1AS
191Zambia321942.5AF
192Zimbabwe641844.7AF

193 rows × 6 columns

Step 4. Which continent drinks more beer on average?

代码如下:

drinks.groupby('continent').beer_servings.mean()

输出结果如下:

continent
AF     61.471698
AS     37.045455
EU    193.777778
OC     89.687500
SA    175.083333
Name: beer_servings, dtype: float64

Step 5. For each continent print the statistics for wine consumption.

代码如下:

drinks.groupby('continent').wine_servings.describe()

输出结果如下:

countmeanstdmin25%50%75%max
continent
AF53.016.26415138.8464190.01.02.013.00233.0
AS44.09.06818221.6670340.00.01.08.00123.0
EU45.0142.22222297.4217380.059.0128.0195.00370.0
OC16.035.62500064.5557900.01.08.523.25212.0
SA12.062.41666788.6201891.03.012.098.50221.0

Step 6. Print the mean alcoohol consumption per continent for every column

代码如下:

drinks.groupby('continent').mean()

输出结果如下:

beer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcohol
continent
AF61.47169816.33962316.2641513.007547
AS37.04545560.8409099.0681822.170455
EU193.777778132.555556142.2222228.617778
OC89.68750058.43750035.6250003.381250
SA175.083333114.75000062.4166676.308333

Step 7. Print the median alcoohol consumption per continent for every column

代码如下:

drinks.groupby('continent').median()

输出结果如下:

beer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcohol
continent
AF32.03.02.02.30
AS17.516.01.01.20
EU219.0122.0128.010.00
OC52.537.08.51.75
SA162.5108.512.06.85

Step 8. Print the mean, min and max values for spirit consumption.

This time output a DataFrame

代码如下:

drinks.groupby('continent').spirit_servings.agg(['mean', 'min', 'max'])
# agg聚合函数,对分组后数据进行聚合,默认情况对分组后其他列进行聚合。
# 对分组后的部分列进行聚合,某些情况下,只需要对部分数据进行不同的聚合操作,可以通过字典来构建
# spirit_servings_info = 'spirit_servings':['min','mean','max']
# print(df.groupby('continent').agg(spirit_servings_info))

输出结果如下:

meanminmax
continent
AF16.3396230152
AS60.8409090326
EU132.5555560373
OC58.4375000254
SA114.75000025302

Exercise 2-Occupation

Introduction:

Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

Step 1. Import the necessary libraries

代码如下:

import pandas as pd

Step 2. Import the dataset from this address.

Step 3. Assign it to a variable called users.

代码如下:

users = pd.read_table('u.user', sep='|', index_col = 'user_id')
users.head()

输出结果如下:

agegenderoccupationzip_code
user_id
124Mtechnician85711
253Fother94043
323Mwriter32067
424Mtechnician43537
533Fother15213

Step 4. Discover what is the mean age per occupation

代码如下:

users.groupby('occupation').age.mean()

输出结果如下:

occupation
administrator    38.746835
artist           31.392857
doctor           43.571429
educator         42.010526
engineer         36.388060
entertainment    29.222222
executive        38.718750
healthcare       41.562500
homemaker        32.571429
lawyer           36.750000
librarian        40.000000
marketing        37.615385
none             26.555556
other            34.523810
programmer       33.121212
retired          63.071429
salesman         35.666667
scientist        35.548387
student          22.081633
technician       33.148148
writer           36.311111
Name: age, dtype: float64

Step 5. Discover the Male ratio per occupation and sort it from the most to the least

代码如下:

# create function
def gender_to_numeric(x):
    if x == 'M':
        return 1
    if x == 'F':
        return 0
users['gender_n'] = users['gender'].apply(gender_to_numeric)

a = users.groupby('occupation').gender_n.sum() / users.occupation.value_counts() * 100
a.sort_values(ascending = False)

输出结果如下:

doctor           100.000000
engineer          97.014925
technician        96.296296
retired           92.857143
programmer        90.909091
executive         90.625000
scientist         90.322581
entertainment     88.888889
lawyer            83.333333
salesman          75.000000
educator          72.631579
student           69.387755
other             65.714286
marketing         61.538462
writer            57.777778
none              55.555556
administrator     54.430380
artist            53.571429
librarian         43.137255
healthcare        31.250000
homemaker         14.285714
dtype: float64

Step 6. For each occupation, calculate the minimum and maximum ages

代码如下:

users.groupby('occupation').age.agg(['min', 'max'])

输出结果如下:

minmax
occupation
administrator2170
artist1948
doctor2864
educator2363
engineer2270
entertainment1550
executive2269
healthcare2262
homemaker2050
lawyer2153
librarian2369
marketing2455
none1155
other1364
programmer2063
retired5173
salesman1866
scientist2355
student742
technician2155
writer1860

Step 7. For each combination of occupation and gender, calculate the mean age

代码如下:

users.groupby(['occupation', 'gender']).mean()

输出结果如下:

agegender_n
occupationgender
administratorF40.6388890.0
M37.1627911.0
artistF30.3076920.0
M32.3333331.0
doctorM43.5714291.0
educatorF39.1153850.0
M43.1014491.0
engineerF29.5000000.0
M36.6000001.0
entertainmentF31.0000000.0
M29.0000001.0
executiveF44.0000000.0
M38.1724141.0
healthcareF39.8181820.0
M45.4000001.0
homemakerF34.1666670.0
M23.0000001.0
lawyerF39.5000000.0
M36.2000001.0
librarianF40.0000000.0
M40.0000001.0
marketingF37.2000000.0
M37.8750001.0
noneF36.5000000.0
M18.6000001.0
otherF35.4722220.0
M34.0289861.0
programmerF32.1666670.0
M33.2166671.0
retiredF70.0000000.0
M62.5384621.0
salesmanF27.0000000.0
M38.5555561.0
scientistF28.3333330.0
M36.3214291.0
studentF20.7500000.0
M22.6691181.0
technicianF38.0000000.0
M32.9615381.0
writerF37.6315790.0
M35.3461541.0

Step 8. For each occupation present the percentage of women and men

代码如下:

# a = users.groupby('occupation').gender_n.sum() / users.occupation.value_counts() * 100
# print(a.sort_values(ascending = False))
# b = 100 - a
# print(b.sort_values(ascending=True))
gender_ocup = users.groupby(['occupation', 'gender']).agg('gender': 'count') # 计算各个职业男女人数
occup_count = users.groupby(['occupation']).agg('count')                     # 计算各个职业总人数
occup_gender = gender_ocup.div(occup_count, level = "occupation") * 100    # 求出各个职业男女占比,返回一个DataFrame
occup_gender.loc[:, 'gender']   # 显示gender数据

输出结果如下:

occupation     gender
administrator  F          45.569620
               M          54.430380
artist         F          46.428571
               M          53.571429
doctor         M         100.000000
educator       F          27.368421
               M          72.631579
engineer       F           2.985075
               M          97.014925
entertainment  F          11.111111
               M          88.888889
executive      F           9.375000
               M          90.625000
healthcare     F          68.750000
               M          31.250000
homemaker      F          85.714286
               M          14.285714
lawyer         F          16.666667
               M          83.333333
librarian      F          56.862745
               M          43.137255
marketing      F          38.461538
               M          61.538462
none           F          44.444444
               M          55.555556
other          F          34.285714
               M          65.714286
programmer     F           9.090909
               M          90.909091
retired        F           7.142857
               M          92.857143
salesman       F          25.000000
               M          75.000000
scientist      F           9.677419
               M          90.322581
student        F          30.612245
               M          69.387755
technician     F           3.703704
               M          96.296296
writer         F          42.222222
               M          57.777778
Name: gender, dtype: float64

Exercise 3-Regiment

Introduction:

Special thanks to: http://chrisalbon.com/ for sharing the dataset and materials.

Step 1. Import the necessary libraries

代码如下:

import pandas as pd

Step 2. Create the DataFrame with the following values:

代码如下:

raw_data = 'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 
        'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 
        'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 
        'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]

Step 3. Assign it to a variable called regiment.

Don’t forget to name each column

代码如下:

regiment = pd.DataFrame(raw_data, columns = raw_data.keys())
regiment

输出结果如下:

regimentcompanynamepreTestScorepostTestScore
0Nighthawks1stMiller425
1Nighthawks1stJacobson2494
2Nighthawks2ndAli3157
3Nighthawks2ndMilner262
4Dragoons1stCooze370
5Dragoons1stJacon425
6Dragoons2ndRyaner2494
7Dragoons2ndSone3157
8Scouts1stSloan262
9Scouts1stPiger370
10Scouts2ndRiani262
11Scouts2ndAli370

Step 4. What is the mean preTestScore from the regiment Nighthawks?

代码如下:

regiment[regiment['regiment'] == 'Nighthawks'].groupby('regiment').mean()
# regiment[regiment['regiment'] == 'Nighthawks'].mean()

输出结果如下:

preTestScorepostTestScore
regiment
Nighthawks15.2559.5

Step 5. Present general statistics by company

代码如下:

regiment.groupby('company').describe()

输出结果如下:

postTestScorepreTestScore
countmeanstdmin25%50%75%maxcountmeanstdmin25%50%75%max
company
1st6.057.66666727.48575425.034.2566.070.094.06.06.6666678.5244752.03.003.54.0024.0
2nd6.067.00000014.05702757.058.2562.068.094.06.015.50000014.6526452.02.2513.529.2531.0

Step 6. What is the mean each company’s preTestScore?

代码如下:

regiment.groupby('company').preTestScore.mean()

输出结果如下:

company
1st     6.666667
2nd    15.500000
Name: preTestScore, dtype: float64

Step 7. Present the mean preTestScores grouped by regiment and company

代码如下:

regiment.groupby(['regiment', 'company']).preTestScore.mean()

输出结果如下:

regiment    company
Dragoons    1st         3.5
            2nd        27.5
Nighthawks  1st        14.0
            2nd        16.5
Scouts      1st         2.5
            2nd         2.5
Name: preTestScore, dtype: float64

Step 8. Present the mean preTestScores grouped by regiment and company without heirarchical indexing

代码如下:

'''
stack()和unstack()
stack:将数据的列“旋转”为行。
unstack:将数据的行“旋转”为列。
如果是多层索引,则以上函数是针对内层索引。
'''
regiment.groupby(['regiment', 'company']).preTestScore.mean().unstack()

输出结果如下:

company1st2nd
regiment
Dragoons3.527.5
Nighthawks14.016.5
Scouts2.52.5

Step 9. Group the entire dataframe by regiment and company

代码如下:

regiment.groupby(['regiment', 'company']).mean()

输出结果如下:

preTestScorepostTestScore
regimentcompany
Dragoons1st3.547.5
2nd27.575.5
Nighthawks1st14.059.5
2nd16.559.5
Scouts1st2.566.0
2nd2.566.0

Step 10. What is the number of observations in each regiment and company

代码如下:

regiment.groupby(['regiment', 'company']).size()

输出结果如下:

regiment    company
Dragoons    1st        2
            2nd        2
Nighthawks  1st        2
            2nd        2
Scouts      1st        2
            2nd        2
dtype: int64

Step 11. Iterate over a group and print the name and the whole data from the regiment

代码如下:

for name, group in regiment.groupby('regiment'):
    print(name)
    print(group)

输出结果如下:

Dragoons
   regiment company    name  preTestScore  postTestScore
4  Dragoons     1st   Cooze             3             70
5  Dragoons     1st   Jacon             4             25
6  Dragoons     2nd  Ryaner            24             94
7  Dragoons     2nd    Sone            31             57
Nighthawks
     regiment company      name  preTestScore  postTestScore
0  Nighthawks     1st    Miller             4             25
1  Nighthawks     1st  Jacobson            24             94
2  Nighthawks     2nd       Ali            31             57
3  Nighthawks     2nd    Milner             2             62
Scouts
   regiment company   name  preTestScore  postTestScore
8    Scouts     1st  Sloan             2             62
9    Scouts     1st  Piger             3             70
10   Scouts     2nd  Riani             2             62
11   Scouts     2nd    Ali             3             70

Conclusion

今天的pandas练习题就这么多了,大家坚持练习呀!还有英文的题目这次就没翻译了,各位要适应看英文。大家加油学习呀!有问题可以评论区探讨,欢迎大家一起学习进步!

以上是关于Python数据分析pandas入门练习题的主要内容,如果未能解决你的问题,请参考以下文章

Python数据分析pandas入门练习题

Python数据分析pandas入门练习题

Python数据分析pandas入门练习题

Python数据分析pandas入门练习题

Python数据分析pandas入门练习题

Python数据分析pandas入门练习题