Python数据分析pandas入门练习题

Posted Geek_bao

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python数据分析pandas入门练习题相关的知识,希望对你有一定的参考价值。

Python数据分析基础

Preparation

下面是练习题的数据集,尽量下载下来使用。下面习题的连接不一定能打开。
需要数据集可以私聊博主或者自行网上寻找,传到csdn,你们下载要会员,就不传了。

Exercise 1-Student Alcohol Consumption

Introduction:

This time you will download a dataset from the UCI.

Step 1. Import the necessary libraries

import pandas as pd

Step 2. Import the dataset from this address.

Step 3. Assign it to a variable called df.

代码如下:

df = pd.read_csv("student-mat.csv", sep=',')
df.head()

输出结果如下:

schoolsexageaddressfamsizePstatusMeduFeduMjobFjob...famrelfreetimegooutDalcWalchealthabsencesG1G2G3
0GPF18UGT3A44at_hometeacher...4341136566
1GPF17UGT3T11at_homeother...5331134556
2GPF15ULE3T11at_homeother...432233107810
3GPF15UGT3T42healthservices...3221152151415
4GPF16UGT3T33otherother...432125461010

5 rows × 33 columns

Step 4. For the purpose of this exercise slice the dataframe from ‘school’ until the ‘guardian’ column

代码如下:

stud_alcoh = df.loc[:, 'school':'guardian']   # loc切片一般用行列名,iloc一般用行列号
stud_alcoh.head()

输出结果如下:

schoolsexageaddressfamsizePstatusMeduFeduMjobFjobreasonguardian
0GPF18UGT3A44at_hometeachercoursemother
1GPF17UGT3T11at_homeothercoursefather
2GPF15ULE3T11at_homeotherothermother
3GPF15UGT3T42healthserviceshomemother
4GPF16UGT3T33otherotherhomefather

Step 5. Create a lambda function that capitalize strings.

代码如下:

capitalizer = lambda str: str.capitalize()  #capitalize()将字符串首字母转换为大写字母,upper()将整个字符串转化为大写
print(capitalizer('www'))

输出结果如下:

Www

Step 6. Capitalize both Mjob and Fjob

代码如下:

# for i in df['Mjob']:
#    print(capitalizer(i))
stud_alcoh.Mjob.apply(capitalizer)
stud_alcoh.Fjob.apply(capitalizer)

输出结果如下:

0       Teacher
1         Other
2         Other
3      Services
4         Other
5         Other
6         Other
7       Teacher
8         Other
9         Other
10       Health
11        Other
12     Services
13        Other
14        Other
15        Other
16     Services
17        Other
18     Services
19        Other
20        Other
21       Health
22        Other
23        Other
24       Health
25     Services
26        Other
27     Services
28        Other
29      Teacher
         ...   
365       Other
366    Services
367    Services
368    Services
369     Teacher
370    Services
371    Services
372     At_home
373       Other
374       Other
375       Other
376       Other
377    Services
378       Other
379       Other
380     Teacher
381       Other
382    Services
383    Services
384       Other
385       Other
386     At_home
387       Other
388    Services
389       Other
390    Services
391    Services
392       Other
393       Other
394     At_home
Name: Fjob, Length: 395, dtype: object

Step 7. Print the last elements of the data set.

代码如下:

# df.iloc[394, 32]
stud_alcoh.tail()

输出结果如下:

schoolsexageaddressfamsizePstatusMeduFeduMjobFjobreasonguardian
390MSM20ULE3A22servicesservicescourseother
391MSM17ULE3T31servicesservicescoursemother
392MSM21RGT3T11otherothercourseother
393MSM18RLE3T32servicesothercoursemother
394MSM19ULE3T11otherat_homecoursefather

Step 8. Did you notice the original dataframe is still lowercase? Why is that? Fix it and capitalize Mjob and Fjob.

代码如下:

stud_alcoh.Mjob = stud_alcoh.Mjob.apply(capitalizer)
stud_alcoh.Fjob = stud_alcoh.Fjob.apply(capitalizer)
stud_alcoh

输出结果如下:

schoolsexageaddressfamsizePstatusMeduFeduMjobFjobreasonguardian
0GPF18UGT3A44At_homeTeachercoursemother
1GPF17UGT3T11At_homeOthercoursefather
2GPF15ULE3T11At_homeOtherothermother
3GPF15UGT3T42HealthServiceshomemother
4GPF16UGT3T33OtherOtherhomefather
5GPM16ULE3T43ServicesOtherreputationmother
6GPM16ULE3T22OtherOtherhomemother
7GPF17UGT3A44OtherTeacherhomemother
8GPM15ULE3A32ServicesOtherhomemother
9GPM15UGT3T34OtherOtherhomemother
10GPF15UGT3T44TeacherHealthreputationmother
11GPF15UGT3T21ServicesOtherreputationfather
12GPM15ULE3T44HealthServicescoursefather
13GPM15UGT3T43TeacherOthercoursemother
14GPM15UGT3A22OtherOtherhomeother
15GPF16UGT3T44HealthOtherhomemother
16GPF16UGT3T44ServicesServicesreputationmother
17GPF16UGT3T33OtherOtherreputationmother
18GPM17UGT3T32ServicesServicescoursemother
19GPM16ULE3T43HealthOtherhomefather
20GPM15UGT3T43TeacherOtherreputationmother
21GPM15UGT3T44HealthHealthotherfather
22GPM16ULE3T42TeacherOthercoursemother
23GPM16ULE3T22OtherOtherreputationmother
24GPF15RGT3T24ServicesHealthcoursemother
25GPF16UGT3T22ServicesServiceshomemother
26GPM15UGT3T22OtherOtherhomemother
27GPM15UGT3T42HealthServicesothermother
28GPM16ULE3A34ServicesOtherhomemother
29GPM16UGT3T44TeacherTeacherhomemother
.......................................
365MSM18RGT3T13At_homeOthercoursemother
366MSM18ULE3T44TeacherServicesothermother
367MSF17RGT3T11OtherServicesreputationmother
368MSF18UGT3T23At_homeServicescoursefather
369MSF18RGT3T44OtherTeacherotherfather
370MSF19ULE3T32ServicesServiceshomeother
371MSM18RLE3T12At_homeServicesotherfather
372MSF17UGT3T22OtherAt_homehomemother
373MSF17RGT3T12OtherOthercoursemother
374MSF18RLE3T44OtherOtherreputationmother
375MSF18RGT3T11OtherOtherhomemother
376MSF20UGT3T42HealthOthercourseother
377MSF18RLE3T44TeacherServicescoursemother
378MSF18UGT3T33OtherOtherhomemother
379MSF17RGT3T31At_homeOtherreputationmother
380MSM18UGT3T44TeacherTeacherhomefather
381MSM18RGT3T21OtherOtherothermother
382MSM17UGT3T23OtherServiceshomefather
383MSM19RGT3T11OtherServicesothermother
384MSM18RGT3T42OtherOtherhomefather
385MSF18RGT3T22At_homeOtherothermother
386MSF18RGT3T44TeacherAt_homereputationmother
387MSF19RGT3T23ServicesOthercoursemother
388MSF18ULE3T31TeacherServicescoursemother
389MSF18UGT3T11OtherOthercoursemother
390MSM20ULE3A22ServicesServicescourseother
391MSM17ULE3T31ServicesServicescoursemother
392MSM21RGT3T11OtherOthercourseother
393MSM18RLE3T32ServicesOthercoursemother
394MSM19ULE3T11OtherAt_homecoursefather

395 rows × 12 columns

Step 9. Create a function called majority that return a boolean value to a new column called legal_drinker (Consider majority as older than 17 years old)

代码如下:

def majority(age):
    if age > 17:
        return True
    else:
        return False
stud_alcoh['legal_drinker'] = stud_alcoh.age.apply(majority)
stud_alcoh

输出结果如下:

schoolsexageaddressfamsizePstatusMeduFeduMjobFjobreasonguardianlegal_drinker
0GPF18UGT3A44At_homeTeachercoursemotherTrue
1GPF17UGT3T11At_homeOthercoursefatherFalse
2GPF15ULE3T11At_homeOtherothermotherFalse
3GPF15UGT3T42HealthServiceshomemotherFalse
4GPF16UGT3T33OtherOtherhomefatherFalse
5GPM16ULE3T43ServicesOtherreputationmotherFalse
6GPM16ULE3T22OtherOtherhomemotherFalse
7GPF17UGT3A44OtherTeacherhomemotherFalse
8GPM15ULE3A32ServicesOtherhomemotherFalse
9GPM15UGT3T34OtherOtherhomemotherFalse
10GPF15UGT3T44TeacherHealthreputationmotherFalse
11GPF15UGT3T21ServicesOtherreputationfatherFalse
12GPM15ULE3T44HealthServicescoursefatherFalse
13GPM15UGT3T43TeacherOthercoursemotherFalse
14GPM15UGT3A22OtherOtherhomeotherFalse
15GPF16UGT3T44HealthOtherhomemotherFalse
16GPF16UGT3T44ServicesServicesreputationmotherFalse
17GPF16UGT3T33OtherOtherreputationmotherFalse
18GPM17UGT3T32ServicesServicescoursemotherFalse
19GPM16ULE3T43HealthOtherhomefatherFalse
20GPM15UGT3T43TeacherOtherreputationmotherFalse
21GPM15UGT3T44HealthHealthotherfatherFalse
22GPM16ULE3T42TeacherOthercoursemotherFalse
23GPM16ULE3T22OtherOtherreputationmotherFalse
24GPF15RGT3T24ServicesHealthcoursemotherFalse
25GPF16UGT3T22ServicesServiceshomemotherFalse
26GPM15UGT3T22OtherOtherhomemotherFalse
27GPM15UGT3T42HealthServicesothermotherFalse
28GPM16ULE3A34ServicesOtherhomemotherFalse
29GPM16UGT3T44TeacherTeacherhomemotherFalse
..........................................
365MSM18RGT3T13At_homeOthercoursemotherTrue
366MSM18ULE3T44TeacherServicesothermotherTrue
367MSF17RGT3T11OtherServicesreputationmotherFalse
368MSF18UGT3T23At_homeServicescoursefatherTrue
369MSF18RGT3T44OtherTeacherotherfatherTrue
370MSF19ULE3T32ServicesServiceshomeotherTrue
371MSM18RLE3T12At_homeServicesotherfatherTrue
372MSF17UGT3T22OtherAt_homehomemotherFalse
373MSF17RGT3T12OtherOthercoursemotherFalse
374MSF18RLE3T44OtherOtherreputationmotherTrue
375MSF18RGT3T11OtherOtherhomemotherTrue
376MSF20UGT3T42HealthOthercourseotherTrue
377MSF18RLE3T44TeacherServicescoursemotherTrue
378MSF18UGT3T33OtherOtherhomemotherTrue
379MSF17RGT3T31At_homeOtherreputationmotherFalse
380MSM18UGT3T44TeacherTeacherhomefatherTrue
381MSM18RGT3T21OtherOtherothermotherTrue
382MSM17UGT3T23OtherServiceshomefatherFalse
383MSM19RGT3T11OtherServicesothermotherTrue
384MSM18RGT3T42OtherOtherhomefatherTrue
385MSF18RGT3T22At_homeOtherothermotherTrue
386MSF18RGT3T44TeacherAt_homereputationmotherTrue
387MSF19RGT3T23ServicesOthercoursemotherTrue
388MSF18ULE3T31TeacherServicescoursemotherTrue
389MSF18UGT3T11OtherOthercoursemotherTrue
390MSM20ULE3A22ServicesServicescourseotherTrue
391MSM17ULE3T31ServicesServicescoursemotherFalse
392MSM21RGT3T11OtherOthercourseotherTrue
393MSM18RLE3T32ServicesOthercoursemotherTrue
394MSM19ULE3T11OtherAt_homecoursefatherTrue

395 rows × 13 columns

Step 10. Multiply every number of the dataset by 10.

I know this makes no sense, don’t forget it is just an exercise

代码如下:

def times10(x):
    if type(x) is int:
        return 10 * x
    return x
# apply()将一个函数作用于DataFrame中的每个行或者列
# 将函数做用于DataFrame中的所有元素(elements)
stud_alcoh.applymap(times10).head()

输出结果如下:

schoolsexageaddressfamsizePstatusMeduFeduMjobFjobreasonguardianlegal_drinker
0GPF180UGT3A4040At_homeTeachercoursemotherTrue
1GPF170UGT3T1010At_homeOthercoursefatherFalse
2GPF150ULE3T1010At_homeOtherothermotherFalse
3GPF150UGT3T4020HealthServiceshomemotherFalse
4GPF160UGT3T3030OtherOtherhomefatherFalse

Exercise 2-United States - Crime Rates - 1960 - 2014

Introduction:

This time you will create a data

Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

Step 1. Import the necessary libraries

import pandas as pd

Step 2. Import the dataset from this address.

Step 3. Assign it to a variable called crime.

代码如下:

crime = pd.read_csv('US_Crime_Rates_1960_2014.csv')
crime.head()

输出结果如下:

YearPopulationTotalViolentPropertyMurderForcible_RapeRobberyAggravated_assaultBurglaryLarceny_TheftVehicle_Theft
01960179323175338420028846030957009110171901078401543209121001855400328200
11961182992000348800028939031986008740172201066701567609496001913000336000
21962185771000375220030151034507008530175501108601645709943002089600366800
319631884830004109500316970379250086401765011647017421010864002297800408300
419641911410004564600364220420040093602142013039020305012132002514400472800

Step 4. What is the type of the columns?

代码如下:

# crime.columns
crime.info()

输出结果如下:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 12 columns):
Year                  55 non-null int64
Population            55 non-null int64
Total                 55 non-null int64
Violent               55 non-null int64
Property              55 non-null int64
Murder                55 non-null int64
Forcible_Rape         55 non-null int64
Robbery               55 non-null int64
Aggravated_assault    55 non-null int64
Burglary              55 non-null int64
Larceny_Theft         55 non-null int64
Vehicle_Theft         55 non-null int64
dtypes: int64(12)
memory usage: 5.2 KB
Have you noticed that the type of Year is int64. But pandas has a different type to work with Time Series. Let’s see it now.

Step 5. Convert the type of the column Year to datetime64

代码如下:

crime.Year = pd.to_datetime(crime.Year, format='%Y')   # 转化为日期格式%Y%m%d %H%M%S
crime.info()   # 输出列名以及列数据类型等相关信息

输出结果如下:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 12 columns):
Year                  55 non-null datetime64[ns]
Population            55 non-null int64
Total                 55 non-null int64
Violent               55 non-null int64
Property              55 non-null int64
Murder                55 non-null int64
Forcible_Rape         55 non-null int64
Robbery               55 non-null int64
Aggravated_assault    55 non-null int64
Burglary              55 non-null int64
Larceny_Theft         55 non-null int64
Vehicle_Theft         55 non-null int64
dtypes: datetime64[ns](1), int64(11)
memory usage: 5.2 KB

Step 6. Set the Year column as the index of the dataframe

代码如下:

crime = crime.set_index('Year', drop = True)# drop参数默认为False,想要删除原先的索引列要置为True
crime.head()

输出结果如下:

PopulationTotalViolentPropertyMurderForcible_RapeRobberyAggravated_assaultBurglaryLarceny_TheftVehicle_Theft
Year
1960-01-01179323175338420028846030957009110171901078401543209121001855400328200
1961-01-01182992000348800028939031986008740172201066701567609496001913000336000
1962-01-01185771000375220030151034507008530175501108601645709943002089600366800
1963-01-011884830004109500316970379250086401765011647017421010864002297800408300
1964-01-011911410004564600364220420040093602142013039020305012132002514400472800

Step 7. Delete the Total column

代码如下:

del crime['Total']   # del直接删除
crime.head()

输出结果如下:

PopulationViolentPropertyMurderForcible_RapeRobberyAggravated_assaultBurglaryLarceny_TheftVehicle_Theft
Year
1960-01-0117932317528846030957009110171901078401543209121001855400328200
1961-01-0118299200028939031986008740172201066701567609496001913000336000
1962-01-0118577100030151034507008530175501108601645709943002089600366800
1963-01-01188483000316970379250086401765011647017421010864002297800408300
1964-01-01191141000364220420040093602142013039020305012132002514400472800

Step 8. Group the year by decades and sum the values

Pay attention to the Population column number, summing this column is a mistake

代码如下:

# Uses resample to sum each decade
# resample聚合函数,将数据以W星期,M月,Q季度,QS季度的开始第一天开始,A年,10A十年,10AS十年聚合日期第一天开始.的形式进行聚合
crimes = crime.resample('10AS').sum()

# Uses resample to get the max value only for the "Population" column
population = crime['Population'].resample('10AS').max()

# Updating the "Population" column
crimes['Population'] = population
crimes

resample函数参数含义如下:

输出结果如下:

PopulationViolentPropertyMurderForcible_RapeRobberyAggravated_assaultBurglaryLarceny_TheftVehicle_Theft
Year
1960-01-01201385000.04134930451609001061802367201633510215852013321100265477005292100
1970-01-01220099000.09607930913838001922305545704159020470212028486000531578009739900
1980-01-01248239000.01407432811704890020643986563953831097619130330734947204025311935411
1990-01-01272690813.017527048119053499211664998827574893010568963267500157767936614624418
2000-01-01307006550.01396805610094436916306892249942303668652124215651766797029111412834
2010-01-01318857056.0607201744095950728674210591749809376414210125170304016983569080
2020-01-01NaN000000000

Step 9. What is the most dangerous decade to live in the US?

代码如下:

crime.idxmax(0) # 计算能够获得最大值的索引位置
# 从结果可以看出90s最危险,其实2020-2021更危险(滑稽+狗头保命)

输出结果如下:

Population           2014-01-01
Violent              1992-01-01
Property             1991-01-01
Murder               1991-01-01
Forcible_Rape        1992-01-01
Robbery              1991-01-01
Aggravated_assault   1993-01-01
Burglary             1980-01-01
Larceny_Theft        1991-01-01
Vehicle_Theft        1991-01-01
dtype: datetime64[ns]

Conclusion

今天主要练习了Apply()函数以及其他相关函数的使用。一定要理解每个参数含义,才能灵活运用函数得到你要的数。今天就到这里了,明天继续学习,各位加油!

以上是关于Python数据分析pandas入门练习题的主要内容,如果未能解决你的问题,请参考以下文章

Python数据分析pandas入门练习题

Python数据分析pandas入门练习题

Python数据分析pandas入门练习题

Python数据分析pandas入门练习题

Python数据分析pandas入门练习题

Python数据分析pandas入门练习题