Python数据分析pandas入门练习题

Posted Geek_bao

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python数据分析pandas入门练习题相关的知识,希望对你有一定的参考价值。

Python数据分析基础

Exercise 1

Step 1. Go to https://www.kaggle.com/openfoodfacts/world-food-facts/data

Step 2. Download the dataset to your computer and unzip it.

Step 3. Use the tsv file and assign it to a dataframe called food

代码如下:

# 从上面网址下载数据集并解压,然后将其读取并赋值给food
import pandas as pd
import numpy as np
food = pd.read_csv('en.openfoodfacts.org.products.tsv', sep='\\t')

输出结果如下:

D:\\Anaconda3\\lib\\site-packages\\IPython\\core\\interactiveshell.py:3020: DtypeWarning: Columns (0,3,5,19,20,24,25,26,27,28,36,37,38,39,48) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Step 4. See the first 5 entries

代码如下:

# 取前五条数据
food.head()

输出结果如下:

codeurlcreatorcreated_tcreated_datetimelast_modified_tlast_modified_datetimeproduct_namegeneric_namequantity...fruits-vegetables-nuts_100gfruits-vegetables-nuts-estimate_100gcollagen-meat-protein-ratio_100gcocoa_100gchlorophyl_100gcarbon-footprint_100gnutrition-score-fr_100gnutrition-score-uk_100gglycemic-index_100gwater-hardness_100g
03087http://world-en.openfoodfacts.org/product/0000...openfoodfacts-contributors14741038662016-09-17T09:17:46Z14741038932016-09-17T09:18:13ZFarine de blé noirNaN1kg...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
14530http://world-en.openfoodfacts.org/product/0000...usda-ndb-import14890699572017-03-09T14:32:37Z14890699572017-03-09T14:32:37ZBanana Chips Sweetened (Whole)NaNNaN...NaNNaNNaNNaNNaNNaN14.014.0NaNNaN
24559http://world-en.openfoodfacts.org/product/0000...usda-ndb-import14890699572017-03-09T14:32:37Z14890699572017-03-09T14:32:37ZPeanutsNaNNaN...NaNNaNNaNNaNNaNNaN0.00.0NaNNaN
316087http://world-en.openfoodfacts.org/product/0000...usda-ndb-import14890557312017-03-09T10:35:31Z14890557312017-03-09T10:35:31ZOrganic Salted Nut MixNaNNaN...NaNNaNNaNNaNNaNNaN12.012.0NaNNaN
416094http://world-en.openfoodfacts.org/product/0000...usda-ndb-import14890556532017-03-09T10:34:13Z14890556532017-03-09T10:34:13ZOrganic PolentaNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

5 rows × 163 columns

Step 5. What is the number of observations in the dataset?

代码如下:

# 该数据集有多少条数据
food.shape[0]

输出结果如下:

356027

Step 6. What is the number of columns in the dataset?

代码如下:

# 该数据集有多少列
food.shape[1]
food.info()

输出结果如下:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356027 entries, 0 to 356026
Columns: 163 entries, code to water-hardness_100g
dtypes: float64(107), object(56)
memory usage: 442.8+ MB

Step 7. Print the name of all the columns.

代码如下:

# 所有的列名
food.columns

输出结果如下:

Index(['code', 'url', 'creator', 'created_t', 'created_datetime',
       'last_modified_t', 'last_modified_datetime', 'product_name',
       'generic_name', 'quantity',
       ...
       'fruits-vegetables-nuts_100g', 'fruits-vegetables-nuts-estimate_100g',
       'collagen-meat-protein-ratio_100g', 'cocoa_100g', 'chlorophyl_100g',
       'carbon-footprint_100g', 'nutrition-score-fr_100g',
       'nutrition-score-uk_100g', 'glycemic-index_100g',
       'water-hardness_100g'],
      dtype='object', length=163)

Step 8. What is the name of 105th column?

代码如下:

# 第105列的列名
food.columns[104]

输出结果如下:

'-glucose_100g'

Step 9. What is the type of the observations of the 105th column?

代码如下:

# 第105列的数据类型
food.dtypes[food.columns[104]]

输出结果如下:

dtype('float64')

Step 10. How is the dataset indexed?

代码如下:

# 该数据集是怎样索引的
food.index

输出结果如下:

RangeIndex(start=0, stop=356027, step=1)

Step 11. What is the product name of the 19th observation?

代码如下:

# 第19条数据的product_name是什么
food.values[18][7]

输出结果如下:

'Lotus Organic Brown Jasmine Rice'

后面两行代码效果一样,用的函数不一样。

food.iloc[18]['product_name']
'Lotus Organic Brown Jasmine Rice'
food.loc[18]['product_name']
'Lotus Organic Brown Jasmine Rice'

Exercise 2 - Getting and Knowing your Data

This time we are going to pull data directly from the internet.

Step 1. Import the necessary libraries

代码如下:

# 导入要用的库
import pandas as pd
import numpy as np

Step 2. Import the dataset from this address.

Step 3. Assign it to a variable called chipo.

代码如下:

# 导入数据集,并赋给变量chipo
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv'
chipo = pd.read_csv(url, sep = '\\t')

Step 4. See the first 10 entries

代码如下:

# 取前十条数据
chipo.head(10)

输出结果如下:

order_idquantityitem_namechoice_descriptionitem_price
011Chips and Fresh Tomato SalsaNaN$2.39
111Izze[Clementine]$3.39
211Nantucket Nectar[Apple]$3.39
311Chips and Tomatillo-Green Chili SalsaNaN$2.39
422Chicken Bowl[Tomatillo-Red Chili Salsa (Hot), [Black Beans...$16.98
531Chicken Bowl[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...$10.98
631Side of ChipsNaN$1.69
741Steak Burrito[Tomatillo Red Chili Salsa, [Fajita Vegetables...$11.75
841Steak Soft Tacos[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...$9.25
951Steak Burrito[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...$9.25

Step 5. What is the number of observations in the dataset?

代码如下:

# Solution 1
# 求数据集的观察数
chipo.shape[0]  # entries <=4622 observations

输出结果如下:

下面代码效果和上面类似。

4622
# Solution 2

chipo.info()  # entries <=4622 observations
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4622 entries, 0 to 4621
Data columns (total 5 columns):
order_id              4622 non-null int64
quantity              4622 non-null int64
item_name             4622 non-null object
choice_description    3376 non-null object
item_price            4622 non-null object
dtypes: int64(2), object(3)
memory usage: 180.6+ KB

Step 6. What is the number of columns in the dataset?

代码如下:

# 求数据集的列数
chipo.shape[1]

输出结果如下:

5

Step 7. Print the name of all the columns.

代码如下:

# 打印数据集的所有列名
chipo.columns

输出结果如下:

Index(['order_id', 'quantity', 'item_name', 'choice_description',
       'item_price'],
      dtype='object')

Step 8. How is the dataset indexed?

代码如下:

# 数据集是如何索引的
chipo.index

输出结果如下:

RangeIndex(start=0, stop=4622, step=1)

Step 9. Which was the most-ordered item?

代码如下:

# 求最多订单的商品
c = chipo.groupby('item_name')
c = c.sum()
c = c.sort_values(['quantity'], ascending=False)
c.head(1)

输出结果如下:

order_idquantity
item_name
Chicken Bowl713926761

Step 10. For the most-ordered item, how many items were ordered?

代码入下:

# 最多订购的商品被订购次数,同上题
c = chipo.groupby('item_name')
c = c.sum()
c = c.sort_values(['quantity'], ascending=False)
c.head(1)

输出结果如下:

order_idquantity
item_name
Chicken Bowl713926761

Step 11. What was the most ordered item in the choice_description column?

代码如下:

# choice_description 列中订购最多的商品是什么?
c = chipo.groupby('choice_description').sum()
c = c.sort_values(['quantity'], ascending=False)
c.head(1)

输出结果如下:

order_idquantity
choice_description
[Diet Coke]123455159

Step 12. How many items were orderd in total?

代码如下:

# 总共订购了多少件商品?
total_items_orders = chipo.quantity.sum()
total_items_orders 

输出结果如下:

4972

Step 13. Turn the item price into a float

Step 13.a. Check the item price type

代码如下:

# 把tiem price转化为浮点型并检查
chipo.item_price.dtype

输出结果如下:

dtype('O')

Step 13.b. Create a lambda function and change the type of item price

代码如下:

dollarizer = lambda x: float(x[1:-1])
chipo.item_price = chipo.item_price.apply(dollarizer)

Step 13.c. Check the item price type

代码如下:

chipo.item_price.dtype

输出结果如下:

dtype('float64')

Step 14. How much was the revenue for the period in the dataset?

代码如下:

# 数据集中该时期的收入是多少?
revenue = (chipo['quantity'] * chipo['item_price']).sum()
print('Revenue was:  $' + str(np.round(revenue, 2))) 
# round() 方法返回浮点数x的四舍五入值。2表示保留小数点后两位数

输出结果如下:

Revenue was:  $39237.02

Step 15. How many orders were made in the period?

代码如下:

# 期间有多少订单?
orders = chipo.order_id.value_counts().count()
# value_counts()查看chipo中有哪些不同的值,并计算每个值有多少个重复值,count()计算有多少不同的值
orders

输出结果如下:

1834

Step 16. What is the average revenue amount per order?

代码如下:

# Solution 1
# 每个订单的平均收入是多少?
chipo['revenue'] = (chipo['quantity'] * chipo['item_price'])
order_grouped = chipo.groupby(by=['order_id']).sum()
order_grouped.mean()['revenue']

输出结果如下:

21.394231188658654

下面代码和上面作用相同。

# Solution 2
chipo.groupby(by=['order_id']).sum().mean()['revenue']
21.394231188658654

Step 17. How many different items are sold?

代码如下:

# 售出多少种不同的商品?
sold_items = chipo.item_name.value_counts().count()
sold_items

输出结果如下:

50

Exercise 3 - Getting and Knowing your Data

This time we are going to pull data directly from the internet.
(后面的题和前面类似,这里不做解释,英文直接做吧)

Step 1. Import the necessary libraries

代码如下

import pandas as pd

Step 2. Import the dataset from this address.

Step 3. Assign it to a variable called users and use the ‘user_id’ as index

代码入戏:

url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user'
users = pd.read_table(url, sep='|', index_col = 'user_id')
# sep/delimiter 用于对行中各字段进行拆分的字符序列

Step 4. See the first 25 entries

代码如下:

users.head(25)

输出结果如下:

agegenderoccupationzip_code
user_id
124Mtechnician85711
253Fother94043
323Mwriter32067
424Mtechnician43537
533Fother15213
642Mexecutive98101
757Madministrator91344
836Madministrator05201
929Mstudent01002
1053Mlawyer90703
1139Fother30329
1228Fother06405
1347Meducator29206
1445Mscientist55106
1549Feducator97301
1621Mentertainment10309
1730Mprogrammer06355
1835Fother37212
1940Mlibrarian02138
2042Fhomemaker95660
2126Mwriter30068
2225Mwriter40206
2330Fartist48197
2421Fartist94533
2539Mengineer55107

Step 5. See the last 10 entries

代码如下:

users.tail(10)

输出结果如下:

agegenderoccupationzip_code
user_id
93461Mengineer22902
93542Mdoctor66221
93624Mother32789
93748Meducator98072
93838Ftechnician55038
93926Fstudent33319
94032Madministrator02215
94120Mstudent97229
94248Flibrarian78209
94322Mstudent77841

Step 6. What is the number of observations in the dataset?

代码入戏:

users.shape[0]

输出结果如下:

943

Step 7. What is the number of columns in the dataset?

代码如下:

users.shape[1]

输出结果如下:

4

Step 8. Print the name of all the columns.

代码如下:

users.columns

输出结果如下:

Index(['age', 'gender', 'occupation', 'zip_code'], dtype='object')

Step 9. How is the dataset indexed?

代码入戏:

users.index

输出结果如下:

Int64Index([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,
            ...
            934, 935, 936, 937, 938, 939, 940, 941, 942, 943],
           dtype='int64', name='user_id', length=943)

Step 10. What is the data type of each column?

代码入下:

users.dtypes

输出结果如下:

age            int64
gender        object
occupation    object
zip_code      object
dtype: object

Step 11. Print only the occupation column

代码入下:

users['occupation']  # 或者users.occupation

输出结果如下:

user_id
1         technician
2              other
3             writer
4         technician
5              other
6          executive
7      administrator
8      administrator
9            student
10            lawyer
11             other
12             other
13          educator
14         scientist
15          educator
16     entertainment
17        programmer
18             other
19         librarian
20         homemaker
21            writer
22            writer
23            artist
24            artist
25          engineer
26          engineer
27         librarian
28            writer
29        programmer
30           student
           ...      
914            other
915    entertainment
916         engineer
917          student
918        scientist
919            other
920           artist
921          student
922    administrator
923          student
924            other
925         salesman
926    entertainment
927       programmer
928          student
929        scientist
930        scientist
931         educator
932         educator
933          student
934         engineer
935           doctor
936            other
937         educator
938       technician
939          student
940    administrator
941          student
942        librarian
943          student
Name: occupation, Length: 943, dtype: object

Step 12. How many different occupations there are in this dataset?

代码如下:

users.occupation.nunique()

输出结果如下:

21

Step 13. What is the most frequent occupation?

代码如下:

users.occupation.value_counts().head()

输出结果如下:

student          196
other            105
educator          95
administrator     79
engineer          67
Name: occupation, dtype: int64

Step 14. Summarize the DataFrame.

代码如下:

users.describe()

输出结果如下:

age
count943.000000
mean34.051962
std12.192740
min7.000000
25%25.000000
50%31.000000
75%43.000000
max73.000000

Step 15. Summarize all the columns

代码如下:

users.describe(include='all') #Notice: By default, only the numeric columns are returned.

输出结果如下:

agegenderoccupationzip_code
count943.000000943943943
uniqueNaN221795
topNaNMstudent55414
freqNaN6701969
mean34.051962NaNNaNNaN
std12.192740NaNNaNNaN
min7.000000NaNNaNNaN
25%25.000000NaNNaNNaN
50%31.000000NaNNaNNaN
75%43.000000NaNNaNNaN
max73.000000NaNNaNNaN

Step 16. Summarize only the occupation column

代码如下:

users.occupation.describe()

输出结果如下:

count         943
unique         21
top       student
freq          196
Name: occupation, dtype: object

Step 17. What is the mean age of users?

代码如下:

round(users.age.mean())

输出结果如下:

34

Step 18. What is the age with least occurrence?

代码如下:

users.age.value_counts().tail()  #7, 10, 11, 66 and 73 years -> only 1 occurrence

输出结果如下:

11    1
10    1
73    1
66    1
7     1
Name: age, dtype: int64

结语

本文使用anaconda的jupyter notebook编写。后面文章也会继续使用,真的好用,推荐给大家,一起学习呀!!!

以上是关于Python数据分析pandas入门练习题的主要内容,如果未能解决你的问题,请参考以下文章

Python数据分析pandas入门练习题

Python数据分析pandas入门练习题

Python数据分析pandas入门练习题

Python数据分析pandas入门练习题

Python数据分析pandas入门练习题

Python数据分析pandas入门练习题