Python|Kaggle机器学习系列之Pandas基础练习题
Posted 海轰Pro
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python|Kaggle机器学习系列之Pandas基础练习题相关的知识,希望对你有一定的参考价值。
前言
Hello!小伙伴!
非常感谢您阅读海轰的文章,倘若文中有错误的地方,欢迎您指出~
自我介绍 ଘ(੭ˊᵕˋ)੭
昵称:海轰
标签:程序猿|C++选手|学生
简介:因C语言结识编程,随后转入计算机专业,有幸拿过一些国奖、省奖…已保研。目前正在学习C++/Linux/Python
学习经验:扎实基础 + 多做笔记 + 多敲代码 + 多思考 + 学好英语!
初学Python 小白阶段
文章仅作为自己的学习笔记 用于知识体系建立以及复习
题不在多 学一题 懂一题
知其然 知其所以然!
Introduction
Now you are ready to get a deeper understanding of your data.
Run the following cell to load your data and some utility functions (including code to check your answers).
运行下面代码
导入所需数据及相应的包
import pandas as pd
pd.set_option("display.max_rows", 5)
reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
from learntools.core import binder; binder.bind(globals())
from learntools.pandas.summary_functions_and_maps import *
print("Setup complete.")
reviews.head()
此次练习的数据:
Exercises
1.
题目
What is the median of the points column in the reviews DataFrame?
解答
题目意思:
求 points列 的中位数
median_points = reviews.points.median()
运行结果:
2.
题目
What countries are represented in the dataset? (Your answer should not include any duplicates.)
解答
题目意思:
题意:数据集中代表了哪些国家?(你的答案不应该包含任何重复的部分。)
也就是需要我们找出数据集中country中出现的所有国家,返回值中无重复
countries = reviews.country.unique()
运行结果:
3.
题目
How often does each country appear in the dataset? Create a Series reviews_per_country mapping countries to the count of reviews of wines from that country.
解答
题目意思:
统计出每个国家所出现的次数
reviews_per_country = reviews.country.value_counts()
运行结果:
4.
题目
Create variable centered_price
containing a version of the price
column with the mean price subtracted.
(Note: this ‘centering’ transformation is a common preprocessing step before applying various machine learning algorithms.)
解答
题目意思:
求price列中每一个价格与price价格平均值的差
centered_price = reviews.price-reviews.price.mean()
运行结果:
5.
题目
I’m an economical wine buyer. Which wine is the “best bargain”? Create a variable bargain_wine
with the title of the wine with the highest points-to-price ratio in the dataset.
解答
题目意思:
找出性价比最高的一款酒的title
性价比:分数/价格
bargain_idx = (reviews.points / reviews.price).idxmax()
bargain_wine = reviews.loc[bargain_idx, 'title']
运行结果:
6.
题目
There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be “tropical” or “fruity”? Create a Series descriptor_counts
counting how many times each of these two words appears in the description
column in the dataset.
解答
题目意思:
分别统计 tropical、fruity在
description
列中出现的次数
以Series结构返回
n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])
运行结果:
7.
题目
We’d like to host these wine reviews on our website, but a rating system ranging from 80 to 100 points is too hard to understand - we’d like to translate them into simple star ratings. A score of 95 or higher counts as 3 stars, a score of at least 85 but less than 95 is 2 stars. Any other score is 1 star.
Also, the Canadian Vintners Association bought a lot of ads on the site, so any wines from Canada should automatically get 3 stars, regardless of points.
Create a series star_ratings
with the number of stars corresponding to each review in the dataset.
解答
题目意思:
points分数 >= 95 3为三颗星
points分数 大于等于85且小于95 为两颗星
小于85 为1颗星
特殊情况:country为Canada的全为三颗星
def stars(row):
if row.country == 'Canada':
return 3
elif row.points >= 95:
return 3
elif row.points >= 85:
return 2
else:
return 1
star_ratings = reviews.apply(stars, axis='columns')
运行结果:
结语
文章仅作为学习笔记,记录从0到1的一个过程
希望对您有所帮助,如有错误欢迎小伙伴指正~
我是 海轰ଘ(੭ˊᵕˋ)੭
如果您觉得写得可以的话,请点个赞吧
谢谢支持 ❤️
以上是关于Python|Kaggle机器学习系列之Pandas基础练习题的主要内容,如果未能解决你的问题,请参考以下文章
Python|Kaggle机器学习系列之Pandas基础练习题
Python|Kaggle机器学习系列之Pandas基础练习题
Python|Kaggle机器学习系列之Pandas基础练习题
Python|Kaggle机器学习系列之Pandas基础练习题