Python｜Kaggle机器学习系列之Pandas基础练习题

Posted 2021-09-06 海轰Pro

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Python｜Kaggle机器学习系列之Pandas基础练习题相关的知识，希望对你有一定的参考价值。

前言

Hello！小伙伴！
非常感谢您阅读海轰的文章，倘若文中有错误的地方，欢迎您指出～

自我介绍 ଘ(੭ˊᵕˋ)੭
昵称：海轰
标签：程序猿｜C++选手｜学生
简介：因C语言结识编程，随后转入计算机专业，有幸拿过一些国奖、省奖…已保研。目前正在学习C++/Linux/Python
学习经验：扎实基础 + 多做笔记 + 多敲代码 + 多思考 + 学好英语！

初学Python 小白阶段
文章仅作为自己的学习笔记用于知识体系建立以及复习
题不在多学一题懂一题
知其然知其所以然！

Introduction

Now you are ready to get a deeper understanding of your data.

Run the following cell to load your data and some utility functions (including code to check your answers).

运行下面代码
导入所需数据及相应的包

import pandas as pd
pd.set_option("display.max_rows", 5)
reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)

from learntools.core import binder; binder.bind(globals())
from learntools.pandas.summary_functions_and_maps import *
print("Setup complete.")

reviews.head()

此次练习的数据：

Exercises

1.

题目

What is the median of the points column in the reviews DataFrame?

解答

题目意思：

求 points列的中位数

median_points = reviews.points.median()

运行结果：

2.

题目

What countries are represented in the dataset? (Your answer should not include any duplicates.)

解答

题目意思：

题意：数据集中代表了哪些国家?(你的答案不应该包含任何重复的部分。)
也就是需要我们找出数据集中country中出现的所有国家，返回值中无重复

countries = reviews.country.unique()

运行结果：

3.

题目

How often does each country appear in the dataset? Create a Series reviews_per_country mapping countries to the count of reviews of wines from that country.

解答

题目意思：

统计出每个国家所出现的次数

reviews_per_country = reviews.country.value_counts()

运行结果：

4.

题目

Create variable centered_price containing a version of the price column with the mean price subtracted.

(Note: this ‘centering’ transformation is a common preprocessing step before applying various machine learning algorithms.)

解答

题目意思：

求price列中每一个价格与price价格平均值的差

centered_price = reviews.price-reviews.price.mean()

运行结果：

5.

题目

I’m an economical wine buyer. Which wine is the “best bargain”? Create a variable bargain_wine with the title of the wine with the highest points-to-price ratio in the dataset.

解答

题目意思：

找出性价比最高的一款酒的title
性价比：分数/价格

bargain_idx = (reviews.points / reviews.price).idxmax()
bargain_wine = reviews.loc[bargain_idx, 'title']

运行结果：

6.

题目

There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be “tropical” or “fruity”? Create a Series descriptor_counts counting how many times each of these two words appears in the description column in the dataset.

解答

题目意思：

分别统计 tropical、fruity在 description列中出现的次数
以Series结构返回

n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])

运行结果：

7.

题目

We’d like to host these wine reviews on our website, but a rating system ranging from 80 to 100 points is too hard to understand - we’d like to translate them into simple star ratings. A score of 95 or higher counts as 3 stars, a score of at least 85 but less than 95 is 2 stars. Any other score is 1 star.

Also, the Canadian Vintners Association bought a lot of ads on the site, so any wines from Canada should automatically get 3 stars, regardless of points.

Create a series star_ratings with the number of stars corresponding to each review in the dataset.

解答

题目意思：

points分数 >= 95 3为三颗星
points分数大于等于85且小于95 为两颗星
小于85 为1颗星
特殊情况：country为Canada的全为三颗星

def stars(row):
    if row.country == 'Canada':
        return 3
    elif row.points >= 95:
        return 3
    elif row.points >= 85:
        return 2
    else:
        return 1

star_ratings = reviews.apply(stars, axis='columns')

运行结果：

结语

文章仅作为学习笔记，记录从0到1的一个过程

希望对您有所帮助，如有错误欢迎小伙伴指正～

我是 海轰ଘ(੭ˊᵕˋ)੭

如果您觉得写得可以的话，请点个赞吧

谢谢支持 ❤️

以上是关于Python｜Kaggle机器学习系列之Pandas基础练习题的主要内容，如果未能解决你的问题，请参考以下文章