Python|Kaggle机器学习系列之Pandas基础练习题

Posted 海轰Pro

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python|Kaggle机器学习系列之Pandas基础练习题相关的知识,希望对你有一定的参考价值。

前言

Hello!小伙伴!
非常感谢您阅读海轰的文章,倘若文中有错误的地方,欢迎您指出~
 
自我介绍 ଘ(੭ˊᵕˋ)੭
昵称:海轰
标签:程序猿|C++选手|学生
简介:因C语言结识编程,随后转入计算机专业,有幸拿过一些国奖、省奖…已保研。目前正在学习C++/Linux/Python
学习经验:扎实基础 + 多做笔记 + 多敲代码 + 多思考 + 学好英语!
 
初学Python 小白阶段
文章仅作为自己的学习笔记 用于知识体系建立以及复习
题不在多 学一题 懂一题
知其然 知其所以然!

Introduction

Now you are ready to get a deeper understanding of your data.

Run the following cell to load your data and some utility functions (including code to check your answers).

运行下面代码
导入所需数据及相应的包

import pandas as pd
pd.set_option("display.max_rows", 5)
reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)

from learntools.core import binder; binder.bind(globals())
from learntools.pandas.summary_functions_and_maps import *
print("Setup complete.")

reviews.head()

此次练习的数据:

Exercises

1.

题目

What is the median of the points column in the reviews DataFrame?

解答

题目意思:

求 points列 的中位数

median_points = reviews.points.median()

运行结果:

2.

题目

What countries are represented in the dataset? (Your answer should not include any duplicates.)

解答

题目意思:

题意:数据集中代表了哪些国家?(你的答案不应该包含任何重复的部分。)
也就是需要我们找出数据集中country中出现的所有国家,返回值中无重复

countries = reviews.country.unique()

运行结果:

3.

题目

How often does each country appear in the dataset? Create a Series reviews_per_country mapping countries to the count of reviews of wines from that country.

解答

题目意思:

统计出每个国家所出现的次数

reviews_per_country = reviews.country.value_counts()

运行结果:

4.

题目

Create variable centered_price containing a version of the price column with the mean price subtracted.

(Note: this ‘centering’ transformation is a common preprocessing step before applying various machine learning algorithms.)

解答

题目意思:

求price列中每一个价格与price价格平均值的差

centered_price = reviews.price-reviews.price.mean()

运行结果:

5.

题目

I’m an economical wine buyer. Which wine is the “best bargain”? Create a variable bargain_wine with the title of the wine with the highest points-to-price ratio in the dataset.

解答

题目意思:

找出性价比最高的一款酒的title
性价比:分数/价格

bargain_idx = (reviews.points / reviews.price).idxmax()
bargain_wine = reviews.loc[bargain_idx, 'title']

运行结果:

6.

题目

There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be “tropical” or “fruity”? Create a Series descriptor_counts counting how many times each of these two words appears in the description column in the dataset.

解答

题目意思:

分别统计 tropical、fruity在 description列中出现的次数
以Series结构返回

n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])

运行结果:

7.

题目

We’d like to host these wine reviews on our website, but a rating system ranging from 80 to 100 points is too hard to understand - we’d like to translate them into simple star ratings. A score of 95 or higher counts as 3 stars, a score of at least 85 but less than 95 is 2 stars. Any other score is 1 star.

Also, the Canadian Vintners Association bought a lot of ads on the site, so any wines from Canada should automatically get 3 stars, regardless of points.

Create a series star_ratings with the number of stars corresponding to each review in the dataset.

解答

题目意思:

points分数 >= 95 3为三颗星
points分数 大于等于85且小于95 为两颗星
小于85 为1颗星
特殊情况:country为Canada的全为三颗星

def stars(row):
    if row.country == 'Canada':
        return 3
    elif row.points >= 95:
        return 3
    elif row.points >= 85:
        return 2
    else:
        return 1

star_ratings = reviews.apply(stars, axis='columns')

运行结果:

结语

文章仅作为学习笔记,记录从0到1的一个过程

希望对您有所帮助,如有错误欢迎小伙伴指正~

我是 海轰ଘ(੭ˊᵕˋ)੭

如果您觉得写得可以的话,请点个赞吧

谢谢支持 ❤️

以上是关于Python|Kaggle机器学习系列之Pandas基础练习题的主要内容,如果未能解决你的问题,请参考以下文章

Python|Kaggle机器学习系列之Pandas基础练习题

Python|Kaggle机器学习系列之Pandas基础练习题

Python|Kaggle机器学习系列之Pandas基础练习题

Python|Kaggle机器学习系列之Pandas基础练习题

机器学习英雄访谈录之 Kaggle Kernels 专家:Aakash Nain

机器学习系列_逻辑回归应用之Kaggle泰坦尼克之灾