Demographic Data Analyzer

Posted Feel Life

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Demographic Data Analyzer相关的知识,希望对你有一定的参考价值。

更多精彩,请点击上方蓝字关注我们!
梁静茹+-+会呼吸的痛.mp3 From Feel Life 04:33

Introduction

In this challenge you must analyze demographic data using Pandas. You are given dataset of demographic data that was extracted from the 1994 Census database.

FreeCodeCamp: https://chinese.freecodecamp.org/learn/data-analysis-with-python/data-analysis-with-python-projects/demographic-data-analyzer

Code

import pandas as pd


def calculate_demographic_data(print_data=True):
    # Read data from file
    df = pd.read_csv('adult.data.csv')

    # How many of each race are represented in this dataset? This should be a Pandas series with race names as the index labels.
    race_count = df.groupby("race").count()['age'].sort_values(ascending=False)

    # What is the average age of men?
    average_age_men = round(df[df['sex'] == 'Male']['age'].mean(), 1)

    # What is the percentage of people who have a Bachelor's degree?
    percentage_bachelors = round(df[df['education'] == 'Bachelors']['education'].count() /df['education'].count() * 1001)

    # What percentage of people with advanced education (`Bachelors`, `Masters`, or `Doctorate`) make more than 50K?
    # What percentage of people without advanced education make more than 50K?

    # with and without `Bachelors`, `Masters`, or `Doctorate`
    higher_education = df[((df['education'] == 'Bachelors') | (df['education'] == 'Masters') | (df['education'] == 'Doctorate'))]['education'].count()
    lower_education = df[((df['education'] != 'Bachelors') & (df['education'] != 'Masters') & (df['education'] != 'Doctorate'))]['education'].count()

    # percentage with salary >50K
    higher_education_rich = round(df[(df['salary'] == '>50K') & ((df['education'] == 'Bachelors') | (df['education'] == 'Masters') | (df['education'] == 'Doctorate'))]['salary'].count() / higher_education * 1001)

    lower_education_rich = round(df[(df['salary'] == '>50K') & ((df['education'] != 'Bachelors') & (df['education'] != 'Masters') & (df['education'] != 'Doctorate'))]['education'].count() / lower_education * 1001)

    # What is the minimum number of hours a person works per week (hours-per-week feature)?
    min_work_hours = df['hours-per-week'].min()

    # What percentage of the people who work the minimum number of hours per week have a salary of >50K?
    num_min_workers = df[(df['hours-per-week'] == min_work_hours) & (df['salary'] == '>50K')]['salary'].count() 

    rich_percentage = round(num_min_workers / df[(df['hours-per-week'] == min_work_hours)]['hours-per-week'].count() * 1001)

    # What country has the highest percentage of people that earn >50K?
    # reference resources:https://www.reddit.com/r/FreeCodeCamp/comments/le7ynx/data_analysis_with_python_projects_solving/
    salary = df.loc[df['salary'] == '>50K']['native-country'].value_counts()
    population = df['native-country'].value_counts()
    highest_earning_country = (salary / population).sort_values(ascending=False).index[0]
    highest_earning_country_percentage = round((salary / population * 100).max(), 1)

    # Identify the most popular occupation for those who earn >50K in India.
    top_IN_occupation = df[(df['salary'] == '>50K') & (df['native-country'] == 'India')]['occupation'].mode()[0]
    # print(top_IN_occupation)

    # DO NOT MODIFY BELOW THIS LINE

    if print_data:
        print("Number of each race:\n", race_count) 
        print("Average age of men:", average_age_men)
        print(f"Percentage with Bachelors degrees: {percentage_bachelors}%")
        print(f"Percentage with higher education that earn >50K: {higher_education_rich}%")
        print(f"Percentage without higher education that earn >50K: {lower_education_rich}%")
        print(f"Min work time: {min_work_hours} hours/week")
        print(f"Percentage of rich among those who work fewest hours: {rich_percentage}%")
        print("Country with highest percentage of rich:", highest_earning_country)
        print(f"Highest percentage of rich people in country: {highest_earning_country_percentage}%")
        print("Top occupations in India:", top_IN_occupation)

    return {
        'race_count': race_count,
        'average_age_men': average_age_men,
        'percentage_bachelors': percentage_bachelors,
        'higher_education_rich': higher_education_rich,
        'lower_education_rich': lower_education_rich,
        'min_work_hours': min_work_hours,
        'rich_percentage': rich_percentage,
        'highest_earning_country': highest_earning_country,
        'highest_earning_country_percentage':
        highest_earning_country_percentage,
        'top_IN_occupation': top_IN_occupation
    }

DataSet

https://replit.com/@caisi35/boilerplate-demographic-data-analyzer#adult.data.csv

Result

Demographic Data Analyzer

Last

  1. 前几个问题自然而然的想到了分组统计,所以就用了groupby和count。根据groupby分组然后用count统计。
  2. pandas的排序用的是sort_values,有values自然就会有其他的,不然直接用sort关键字作方法就可以了。这里的另外一种排序是sort_index根据索引排序。正倒序的关键参数也和Python的不一样,这里的是Boolean类型的ascending翻译过来就是上升、升序,那逆序就是False了。
  3. 这里对列的筛选用的最多的是嵌套的dataframe: df[df['columns'] OPERATOR term] 多个筛选条件的时候用的是 & |分别对于Python的 and or。另外一种筛选方法是使用loc,其实跟第一种方式区别不大: df[(df['salary'] == '>50K')], df.loc[df['salary'] == '>50K']结果都一样。
  4. 到最后两三个问题的时候,知识就不够用了。后面看了一个视频才知道用到了没见过的方法 value_countsmodevalue_counts是对值进行统计,有点像分组统计的味道,而 mode则是求Series的众数。
  5. round是四舍五入保留小数位的函数。


公众号ID和密码一样


Demographic Data Analyzer





Demographic Data Analyzer

扫码关注

有趣的灵魂在等你

我就知道你“在看”


以上是关于Demographic Data Analyzer的主要内容,如果未能解决你的问题,请参考以下文章

如何使用Derwent Data Analyzer快速导入大量数据

Derwent Data Analyzer在情报分析服务中应用

德温特现场Derwent Data Analyzer上海培训交流会精彩回顾

更智能,更高效的大数据分析工具——不一样的Derwent Data Analyzer V7

科睿唯安网络研讨会人工智能来啦,看新版Derwent Data Analyzer玩转大数据!

SignalTap II Logic Analyzer 无法观测到信号?