text 数据分析师/体验课/市委书记养成记.ipynb
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了text 数据分析师/体验课/市委书记养成记.ipynb相关的知识,希望对你有一定的参考价值。
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "### 导包"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2018-12-22T08:28:43.784481Z",
"start_time": "2018-12-22T08:28:43.112139Z"
},
"trusted": true
},
"cell_type": "code",
"source": "import pandas as pd # 导入数据分析工具包\nimport numpy as np # 导入科学计算工具包\nimport matplotlib.pyplot as plt # 导入图表绘制工具包\nimport seaborn as sns # 导入可视化工具包",
"execution_count": 1,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### 读取数据"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2018-12-22T08:28:43.814002Z",
"start_time": "2018-12-22T08:28:43.786085Z"
},
"trusted": true
},
"cell_type": "code",
"source": "df = pd.read_csv(\"地市级党委书记数据库(2000-10).csv\", encoding=\"gbk\")",
"execution_count": 2,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### 取前5行"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2018-12-22T08:28:43.848844Z",
"start_time": "2018-12-22T08:28:43.815554Z"
},
"trusted": true
},
"cell_type": "code",
"source": "df.head()",
"execution_count": 3,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>省级政区代码</th>\n <th>省级政区名称</th>\n <th>地市级政区代码</th>\n <th>地市级政区名称</th>\n <th>年份</th>\n <th>党委书记姓名</th>\n <th>出生年份</th>\n <th>出生月份</th>\n <th>籍贯省份代码</th>\n <th>籍贯省份名称</th>\n <th>...</th>\n <th>民族</th>\n <th>教育</th>\n <th>是否是党校教育(是=1,否=0)</th>\n <th>专业:人文</th>\n <th>专业:社科</th>\n <th>专业:理工</th>\n <th>专业:农科</th>\n <th>专业:医科</th>\n <th>入党年份</th>\n <th>工作年份</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>130000</td>\n <td>河北省</td>\n <td>130100</td>\n <td>石家庄市</td>\n <td>2000</td>\n <td>陈来立</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>...</td>\n <td>NaN</td>\n <td>硕士</td>\n <td>1.0</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>1</th>\n <td>130000</td>\n <td>河北省</td>\n <td>130100</td>\n <td>石家庄市</td>\n <td>2001</td>\n <td>吴振华</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>...</td>\n <td>NaN</td>\n <td>本科</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>2</th>\n <td>130000</td>\n <td>河北省</td>\n <td>130100</td>\n <td>石家庄市</td>\n <td>2002</td>\n <td>吴振华</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>...</td>\n <td>NaN</td>\n <td>本科</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>3</th>\n <td>130000</td>\n <td>河北省</td>\n <td>130100</td>\n <td>石家庄市</td>\n <td>2003</td>\n <td>吴振华</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>...</td>\n <td>NaN</td>\n <td>本科</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>4</th>\n <td>130000</td>\n <td>河北省</td>\n <td>130100</td>\n <td>石家庄市</td>\n <td>2004</td>\n <td>吴振华</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>...</td>\n <td>NaN</td>\n <td>本科</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n </tbody>\n</table>\n<p>5 rows × 23 columns</p>\n</div>",
"text/plain": " 省级政区代码 省级政区名称 地市级政区代码 地市级政区名称 年份 党委书记姓名 出生年份 出生月份 籍贯省份代码 籍贯省份名称 \\\n0 130000 河北省 130100 石家庄市 2000 陈来立 NaN NaN NaN NaN \n1 130000 河北省 130100 石家庄市 2001 吴振华 NaN NaN NaN NaN \n2 130000 河北省 130100 石家庄市 2002 吴振华 NaN NaN NaN NaN \n3 130000 河北省 130100 石家庄市 2003 吴振华 NaN NaN NaN NaN \n4 130000 河北省 130100 石家庄市 2004 吴振华 NaN NaN NaN NaN \n\n ... 民族 教育 是否是党校教育(是=1,否=0) 专业:人文 专业:社科 专业:理工 专业:农科 专业:医科 入党年份 工作年份 \n0 ... NaN 硕士 1.0 NaN NaN NaN NaN NaN NaN NaN \n1 ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0 NaN NaN \n2 ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0 NaN NaN \n3 ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0 NaN NaN \n4 ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0 NaN NaN \n\n[5 rows x 23 columns]"
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### 取指定行"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2018-12-22T08:28:43.880141Z",
"start_time": "2018-12-22T08:28:43.850338Z"
},
"trusted": true
},
"cell_type": "code",
"source": "df[10:20]",
"execution_count": 4,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>省级政区代码</th>\n <th>省级政区名称</th>\n <th>地市级政区代码</th>\n <th>地市级政区名称</th>\n <th>年份</th>\n <th>党委书记姓名</th>\n <th>出生年份</th>\n <th>出生月份</th>\n <th>籍贯省份代码</th>\n <th>籍贯省份名称</th>\n <th>...</th>\n <th>民族</th>\n <th>教育</th>\n <th>是否是党校教育(是=1,否=0)</th>\n <th>专业:人文</th>\n <th>专业:社科</th>\n <th>专业:理工</th>\n <th>专业:农科</th>\n <th>专业:医科</th>\n <th>入党年份</th>\n <th>工作年份</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>10</th>\n <td>130000</td>\n <td>河北省</td>\n <td>130100</td>\n <td>石家庄市</td>\n <td>2010</td>\n <td>孙瑞彬</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>...</td>\n <td>NaN</td>\n <td>硕士</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>11</th>\n <td>130000</td>\n <td>河北省</td>\n <td>130200</td>\n <td>唐山市</td>\n <td>2000</td>\n <td>白润璋</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>...</td>\n <td>NaN</td>\n <td>本科</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>12</th>\n <td>130000</td>\n <td>河北省</td>\n <td>130200</td>\n <td>唐山市</td>\n <td>2001</td>\n <td>白润璋</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>...</td>\n <td>NaN</td>\n <td>本科</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>13</th>\n <td>130000</td>\n <td>河北省</td>\n <td>130200</td>\n <td>唐山市</td>\n <td>2002</td>\n <td>白润璋</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>...</td>\n <td>NaN</td>\n <td>本科</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>14</th>\n <td>130000</td>\n <td>河北省</td>\n <td>130200</td>\n <td>唐山市</td>\n <td>2003</td>\n <td>张和</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>...</td>\n <td>NaN</td>\n <td>本科</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>15</th>\n <td>130000</td>\n <td>河北省</td>\n <td>130200</td>\n <td>唐山市</td>\n <td>2004</td>\n <td>张和</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>...</td>\n <td>NaN</td>\n <td>本科</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>16</th>\n <td>130000</td>\n <td>河北省</td>\n <td>130200</td>\n <td>唐山市</td>\n <td>2005</td>\n <td>张和</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>...</td>\n <td>NaN</td>\n <td>本科</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>17</th>\n <td>130000</td>\n <td>河北省</td>\n <td>130200</td>\n <td>唐山市</td>\n <td>2006</td>\n <td>张和</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>...</td>\n <td>NaN</td>\n <td>本科</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>18</th>\n <td>130000</td>\n <td>河北省</td>\n <td>130200</td>\n <td>唐山市</td>\n <td>2007</td>\n <td>赵勇</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>...</td>\n <td>NaN</td>\n <td>博士</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>19</th>\n <td>130000</td>\n <td>河北省</td>\n <td>130200</td>\n <td>唐山市</td>\n <td>2008</td>\n <td>赵勇</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>...</td>\n <td>NaN</td>\n <td>博士</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>1.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>0.0</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n </tbody>\n</table>\n<p>10 rows × 23 columns</p>\n</div>",
"text/plain": " 省级政区代码 省级政区名称 地市级政区代码 地市级政区名称 年份 党委书记姓名 出生年份 出生月份 籍贯省份代码 籍贯省份名称 \\\n10 130000 河北省 130100 石家庄市 2010 孙瑞彬 NaN NaN NaN NaN \n11 130000 河北省 130200 唐山市 2000 白润璋 NaN NaN NaN NaN \n12 130000 河北省 130200 唐山市 2001 白润璋 NaN NaN NaN NaN \n13 130000 河北省 130200 唐山市 2002 白润璋 NaN NaN NaN NaN \n14 130000 河北省 130200 唐山市 2003 张和 NaN NaN NaN NaN \n15 130000 河北省 130200 唐山市 2004 张和 NaN NaN NaN NaN \n16 130000 河北省 130200 唐山市 2005 张和 NaN NaN NaN NaN \n17 130000 河北省 130200 唐山市 2006 张和 NaN NaN NaN NaN \n18 130000 河北省 130200 唐山市 2007 赵勇 NaN NaN NaN NaN \n19 130000 河北省 130200 唐山市 2008 赵勇 NaN NaN NaN NaN \n\n ... 民族 教育 是否是党校教育(是=1,否=0) 专业:人文 专业:社科 专业:理工 专业:农科 专业:医科 入党年份 \\\n10 ... NaN 硕士 1.0 0.0 1.0 0.0 0.0 0.0 NaN \n11 ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0 NaN \n12 ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0 NaN \n13 ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0 NaN \n14 ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0 NaN \n15 ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0 NaN \n16 ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0 NaN \n17 ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0 NaN \n18 ... NaN 博士 0.0 0.0 1.0 0.0 0.0 0.0 NaN \n19 ... NaN 博士 0.0 0.0 1.0 0.0 0.0 0.0 NaN \n\n 工作年份 \n10 NaN \n11 NaN \n12 NaN \n13 NaN \n14 NaN \n15 NaN \n16 NaN \n17 NaN \n18 NaN \n19 NaN \n\n[10 rows x 23 columns]"
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### 获取所有列名"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2018-12-22T08:28:43.893834Z",
"start_time": "2018-12-22T08:28:43.882316Z"
},
"trusted": true
},
"cell_type": "code",
"source": "df.columns.tolist()",
"execution_count": 5,
"outputs": [
{
"data": {
"text/plain": "['省级政区代码',\n '省级政区名称',\n '地市级政区代码',\n '地市级政区名称',\n '年份',\n '党委书记姓名',\n '出生年份',\n '出生月份',\n '籍贯省份代码',\n '籍贯省份名称',\n '籍贯地市代码',\n '籍贯地市名称',\n '性别',\n '民族',\n '教育',\n '是否是党校教育(是=1,否=0)',\n '专业:人文',\n '专业:社科',\n '专业:理工',\n '专业:农科',\n '专业:医科',\n '入党年份',\n '工作年份']"
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### 取指定列前10行值"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2018-12-22T08:28:43.904829Z",
"start_time": "2018-12-22T08:28:43.895282Z"
},
"trusted": true
},
"cell_type": "code",
"source": "df[\"党委书记姓名\"].head(10)",
"execution_count": 6,
"outputs": [
{
"data": {
"text/plain": "0 陈来立\n1 吴振华\n2 吴振华\n3 吴振华\n4 吴振华\n5 吴振华\n6 吴振华\n7 吴显国\n8 吴显国\n9 车俊\nName: 党委书记姓名, dtype: object"
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### 返回基本数据信息\n- `include=[np.number]`: 只统计数值类型,包括:计数,平均数,方差,最小值,四分位数(25%,50%,75%),最大值\n- `include=[np.object]`: 只统计字符串类型,包括:计数,唯一值数量,出现频率最高的内容,最高出现频率"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2018-12-22T08:28:43.965838Z",
"start_time": "2018-12-22T08:28:43.906196Z"
},
"trusted": true
},
"cell_type": "code",
"source": "df.describe(include=[np.number])",
"execution_count": 7,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>省级政区代码</th>\n <th>地市级政区代码</th>\n <th>年份</th>\n <th>出生年份</th>\n <th>出生月份</th>\n <th>籍贯省份代码</th>\n <th>籍贯地市代码</th>\n <th>是否是党校教育(是=1,否=0)</th>\n <th>专业:人文</th>\n <th>专业:社科</th>\n <th>专业:理工</th>\n <th>专业:农科</th>\n <th>专业:医科</th>\n <th>入党年份</th>\n <th>工作年份</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>count</th>\n <td>3663.000000</td>\n <td>3663.000000</td>\n <td>3663.000000</td>\n <td>2676.000000</td>\n <td>2645.000000</td>\n <td>2624.000000</td>\n <td>2615.000000</td>\n <td>2493.000000</td>\n <td>2370.000000</td>\n <td>2376.000000</td>\n <td>2371.000000</td>\n <td>2369.000000</td>\n <td>2370.000000</td>\n <td>2384.000000</td>\n <td>2568.000000</td>\n </tr>\n <tr>\n <th>mean</th>\n <td>403393.393393</td>\n <td>404456.756757</td>\n <td>2005.000000</td>\n <td>1953.622571</td>\n <td>6.790548</td>\n <td>364428.353659</td>\n <td>365742.332696</td>\n <td>0.430405</td>\n <td>0.275527</td>\n <td>0.627525</td>\n <td>0.256854</td>\n <td>0.067539</td>\n <td>0.009705</td>\n <td>1976.906879</td>\n <td>1973.129673</td>\n </tr>\n <tr>\n <th>std</th>\n <td>148176.721620</td>\n <td>148485.810327</td>\n <td>3.162709</td>\n <td>4.416316</td>\n <td>3.614664</td>\n <td>126267.485520</td>\n <td>125961.993399</td>\n <td>0.576136</td>\n <td>0.446874</td>\n <td>0.483566</td>\n <td>0.436990</td>\n <td>0.251006</td>\n <td>0.098054</td>\n <td>5.310080</td>\n <td>4.856564</td>\n </tr>\n <tr>\n <th>min</th>\n <td>130000.000000</td>\n <td>130100.000000</td>\n <td>2000.000000</td>\n <td>1941.000000</td>\n <td>1.000000</td>\n <td>110000.000000</td>\n <td>120000.000000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n <td>1961.000000</td>\n <td>1958.000000</td>\n </tr>\n <tr>\n <th>25%</th>\n <td>330000.000000</td>\n <td>330100.000000</td>\n <td>2002.000000</td>\n <td>1951.000000</td>\n <td>3.000000</td>\n <td>320000.000000</td>\n <td>320700.000000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n <td>1973.000000</td>\n <td>1970.000000</td>\n </tr>\n <tr>\n <th>50%</th>\n <td>420000.000000</td>\n <td>420200.000000</td>\n <td>2005.000000</td>\n <td>1954.000000</td>\n <td>7.000000</td>\n <td>370000.000000</td>\n <td>370700.000000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n <td>1.000000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n <td>1976.000000</td>\n <td>1972.500000</td>\n </tr>\n <tr>\n <th>75%</th>\n <td>510000.000000</td>\n <td>513400.000000</td>\n <td>2008.000000</td>\n <td>1956.000000</td>\n <td>10.000000</td>\n <td>430000.000000</td>\n <td>431300.000000</td>\n <td>1.000000</td>\n <td>1.000000</td>\n <td>1.000000</td>\n <td>1.000000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n <td>1981.000000</td>\n <td>1976.000000</td>\n </tr>\n <tr>\n <th>max</th>\n <td>650000.000000</td>\n <td>654300.000000</td>\n <td>2010.000000</td>\n <td>1966.000000</td>\n <td>14.000000</td>\n <td>640000.000000</td>\n <td>640500.000000</td>\n <td>9.000000</td>\n <td>1.000000</td>\n <td>1.000000</td>\n <td>1.000000</td>\n <td>1.000000</td>\n <td>1.000000</td>\n <td>1994.000000</td>\n <td>1990.000000</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " 省级政区代码 地市级政区代码 年份 出生年份 出生月份 \\\ncount 3663.000000 3663.000000 3663.000000 2676.000000 2645.000000 \nmean 403393.393393 404456.756757 2005.000000 1953.622571 6.790548 \nstd 148176.721620 148485.810327 3.162709 4.416316 3.614664 \nmin 130000.000000 130100.000000 2000.000000 1941.000000 1.000000 \n25% 330000.000000 330100.000000 2002.000000 1951.000000 3.000000 \n50% 420000.000000 420200.000000 2005.000000 1954.000000 7.000000 \n75% 510000.000000 513400.000000 2008.000000 1956.000000 10.000000 \nmax 650000.000000 654300.000000 2010.000000 1966.000000 14.000000 \n\n 籍贯省份代码 籍贯地市代码 是否是党校教育(是=1,否=0) 专业:人文 \\\ncount 2624.000000 2615.000000 2493.000000 2370.000000 \nmean 364428.353659 365742.332696 0.430405 0.275527 \nstd 126267.485520 125961.993399 0.576136 0.446874 \nmin 110000.000000 120000.000000 0.000000 0.000000 \n25% 320000.000000 320700.000000 0.000000 0.000000 \n50% 370000.000000 370700.000000 0.000000 0.000000 \n75% 430000.000000 431300.000000 1.000000 1.000000 \nmax 640000.000000 640500.000000 9.000000 1.000000 \n\n 专业:社科 专业:理工 专业:农科 专业:医科 入党年份 \\\ncount 2376.000000 2371.000000 2369.000000 2370.000000 2384.000000 \nmean 0.627525 0.256854 0.067539 0.009705 1976.906879 \nstd 0.483566 0.436990 0.251006 0.098054 5.310080 \nmin 0.000000 0.000000 0.000000 0.000000 1961.000000 \n25% 0.000000 0.000000 0.000000 0.000000 1973.000000 \n50% 1.000000 0.000000 0.000000 0.000000 1976.000000 \n75% 1.000000 1.000000 0.000000 0.000000 1981.000000 \nmax 1.000000 1.000000 1.000000 1.000000 1994.000000 \n\n 工作年份 \ncount 2568.000000 \nmean 1973.129673 \nstd 4.856564 \nmin 1958.000000 \n25% 1970.000000 \n50% 1972.500000 \n75% 1976.000000 \nmax 1990.000000 "
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2018-12-22T08:28:44.011786Z",
"start_time": "2018-12-22T08:28:43.967523Z"
},
"trusted": true
},
"cell_type": "code",
"source": "df.describe(include=[np.object])",
"execution_count": 8,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>省级政区名称</th>\n <th>地市级政区名称</th>\n <th>党委书记姓名</th>\n <th>籍贯省份名称</th>\n <th>籍贯地市名称</th>\n <th>性别</th>\n <th>民族</th>\n <th>教育</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>count</th>\n <td>3663</td>\n <td>3663</td>\n <td>3021</td>\n <td>2624</td>\n <td>2615</td>\n <td>2708</td>\n <td>2517</td>\n <td>2550</td>\n </tr>\n <tr>\n <th>unique</th>\n <td>27</td>\n <td>333</td>\n <td>901</td>\n <td>29</td>\n <td>240</td>\n <td>2</td>\n <td>2</td>\n <td>7</td>\n </tr>\n <tr>\n <th>top</th>\n <td>四川省</td>\n <td>张掖市</td>\n <td>焉荣竹</td>\n <td>山东省</td>\n <td>威海市</td>\n <td>男</td>\n <td>汉族</td>\n <td>硕士</td>\n </tr>\n <tr>\n <th>freq</th>\n <td>231</td>\n <td>11</td>\n <td>11</td>\n <td>313</td>\n <td>58</td>\n <td>2633</td>\n <td>2351</td>\n <td>1381</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " 省级政区名称 地市级政区名称 党委书记姓名 籍贯省份名称 籍贯地市名称 性别 民族 教育\ncount 3663 3663 3021 2624 2615 2708 2517 2550\nunique 27 333 901 29 240 2 2 7\ntop 四川省 张掖市 焉荣竹 山东省 威海市 男 汉族 硕士\nfreq 231 11 11 313 58 2633 2351 1381"
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### 按性别分析占比"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2018-12-22T08:28:44.028176Z",
"start_time": "2018-12-22T08:28:44.013857Z"
},
"trusted": true
},
"cell_type": "code",
"source": "gender = df['性别']\ngd = gender[gender.notnull()] # .notnull()表示取得所有非空内容\ncount = gd.count()\nrate_m = \"{:.2f}\".format(gd[gd == \"男\"].count() * 100 / count)\nrate_w = \"{:.2f}\".format(gd[gd == \"女\"].count() * 100 / count)\nprint(gd.head(), \"\\n\\n\", gd.unique(), \"\\n\\n\",\n rate_m, rate_w) # .unique()表示显示数据的唯一值内容",
"execution_count": 9,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": "121 男\n122 男\n123 男\n124 男\n125 男\nName: 性别, dtype: object \n\n ['男' '女'] \n\n 97.23 2.77\n"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### 按省份、性别分析占比"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2018-12-22T08:28:44.055894Z",
"start_time": "2018-12-22T08:28:44.029829Z"
},
"trusted": true
},
"cell_type": "code",
"source": "prov_gender = df[[\"省级政区名称\", \"性别\"]].dropna()\n# .crosstab(行,列)用于针对字符串数据的透视(类似excel的数据透视)\npg = pd.crosstab(prov_gender[\"省级政区名称\"], prov_gender[\"性别\"])\npg.head()",
"execution_count": 10,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th>性别</th>\n <th>女</th>\n <th>男</th>\n </tr>\n <tr>\n <th>省级政区名称</th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>云南省</th>\n <td>2</td>\n <td>73</td>\n </tr>\n <tr>\n <th>内蒙古自治区</th>\n <td>0</td>\n <td>86</td>\n </tr>\n <tr>\n <th>吉林省</th>\n <td>4</td>\n <td>72</td>\n </tr>\n <tr>\n <th>四川省</th>\n <td>8</td>\n <td>155</td>\n </tr>\n <tr>\n <th>宁夏回族自治区</th>\n <td>0</td>\n <td>49</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": "性别 女 男\n省级政区名称 \n云南省 2 73\n内蒙古自治区 0 86\n吉林省 4 72\n四川省 8 155\n宁夏回族自治区 0 49"
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2018-12-22T08:28:44.076762Z",
"start_time": "2018-12-22T08:28:44.057435Z"
},
"trusted": true
},
"cell_type": "code",
"source": "pg[\"女性占比\"] = pg[\"女\"] / (pg[\"男\"] + pg[\"女\"])\n# .sort_values()排序,ascending = False表示降序\npg = pg.sort_values(by=[\"女性占比\"], ascending=False)\npg.head()",
"execution_count": 11,
"outputs": [
{
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th>性别</th>\n <th>女</th>\n <th>男</th>\n <th>女性占比</th>\n </tr>\n <tr>\n <th>省级政区名称</th>\n <th></th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>辽宁省</th>\n <td>13</td>\n <td>121</td>\n <td>0.097015</td>\n </tr>\n <tr>\n <th>陕西省</th>\n <td>9</td>\n <td>93</td>\n <td>0.088235</td>\n </tr>\n <tr>\n <th>吉林省</th>\n <td>4</td>\n <td>72</td>\n <td>0.052632</td>\n </tr>\n <tr>\n <th>山西省</th>\n <td>6</td>\n <td>112</td>\n <td>0.050847</td>\n </tr>\n <tr>\n <th>四川省</th>\n <td>8</td>\n <td>155</td>\n <td>0.049080</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": "性别 女 男 女性占比\n省级政区名称 \n辽宁省 13 121 0.097015\n陕西省 9 93 0.088235\n吉林省 4 72 0.052632\n山西省 6 112 0.050847\n四川省 8 155 0.049080"
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## 图表"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "### 不同省份女性市委书记占比"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2018-12-22T08:28:44.094962Z",
"start_time": "2018-12-22T08:28:44.078329Z"
},
"trusted": true
},
"cell_type": "code",
"source": "fig_1 = plt.figure(figsize=(8, 4))\nfig_1.show()",
"execution_count": 12,
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": "/root/anaconda3/lib/python3.7/site-packages/matplotlib/figure.py:457: UserWarning: matplotlib is currently using a non-GUI backend, so cannot show the figure\n \"matplotlib is currently using a non-GUI backend, \"\n"
},
{
"data": {
"text/plain": "<Figure size 576x288 with 0 Axes>"
},
"metadata": {},
"output_type": "display_data"
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "",
"execution_count": null,
"outputs": []
}
],
"metadata": {
"gist": {
"id": "773548b4bc39feb1d8100edf8e85e366",
"data": {
"description": "数据分析师/体验课/市委书记养成记.ipynb",
"public": true
}
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3",
"language": "python"
},
"language_info": {
"name": "python",
"version": "3.7.0",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
},
"toc": {
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"base_numbering": 1,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": true
},
"varInspector": {
"window_display": true,
"cols": {
"lenName": 16,
"lenType": 16,
"lenVar": 40
},
"kernels_config": {
"python": {
"library": "var_list.py",
"delete_cmd_prefix": "del ",
"delete_cmd_postfix": "",
"varRefreshCmd": "print(var_dic_list())"
},
"r": {
"library": "var_list.r",
"delete_cmd_prefix": "rm(",
"delete_cmd_postfix": ") ",
"varRefreshCmd": "cat(var_dic_list()) "
}
},
"types_to_exclude": [
"module",
"function",
"builtin_function_or_method",
"instance",
"_Feature"
],
"position": {
"height": "836px",
"left": "1565px",
"right": "20px",
"top": "144px",
"width": "537px"
}
},
"_draft": {
"nbviewer_url": "https://gist.github.com/773548b4bc39feb1d8100edf8e85e366"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
以上是关于text 数据分析师/体验课/市委书记养成记.ipynb的主要内容,如果未能解决你的问题,请参考以下文章