python panda 统计一列有多少数据

Posted 2023-04-19

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了python panda 统计一列有多少数据相关的知识，希望对你有一定的参考价值。

参考技术A 创建数据

通过Python的zip构造出一元组组成的列表作为DataFrame的输入数据rec。

In [3]: import pandas as pd
In [4]: import random
In [5]: num = random.sample(xrange(10000, 1000000), 5)
In [6]: num
Out[6]: [244937, 132008, 278446, 613409, 799201]
In [8]: names = "hello the cruel world en".split()
In [9]: names
Out[9]: ['hello', 'the', 'cruel', 'world', 'en']
In [10]: rec = zip(names, num)
In [15]: data = pd.DataFrame(rec, columns = [u"姓名",u"业绩" ])
In [16]: data
Out[16]:
姓名业绩
0 hello 244937
1 the 132008
2 cruel 278446
3 world 613409
4 en 799201

DataFrame方法函数的第一个参数是数据源，第二个参数columns是输出数据表的表头，或者说是表格的字段名。

导出数据csv

Windows平台上的编码问题，我们可以先做个简单处理，是ipython-notebook支持utf8.
import sys
reload(sys)
sys.setdefaultencoding("utf8")

接下来可以数据导出了。
In [31]: data
Out[31]:
姓名业绩
0 hello 244937
1 the 132008
2 cruel 278446
3 world 613409
4 en 799201
#在ipython-note里后加问号可查帮助，q退出帮助
In [32]: data.to_csv?
In [33]: data.to_csv("c:\\out.csv", index = True, header = [u"雇员", u"销售业绩"])

将data导出到out.csv文件里，index参数是指是否有主索引，header如果不指定则是以data里columns为头，如果指定则是以后边列表里的字符串为表头，但要注意的是header后的字符串列表的个数要和data里的columns字段个数相同。

可到c盘用Notepad++打开out.csv看看。

简单的数据分析
In [43]: data
Out[43]:
姓名业绩
0 hello 244937
1 the 132008
2 cruel 278446
3 world 613409
4 en 799201本回答被提问者采纳

Pandas案例--人口密度分析

需求：
- 导入文件，查看原始数据
- 将人口数据和各州简称数据进行合并
- 将合并的数据中重复的abbreviation列进行删除
- 查看存在缺失数据的列
- 找到有哪些state/region使得state的值为NaN，进行去重操作
- 为找到的这些state/region的state项补上正确的值，从而去除掉state这一列的所有NaN
- 合并各州面积数据areas
- 我们会发现area(sq.mi)这一列有缺失数据，找出是哪些行
- 去除含有缺失数据的行
- 找出2010年的全民人口数据
- 计算各州的人口密度
- 排序，并找出人口密度最高的五个州 df.sort_values()

import numpy as np
from pandas import DataFrame,Series
import pandas as pd

abb = pd.read_csv('./data/state-abbrevs.csv')
abb.head(2)

	state	abbreviation
0	Alabama	AL
1	Alaska	AK

pop = pd.read_csv('./data/state-population.csv')
pop.head(2)

	state/region	ages	year	population
0	AL	under18	2012	1117489.0
1	AL	total	2012	4817528.0

area = pd.read_csv('./data/state-areas.csv')
area.head(2)

	state	area (sq. mi)
0	Alabama	52423
1	Alaska	656425

# 将人口数据和各州简称数据进行合并
abb_pop = pd.merge(abb,pop,left_on='abbreviation',right_on='state/region',how='outer')
abb_pop.head(2)

	state	abbreviation	state/region	ages	year	population
0	Alabama	AL	AL	under18	2012	1117489.0
1	Alabama	AL	AL	total	2012	4817528.0

# 将合并的数据中重复的abbreviation列进行删除
abb_pop.drop(labels='abbreviation',axis=1,inplace=True)
abb_pop.head(2)

	state	state/region	ages	year	population
0	Alabama	AL	under18	2012	1117489.0
1	Alabama	AL	total	2012	4817528.0

# 查看存在缺失数据的列
abb_pop.isnull().any(axis=0)

state            True
state/region    False
ages            False
year            False
population       True
dtype: bool

# 找到有哪些state/region使得state的值为NaN，进行去重操作
# 1.state列中哪些值为空
abb_pop['state'].isnull()

0       False
1       False
2       False
        ...  
2542     True
2543     True
Name: state, Length: 2544, dtype: bool

# 2.可以将step1中空对应的行数据取出(state中的空值对应的行数据)
abb_pop.loc[abb_pop['state'].isnull()]

	state	state/region	ages	year	population
2448	NaN	PR	under18	1990	NaN
...	...	...	...	...	...
2543	NaN	USA	total	2012	313873685.0

96 rows × 5 columns

# 3.将对应的行数据中指定的简称列取出
abb_pop.loc[abb_pop['state'].isnull()]['state/region'].unique()

array(['PR', 'USA'], dtype=object)

# 为找到的这些state/region的state项补上正确的值，从而去除掉state这一列的所有NaN
# 1.先将USA对应的state列中的空值定位到
abb_pop['state/region'] == 'USA'

0       False
1       False
2       False
3       False
        ...  
2541     True
2542     True
2543     True
Name: state/region, Length: 2544, dtype: bool

# 2,将布尔值作为原数据的行索引，取出USA简称对应的行数据
abb_pop.loc[abb_pop['state/region'] == 'USA']

	state	state/region	ages	year	population
2496	NaN	USA	under18	1990	64218512.0
...	...	...	...	...	...
2542	NaN	USA	under18	2012	73708179.0
2543	NaN	USA	total	2012	313873685.0

# 3.获取符合要求行数据的行索引
indexs = abb_pop.loc[abb_pop['state/region'] == 'USA'].index

# 4.将indexs这些行中的state列的值批量赋值成united states
abb_pop.loc[indexs,'state'] = 'United Status'

# 将PR对应的state列中的空批量赋值成 PUERTO RICO
abb_pop['state/region'] == 'PR'
abb_pop.loc[abb_pop['state/region'] == 'PR']
indexs = abb_pop.loc[abb_pop['state/region'] == 'PR'].index
abb_pop.loc[indexs,'state'] = 'PUERTO RICO'

# 合并各州面积数据areas
abb_pop_area = pd.merge(abb_pop,area,how='outer')
abb_pop_area.head(3)

	state	state/region	ages	year	population	area (sq. mi)
0	Alabama	AL	under18	2012.0	1117489.0	52423.0
1	Alabama	AL	total	2012.0	4817528.0	52423.0
2	Alabama	AL	under18	2010.0	1130966.0	52423.0

# 我们会发现area(sq.mi)这一列有缺失数据，找出是哪些行
abb_pop_area['area (sq. mi)'].isnull()
# 将空值对应的行数据取出
indexs = abb_pop_area.loc[abb_pop_area['area (sq. mi)'].isnull()].index
indexs

Int64Index([2448, 2449, 2450, 2451, 2452, 2453, 2454, 2455, 2456, 2457, 2458,
            2459, 2460, 2461, 2462, 2463, 2464, 2465, 2466, 2467, 2468, 2469,
            2470, 2471, 2472, 2473, 2474, 2475, 2476, 2477, 2478, 2479, 2480,
            2481, 2482, 2483, 2484, 2485, 2486, 2487, 2488, 2489, 2490, 2491,
            2492, 2493, 2494, 2495, 2496, 2497, 2498, 2499, 2500, 2501, 2502,
            2503, 2504, 2505, 2506, 2507, 2508, 2509, 2510, 2511, 2512, 2513,
            2514, 2515, 2516, 2517, 2518, 2519, 2520, 2521, 2522, 2523, 2524,
            2525, 2526, 2527, 2528, 2529, 2530, 2531, 2532, 2533, 2534, 2535,
            2536, 2537, 2538, 2539, 2540, 2541, 2542, 2543],
           dtype='int64')

# 去除含有缺失数据的行
abb_pop_area.drop(labels=indexs,axis=0,inplace=True)

# 找出2010年的全民人口数据    条件查询
abb_pop_area.query('year == 2010 & ages == "total"')

... ...

	state	state/region	ages	year	population	area (sq. mi)
3	Alabama	AL	total	2010.0	4785570.0	52423.0
91	Alaska	AK	total	2010.0	713868.0	656425.0
101	Arizona	AZ	total	2010.0	6408790.0	114006.0
189	Arkansas	AR	total	2010.0	2922280.0	53182.0
197	California	CA	total	2010.0	37333601.0	163707.0
2405	Wyoming	WY	total	2010.0	564222.0	97818.0

# 计算各州的人口密度
abb_pop_area['midu'] = abb_pop_area['population'] / abb_pop_area['area (sq. mi)']
abb_pop_area.head(2)

	state	state/region	ages	year	population	area (sq. mi)	midu
0	Alabama	AL	under18	2012.0	1117489.0	52423.0	21.316769
1	Alabama	AL	total	2012.0	4817528.0	52423.0	91.897221

# 排序，并找出人口密度最高的五个州   df.sort_values()
abb_pop_area.sort_values(by='midu',axis=0,ascending=False).head(5)

	state	state/region	ages	year	population	area (sq. mi)	midu
391	District of Columbia	DC	total	2013.0	646449.0	68.0	9506.602941
385	District of Columbia	DC	total	2012.0	633427.0	68.0	9315.102941
387	District of Columbia	DC	total	2011.0	619624.0	68.0	9112.117647
431	District of Columbia	DC	total	1990.0	605321.0	68.0	8901.779412
389	District of Columbia	DC	total	2010.0	605125.0	68.0	8898.897059

abb_pop_area.groupby(by='state')['area (sq. mi)'].max().sort_values(ascending=False).head(5)

state
Alaska        656425.0
Texas         268601.0
California    163707.0
Montana       147046.0
New Mexico    121593.0
Name: area (sq. mi), dtype: float64

以上是关于python panda 统计一列有多少数据的主要内容，如果未能解决你的问题，请参考以下文章