萌新向Python数据分析及数据挖掘 第二章 pandas 第五节 Getting Started with pandas
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了萌新向Python数据分析及数据挖掘 第二章 pandas 第五节 Getting Started with pandas相关的知识,希望对你有一定的参考价值。
Getting Started with pandas
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc(‘figure‘, figsize=(10, 6))
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.max_rows = 20
np.set_printoptions(precision=4, suppress=True)
Introduction to pandas Data Structures
Series
obj = pd.Series([4, 7, -5, 3])
obj
obj.values
obj.index # like range(4)
obj2 = pd.Series([4, 7, -5, 3], index=[‘d‘, ‘b‘, ‘a‘, ‘c‘])#设置索引
?
obj2
obj2.index
Init signature: pd.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)
Docstring:
One-dimensional ndarray with axis labels (including time series).
?
Labels need not be unique but must be a hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the index. Statistical
methods from ndarray have been overridden to automatically exclude
missing data (currently represented as NaN).
?
Operations between Series (+, -, /, *, **) align values based on their
associated index values-- they need not be the same length. The result
index will be the sorted union of the two indexes.
?
Parameters
----------
data : array-like, dict, or scalar value
Contains data stored in Series
index : array-like or Index (1d)
Values must be hashable and have the same length as `data`.
Non-unique index values are allowed. Will default to
RangeIndex(len(data)) if not provided. If both a dict and index
sequence are used, the index will override the keys found in the
dict.
dtype : numpy.dtype or None
If None, dtype will be inferred
copy : boolean, default False
Copy input data
obj2[‘a‘]
obj2[‘d‘] = 6
obj2[[‘c‘, ‘a‘, ‘d‘]]
obj2[obj2 > 0]
?
obj2 * 2
np.exp(obj2)
?
‘b‘ in obj2
‘e‘ in obj2
sdata = {‘Ohio‘: 35000, ‘Texas‘: 71000, ‘Oregon‘: 16000, ‘Utah‘: 5000}
obj3 = pd.Series(sdata)
obj3
states = [‘California‘, ‘Ohio‘, ‘Oregon‘, ‘Texas‘]
obj4 = pd.Series(sdata, index=states)
obj4
pd.isnull(obj4)
?
Signature: pd.isnull(obj)
Docstring:
Detect missing values (NaN in numeric arrays, None/NaN in object arrays)
?
Parameters
----------
arr : ndarray or object value
Object to check for null-ness
?
Returns
-------
isna : array-like of bool or bool
Array or bool indicating whether an object is null or if an array is
given which of the element is null.
?
See also
--------
pandas.notna: boolean inverse of pandas.isna
pandas.isnull: alias of isna
pd.notnull(obj4)
Signature: pd.notnull(obj)
Docstring:
Replacement for numpy.isfinite / -numpy.isnan which is suitable for use
on object arrays.
?
Parameters
----------
arr : ndarray or object value
Object to check for *not*-null-ness
?
Returns
-------
notisna : array-like of bool or bool
Array or bool indicating whether an object is *not* null or if an array
is given which of the element is *not* null.
?
See also
--------
pandas.isna : boolean inverse of pandas.notna
pandas.notnull : alias of notna
obj4.isnull()
obj3
?
obj4
?
obj3 + obj4 #同索引数值相加
obj4.name = ‘population‘
obj4.index.name = ‘state‘#设置index名称
obj4
obj
?
obj.index = [‘Bob‘, ‘Steve‘, ‘Jeff‘, ‘Ryan‘]#设置index
obj
DataFrame
data = {‘state‘: [‘Ohio‘, ‘Ohio‘, ‘Ohio‘, ‘Nevada‘, ‘Nevada‘, ‘Nevada‘],
‘year‘: [2000, 2001, 2002, 2001, 2002, 2003],
‘pop‘: [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame
frame.head()
Signature: frame.head(n=5)
Docstring:
Return the first n rows.
?
Parameters
----------
n : int, default 5
Number of rows to select.
?
Returns
-------
obj_head : type of caller
The first n rows of the caller object.
pd.DataFrame(data, columns=[‘year‘, ‘state‘, ‘pop‘])#设置列名
Init signature: pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
Docstring:
Two-dimensional size-mutable, potentially heterogeneous tabular data
structure with labeled axes (rows and columns). Arithmetic operations
align on both row and column labels. Can be thought of as a dict-like
container for Series objects. The primary pandas data structure
?
Parameters
----------
data : numpy ndarray (structured or homogeneous), dict, or DataFrame
Dict can contain Series, arrays, constants, or list-like objects
index : Index or array-like
Index to use for resulting frame. Will default to np.arange(n) if
no indexing information part of input data and no index provided
columns : Index or array-like
Column labels to use for resulting frame. Will default to
np.arange(n) if no column labels are provided
dtype : dtype, default None
Data type to force. Only a single dtype is allowed. If None, infer
copy : boolean, default False
Copy data from inputs. Only affects DataFrame / 2d ndarray input
?
Examples
--------
Constructing DataFrame from a dictionary.
?
>>> d = {‘col1‘: [1, 2], ‘col2‘: [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df
col1 col2
0 1 3
1 2 4
?
Notice that the inferred dtype is int64.
?
>>> df.dtypes
col1 int64
col2 int64
dtype: object
?
To enforce a single dtype:
?
>>> df = pd.DataFrame(data=d, dtype=np.int8)
>>> df.dtypes
col1 int8
col2 int8
dtype: object
?
Constructing DataFrame from numpy ndarray:
?
>>> df2 = pd.DataFrame(np.random.randint(low=0, high=10, size=(5, 5)),
... columns=[‘a‘, ‘b‘, ‘c‘, ‘d‘, ‘e‘])
>>> df2
a b c d e
0 2 8 8 3 4
1 4 2 9 0 9
2 1 0 7 8 0
3 5 1 7 1 3
4 6 0 2 4 2
frame2 = pd.DataFrame(data, columns=[‘year‘, ‘state‘, ‘pop‘, ‘debt‘],#空数据会显示NaN
index=[‘one‘, ‘two‘, ‘three‘, ‘four‘,
‘five‘, ‘six‘])
frame2
?
frame2.columns
frame2[‘state‘]
?
frame2.year#列名有空格不可用
frame2.loc[‘three‘]
frame2[‘debt‘] = 16.5
frame2
?
frame2[‘debt‘] = np.arange(6.)
frame2
Docstring:
arange([start,] stop[, step,], dtype=None)
?
Return evenly spaced values within a given interval.
?
Values are generated within the half-open interval ``[start, stop)``
(in other words, the interval including `start` but excluding `stop`).
For integer arguments the function is equivalent to the Python built-in
`range <http://docs.python.org/lib/built-in-funcs.html>`_ function,
but returns an ndarray rather than a list.
?
When using a non-integer step, such as 0.1, the results will often not
be consistent. It is better to use ``linspace`` for these cases.
?
Parameters
----------
start : number, optional
Start of interval. The interval includes this value. The default
start value is 0.
stop : number
End of interval. The interval does not include this value, except
in some cases where `step` is not an integer and floating point
round-off affects the length of `out`.
step : number, optional
Spacing between values. For any output `out`, this is the distance
between two adjacent values, ``out[i+1] - out[i]``. The default
step size is 1. If `step` is specified as a position argument,
`start` must also be given.
dtype : dtype
The type of the output array. If `dtype` is not given, infer the data
type from the other input arguments.
?
Returns
-------
arange : ndarray
Array of evenly spaced values.
?
For floating point arguments, the length of the result is
``ceil((stop - start)/step)``. Because of floating point overflow,
this rule may result in the last element of `out` being greater
than `stop`.
?
See Also
--------
linspace : Evenly spaced numbers with careful handling of endpoints.
ogrid: Arrays of evenly spaced numbers in N-dimensions.
mgrid: Grid-shaped arrays of evenly spaced numbers in N-dimensions.
?
Examples
--------
>>> np.arange(3)
array([0, 1, 2])
>>> np.arange(3.0)
array([ 0., 1., 2.])
>>> np.arange(3,7)
array([3, 4, 5, 6])
>>> np.arange(3,7,2)
array([3, 5])
val = pd.Series([-1.2, -1.5, -1.7], index=[‘two‘, ‘four‘, ‘five‘])
frame2[‘debt‘] = val
frame2#匹配index填充数据
?
frame2[‘eastern‘] = frame2.state == ‘Ohio‘
frame2
del frame2[‘eastern‘]#删除列
frame2.columns
pop = {‘Nevada‘: {2001: 2.4, 2002: 2.9},
‘Ohio‘: {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = pd.DataFrame(pop)
frame3
frame3.T#转置
pd.DataFrame(pop, index=[2001, 2002, 2003])
pdata = {‘Ohio‘: frame3[‘Ohio‘][:-1],
‘Nevada‘: frame3[‘Nevada‘][:2]}
pd.DataFrame(pdata)
frame3.index.name = ‘year‘; frame3.columns.name = ‘state‘#设置列统称
frame3
frame3.values
frame2.values
Index Objects
obj = pd.Series(range(3), index=[‘a‘, ‘b‘, ‘c‘])
index = obj.index
index
index[1:]
index[1] = ‘d‘ # TypeError
labels = pd.Index(np.arange(3))
labels
?
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2
obj2.index is labels
frame3
frame3.columns
?
‘Ohio‘ in frame3.columns
?
2003 in frame3.index
dup_labels = pd.Index([‘foo‘, ‘foo‘, ‘bar‘, ‘bar‘])
dup_labels
Essential Functionality
Reindexing
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=[‘d‘, ‘b‘, ‘a‘, ‘c‘])
obj
obj2 = obj.reindex([‘a‘, ‘b‘, ‘c‘, ‘d‘, ‘e‘])
obj2
ignature: obj.reindex(index=None, **kwargs)
Docstring:
Conform Series to new index with optional filling logic, placing
NA/NaN in locations having no value in the previous index. A new object
is produced unless the new index is equivalent to the current one and
copy=False
?
Parameters
----------
?
index : array-like, optional (should be specified using keywords)
New labels / index to conform to. Preferably an Index object to
avoid duplicating data
?
method : {None, ‘backfill‘/‘bfill‘, ‘pad‘/‘ffill‘, ‘nearest‘}, optional
method to use for filling holes in reindexed DataFrame.
Please note: this is only applicable to DataFrames/Series with a
monotonically increasing/decreasing index.
?
* default: don‘t fill gaps
* pad / ffill: propagate last valid observation forward to next
valid
* backfill / bfill: use next valid observation to fill gap
* nearest: use nearest valid observations to fill gap
?
copy : boolean, default True
Return a new object, even if the passed indexes are the same
level : int or name
Broadcast across a level, matching Index values on the
passed MultiIndex level
fill_value : scalar, default np.NaN
Value to use for missing values. Defaults to NaN, but can be any
"compatible" value
limit : int, default None
Maximum number of consecutive elements to forward or backward fill
tolerance : optional
Maximum distance between original and new labels for inexact
matches. The values of the index at the matching locations most
satisfy the equation ``abs(index[indexer] - target) <= tolerance``.
?
Tolerance may be a scalar value, which applies the same tolerance
to all values, or list-like, which applies variable tolerance per
element. List-like includes list, tuple, array, Series, and must be
the same size as the index and its dtype must exactly match the
index‘s type.
?
.. versionadded:: 0.17.0
.. versionadded:: 0.21.0 (list-like tolerance)
?
Examples
--------
?
``DataFrame.reindex`` supports two calling conventions
?
* ``(index=index_labels, columns=column_labels, ...)``
* ``(labels, axis={‘index‘, ‘columns‘}, ...)``
?
We *highly* recommend using keyword arguments to clarify your
intent.
?
Create a dataframe with some fictional data.
?
>>> index = [‘Firefox‘, ‘Chrome‘, ‘Safari‘, ‘IE10‘, ‘Konqueror‘]
>>> df = pd.DataFrame({
... ‘http_status‘: [200,200,404,404,301],
... ‘response_time‘: [0.04, 0.02, 0.07, 0.08, 1.0]},
... index=index)
>>> df
http_status response_time
Firefox 200 0.04
Chrome 200 0.02
Safari 404 0.07
IE10 404 0.08
Konqueror 301 1.00
?
Create a new index and reindex the dataframe. By default
values in the new index that do not have corresponding
records in the dataframe are assigned ``NaN``.
?
>>> new_index= [‘Safari‘, ‘Iceweasel‘, ‘Comodo Dragon‘, ‘IE10‘,
... ‘Chrome‘]
>>> df.reindex(new_index)
http_status response_time
Safari 404.0 0.07
Iceweasel NaN NaN
Comodo Dragon NaN NaN
IE10 404.0 0.08
Chrome 200.0 0.02
?
We can fill in the missing values by passing a value to
the keyword ``fill_value``. Because the index is not monotonically
increasing or decreasing, we cannot use arguments to the keyword
``method`` to fill the ``NaN`` values.
?
>>> df.reindex(new_index, fill_value=0)
http_status response_time
Safari 404 0.07
Iceweasel 0 0.00
Comodo Dragon 0 0.00
IE10 404 0.08
Chrome 200 0.02
?
>>> df.reindex(new_index, fill_value=‘missing‘)
http_status response_time
Safari 404 0.07
Iceweasel missing missing
Comodo Dragon missing missing
IE10 404 0.08
Chrome 200 0.02
?
We can also reindex the columns.
?
>>> df.reindex(columns=[‘http_status‘, ‘user_agent‘])
http_status user_agent
Firefox 200 NaN
Chrome 200 NaN
Safari 404 NaN
IE10 404 NaN
Konqueror 301 NaN
?
Or we can use "axis-style" keyword arguments
?
>>> df.reindex([‘http_status‘, ‘user_agent‘], axis="columns")
http_status user_agent
Firefox 200 NaN
Chrome 200 NaN
Safari 404 NaN
IE10 404 NaN
Konqueror 301 NaN
?
To further illustrate the filling functionality in
``reindex``, we will create a dataframe with a
monotonically increasing index (for example, a sequence
of dates).
?
>>> date_index = pd.date_range(‘1/1/2010‘, periods=6, freq=‘D‘)
>>> df2 = pd.DataFrame({"prices": [100, 101, np.nan, 100, 89, 88]},
... index=date_index)
>>> df2
prices
2010-01-01 100
2010-01-02 101
2010-01-03 NaN
2010-01-04 100
2010-01-05 89
2010-01-06 88
?
Suppose we decide to expand the dataframe to cover a wider
date range.
?
>>> date_index2 = pd.date_range(‘12/29/2009‘, periods=10, freq=‘D‘)
>>> df2.reindex(date_index2)
prices
2009-12-29 NaN
2009-12-30 NaN
2009-12-31 NaN
2010-01-01 100
2010-01-02 101
2010-01-03 NaN
2010-01-04 100
2010-01-05 89
2010-01-06 88
2010-01-07 NaN
?
The index entries that did not have a value in the original data frame
(for example, ‘2009-12-29‘) are by default filled with ``NaN``.
If desired, we can fill in the missing values using one of several
options.
?
For example, to backpropagate the last valid value to fill the ``NaN``
values, pass ``bfill`` as an argument to the ``method`` keyword.
?
>>> df2.reindex(date_index2, method=‘bfill‘)
prices
2009-12-29 100
2009-12-30 100
2009-12-31 100
2010-01-01 100
2010-01-02 101
2010-01-03 NaN
2010-01-04 100
2010-01-05 89
2010-01-06 88
2010-01-07 NaN
?
Please note that the ``NaN`` value present in the original dataframe
(at index value 2010-01-03) will not be filled by any of the
value propagation schemes. This is because filling while reindexing
does not look at dataframe values, but only compares the original and
desired indexes. If you do want to fill in the ``NaN`` values present
in the original dataframe, use the ``fillna()`` method.
?
See the :ref:`user guide <basics.reindexing>` for more.
?
Returns
-------
obj3 = pd.Series([‘blue‘, ‘purple‘, ‘yellow‘], index=[0, 2, 4])
obj3
?
obj3.reindex(range(6), method=‘ffill‘)#按index空缺填充
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
index=[‘a‘, ‘c‘, ‘d‘],
columns=[‘Ohio‘, ‘Texas‘, ‘California‘])
frame
?
frame2 = frame.reindex([‘a‘, ‘b‘, ‘c‘, ‘d‘])
frame2
states = [‘Texas‘, ‘Utah‘, ‘California‘]
frame.reindex(columns=states)
frame.loc[[‘a‘, ‘b‘, ‘c‘, ‘d‘], states]#这里提醒,传入列表或有找不到的标签的,以后会报错,用.reindex代替
Dropping Entries from an Axis
obj = pd.Series(np.arange(5.), index=[‘a‘, ‘b‘, ‘c‘, ‘d‘, ‘e‘])
obj
?
new_obj = obj.drop(‘c‘)
new_obj
?
Signature: obj.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors=‘raise‘)
Docstring:
Return new object with labels in requested axis removed.
?
Parameters
----------
labels : single label or list-like
Index or column labels to drop.
axis : int or axis name
Whether to drop labels from the index (0 / ‘index‘) or
columns (1 / ‘columns‘).
index, columns : single label or list-like
Alternative to specifying `axis` (``labels, axis=1`` is
equivalent to ``columns=labels``).
?
.. versionadded:: 0.21.0
level : int or level name, default None
For MultiIndex
inplace : bool, default False
If True, do operation inplace and return None.
errors : {‘ignore‘, ‘raise‘}, default ‘raise‘
If ‘ignore‘, suppress error and existing labels are dropped.
?
Returns
-------
dropped : type of caller
?
Examples
--------
>>> df = pd.DataFrame(np.arange(12).reshape(3,4),
columns=[‘A‘, ‘B‘, ‘C‘, ‘D‘])
>>> df
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
?
Drop columns
?
>>> df.drop([‘B‘, ‘C‘], axis=1)
A D
0 0 3
1 4 7
2 8 11
?
>>> df.drop(columns=[‘B‘, ‘C‘])
A D
0 0 3
1 4 7
2 8 11
?
Drop a row by index
?
>>> df.drop([0, 1])
A B C D
2 8 9 10 11
?
Notes
obj.drop([‘d‘, ‘c‘])#
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=[‘Ohio‘, ‘Colorado‘, ‘Utah‘, ‘New York‘],
columns=[‘one‘, ‘two‘, ‘three‘, ‘four‘])
data
data.drop([‘Colorado‘, ‘Ohio‘])
data.drop(‘two‘, axis=1)
data.drop([‘two‘, ‘four‘], axis=‘columns‘)
obj.drop(‘c‘, inplace=True)
?
obj
Indexing, Selection, and Filtering
obj = pd.Series(np.arange(4.), index=[‘a‘, ‘b‘, ‘c‘, ‘d‘])
obj
?
obj[‘b‘]
?
obj[1]
?
obj[2:4]
?
obj[[‘b‘, ‘a‘, ‘d‘]]
?
obj[[1, 3]]
?
obj[obj < 2]
?
?
?
?
obj[‘b‘:‘c‘]
obj[‘b‘:‘c‘] = 5
?
obj
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=[‘Ohio‘, ‘Colorado‘, ‘Utah‘, ‘New York‘],
columns=[‘one‘, ‘two‘, ‘three‘, ‘four‘])
data
?
data[‘two‘]
?
data[[‘three‘, ‘one‘]]
data[:2]
?
data[data[‘three‘] > 5]
data < 5
data[data < 5] = 0
data
Selection with loc and iloc
data.loc[‘Colorado‘, [‘two‘, ‘three‘]]
data.iloc[2, [3, 0, 1]]
data.iloc[2]
data.iloc[[1, 2], [3, 0, 1]]
data.loc[:‘Utah‘, ‘two‘]#标签多选
data.iloc[:, :3][data.three > 5]#位置多选
Integer Indexes
ser = pd.Series(np.arange(3.)) ser ser[-1]
ser = pd.Series(np.arange(3.))
ser
ser2 = pd.Series(np.arange(3.), index=[‘a‘, ‘b‘, ‘c‘])
ser2[-1]
ser[:1]
?
ser.loc[:1]
?
ser.iloc[:1]
Arithmetic and Data Alignment
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=[‘a‘, ‘c‘, ‘d‘, ‘e‘])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],
index=[‘a‘, ‘c‘, ‘e‘, ‘f‘, ‘g‘])
s1
?
s2
s1 + s2# 不匹配的不算单个,直接NaN
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list(‘bcd‘),
index=[‘Ohio‘, ‘Texas‘, ‘Colorado‘])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list(‘bde‘),
index=[‘Utah‘, ‘Ohio‘, ‘Texas‘, ‘Oregon‘])
df1
?
df2
df1 + df2
df1 = pd.DataFrame({‘A‘: [1, 2]})
df2 = pd.DataFrame({‘B‘: [3, 4]})
df1
?
df2
?
df1 - df2 #需要 行标签 列表去都对上
Arithmetic methods with fill values
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
columns=list(‘abcd‘))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
columns=list(‘abcde‘))
?
?
df2
df2.loc[1, ‘b‘] = np.nan
?
df1
df1 + df2
df1.add(df2, fill_value=0)#当找不到标签时等于0
Signature: df1.add(other, axis=‘columns‘, level=None, fill_value=None)
Docstring:
Addition of dataframe and other, element-wise (binary operator `add`).
?
Equivalent to ``dataframe + other``, but with support to substitute a fill_value for
missing data in one of the inputs.
?
Parameters
----------
other : Series, DataFrame, or constant
axis : {0, 1, ‘index‘, ‘columns‘}
For Series input, axis to match Series index on
fill_value : None or float value, default None
Fill missing (NaN) values with this value. If both DataFrame
locations are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the
passed MultiIndex level
?
Notes
-----
Mismatched indices will be unioned together
?
Returns
-------
result : DataFrame
?
See also
--------
1 / df1
?
df1.rdiv(1)
Signature: df1.rdiv(other, axis=‘columns‘, level=None, fill_value=None)
Docstring:
Floating division of dataframe and other, element-wise (binary operator `rtruediv`).
?
Equivalent to ``other / dataframe``, but with support to substitute a fill_value for
missing data in one of the inputs.
?
Parameters
----------
other : Series, DataFrame, or constant
axis : {0, 1, ‘index‘, ‘columns‘}
For Series input, axis to match Series index on
fill_value : None or float value, default None
Fill missing (NaN) values with this value. If both DataFrame
locations are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the
passed MultiIndex level
?
Notes
-----
Mismatched indices will be unioned together
?
Returns
-------
result : DataFrame
?
See also
--------
df1.reindex(columns=df2.columns, fill_value=0)
df1.reindex(index=df2.index,columns=df2.columns, fill_value=np.pi)
Operations between DataFrame and Series
arr = np.arange(12.).reshape((3, 4))
arr
?
arr[0]
?
?
arr - arr[0]
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
columns=list(‘bde‘),
index=[‘Utah‘, ‘Ohio‘, ‘Texas‘, ‘Oregon‘])
?
series = frame.iloc[0]
frame
?
series
?
?
frame - series
series2 = pd.Series(range(3), index=[‘b‘, ‘e‘, ‘f‘])
frame + series2
series3 = frame[‘d‘]
frame
?
series3
?
frame.sub(series3, axis=‘index‘)#标签减法
Signature: frame.sub(other, axis=‘columns‘, level=None, fill_value=None)
Docstring:
Subtraction of dataframe and other, element-wise (binary operator `sub`).
?
Equivalent to ``dataframe - other``, but with support to substitute a fill_value for
missing data in one of the inputs.
?
Parameters
----------
other : Series, DataFrame, or constant
axis : {0, 1, ‘index‘, ‘columns‘}
For Series input, axis to match Series index on
fill_value : None or float value, default None
Fill missing (NaN) values with this value. If both DataFrame
locations are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the
passed MultiIndex level
?
Notes
-----
Mismatched indices will be unioned together
?
Returns
-------
result : DataFrame
?
See also
--------
DataFrame.rsub
Function Application and Mapping
frame = pd.DataFrame(np.random.randn(4, 3), columns=list(‘bde‘),
index=[‘Utah‘, ‘Ohio‘, ‘Texas‘, ‘Oregon‘])
frame
?
np.abs(frame)#取绝对值
Call signature: np.abs(*args, **kwargs)
Type: ufunc
String form: <ufunc ‘absolute‘>
File: c:\users\qq123\anaconda3\lib\site-packages\numpy\__init__.py
Docstring:
absolute(x, /, out=None, *, where=True, casting=‘same_kind‘, order=‘K‘, dtype=None, subok=True[, signature, extobj])
?
Calculate the absolute value element-wise.
?
Parameters
----------
x : array_like
Input array.
out : ndarray, None, or tuple of ndarray and None, optional
A location into which the result is stored. If provided, it must have
a shape that the inputs broadcast to. If not provided or `None`,
a freshly-allocated array is returned. A tuple (possible only as a
keyword argument) must have length equal to the number of outputs.
where : array_like, optional
Values of True indicate to calculate the ufunc at that position, values
of False indicate to leave the value in the output alone.
**kwargs
For other keyword-only arguments, see the
:ref:`ufunc docs <ufuncs.kwargs>`.
?
Returns
-------
absolute : ndarray
An ndarray containing the absolute value of
each element in `x`. For complex input, ``a + ib``, the
absolute value is :math:`\sqrt{ a^2 + b^2 }`.
?
Examples
--------
>>> x = np.array([-1.2, 1.2])
>>> np.absolute(x)
array([ 1.2, 1.2])
>>> np.absolute(1.2 + 1j)
1.5620499351813308
?
Plot the function over ``[-10, 10]``:
?
>>> import matplotlib.pyplot as plt
?
>>> x = np.linspace(start=-10, stop=10, num=101)
>>> plt.plot(x, np.absolute(x))
>>> plt.show()
?
Plot the function over the complex plane:
?
>>> xx = x + 1j * x[:, np.newaxis]
>>> plt.imshow(np.abs(xx), extent=[-10, 10, -10, 10], cmap=‘gray‘)
>>> plt.show()
Class docstring:
Functions that operate element by element on whole arrays.
?
To see the documentation for a specific ufunc, use `info`. For
example, ``np.info(np.sin)``. Because ufuncs are written in C
(for speed) and linked into Python with NumPy‘s ufunc facility,
Python‘s help() function finds this page whenever help() is called
on a ufunc.
?
A detailed explanation of ufuncs can be found in the docs for :ref:`ufuncs`.
?
Calling ufuncs:
===============
?
op(*x[, out], where=True, **kwargs)
Apply `op` to the arguments `*x` elementwise, broadcasting the arguments.
?
The broadcasting rules are:
?
* Dimensions of length 1 may be prepended to either array.
* Arrays may be repeated along dimensions of length 1.
?
Parameters
----------
*x : array_like
Input arrays.
out : ndarray, None, or tuple of ndarray and None, optional
Alternate array object(s) in which to put the result; if provided, it
must have a shape that the inputs broadcast to. A tuple of arrays
(possible only as a keyword argument) must have length equal to the
number of outputs; use `None` for outputs to be allocated by the ufunc.
where : array_like, optional
Values of True indicate to calculate the ufunc at that position, values
of False indicate to leave the value in the output alone.
**kwargs
For other keyword-only arguments, see the :ref:`ufunc docs <ufuncs.kwargs>`.
?
Returns
-------
r : ndarray or tuple of ndarray
`r` will have the shape that the arrays in `x` broadcast to; if `out` is
provided, `r` will be equal to `out`. If the function has more than one
output, then the result will be a tuple of arrays.
f = lambda x: x.max() - x.min()
frame.apply(f)#每列最大值减最小值
?
frame.apply(f,axis=1)#每一行最大值减最小值
frame.apply(f, axis=‘columns‘)
def f(x):
return pd.Series([x.min(), x.max()], index=[‘min‘, ‘max‘])
frame.apply(f)
def f(x):
return pd.Series([x.min(), x.max()], index=[‘min‘, ‘max‘])
frame.apply(f,axis=1)
format = lambda x: ‘%.2f‘ % x #设置格式
frame.applymap(format)
frame[‘e‘].map(format)
Sorting and Ranking
obj = pd.Series(range(4), index=[‘d‘, ‘a‘, ‘b‘, ‘c‘])
obj.sort_index()
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
index=[‘three‘, ‘one‘],
columns=[‘d‘, ‘a‘, ‘b‘, ‘c‘])
frame.sort_index()
?
Signature: frame.sort_index(axis=0, level=None, ascending=True, inplace=False, kind=‘quicksort‘, na_position=‘last‘, sort_remaining=True, by=None)
Docstring:
Sort object by labels (along an axis)
?
Parameters
----------
axis : index, columns to direct sorting
level : int or level name or list of ints or list of level names
if not None, sort on values in specified index level(s)
ascending : boolean, default True
Sort ascending vs. descending
inplace : bool, default False
if True, perform operation in-place
kind : {‘quicksort‘, ‘mergesort‘, ‘heapsort‘}, default ‘quicksort‘
Choice of sorting algorithm. See also ndarray.np.sort for more
information. `mergesort` is the only stable algorithm. For
DataFrames, this option is only applied when sorting on a single
column or label.
na_position : {‘first‘, ‘last‘}, default ‘last‘
`first` puts NaNs at the beginning, `last` puts NaNs at the end.
Not implemented for MultiIndex.
sort_remaining : bool, default True
if true and sorting by level and index is multilevel, sort by other
levels too (in order) after sorting by specified level
?
Returns
-------
sorted_obj : DataFrame
File: c:\users\qq123\anaconda3\lib\site-packages\pandas\core\frame.py
Type: method
frame.sort_index(axis=1)#列排序
frame.sort_index(axis=1, ascending=False)#降序
obj = pd.Series([4, 7, -3, 2])
obj.sort_values()
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()
frame = pd.DataFrame({‘b‘: [4, 7, -3, 2], ‘a‘: [0, 1, 0, 1]})
frame
?
frame.sort_values(by=‘b‘)
Signature: frame.sort_values(by, axis=0, ascending=True, inplace=False, kind=‘quicksort‘, na_position=‘last‘)
Docstring:
Sort by the values along either axis
?
.. versionadded:: 0.17.0
?
Parameters
----------
by : str or list of str
Name or list of names which refer to the axis items.
axis : {0 or ‘index‘, 1 or ‘columns‘}, default 0
Axis to direct sorting
ascending : bool or list of bool, default True
Sort ascending vs. descending. Specify list for multiple sort
orders. If this is a list of bools, must match the length of
the by.
inplace : bool, default False
if True, perform operation in-place
kind : {‘quicksort‘, ‘mergesort‘, ‘heapsort‘}, default ‘quicksort‘
Choice of sorting algorithm. See also ndarray.np.sort for more
information. `mergesort` is the only stable algorithm. For
DataFrames, this option is only applied when sorting on a single
column or label.
na_position : {‘first‘, ‘last‘}, default ‘last‘
`first` puts NaNs at the beginning, `last` puts NaNs at the end
?
Returns
-------
sorted_obj : DataFrame
?
Examples
--------
>>> df = pd.DataFrame({
... ‘col1‘ : [‘A‘, ‘A‘, ‘B‘, np.nan, ‘D‘, ‘C‘],
... ‘col2‘ : [2, 1, 9, 8, 7, 4],
... ‘col3‘: [0, 1, 9, 4, 2, 3],
... })
>>> df
col1 col2 col3
0 A 2 0
1 A 1 1
2 B 9 9
3 NaN 8 4
4 D 7 2
5 C 4 3
?
Sort by col1
?
>>> df.sort_values(by=[‘col1‘])
col1 col2 col3
0 A 2 0
1 A 1 1
2 B 9 9
5 C 4 3
4 D 7 2
3 NaN 8 4
?
Sort by multiple columns
?
>>> df.sort_values(by=[‘col1‘, ‘col2‘])
col1 col2 col3
1 A 1 1
0 A 2 0
2 B 9 9
5 C 4 3
4 D 7 2
3 NaN 8 4
?
Sort Descending
?
>>> df.sort_values(by=‘col1‘, ascending=False)
col1 col2 col3
4 D 7 2
5 C 4 3
2 B 9 9
0 A 2 0
1 A 1 1
3 NaN 8 4
?
Putting NAs first
?
>>> df.sort_values(by=‘col1‘, ascending=False, na_position=‘first‘)
col1 col2 col3
3 NaN 8 4
4 D 7 2
5 C 4 3
2 B 9 9
0 A 2 0
1 A 1 1
frame.sort_values(by=[‘a‘, ‘b‘])
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
?
obj.rank()
obj.rank(pct=True)
Signature: obj.rank(axis=0, method=‘average‘, numeric_only=None, na_option=‘keep‘, ascending=True, pct=False) Docstring: Compute numerical data ranks (1 through n) along axis. Equal values are assigned a rank that is the average of the ranks of those values
Parameters
axis : {0 or ‘index‘, 1 or ‘columns‘}, default 0 index to direct ranking method : {‘average‘, ‘min‘, ‘max‘, ‘first‘, ‘dense‘}
* average: average rank of group
* min: lowest rank in group
* max: highest rank in group
* first: ranks assigned in order they appear in the array
* dense: like ‘min‘, but rank always increases by 1 between groups
numeric_only : boolean, default None Include only float, int, boolean data. Valid only for DataFrame or Panel objects na_option : {‘keep‘, ‘top‘, ‘bottom‘}
* keep: leave NA values where they are
* top: smallest rank if ascending
* bottom: smallest rank if descending
ascending : boolean, default True False for ranks by high (1) to low (N) pct : boolean, default False Computes percentage rank of data
Returns
obj.rank(method=‘first‘)
# 数值相同去最大排名
obj.rank(ascending=False, method=‘max‘)
frame = pd.DataFrame({‘b‘: [4.3, 7, -3, 2], ‘a‘: [0, 1, 0, 1],
‘c‘: [-2, 5, 8, -2.5]})
frame
frame.rank(axis=‘columns‘)#每行大小排名
Axis Indexes with Duplicate Labels
obj = pd.Series(range(5), index=[‘a‘, ‘a‘, ‘b‘, ‘b‘, ‘c‘])
obj
obj.index.is_unique#判断行标签是否无重复
obj[‘a‘]
?
obj[‘c‘]
df = pd.DataFrame(np.random.randn(4, 3), index=[‘a‘, ‘a‘, ‘b‘, ‘b‘])
df
df.loc[‘b‘]
Summarizing and Computing Descriptive Statistics
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
[np.nan, np.nan], [0.75, -1.3]],
index=[‘a‘, ‘b‘, ‘c‘, ‘d‘],
columns=[‘one‘, ‘two‘])
df
df.sum()
Signature: df.sum(axis=None, skipna=None, level=None, numeric_only=None, min_count=0, **kwargs)
Docstring:
df.sum(axis=‘columns‘)
df.mean(axis=‘columns‘, skipna=False)#强制不跳过NaN
df.idxmax()
df.idxmax(axis=1)
Signature: df.idxmax(axis=0, skipna=True)
Docstring:
Return index of first occurrence of maximum over requested axis.
NA/null values are excluded.
?
Parameters
----------
axis : {0 or ‘index‘, 1 or ‘columns‘}, default 0
0 or ‘index‘ for row-wise, 1 or ‘columns‘ for column-wise
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result
will be NA.
?
Raises
------
ValueError
* If the row/column is empty
?
Returns
-------
idxmax : Series
?
Notes
-----
This method is the DataFrame version of ``ndarray.argmax``.
?
See Also
--------
Series.idxmax
df.cumsum()
Signature: df.cumsum(axis=None, skipna=True, *args, **kwargs)
Docstring:
Return cumulative sum over requested axis.
?
Parameters
----------
axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result
will be NA
?
Returns
-------
cumsum : Series
df.describe()
Signature: df.describe(percentiles=None, include=None, exclude=None) Docstring: Generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset‘s distribution, excluding NaN
values.
Analyzes both numeric and object series, as well as DataFrame
column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.
Parameters
percentiles : list-like of numbers, optional The percentiles to include in the output. All should fall between 0 and 1. The default is [.25, .5, .75]
, which returns the 25th, 50th, and 75th percentiles. include : ‘all‘, list-like of dtypes or None (default), optional A white list of data types to include in the result. Ignored for Series
. Here are the options:
- ‘all‘ : All columns of the input will be included in the output.
- A list-like of dtypes : Limits the results to the
provided data types.
To limit the result to numeric types submit
``numpy.number``. To limit it instead to object columns submit
the ``numpy.object`` data type. Strings
can also be used in the style of
``select_dtypes`` (e.g. ``df.describe(include=[‘O‘])``). To
select pandas categorical columns, use ``‘category‘``
- None (default) : The result will include all numeric columns.
exclude : list-like of dtypes or None (default), optional, A black list of data types to omit from the result. Ignored for Series
. Here are the options:
- A list-like of dtypes : Excludes the provided data types
from the result. To exclude numeric types submit
``numpy.number``. To exclude object columns submit the data
type ``numpy.object``. Strings can also be used in the style of
``select_dtypes`` (e.g. ``df.describe(include=[‘O‘])``). To
exclude pandas categorical columns, use ``‘category‘``
- None (default) : The result will exclude nothing.
Returns
summary: Series/DataFrame of summary statistics
Notes
For numeric data, the result‘s index will include count
, mean
, std
, min
, max
as well as lower, 50
and upper percentiles. By default the lower percentile is 25
and the upper percentile is 75
. The 50
percentile is the same as the median.
For object data (e.g. strings or timestamps), the result‘s index will include count
, unique
, top
, and freq
. The top
is the most common value. The freq
is the most common value‘s frequency. Timestamps also include the first
and last
items.
If multiple object values have the highest count, then the count
and top
results will be arbitrarily chosen from among those with the highest count.
For mixed data types provided via a DataFrame
, the default is to return only an analysis of numeric columns. If the dataframe consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. If include=‘all‘
is provided as an option, the result will include a union of attributes of each type.
The include
and exclude
parameters can be used to limit which columns in a DataFrame
are analyzed for the output. The parameters are ignored when analyzing a Series
.
Examples
Describing a numeric Series
.
s = pd.Series([1, 2, 3]) s.describe() count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0
Describing a categorical Series
.
s = pd.Series([‘a‘, ‘a‘, ‘b‘, ‘c‘]) s.describe() count 4 unique 3 top a freq 2 dtype: object
Describing a timestamp Series
.
s = pd.Series([ ... np.datetime64("2000-01-01"), ... np.datetime64("2010-01-01"), ... np.datetime64("2010-01-01") ... ]) s.describe() count 3 unique 2 top 2010-01-01 00:00:00 freq 2 first 2000-01-01 00:00:00 last 2010-01-01 00:00:00 dtype: object
Describing a DataFrame
. By default only numeric fields are returned.
df = pd.DataFrame({ ‘object‘: [‘a‘, ‘b‘, ‘c‘], ... ‘numeric‘: [1, 2, 3], ... ‘categorical‘: pd.Categorical([‘d‘,‘e‘,‘f‘]) ... }) df.describe() numeric count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0
Describing all columns of a DataFrame
regardless of data type.
df.describe(include=‘all‘) categorical numeric object count 3 3.0 3 unique 3 NaN 3 top f NaN c freq 1 NaN 1 mean NaN 2.0 NaN std NaN 1.0 NaN min NaN 1.0 NaN 25% NaN 1.5 NaN 50% NaN 2.0 NaN 75% NaN 2.5 NaN max NaN 3.0 NaN
Describing a column from a DataFrame
by accessing it as an attribute.
df.numeric.describe() count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0 Name: numeric, dtype: float64
Including only numeric columns in a DataFrame
description.
df.describe(include=[np.number]) numeric count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0
Including only string columns in a DataFrame
description.
df.describe(include=[np.object]) object count 3 unique 3 top c freq 1
Including only categorical columns from a DataFrame
description.
df.describe(include=[‘category‘]) categorical count 3 unique 3 top f freq 1
Excluding numeric columns from a DataFrame
description.
df.describe(exclude=[np.number]) categorical object count 3 3 unique 3 3 top f c freq 1 1
Excluding object columns from a DataFrame
description.
df.describe(exclude=[np.object]) categorical numeric count 3 3.0 unique 3 NaN top f NaN freq 1 NaN mean NaN 2.0 std NaN 1.0 min NaN 1.0 25% NaN 1.5 50% NaN 2.0 75% NaN 2.5 max NaN 3.0
See Also
DataFrame.count DataFrame.max DataFrame.min DataFrame.mean DataFrame.std DataFrame.select_dtypes
obj = pd.Series([‘a‘, ‘a‘, ‘b‘, ‘c‘] * 4)
obj.describe()
Correlation and Covariance
conda install pandas-datareader
price = pd.read_pickle(‘examples/yahoo_price.pkl‘)
volume = pd.read_pickle(‘examples/yahoo_volume.pkl‘)
import pandas_datareader.data as web all_data = {ticker: web.get_data_yahoo(ticker) for ticker in [‘AAPL‘, ‘IBM‘, ‘MSFT‘, ‘GOOG‘]}
price = pd.DataFrame({ticker: data[‘Adj Close‘] for ticker, data in all_data.items()}) volume = pd.DataFrame({ticker: data[‘Volume‘] for ticker, data in all_data.items()})
returns = price.pct_change()
returns.tail()
returns[‘MSFT‘].corr(returns[‘IBM‘])#相关性
?
returns[‘MSFT‘].cov(returns[‘IBM‘])#协方差
returns.MSFT.corr(returns.IBM)
returns.corr()
?
returns.cov()
returns.corrwith(returns.IBM)
returns.corrwith(volume)
Unique Values, Value Counts, and Membership
obj = pd.Series([‘c‘, ‘a‘, ‘d‘, ‘a‘, ‘a‘, ‘b‘, ‘b‘, ‘c‘, ‘c‘])
uniques = obj.unique()
uniques
obj.value_counts()
pd.value_counts(obj.values, sort=False)
obj
mask = obj.isin([‘b‘, ‘c‘])
mask
obj[mask]
to_match = pd.Series([‘c‘, ‘a‘, ‘b‘, ‘b‘, ‘c‘, ‘a‘])
unique_vals = pd.Series([‘c‘, ‘b‘, ‘a‘])
pd.Index(unique_vals).get_indexer(to_match)
unique_vals
data = pd.DataFrame({‘Qu1‘: [1, 3, 4, 3, 4],
‘Qu2‘: [2, 3, 1, 2, 3],
‘Qu3‘: [1, 5, 2, 4, 4]})
data
result = data.apply(pd.value_counts).fillna(0)
result
Conclusion
pd.options.display.max_rows = PREVIOUS_MAX_ROWS
?