
Posted 败家先森



pandas:powerful Python data analysis tookit

                             ——Wes McKinney & PyData Development Team,Release 0.18.0,March 17, 2016
Customarily,we import as follows:

import numpy as np  
import pandas as pd
import matplotlib.pyplot as plt

1. Object Creation

1.1 Series

Series is a one-dimensional labeled array capable of holding any data type(integers,strings,floating point numbers,Python objects,etc.).The axis labels are collectively referred to as the index(轴标签统称为索引).The basic method to create a Series is to call:

>>> s = pd.Series(data,index = index)

Here,data can be many different things:

  • a Python dict
  • an ndarray
  • a scalar value

The passed index is a list of axis labels.Thus,this separates into a few cases depending on what data is:

From ndarray
If data is an ndarray,index must be the same length as data.If no index is passed,one will be created havig values [0,1,3,…,len(data)-1].

>>> s = pd.Series(np.random.randn(5),index = ['a','b','c','d','e'])
>>> s
a   -0.159223
b    2.317106
c   -0.341460
d   -1.499552
e    0.400351
dtype: float64

>>> s.index
Index([u'a', u'b', u'c', u'd', u'e'], dtype='object')

>>> pd.Series(np.random.randn(5))
0    0.785536
1   -1.014011
2   -0.120812
3    0.289870
4    0.705393
dtype: float64

From dict
If data is a dict,if index is passed the values in data corresponding to the labels in the index will be pulled out.Otherwise,an index will be constructed from the sorted keys of the dict,if possible.

>>> d = 'a':0.,'b':1.,'c':2.
>>> pd.Series(d)
a    0
b    1
c    2
dtype: float64

>>> pd.Series(d,index = list('bcda'))
b     1
c     2
d   NaN
a     0
dtype: float64

**Note:**NaN(not a number) is the standard missing data marker used in pandas.

From scalar value
If data is a scalar value,an index must be provided.The value will be repeated to match the length of index.

>>> pd.Series(5.,index = ['a','b','c','d','e'])
a    5
b    5
c    5
d    5
e    5
dtype: float64

1.1.1 Series is ndarray-like

Series acts very similarly to a ndarray,and is a valid argument to most numpy functions.However,things like slicing also slice the index.

>>> s[0]
>>> s[:3]
a   -0.159223
b    2.317106
c   -0.341460
dtype: float64

>>> s[s > s.median()]
e    0.400351
b    2.317106
dtype: float64

>>> s[[4,3,2]]
b    2.317106
e    0.400351
a   -0.159223
dtype: float64

>>> np.exp(s)
d     0.223230
c     0.710732
a     0.852806
e     1.492348
b    10.146272
dtype: float64

1.1.2 Series is dict-like

>>> s['a']

>>> s['e'] = 12
>>> s
d    -1.499552
c    -0.341460
a    -0.159223
e    12.000000
b     2.317106
dtype: float64

>>> 'e' in s
>>> 'f' in s 

>>> s['f']
KeyError: 'f'

Using the get method,a missing label will return None or specified default:

>>> s.get('f')
>>> s.get('f',np.nan)
>>> s.get('e',np.nan)

1.1.3 Vectorized operations and label alignment with Series

When doing data analysis,as with raw numpy arrays looping through Series value-by-value is usually not necessary.Series can be also be passed into most numpy methods expecting an ndarray.

>>> s+s
d    -2.999105
c    -0.682919
a    -0.318446
e    24.000000
b     4.634213
dtype: float64

>>> s * 2
>>> np.exp(s)

A key different between Series and ndarray is that operations betwee Series automatically align the data based on label.Thus,you can write computations without giving consideration to whether the Series involved have the same labels.

>>> s[1:] + s[:-1]
a    -0.318446
b          NaN
c    -0.682919
d          NaN
e    24.000000
dtype: float64

Creating a Series by passing a list of values,letting pandas create a default integer index:

>>> s = pd.Series([1,2,3,np.nan,6,8])
0     1
1     3
2     5
3   NaN
4     6
5     8
dtype: float64

1.2 DataFrame

Creating a DataFrame by passing a numpy array,with a datetime index and labeled columns:

>>> dates = pd.date_range('20130101',periods=6)

>>> dates
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01, ..., 2013-01-06]
Length: 6, Freq: D, Timezone: None

>>> df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))

>>> df

               A         B         C         D
2013-01-01  0.716330 -1.782610  0.809990  0.319876
2013-01-02 -0.171806 -0.526268  0.206743 -1.246213
2013-01-03 -1.774970  1.890517 -0.773496 -0.930083
2013-01-04  0.537348 -0.870212 -1.227291  1.322823
2013-01-05 -0.897589  1.275171  1.064439 -2.021186
2013-01-06  0.130427 -1.067145 -1.273118  1.786337

[6 rows x 4 columns]    

Creating a DataFrame by passing a dict of objects that can be converted to series-like:

>>> df2 = pd.DataFrame('A':1.,
   A          B  C  D      E    F
0  1 2013-01-02  1  3   test  foo
1  1 2013-01-02  1  3  train  foo
2  1 2013-01-02  1  3   test  foo
3  1 2013-01-02  1  3  train  foo

[4 rows x 6 columns]

Having specific dtypes

>>> df2.dtypes
A           float64
B    datetime64[ns]
C           float32
D             int32
E            object
F            object
dtype: object

If you’re using IPython,tab completion for column names (as well as public attributes) is automatically enabled.

>>> df2.<TAB>
Display all 210 possibilities? (y or n)
df2.A                  df2.from_csv           df2.rank
df2.B                  df2.from_dict          df2.rdiv
df2.C                  df2.from_items         df2.reindex
df2.D                  df2.from_records       df2.reindex_axis
df2.E                  df2.ftypes             df2.reindex_like
df2.F                  df2.ge                 df2.rename
df2.T                  df2.get                df2.rename_axis
df2.abs                df2.get_dtype_counts   df2.reorder_levels
df2.add                df2.get_ftype_counts   df2.replace
df2.add_prefix         df2.get_value          df2.resample
df2.add_suffix         df2.get_values         df2.reset_index
df2.align              df2.groupby            df2.rfloordiv
df2.all                df2.gt                 df2.rmod
df2.any                df2.head               df2.rmul
df2.append             df2.hist               df2.rpow
df2.apply              df2.iat                df2.rsub
df2.applymap           df2.icol               df2.rtruediv
df2.as_blocks          df2.idxmax             df2.save
df2.as_matrix          df2.idxmin             df2.select
df2.asfreq             df2.iget_value         df2.set_index
df2.astype             df2.iloc               df2.set_value
df2.at                 df2.index              df2.shape
df2.at_time            df2.info               df2.shift
df2.axes               df2.insert             df2.skew
df2.between_time       df2.interpolate        df2.sort
df2.bfill              df2.irow               df2.sort_index
df2.blocks             df2.is_copy            df2.sortlevel
df2.bool               df2.isin               df2.squeeze
df2.boxplot            df2.isnull             df2.stack
df2.clip               df2.iteritems          df2.std
df2.clip_lower         df2.iterkv             df2.sub
df2.clip_upper         df2.iterrows           df2.subtract
df2.columns            df2.itertuples         df2.sum
df2.combine            df2.ix                 df2.swapaxes
df2.combineAdd         df2.join               df2.swaplevel
df2.combineMult        df2.keys               df2.tail
df2.combine_first      df2.kurt               df2.take
df2.compound           df2.kurtosis           df2.to_clipboard
df2.consolidate        df2.last               df2.to_csv
df2.convert_objects    df2.last_valid_index   df2.to_dense
df2.copy               df2.le                 df2.to_dict
df2.corr               df2.load               df2.to_excel
df2.corrwith           df2.loc                df2.to_gbq
df2.count              df2.lookup             df2.to_hdf
df2.cov                df2.lt                 df2.to_html
df2.cummax             df2.mad                df2.to_json
df2.cummin             df2.mask               df2.to_latex
df2.cumprod            df2.max                df2.to_msgpack
df2.cumsum             df2.mean               df2.to_panel
df2.delevel            df2.median             df2.to_period
df2.describe           df2.merge              df2.to_pickle
df2.diff               df2.min                df2.to_records
df2.div                df2.mod                df2.to_sparse
df2.divide             df2.mode               df2.to_sql
df2.dot                df2.mul                df2.to_stata
df2.drop               df2.multiply           df2.to_string
df2.drop_duplicates    df2.ndim               df2.to_timestamp
df2.dropna             df2.ne                 df2.to_wide
df2.dtypes             df2.notnull            df2.transpose
df2.duplicated         df2.pct_change         df2.truediv
df2.empty              df2.pivot              df2.truncate
df2.eq                 df2.pivot_table        df2.tshift
df2.equals             df2.plot               df2.tz_convert
df2.eval               df2.pop                df2.tz_localize
df2.ffill              df2.pow                df2.unstack
df2.fillna             df2.prod               df2.update
df2.filter             df2.product            df2.values
df2.first              df2.quantile           df2.var
df2.first_valid_index  df2.query              df2.where
df2.floordiv           df2.radd               df2.xs

>>> df2.

2. Viewing Data

See the top&bottom rows of the frame

>>> df.head()
>>> df.tail()
>>> df.head(10)
>>> df.tail(10)

Display the index,columns,and the underlying numpy data

>>> df.index
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01, ..., 2013-01-06]
Length: 6, Freq: D, Timezone: None

>>> df.columns
Index([u'A', u'B', u'C', u'D'], dtype='object')

>>> df.values
array([[ 0.71632974, -1.78261015,  0.80999048,  0.31987599],
   [-0.17180562, -0.52626809,  0.20674317, -1.24621339],
   [-1.77496978,  1.89051681, -0.77349583, -0.93008323],
   [ 0.53734751, -0.87021202, -1.22729091,  1.32282329],
   [-0.89758898,  1.27517093,  1.06443943, -2.02118609],
   [ 0.13042695, -1.06714528, -1.27311829,  1.78633711]])

Describe shows a quick statistic summary of your data

>>> df.describe()

          A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean  -0.243377 -0.180091 -0.198789 -0.128074
std    0.943312  1.439183  1.031516  1.513146
min   -1.774970 -1.782610 -1.273118 -2.021186
25%   -0.716143 -1.017912 -1.113842 -1.167181
50%   -0.020689 -0.698240 -0.283376 -0.305104
75%    0.435617  0.824811  0.659179  1.072086
max    0.716330  1.890517  1.064439  1.786337

[8 rows x 4 columns]

Transposing your data

>>> df.T
   2013-01-01  2013-01-02  2013-01-03  2013-01-04  2013-01-05  2013-01-06
A    0.716330   -0.171806   -1.774970    0.537348   -0.897589    0.130427
B   -1.782610   -0.526268    1.890517   -0.870212    1.275171   -1.067145
C    0.809990    0.206743   -0.773496   -1.227291    1.064439   -1.273118
D    0.319876   -1.246213   -0.930083    1.322823   -2.021186    1.786337

[4 rows x 6 columns]

sorting by an axis(sort_index method)

>>> df.sort_index(axis=1,ascending=False)   
               D         C         B         A
2013-01-01  0.319876  0.809990 -1.782610  0.716330
2013-01-02 -1.246213  0.206743 -0.526268 -0.171806
2013-01-03 -0.930083 -0.773496  1.890517 -1.774970
2013-01-04  1.322823 -1.227291 -0.870212  0.537348
2013-01-05 -2.021186  1.064439  1.275171 -0.897589
2013-01-06  1.786337 -1.273118 -1.067145  0.130427

[6 rows x 4 columns]

>>> df.sort_index(axis=0,ascending=False)

               A         B         C         D
2013-01-06  0.130427 -1.067145 -1.273118  1.786337
2013-01-05 -0.897589  1.275171  1.064439 -2.021186
2013-01-04  0.537348 -0.870212 -1.227291  1.322823
2013-01-03 -1.774970  1.890517 -0.773496 -0.930083
2013-01-02 -0.171806 -0.526268  0.206743 -1.246213
2013-01-01  0.716330 -1.782610  0.809990  0.319876

[6 rows x 4 columns]

>>> df.sort_index(axis=1,ascending=False).sort_index(axis=0,ascending=False)
               D         C         B         A
2013-01-06  1.786337 -1.273118 -1.067145  0.130427
2013-01-05 -2.021186  1.064439  1.275171 -0.897589
2013-01-04  1.322823 -1.227291 -0.870212  0.537348
2013-01-03 -0.930083 -0.773496  1.890517 -1.774970
2013-01-02 -1.246213  0.206743 -0.526268 -0.171806
2013-01-01  0.319876  0.809990 -1.782610  0.716330

[6 rows x 4 columns]

Sorting by values(sort method,0:descending,1:ascengding)

>>> df.sort(['A','B'],ascending=[0,1])

3. Selection

3.1 Getting

Selecting a single column,which yields a Series,equivalent to df.A

>>> df.A
2013-01-01    0.716330
2013-01-02   -0.171806
2013-01-03   -1.774970
2013-01-04    0.537348
2013-01-05   -0.897589
2013-01-06    0.130427
Freq: D, Name: A, dtype: float64

>>> df['A']
2013-01-01    0.716330
2013-01-02   -0.171806
2013-01-03   -1.774970
2013-01-04    0.537348
2013-01-05   -0.897589
2013-01-06    0.130427
Freq: D, Name: A, dtype: float64

Selecting via [],which slices the rows.

>>> df[0:3]
               A         B         C         D
2013-01-01  0.716330 -1.782610  0.809990  0.319876
2013-01-02 -0.171806 -0.526268  0.206743 -1.246213
2013-01-03 -1.774970  1.890517 -0.773496 -0.930083

               A         B         C         D
2013-01-01  0.716330 -1.782610  0.809990  0.319876
2013-01-02 -0.171806 -0.526268  0.206743 -1.246213
2013-01-03 -1.774970  1.890517 -0.773496 -0.930083

[3 rows x 4 columns]

3.2 Selecting by Label

For getting a cross section(横截面) using a label

>>> df.loc[dates[0]]    
A    0.716330
B   -1.782610
C    0.809990
D    0.319876
Name: 2013-01-01 00:00:00, dtype: float64

Selecting on a multi-axis by label

>>> df.loc[:,['A','B']]
                   A         B
2013-01-01  0.716330 -1.782610
2013-01-02 -0.171806 -0.526268
2013-01-03 -1.774970  1.890517
2013-01-04  0.537348 -0.870212
2013-01-05 -0.897589  1.275171
2013-01-06  0.130427 -1.067145

[6 rows x 2 columns]

Showing label slicing,both endpoints are included

>>> df.loc['20130102':'20130104',['A','B']]
                   A         B
2013-01-02 -0.171806 -0.526268
2013-01-03 -1.774970  1.890517
2013-01-04  0.537348 -0.870212

[3 rows x 2 columns]

Reduction in the dimensions of the returned object

>>> df.loc['20130102',['A','B']]
A   -0.171806
B   -0.526268
Name: 2013-01-02 00:00:00, dtype: float64

For getting a scalar value(标量)

>>> df.loc[dates[0],'A']

For getting fast access to a scalar(equiv to the prior method)

>>> df.at[dates[0],'A']

3.3 Selection by Position

Select via the position of the passed integers

>>> df.iloc[3]    #第四行
A    0.537348
B   -0.870212
C   -1.227291
D    1.322823
Name: 2013-01-04 00:00:00, dtype: float64

By integer slices,acting similar to numpy/python

>>> df.iloc[3:5,0:2]
                   A         B
2013-01-04  0.537348 -0.870212
2013-01-05 -0.897589  1.275171

[2 rows x 2 columns]

By lists of integer position locations,similar to the numpy/python style

>>> df.iloc[[1,2,4],[0,2]]
                   A         C
2013-01-02 -0.171806  0.206743
2013-01-03 -1.774970 -0.773496
2013-01-05 -0.897589  1.064439

[3 rows x 2 columns]

For slicing rows explicitly

>>> df.iloc[1:3,:]
                   A         B         C         D
2013-01-02 -0.171806 -0.526268  0.206743 -1.246213
2013-01-03 -1.774970  1.890517 -0.773496 -0.930083

[2 rows x 4 columns]

For slicing columns explicitly

>>> df.iloc[:,1:3]
                   B         C
2013-01-01 -1.782610  0.809990
2013-01-02 -0.526268  0.206743
2013-01-03  1.890517 -0.773496
2013-01-04 -0.870212 -1.227291
2013-01-05  1.275171  1.064439
2013-01-06 -1.067145 -1.273118

[6 rows x 2 columns]

For getting a value explicitly

>>> df.iloc[1,1]

For getting fast access to a scalar(equiv to the prior method)

>>> df.iat[1,1]

3.4 Boolean Indexing

Using a single column’s values to select data

>>> df[df.A>0]
>>> df[df['A']>0]
                   A         B         C         D
2013-01-02  0.859761  0.755971  1.371420  0.271600
2013-01-03  0.606392  0.077458  0.251290  2.134013
2013-01-05  0.022155 -0.216343 -1.179598  0.431374
2013-01-06  2.676268  2.295133 -2.132639  0.702915

[4 rows x 4 columns]

A where operation for getting.

>>> df[df>0]
                   A         B         C         D
2013-01-01       NaN       NaN       NaN  0.209321
2013-01-02  0.859761  0.755971  1.371420  0.271600
2013-01-03  0.606392  0.077458  0.251290  2.134013
2013-01-04       NaN       NaN  0.946518       NaN
2013-01-05  0.022155       NaN       NaN  0.431374
2013-01-06  2.676268  2.295133       NaN  0.702915

[6 rows x 4 columns]

Using the isin() method for filtering(过滤):

>>> df[df>0]
>>> df2 = df.copy() 
>>> df2['E'] = ['one','one','two','three','four','three']
>>> df2
                   A         B         C         D      E
2013-01-01 -0.234954 -1.346601 -1.030691  0.209321    one
2013-01-02  0.859761  0.755971  1.371420  0.271600    one
2013-01-03  0.606392  0.077458  0.251290  2.134013    two
2013-01-04 -0.938926 -0.749240  0.946518 -0.248072  three
2013-01-05  0.022155 -0.216343 -1.179598  0.431374   four
2013-01-06  2.676268  2.295133 -2.132639  0.702915  three

[6 rows x 5 columns]

>>> df2[df2.E.isin(['two','four'])]
                   A         B         C         D     E
2013-01-03  0.606392  0.077458  0.251290  2.134013   two
2013-01-05  0.022155 -0.216343 -1.179598  0.431374  four

[2 rows x 5 columns]

3.5 Setting

Setting a new column automatically aligns the data by the indexes

>>> s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20130102',periods=6))
>>> s1
2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

>>> df['F'] = s1

Setting values by label

>>> df.at[dates[0],'A'] = 0

Setting values by position

>>> df.iat[0,1] = 0

Setting by assigning with a numpy array

>>> df.loc[:,'D'] = np.array([5*len(df)])

The result of the prior setting operations

2013-01-01 0.000000 0.000000 -1.030691 30 NaN
2013-01-02 0.859761 0.755971 1.371420 30 1
2013-01-03 0.606392 0.077458 0.251290 30 2
2013-01-04 -0.938926 -0.749240 0.946518 30 3
2013-01-05 0.022155 -0.216343 -1.179598 30 4
2013-01-06 2.676268 2.295133 -2.132639 30 5

[6 rows x 5 columns]

A where operation with setting

>>> df2 = df.copy()
>>> df2[df2>0] = -df2
                   A         B         C   D   F
2013-01-01  0.000000  0.000000 -1.030691 -30 NaN
2013-01-02 -0.859761 -0.755971 -1.371420 -30  -1
2013-01-03 -0.606392 -0.077458 -0.251290 -30  -2
2013-01-04 -0.938926 -0.749240 -0.946518 -30  -3
2013-01-05 -0.022155 -0.216343 -1.179598 -30  -4
2013-01-06 -2.676268 -2.295133 -2.132639 -30  -5

[6 rows x 5 columns]

4. Missing Data

pandas primarily use the value np.nan to represent missing data.It is default not included in computations.

Reindexing allows you to change/add/delete the index on a specified axis.This returns a copy of the data.

>>> df1 = df.reindex(index = dates[0:4],columns = list(df.columns) + ['E'])
>>> df1.loc[dates[0]:dates[1],'E'] = 1  
>>> df
                   A         B         C   D   F
2013-01-01  0.000000  0.000000 -1.030691  30 NaN
2013-01-02  0.859761  0.755971  1.371420  30   1
2013-01-03  0.606392  0.077458  0.251290  30   2
2013-01-04 -0.938926 -0.749240  0.946518  30   3
2013-01-05  0.022155 -0.216343 -1.179598  30   4
2013-01-06  2.676268  2.295133 -2.132639  30   5

[6 rows x 5 columns]

>>> df1
                   A         B         C   D   F   E
2013-01-01  0.000000  0.000000 -1.030691  30 NaN   1
2013-01-02  0.859761  0.755971  1.371420  30   1   1
2013-01-03  0.606392  0.077458  0.251290  30   2 NaN
2013-01-04 -0.938926 -0.749240  0.946518  30   3 NaN

[4 rows x 6 columns]

To drop any rows that have missing data.

>>> df1.dropna(how='any')
                   A         B        C   D  F  E
2013-01-02  0.859761  0.755971  1.37142  30  1  1

[1 rows x 6 columns]

>>> df1.fillna(value=5)
                   A         B         C   D  F  E
2013-01-01  0.000000  0.000000 -1.030691  30  5  1
2013-01-02  0.859761  0.755971  1.371420  30  1  1
2013-01-03  0.606392  0.077458  0.251290  30  2  5
2013-01-04 -0.938926 -0.749240  0.946518  30  3  5

[4 rows x 6 columns]

To get the boolean mask where values are nan

>>> pd.isnull(df1)/df1.isnull()
                A      B      C      D      F      E
2013-01-01  False  False  False  False   True  False
2013-01-02  False  False  False  False  False  False
2013-01-03  False  False  False  False  False   True
2013-01-04  False  False  False  False  False   True

[4 rows x 6 columns]

5. Operations

5.1 Stats

Operations in general exclude missing data.
Performing a descriptive statistic

>>> df.mean()
A     0.537608
B     0.360497
C    -0.295617
D    30.000000
F     3.000000
dtype: float64

Same operation on the other axis

>>> df.mean(1)
2013-01-01    7.242327
2013-01-02    6.797430
2013-01-03    6.587028
2013-01-04    6.451670
2013-01-05    6.525243
2013-01-06    7.567752
Freq: D, dtype: float64

Operating with objects that have different dimensionality and need aligment.In addition,pandas automatically broadcasts along the specified dimension.

>>> s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)
>>> s
2013-01-01   NaN
2013-01-02   NaN
2013-01-03     1
2013-01-04     3
2013-01-05     5
2013-01-06   NaN
Freq: D, dtype: float64

>>> df.sub(s,axis = 'index')
                   A         B         C   D   F
2013-01-01       NaN       NaN       NaN NaN NaN
2013-01-02       NaN       NaN       NaN NaN NaN
2013-01-03 -0.393608 -0.922542 -0.748710  29   1
2013-01-04 -3.938926 -3.749240 -2.053482  27   0
2013-01-05 -4.977845 -5.216343 -6.179598  25  -1
2013-01-06       NaN       NaN       NaN NaN NaN

[6 rows x 5 columns]

5.2 Apply

Applying function to the data

>>> df.apply(np.cumsum) #累计求和
                   A         B         C    D   F
2013-01-01  0.000000  0.000000 -1.030691   30 NaN
2013-01-02  0.859761  0.755971  0.340729   60   1
2013-01-03  1.466153  0.833429  0.592019   90   3
2013-01-04  0.527227  0.084189  1.538537  120   6
2013-01-05  0.549382 -0.132154  0.358939  150  10
2013-01-06  3.225651  2.162979 -1.773700  180  15

[6 rows x 5 columns]

>>> df.apply(lambda x:x.max() - x.min())
A    3.615194
B    3.044374
C    3.504059
D    0.000000
F    4.000000
dtype: float64

5.3 Histogramming

>>> s = pd.Series(np.random.randint(0,7,size=10))
>>> s
0    6
1    3
2    6
3    0
4    0
5    6
6    4
7    3
8    0
9    1
dtype: int64

>>> s.value_counts()
6    3
0    3
3    2
4    1
1    1
dtype: int64

5.4 String Methods

Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses regular expressions by default (and in some cases always uses them)

>>> s = pd.Series(['A','B','C','AaBa','Baca',np.nan,'CABA','dog','cat'])
>>> s.str.lower()
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object


