Pandas Quick Start¶
参考:
https://pandas.pydata.org/pandas-docs/stable/10min.html
https://pyzh.readthedocs.io/en/latest/python-pandas.html
1. 什么是pandas?¶
pandas: Python 数据分析模块
pandas是为了解决数据分析任务而创建的,纳入了大量的库和标准数据模型,提供了高效地操作大型数据集所需的工具。
pandas中的数据结构 :
Series: 一维数组,类似于python中的基本数据结构list,区别是series只允许存储相同的数据类型,这样可以更有效的使用内存,提高运算效率。就像数据库中的列数据。
DataFrame: 二维的表格型数据结构。很多功能与R中的data.frame类似。可以将DataFrame理解为Series的容器。
Panel:三维的数组,可以理解为DataFrame的容器。
2. 十分钟搞定pandas¶
引入需要的包:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
注
numpy 是一个python实现的科学计算包
matplotlib 是一个python的2D绘图库
3.创建对象¶
详情请查看 数据结构介绍
1.通过传入一个列表来创建 Series ,pandas会创建默认的整形 index:
>>> s = pd.Series([1,3,5,np.nan,6,8])
>>> s
0 1
1 3
2 5
3 NaN
4 6
5 8
dtype: float64
2.通过传递数字数组、时间索引、列标签来创建 DataFrame
>>> dates = pd.date_range('20130101',periods=6)
>>> dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
>>> df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
>>> df
A B C D
2013-01-01 0.859619 -0.545903 0.012447 1.257684
2013-01-02 0.119622 -0.484051 0.404728 0.360880
2013-01-03 -0.719234 -0.396174 0.635237 0.216691
2013-01-04 -0.921692 0.876693 -0.670553 1.468060
2013-01-05 -0.300317 -0.011320 -1.376442 1.694740
2013-01-06 -1.903683 0.786785 -0.194179 0.177973
3.通过传递能被转换成类似结构的字典来创建 DataFrame:
>>>df2 = pd.DataFrame({'A' : 1.,
'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
'D' : np.array([3] * 4,dtype='int32'),
'E' : pd.Categorical(["test","train","test","train"]),
'F' : 'foo' })
>>> df2
A B C D E F
0 1 2013-01-02 1 3 test foo
1 1 2013-01-02 1 3 train foo
2 1 2013-01-02 1 3 test foo
3 1 2013-01-02 1 3 train foo
4.查看各列的 dtypes
>>> df2.dtypes
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
5.如果使用IPython,Tab会自动补全所有的属性和自定义的列,如下所示:
>>> df2.<TAB>
df2.A df2.boxplot
df2.abs df2.C
df2.add df2.clip
df2.add_prefix df2.clip_lower
df2.add_suffix df2.clip_upper
df2.align df2.columns
df2.all df2.combine
df2.any df2.combineAdd
df2.append df2.combine_first
df2.apply df2.combineMult
df2.applymap df2.compound
df2.as_blocks df2.consolidate
df2.asfreq df2.convert_objects
df2.as_matrix df2.copy
df2.astype df2.corr
df2.at df2.corrwith
df2.at_time df2.count
df2.axes df2.cov
df2.B df2.cummax
df2.between_time df2.cummin
df2.bfill df2.cumprod
df2.blocks df2.cumsum
df2.bool df2.D
可以看到,A、B、C、D列均通过Tab自动生成
4. 查看数据¶
1.查看DataFrame头部和尾部行的数据:
>>> df.head()
>>> df.tail(3)
2.查看索引、列、和数组数据:
>>> df.index
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
>>> df.columns
Index([u'A', u'B', u'C', u'D'], dtype='object')
>>> df.values
array([[ 0.85961861, -0.54590304, 0.01244705, 1.25768432],
[ 0.11962178, -0.4840508 , 0.40472795, 0.36088029],
[-0.7192337 , -0.39617432, 0.63523701, 0.21669124],
[-0.92169244, 0.87669275, -0.67055318, 1.46806034],
[-0.30031679, -0.01132035, -1.37644224, 1.69474031],
[-1.90368258, 0.78678454, -0.19417942, 0.17797326]])
3.查看数据的快速统计结果:
>>> df.describe()
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean -0.477614 0.037671 -0.198127 0.862672
std 0.945047 0.643196 0.736736 0.685969
min -1.903683 -0.545903 -1.376442 0.177973
25% -0.871078 -0.462082 -0.551460 0.252739
50% -0.509775 -0.203747 -0.090866 0.809282
75% 0.014637 0.587258 0.306658 1.415466
max 0.859619 0.876693 0.635237 1.694740
4.对数据进行行列转换,转置:
>>> df.T
5.按 axis 排序:
>>> df.sort_index(axis=1, ascending=False) # 按列标签排序
6.按值排序:
>>> df.sort_values(by='B') # 按 B 列的值排序
5.选择数据¶
Python 和 Numpy 中的表达式有时不够直观,使用 Pandas 方法比如 at, iat, loc, iloc, ix
等。Indexing and Selecting Data, MultiIndex / Advanced Indexing
标签式选择¶
loc
和 at
df.loc[dates[0]]
df.loc[:,['A','B']]
df.loc['20130102':'20130104',['A','B']]
df.loc['20130102':'20130104','A':'B']
df.loc['20130102':'20130104',['A','B']]
df.loc[dates[0],'A']
df.at[dates[0],'A']
位置式选择¶
iloc
和 iat
df.iloc[3]
df.iloc[3:5,0:2]
df.iloc[[1,2,4],[0,2]]
df.iloc[1:3,:]
df.iloc[:,1:3]
df.iloc[1,1]
df.iat[1,1]
Boolean 索引¶
df[df.A > 0] # 通过条件行号选取
df[df > 0] # where 选择数据
# 通过 isin 选择,通常不对数据进行选择
df2['E'] = ['one', 'one','two','three','four','three']
df2[df2['E'].isin(['two','four'])]
Boolean 索引可以是表达式,也可以是表达式的组合,用来进行一些复杂的选择和查找
设置¶
新增一列数据
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
df['F'] = s1
通过标签更新值 df.at[dates[0],'A'] = 0
通过位置更新值 df.iat[0,1] = 0
通过数据更新一系列值 df.loc[:,'D'] = np.array([5] * len(df))
通过 where 更新值 df2[df2 > 0] = -df2
6.缺失数据的处理¶
pandas 用 np.nan 代表缺失数据 Missing Data section
- reindex() 可以修改/增加/删除索引,返回一个数据的副本:
>>> df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
>>> df1.loc[dates[0]:dates[1],'E'] = 1
>>> df1
A B C D F E
2013-01-01 0.000000 0.000000 0.012447 5 NaN 1
2013-01-02 0.119622 -0.484051 0.404728 5 1 1
2013-01-03 -0.719234 -0.396174 0.635237 5 2 NaN
2013-01-04 -0.921692 0.876693 -0.670553 5 3 NaN
2.丢掉含有缺失项的行:
>>> df1.dropna(how='any')
A B C D F E
2013-01-02 0.119622 -0.484051 0.404728 5 1 1
3.对缺失项赋值:
>>> df1.fillna(value=5)
A B C D F E
2013-01-01 0.000000 0.000000 0.012447 5 5 1
2013-01-02 0.119622 -0.484051 0.404728 5 1 1
2013-01-03 -0.719234 -0.396174 0.635237 5 2 5
2013-01-04 -0.921692 0.876693 -0.670553 5 3 5
4.对缺失项布尔赋值:
>>> pd.isnull(df1)
A B C D F E
2013-01-01 False False False False True False
2013-01-02 False False False False False False
2013-01-03 False False False False False True
2013-01-04 False False False False False True
7.相关操作¶
统计(操作通常情况下不包含缺失项)
1.按列求平均值:
>>> df.mean()
2.按行求平均值:
>>> df.mean(1)
3.操作不同的维度需要先对齐,pandas会沿着指定维度执行,如果有 index 的话,pandas 会自动进行对齐:
>>> s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)
>>> s
2013-01-01 NaN
2013-01-02 NaN
2013-01-03 1
2013-01-04 3
2013-01-05 5
2013-01-06 NaN
Freq: D, dtype: float64
>>> df.sub(s, axis='index') # 等同 df.sub(s, axis=0)
A B C D F
2013-01-01 NaN NaN NaN NaN NaN
2013-01-02 NaN NaN NaN NaN NaN
2013-01-03 -1.719234 -1.396174 -0.364763 4 1
2013-01-04 -3.921692 -2.123307 -3.670553 2 0
2013-01-05 -5.300317 -5.011320 -6.376442 0 -1
2013-01-06 NaN NaN NaN NaN NaN
注:
这里对齐维度指的对齐时间index shift(2)指沿着时间轴将数据顺移两位,空位补 NaN sub指减法,与NaN进行操作,结果也是NaN
应用
1.对数据应用function:
>>> df.apply(np.cumsum)
A B C D F
2013-01-01 0.000000 0.000000 0.012447 5 NaN
2013-01-02 0.119622 -0.484051 0.417175 10 1
2013-01-03 -0.599612 -0.880225 1.052412 15 3
2013-01-04 -1.521304 -0.003532 0.381859 20 6
2013-01-05 -1.821621 -0.014853 -0.994583 25 10
2013-01-06 -3.725304 0.771932 -1.188763 30 15
>>> df.apply(lambda x: x.max() - x.min())
A 2.023304
B 1.360744
C 2.011679
D 0.000000
F 4.000000
dtype: float64
注: cumsum 累加
直方图:
>>> s = pd.Series(np.random.randint(0, 7, size=10))
>>> s
0 1
1 3
2 5
3 1
4 6
5 1
6 3
7 4
8 0
9 3
dtype: int64
>>> s.value_counts()
3 3
1 3
6 1
5 1
4 1
0 1
dtype: int64
pandas默认配置了一些字符串处理方法,可以方便的操作元素 Vectorized String Methods
字符串方法:
>>> s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
>>> s.str.lower()
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
8. 合并¶
连接
pandas提供了大量的方法,能轻松的对Series,DataFrame和Panel执行合并操作。Merging section
使用 concat() 连接 pandas 对象:
>>> df = pd.DataFrame(np.random.randn(10, 4))
>>> pieces = [df[:3], df[3:7], df[7:]]
>>> pd.concat(pieces)
0 1 2 3
0 -0.199614 1.914485 0.396383 -0.295306
1 -0.061961 -1.352883 0.266751 -0.874132
2 0.346504 -2.328099 -1.492250 0.095392
3 0.187115 0.562740 -1.677737 -0.224807
4 -1.422599 -1.028044 0.789487 0.806940
5 0.439478 -0.592229 0.736081 1.008404
6 -0.205641 -0.649465 -0.706395 0.578698
7 -2.168725 -2.487189 0.060258 1.965318
8 0.207634 0.512572 0.595373 0.816516
9 0.764893 0.612208 -1.022504 -2.032126
Join 和 Merge
类似SQL的合并操作,Database style joining
>>> left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
>>> right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
>>> left
key lval
0 foo 1
1 foo 2
>>> right
key rval
0 foo 4
1 foo 5
>>> pd.merge(left, right, on='key')
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
>>> left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
>>> right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
>>> left
key lval
0 foo 1
1 bar 2
>>> right
key rval
0 foo 4
1 bar 5
>>> pd.merge(left, right, on='key')
key lval rval
0 foo 1 4
1 bar 2 5
追加
>>> df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
>>> s = df.iloc[3]
>>> df.append(s, ignore_index=True)
A B C D
0 -1.710447 2.541720 -0.654403 0.132077
1 0.667796 -1.124769 -0.430752 -0.244731
2 1.555865 -0.483805 0.066114 -0.409518
3 1.171798 0.036219 -0.515065 0.860625
4 -0.834051 -2.178128 -0.345627 0.819392
5 -0.354886 0.161204 1.465532 1.879841
6 0.560888 1.208905 1.301983 0.799084
7 -0.770196 0.307691 1.212200 0.909137
8 1.171798 0.036219 -0.515065 0.860625
9.分组¶
group by:
- Splitting 将数据分组
- Applying 对每个分组应用不同的function
- Combining 使用某种数据结果展示结果 Grouping section
>>> df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
>>> df
A B C D
0 foo one -0.655020 -0.671592
1 bar one 0.846428 1.884603
2 foo two -2.280466 0.725070
3 bar three 1.166448 -0.208171
4 foo two -0.257124 -0.850319
5 bar two -0.654609 1.258091
6 foo one -1.624213 -0.383978
7 foo three -0.523944 0.114338
分组后sum求和:
>>> df.groupby('A').sum()
C D
A
bar 1.358267 2.934523
foo -5.340766 -1.066481
对多列分组后 sum:
>>> df.groupby(['A','B']).sum()
C D
A B
bar one 0.846428 1.884603
three 1.166448 -0.208171
two -0.654609 1.258091
foo one -2.279233 -1.055570
three -0.523944 0.114338
two -2.537589 -0.125249
10.重塑¶
Hierarchical Indexing 和 Reshaping
stack:
>>> tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two',
'one', 'two', 'one', 'two']]))
>>> index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
>>> index
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
names=[u'first', u'second'])
>>> df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
>>> df
A B
first second
bar one -0.922059 -0.918091
two -0.825565 -0.880527
baz one 0.241927 1.130320
two -0.261823 2.463877
foo one -0.220328 -0.519477
two -1.028038 -0.543191
qux one 0.315674 0.558686
two 0.422296 0.241212
>>> df2 = df[:4]
>>> df2
A B
first second
bar one -0.922059 -0.918091
two -0.825565 -0.880527
baz one 0.241927 1.130320
two -0.261823 2.463877
注:pd.MultiIndex.from_tuples 将包含多个list的元组转换为复杂索引
使用 stack() 方法为 DataFrame 增加 column:???
>>> stacked = df2.stack()
>>> stacked
first second
bar one A -0.922059
B -0.918091
two A -0.825565
B -0.880527
baz one A 0.241927
B 1.130320
two A -0.261823
B 2.463877
dtype: float64
使用unstack()方法还原stack的DataFrame,默认还原最后一级,也可以自由指定:
>>> stacked.unstack()
A B
first second
bar one -0.922059 -0.918091
two -0.825565 -0.880527
baz one 0.241927 1.130320
two -0.261823 2.463877
>>> stacked.unstack(1)
second one two
first
bar A -0.922059 -0.825565
B -0.918091 -0.880527
baz A 0.241927 -0.261823
B 1.130320 2.463877
>>> stacked.unstack(0)
first bar baz
second
one A -0.922059 0.241927
B -0.918091 1.130320
two A -0.825565 -0.261823
B -0.880527 2.463877
透视表 详情请查看 Pivot Tables
>>> df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
'B' : ['A', 'B', 'C'] * 4,
'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
'D' : np.random.randn(12),
'E' : np.random.randn(12)})
注:可以理解为自由组合表的行与列,类似于交叉报表
>>> pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
C bar foo
A B
one A -1.250611 -1.047274
B 1.532134 -0.455948
C 0.125989 -0.500260
three A 0.623716 NaN
B NaN 0.095117
C -0.348707 NaN
two A NaN 0.390363
B -0.743466 NaN
C NaN 0.792279
11. 时间序列¶
pandas可以简单高效的进行重新采样通过频率转换(例如:将秒级数据转换成五分钟为单位的数据)。这常见与金融应用中,但是不限于此。详情请查看 Time Series section
>>> rng = pd.date_range('1/1/2012', periods=100, freq='S')
>>> ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
>>> ts.resample('5Min').sum()
2012-01-01 24390
Freq: 5T, dtype: int64
注:将随机产生的秒级数据整合成 5min 的数据
时区表现:
>>> rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')
>>> ts = pd.Series(np.random.randn(len(rng)), rng)
>>> ts
2012-03-06 0.972202
2012-03-07 -0.839969
2012-03-08 -0.979993
2012-03-09 -0.052460
2012-03-10 -0.487963
Freq: D, dtype: float64
>>> ts_utc = ts.tz_localize('UTC')
>>> ts_utc
2012-03-06 00:00:00+00:00 0.972202 2012-03-07 00:00:00+00:00 -0.839969 2012-03-08 00:00:00+00:00 -0.979993 2012-03-09 00:00:00+00:00 -0.052460 2012-03-10 00:00:00+00:00 -0.487963 Freq: D, dtype: float64
时区变换:
>>> ts_utc.tz_convert('US/Eastern')
2012-03-05 19:00:00-05:00 0.972202 2012-03-06 19:00:00-05:00 -0.839969 2012-03-07 19:00:00-05:00 -0.979993 2012-03-08 19:00:00-05:00 -0.052460 2012-03-09 19:00:00-05:00 -0.487963 Freq: D, dtype: float64
在不同的时间跨度表现间变换:
>>> rng = pd.date_range('1/1/2012', periods=5, freq='M')
>>> ts = pd.Series(np.random.randn(len(rng)), index=rng)
>>> ts
2012-01-31 -0.681068 2012-02-29 -0.263571 2012-03-31 1.268001 2012-04-30 0.331786 2012-05-31 0.663572 Freq: M, dtype: float64
>>> ps = ts.to_period()
>>> ps
2012-01 -0.681068
2012-02 -0.263571
2012-03 1.268001
2012-04 0.331786
2012-05 0.663572
Freq: M, dtype: float64
>>> ps.to_timestamp()
2012-01-01 -0.681068
2012-02-01 -0.263571
2012-03-01 1.268001
2012-04-01 0.331786
2012-05-01 0.663572
Freq: MS, dtype: float64
注:to_period()默认频率为M,to_period和to_timestamp可以相互转换
在周期和时间戳间转换,下面的栗子将季度时间转换为各季度最后一个月的09am:
>>> prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
>>> prng
PeriodIndex(['1990Q1', '1990Q2', '1990Q3', '1990Q4', '1991Q1', '1991Q2',
'1991Q3', '1991Q4', '1992Q1', '1992Q2', '1992Q3', '1992Q4',
'1993Q1', '1993Q2', '1993Q3', '1993Q4', '1994Q1', '1994Q2',
'1994Q3', '1994Q4', '1995Q1', '1995Q2', '1995Q3', '1995Q4',
'1996Q1', '1996Q2', '1996Q3', '1996Q4', '1997Q1', '1997Q2',
'1997Q3', '1997Q4', '1998Q1', '1998Q2', '1998Q3', '1998Q4',
'1999Q1', '1999Q2', '1999Q3', '1999Q4', '2000Q1', '2000Q2',
'2000Q3', '2000Q4'],
dtype='int64', freq='Q-NOV')
>>> ts = pd.Series(np.random.randn(len(prng)), prng)
>>> ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
>>> ts.head()
1990-03-01 09:00 -0.927090 1990-06-01 09:00 -1.045881 1990-09-01 09:00 -0.837705 1990-12-01 09:00 -0.529390 1991-03-01 09:00 -0.423405 Freq: H, dtype: float64
12.分类¶
从 0.15 版以后,pandas 可以在 DataFrame 中包含分类数据,category
>>> df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
1.将原始成绩转换为分类数据:
>>> df["grade"] = df["raw_grade"].astype("category")
>>> df["grade"]
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): [a, b, e]
2.重命名分类使其更有意义:
>>> df["grade"].cat.categories = ["very good", "good", "very bad"]
3.重新整理类别,并添加缺少的类别:
>>> df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
>>> df["grade"]
0 very good
1 good
2 good
3 very good
4 very good
5 very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]
4.按整理后的类别排序 (并非词汇的顺序)
>>> df.sort_values(by="grade")
id raw_grade grade
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good
4 5 a very good
5.按类别分组也包括空类别:
>>> df.groupby("grade").size()
grade
very bad 1
bad 0
medium 0
good 2
very good 3
dtype: int64
13.绘图¶
plot 方法可以做简单的可视化
>>> df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
columns=['A', 'B', 'C', 'D'])
>>> df = df.cumsum()
>>> plt.figure(); df.plot(); plt.legend(loc='best')
<matplotlib.legend.Legend at 0x7ff29c8163d0>
14.数据IO¶
csv¶
写入 csv df.to_csv('foo.csv')
读取 csv
pd.read_csv('foo.csv')
Unnamed: 0 A B C D
0 2000-01-01 0.266457 -0.399641 -0.219582 1.186860
1 2000-01-02 -1.170732 -0.345873 1.653061 -0.282953
2 2000-01-03 -1.734933 0.530468 2.060811 -0.515536
3 2000-01-04 -1.555121 1.452620 0.239859 -1.156896
4 2000-01-05 0.578117 0.511371 0.103552 -2.428202
5 2000-01-06 0.478344 0.449933 -0.741620 -1.962409
6 2000-01-07 1.235339 -0.091757 -1.543861 -1.084753
.. ... ... ... ... ...
993 2002-09-20 -10.628548 -9.153563 -7.883146 28.313940
994 2002-09-21 -10.390377 -8.727491 -6.399645 30.914107
995 2002-09-22 -8.985362 -8.485624 -4.669462 31.367740
996 2002-09-23 -9.558560 -8.781216 -4.499815 30.518439
997 2002-09-24 -9.902058 -9.340490 -4.386639 30.105593
998 2002-09-25 -10.216020 -9.480682 -3.933802 29.758560
999 2002-09-26 -11.856774 -10.671012 -3.216025 29.369368
[1000 rows x 5 columns]
要以第一列为 index pd.read_csv('foo.csv', index_col = 0)
HDF5¶
写入HDF5 Store:
>>> df.to_hdf('foo.h5','df')
从 HDF5 Store 读取:
>>> pd.read_hdf('foo.h5','df')
A B C D
2000-01-01 0.266457 -0.399641 -0.219582 1.186860
2000-01-02 -1.170732 -0.345873 1.653061 -0.282953
2000-01-03 -1.734933 0.530468 2.060811 -0.515536
2000-01-04 -1.555121 1.452620 0.239859 -1.156896
2000-01-05 0.578117 0.511371 0.103552 -2.428202
2000-01-06 0.478344 0.449933 -0.741620 -1.962409
2000-01-07 1.235339 -0.091757 -1.543861 -1.084753
... ... ... ... ...
2002-09-20 -10.628548 -9.153563 -7.883146 28.313940
2002-09-21 -10.390377 -8.727491 -6.399645 30.914107
2002-09-22 -8.985362 -8.485624 -4.669462 31.367740
2002-09-23 -9.558560 -8.781216 -4.499815 30.518439
2002-09-24 -9.902058 -9.340490 -4.386639 30.105593
2002-09-25 -10.216020 -9.480682 -3.933802 29.758560
2002-09-26 -11.856774 -10.671012 -3.216025 29.369368
[1000 rows x 4 columns]
Excel¶
MS Excel
写入 excel 文件:
>>> df.to_excel('foo.xlsx', sheet_name='Sheet1')
从 excel 文件读取:
>>> pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
A B C D
2000-01-01 0.266457 -0.399641 -0.219582 1.186860
2000-01-02 -1.170732 -0.345873 1.653061 -0.282953
2000-01-03 -1.734933 0.530468 2.060811 -0.515536
2000-01-04 -1.555121 1.452620 0.239859 -1.156896
2000-01-05 0.578117 0.511371 0.103552 -2.428202
2000-01-06 0.478344 0.449933 -0.741620 -1.962409
2000-01-07 1.235339 -0.091757 -1.543861 -1.084753
... ... ... ... ...
2002-09-20 -10.628548 -9.153563 -7.883146 28.313940
2002-09-21 -10.390377 -8.727491 -6.399645 30.914107
2002-09-22 -8.985362 -8.485624 -4.669462 31.367740
2002-09-23 -9.558560 -8.781216 -4.499815 30.518439
2002-09-24 -9.902058 -9.340490 -4.386639 30.105593
2002-09-25 -10.216020 -9.480682 -3.933802 29.758560
2002-09-26 -11.856774 -10.671012 -3.216025 29.369368
[1000 rows x 4 columns]
其他东西¶
统计分析¶
列出一些常用的统计函数
In [67]: np.random.seed(1234)
In [68]: d1 = pd.Series(2*np.random.normal(size = 100)+3)
In [69]: d2 = np.random.f(2,4,size = 100)
In [70]: d3 = np.random.randint(1,100,size = 100)
In [71]: d1.count() #非空元素计算
Out[71]: 100
In [72]: d1.min() #最小值
Out[72]: -4.1270333212494705
In [73]: d1.max() #最大值
Out[73]: 7.7819210309260658
In [74]: d1.idxmin() #最小值的位置,类似于R中的which.min函数
Out[74]: 81
In [75]: d1.idxmax() #最大值的位置,类似于R中的which.max函数
Out[75]: 39
In [76]: d1.quantile(0.1) #10%分位数
Out[76]: 0.68701846440699277
In [77]: d1.sum() #求和
Out[77]: 307.0224566250874
In [78]: d1.mean() #均值
Out[78]: 3.070224566250874
In [79]: d1.median() #中位数
Out[79]: 3.204555266776845
In [80]: d1.mode() #众数
Out[80]: Series([], dtype: float64)
In [81]: d1.var() #方差
Out[81]: 4.005609378535085
In [82]: d1.std() #标准差
Out[82]: 2.0014018533355777
In [83]: d1.mad() #平均绝对偏差
Out[83]: 1.5112880411556109
In [84]: d1.skew() #偏度
Out[84]: -0.64947807604842933
In [85]: d1.kurt() #峰度
Out[85]: 1.2201094052398012
In [86]: d1.describe() #一次性输出多个描述性统计指标
Out[86]:
count 100.000000
mean 3.070225
std 2.001402
min -4.127033
25% 2.040101
50% 3.204555
75% 4.434788
max 7.781921
dtype: float64
必须注意的是,describe方法只能针对序列或 DataFrame,一维数组(numpy.ndarray)是没有这个方法的。
自定义一个函数,将这些统计描述指标全部汇总到一起:
In [87]: def stats(x):
...: return pd.Series([x.count(),x.min(),x.idxmin(),
...: x.quantile(.25),x.median(),
...: x.quantile(.75),x.mean(),
...: x.max(),x.idxmax(),
...: x.mad(),x.var(),
...: x.std(),x.skew(),x.kurt()],
...: index = ['Count','Min','Whicn_Min',
...: 'Q1','Median','Q3','Mean',
...: 'Max','Which_Max','Mad',
...: 'Var','Std','Skew','Kurt'])
In [88]: stats(d1)
Out[88]:
Count 100.000000
Min -4.127033
Whicn_Min 81.000000
Q1 2.040101
Median 3.204555
Q3 4.434788
Mean 3.070225
Max 7.781921
Which_Max 39.000000
Mad 1.511288
Var 4.005609
Std 2.001402
Skew -0.649478
Kurt 1.220109
dtype: float64
对 DataFrame 的迭代¶
按行按列迭代
DaraFrame 的迭代器方法 iterrows() 返回 index 和 行 Series 元组 iteritems() 返回 column 和 列 Seires 元组 itertuples 按行返回 namedtuple
>>> df = pd.DataFrame(np.random.randn(10, 4),index = range(10), columns = list('abcd'))
a b c d
0 -1.445993 0.913242 0.793857 -0.407082
1 -0.857645 0.468194 -2.723609 -0.256915
2 0.495262 -1.148540 0.779249 -0.970309
3 -1.205648 0.121347 0.550200 -1.099447
4 -0.284368 0.112743 -0.730803 2.301130
5 -0.540422 1.343915 -0.505066 -1.826931
6 -1.201243 0.651857 -0.488602 -0.461413
7 0.454218 0.699664 -0.254396 0.591888
8 -0.155522 -0.507313 -0.727528 1.206387
9 -0.052086 -0.687135 -3.011200 0.550724
这样迭代得到的是 columns 名称
for i in df:
print(i)
a
b
c
d
for i,row in df.iterrows():
print(i)
print(row)
0
a 0.788948
b -0.143154
c -0.156242
d 1.802068
Name: 0, dtype: float64
.....
9
a 0.319311
b 0.567725
c -1.296204
d -0.332152
Name: 9, dtype: float64