此文是 ten-Minutes-to-pandas 下半部分的翻译。上半部分请看：《数据挖掘比赛（4）ten Minutes to pandas中文版上》

紧接上文的数据，如下：

以下下半部分正文开始：

操作（Operations）

统计（Stats）

通常情况下，这些操作的对象不包括缺失值

描述性统计信息

df.mean()

指定轴向

在其他轴上执行相同操作

1	df.mean(1)

自动对应维度

对具有不同维度和需要对齐的对象操作时，pandas会自动地沿着特定的维度进行广播（注：其实就是运算啦）

1
2
3

# 准备工作
s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)
s

（注：.shift操作会对数据进行移动，空出的位置用nan代替）

1	df.sub(s, axis='index')

（注： .sub 表示减去）

函数应用（Apply）

把函数应用到数据上

使用已有函数

1	df.apply(np.cumsum)

（注：np.cumsum的使用方法）

使用匿名函数

1	df.apply(lambda x: x.max() - x.min())

（注：lambda表达式，建议百度）

直方图（Histogramming）

更多内容请查阅《Histogramming and Discretization》

1
2
3

# 准备工作
s = pd.Series(np.random.randint(0, 7, size=10))
s

1	s.value_counts()

（注：Histogramming翻译过来是叫直方图。这里value_counts返回的数据中说明了 5 出现了 3 次， 2 出现了 2 次等等，虽无图形，但实际上却是是直方图的表示）

字符串方法（String Methods）

Series对象的 str属性中集成了一系列用于处理字符串的方法，如下代码所示，能够很方便对对象中的每个元素进行处理。注意到，通常情况下在 str属性中的模式匹配（pattern-matching）默认使用了正则表达式（regular expressions）。更多内容请查阅《Vectorized String Methods》

1
2
3

# 准备工作
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s

1 2	# .loewr() 转换成小写字母 s.str.lower()

数据合并（Merge）

就合并类操作（join / merge-type operations）而言，pandas提供了各种工具能方便地对Series，DataFrame，和 Panel对象进行各种逻辑演算来进行数据合并。更多内容请查阅《Merging section》
（注：上面这段话在原文中放在concat的开头，为逻辑和结构上的完整和流畅，我这里放到了这边）

Concat

用 concat() 把pandas对象联系（Concatenating）起来

1
2
3

# 准备工作
df = pd.DataFrame(np.random.randn(10, 4))
df

1
2
3

# 准备工作2：把刚刚生成的df分片（break it into pieces）
pieces = [df[:3], df[3:7], df[7:]]
pieces

1 2	# 使用concat()连接 pd.concat(pieces)

Join

SQL形式的连接。更多内容请查阅《Database style joining》

示例一

1
2
3

# 准备工作1
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
left

1
2
3

# 准备工作2
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
right

1 2	# 使用 merge() 连接 pd.merge(left, right, on='key')

示例二

另一个例子如下：

1
2
3

# 准备工作1
left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
left

1
2
3

# 准备工作2
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
right

1	pd.merge(left, right, on='key')

Append

向 dataframe对象添加行。更多内容请查阅《Appending》

1
2
3

# 准备工作1
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
df

1
2
3

# 准备工作2
s = df.iloc[3]
s

1 2	# 使用 append df.append(s, ignore_index=True)

分组（Grouping）

对分组操作，我们指的是包含以一个或多个步骤的过程：

根据某些标准把数据切分（Splitting）成不同组别
给每个组别独立地应用（Applying）函数
将结果组合（Combining）成同一数据结构

更多内容请查看《Grouping section》

# 准备工作
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
  ....:                           'foo', 'bar', 'foo', 'foo'],
  ....:                    'B' : ['one', 'one', 'two', 'three',
  ....:                           'two', 'two', 'one', 'three'],
  ....:                    'C' : np.random.randn(8),
  ....:                    'D' : np.random.randn(8)})
df

一列

分组，然后对各个分组结果应用函数（sum）

1	df.groupby('A').sum()

多列

根据多列分组，形成层次索引，从而可以对其使用函数。

1	df.groupby(['A','B']).sum()

数据重组（Reshaping）

更多内容请查看《Hierarchical Indexing》和《Reshaping》

Stack

# 准备工作
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
   ....:                      'foo', 'foo', 'qux', 'qux'],
   ....:                     ['one', 'two', 'one', 'two',
   ....:                      'one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df2 = df[:4]
df2

1 2	stacked = df2.stack() stacked

对于一个 “stacked” 的 DataFrame 或者 Series 对象（它们的索引是层次索引），stack（）操作的逆操作是 unstack（），它默认情况下只处理末级层次的索引。

1	stacked.unstack()

1	stacked.unstack(1)

1	stacked.unstack(0)

数据透视表（Pivot Tables）

更多内容请查阅 Pivot Tables

# 准备工作
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
   .....:                    'B' : ['A', 'B', 'C'] * 4,
   .....:                    'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
   .....:                    'D' : np.random.randn(12),
   .....:                    'E' : np.random.randn(12)})
df

我们可以很简便地从数据中得到数据透视表

1	pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])

时间序列（Time Series）

在频率转换重采样时，pandas具有简单强大有效的作用（比如说，把秒级采样的数据转换成 5分钟级别的数据）。这在金融领域非常常见，当然也不仅局限于此。更多内容请查阅《Time Series section》

时分秒

# 准备工作
rng = pd.date_range('1/1/2012', periods=100, freq='S')
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts

（注：上图只截取了一部分的数据）

1 2	# 转换 ts.resample('5Min').sum()

时区

时区表示

# 准备工作
rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)), rng)
ts

1 2	ts_utc = ts.tz_localize('UTC') ts_utc

时区转换

1	ts_utc.tz_convert('US/Eastern')

时期转换（period）

1
2
3

rng = pd.date_range('1/1/2012', periods=5, freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts

1 2	ps = ts.to_period() ps

时间戳转换（timestamp）

1	ps.to_timestamp()

函数应用

在时期（period）和时间戳（timestamp）转换时有一些方便的算术函数可以使用。在下面的例子中，我们把以季度为频率的数据转换成以季度末月为频率的数据。

1
2
3

prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
ts = pd.Series(np.random.randn(len(prng)), prng)
ts

1 2	ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9 ts.head()

分类型数据（categorical）

从 0.15版本开始，pandas的dataframe对象开始支持分类性数据（categorical data）。更多内容请查阅《categorical introduction》和《API documentation》

1 2	df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']}) df

转换

将原始grade数据转换成分类型数据

1 2	df["grade"] = df["raw_grade"].astype("category") df["grade"]

重命名

给分类型数据重命名为更有意义的名字。（通过 Series.cat.categories 来指派位置）

1 2	df["grade"].cat.categories = ["very good", "good", "very bad"] df

数据修整

给分类型数据重排序，同时填补缺失值。（默认情况下， Series.cat 方法会返回一个新的Series类型）

1 2	df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"]) df["grade"]

排序问题

给分类型排序是按照 categories 的顺序，而不是按照字典顺序

1	df.sort_values(by="grade")

归类

按照分类列来数据归类时，空的类别也会显示出来。

1	df.groupby("grade").size()

作图（Plotting）

更多内容请看《Plotting》

基本画图

# 数据
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts

1 2	# 作图 ts.plot()

dataframe作图

在dataframe对象里，plot（）可以很方便地画出所有有标签的列。

df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
   .....:                   columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
plt.figure(); df.plot(); plt.legend(loc='best')