下载本书:https://www.jianshu.com/p/62524f4c240e
引入Pandas和Numpy
代码语言:javascript复制>>> import pandas as pd
>>> import numpy as np
Pandas的DataFrame(数据帧)
使用read_csv()
函数将数据从磁盘读入内存中的DataFrame对象。
代码语言:javascript复制所有数据可从GitHub下载:下载地址
>>> movies = pd.read_csv("data/movie.csv")
>>> movies
color direc/_name ... aspec/ratio movie/likes
0 Color James Cameron ... 1.78 33000
1 Color Gore Verbinski ... 2.35 0
2 Color Sam Mendes ... 2.35 85000
3 Color Christopher Nolan ... 2.35 164000
4 NaN Doug Walker ... NaN 0
... ... ... ... ... ...
4911 Color Scott Smith ... NaN 84
4912 Color NaN ... 16.00 32000
4913 Color Benjamin Roberds ... NaN 16
4914 Color Daniel Hsia ... 2.35 660
4915 Color Jon Gunn ... 1.85 456
DataFrame的结构
在上图中,索引index是0轴,列column是1轴。
Pandas使用NaN(not a number)表示缺失值。
movies.head(n)
可以返回前n
行,movies.tail(n)
可以返回后n
行。
DataFrame的属性
提取DataFrame的列、索引和数据:
代码语言:javascript复制>>> movies = pd.read_csv("data/movie.csv")
>>> columns = movies.columns
>>> index = movies.index
>>> data = movies.to_numpy()
展示列、索引和数据:
代码语言:javascript复制>>> columns
Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
'imdb_score', 'aspect_ratio', 'movie_facebook_likes'], dtype='object')
>>> index
RangeIndex(start=0, stop=4916, step=1)
>>> data
array([['Color', 'James Cameron', 723.0, ..., 7.9, 1.78, 33000],
['Color', 'Gore Verbinski', 302.0, ..., 7.1, 2.35, 0],
['Color', 'Sam Mendes', 602.0, ..., 6.8, 2.35, 85000],
...,
['Color', 'Benjamin Roberds', 13.0, ..., 6.3, nan, 16],
['Color', 'Daniel Hsia', 14.0, ..., 6.3, 2.35, 660],
['Color', 'Jon Gunn', 43.0, ..., 6.6, 1.85, 456]], dtype=object)
列、索引和数据的数据类型:
代码语言:javascript复制>>> type(index)
<class 'pandas.core.indexes.range.RangeIndex'>
>>> type(columns)
<class 'pandas.core.indexes.base.Index'>
>>> type(data)
<class 'numpy.ndarray'>
index和column是Index
的子类,有时也被称为行索引和列索引:
>>> issubclass(pd.RangeIndex, pd.Index)
True
>>> issubclass(columns.__class__, pd.Index)
True
DataFrame的.values
属性(或.to_numpy()
方法)可以将索引、列、数据转换为ndarray
,也就是Numpy的n维数组:
>>> index.to_numpy()
array([ 0, 1, 2, ..., 4913, 4914, 4915], dtype=int64))
>>> columns.to_numpy()
array(['color', 'director_name', 'num_critic_for_reviews', 'duration',
'director_facebook_likes', 'actor_3_facebook_likes',
'actor_2_name', 'actor_1_facebook_likes', 'gross', 'genres',
'actor_1_name', 'movie_title', 'num_voted_users',
'cast_total_facebook_likes', 'actor_3_name',
'facenumber_in_poster', 'plot_keywords', 'movie_imdb_link',
'num_user_for_reviews', 'language', 'country', 'content_rating',
'budget', 'title_year', 'actor_2_facebook_likes', 'imdb_score',
'aspect_ratio', 'movie_facebook_likes'], dtype=object)
了解数据类型
广义上讲,可以将数据分为连续数据和离散的类别数据。
-
float
- NumPy的浮点类型,支持缺失值; -
int
- NumPy的整数类型,不支持缺失值; -
Int64
- Pandas的整数类型,支持缺失值; -
object
- NumPy用于存储字符串和混合类型的的数据类型; -
category
- Pandas的类别类型,支持缺失值; -
bool
- NumPy的布尔类型,不支持缺失值(None变为False,np.nan
变为True); -
boolean
- Pandas的布尔类型,支持缺失值; -
datetime64[ns]
- NumPy的日期类型,支持缺失值(NaT);
可以使用.dtypes
属性展示列名和对应的数据类型:
>>> movies = pd.read_csv("data/movie.csv")
>>> movies.dtypes
color object
director_name object
num_critic_for_reviews float64
duration float64
director_facebook_likes float64
...
title_year float64
actor_2_facebook_likes float64
imdb_score float64
aspect_ratio float64
movie_facebook_likes int64
Length: 28, dtype: object
使用.value_counts
方法返回每种数据类型的数量:
>>> movies.dtypes.value_counts()
float64 13
int64 3
object 12
dtype: int64
使用.info
方法查看数据类型:
>>> movies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4916 entries, 0 to 4915
Data columns (total 28 columns):
color 4897 non-null object
director_name 4814 non-null object
num_critic_for_reviews 4867 non-null float64
duration 4901 non-null float64
director_facebook_likes 4814 non-null float64
actor_3_facebook_likes 4893 non-null float64
actor_2_name 4903 non-null object
actor_1_facebook_likes 4909 non-null float64
gross 4054 non-null float64
genres 4916 non-null object
actor_1_name 4909 non-null object
movie_title 4916 non-null object
num_voted_users 4916 non-null int64
cast_total_facebook_likes 4916 non-null int64
actor_3_name 4893 non-null object
facenumber_in_poster 4903 non-null float64 plot_keywords 4764 non-null object
movie_imdb_link 4916 non-null object
num_user_for_reviews 4895 non-null float64
language 4904 non-null object
country 4911 non-null object
content_rating 4616 non-null object
budget 4432 non-null float64
title_year 4810 non-null float64
actor_2_facebook_likes 4903 non-null float64
imdb_score 4916 non-null float64
aspect_ratio 4590 non-null float64
movie_facebook_likes 4916 non-null int64
dtypes: float64(13), int64(3), object(12)
memory usage: 1.1 MB
Pandas默认将数值类型用64位表示,所以上面出现的是int64和float64。
object
类型中可能包含任意Python的数据类型,也可能包含缺失值。对于Pandas的Series,如果有缺失值和字符串,则数据类型是O:
代码语言:javascript复制上来就讲应用最广的DataFrame是这本书的一个特点,原本应该从Series讲起的。
>>> pd.Series(["Paul", np.nan, "George"]).dtype
dtype('O')
选择一列
使用列索引选择一列:
代码语言:javascript复制>>> movies = pd.read_csv("data/movie.csv")
>>> movies["director_name"]
0 James Cameron
1 Gore Verbinski
2 Sam Mendes
3 Christopher Nolan
4 Doug Walker
...
4911 Scott Smith
4912 NaN
4913 Benjamin Roberds
4914 Daniel Hsia
4915 Jon Gunn
Name: director_name, Length: 4916, dtype: object
使用属性选择一列:
代码语言:javascript复制>>> movies.director_name
0 James Cameron
1 Gore Verbinski
2 Sam Mendes
3 Christopher Nolan
4 Doug Walker
...
4911 Scott Smith
4912 NaN
4913 Benjamin Roberds
4914 Daniel Hsia
4915 Jon Gunn
Name: director_name, Length: 4916, dtype: object
使用.loc
和.iloc
选择一列,前者使用列名,后者使用位置序号:
# :表示从第一行到最后一行全选
>>> movies.loc[:, "director_name"]
0 James Cameron
1 Gore Verbinski
2 Sam Mendes
3 Christopher Nolan
4 Doug Walker
...
4911 Scott Smith
4912 NaN
4913 Benjamin Roberds
4914 Daniel Hsia
4915 Jon Gunn
Name: director_name, Length: 4916, dtype: object
>>> movies.iloc[:, 1]
0 James Cameron
1 Gore Verbinski
2 Sam Mendes
3 Christopher Nolan
4 Doug Walker
...
4911 Scott Smith
4912 NaN
4913 Benjamin Roberds
4914 Daniel Hsia
4915 Jon Gunn
Name: director_name, Length: 4916, dtype: object
查看列的属性
代码语言:javascript复制>>> movies["director_name"].index
RangeIndex(start=0, stop=4916, step=1)
>>> movies["director_name"].dtype
dtype('O')
>>> movies["director_name"].size
4196
>>> movies["director_name"].name
'director_name'
确认输出是Series对象:
代码语言:javascript复制>>> type(movies["director_name"])
<class 'pandas.core.series.Series'>
DataFrame中的每一列都可以被取出,当做Series进行操作。
调用Series方法
用dir()
查看pd.Series和pd.DataFrame的方法:
>>> s_attr_methods = set(dir(pd.Series))
>>> len(s_attr_methods)
471
>>> df_attr_methods = set(dir(pd.DataFrame))
>>> len(df_attr_methods)
458
>>> len(s_attr_methods & df_attr_methods)
400
先读取两列:
代码语言:javascript复制>>> movies = pd.read_csv("data/movie.csv")
>>> director = movies["director_name"]
>>> fb_likes = movies["actor_1_facebook_likes"]
>>> director.dtype
dtype('O')
>>> fb_likes.dtype
dtype('float64')
除了可以用.head
方法列出Series的前5行,还可以用.sample
查看数据:
>>> director.head()
0 James Cameron
1 Gore Verbinski
2 Sam Mendes
3 Christopher Nolan
4 Doug Walker
Name: director_name, dtype: object
>>> director.sample(n=5, random_state=42)
2347 Brian Percival
4687 Lucio Fulci
691 Phillip Noyce
3911 Sam Peckinpah
2488 Rowdy Herrington
Name: director_name, dtype: object
>>> fb_likes.head()
0 1000.0
1 40000.0
2 11000.0
3 27000.0
4 131.0
Name: actor_1_facebook_likes, dtype: float64
Series的数据类型决定了哪些方法最常用。例如,object
最常用的方法是.value_counts
:
>>> director.value_counts()
Steven Spielberg 26
Woody Allen 22
Clint Eastwood 20
Martin Scorsese 20
Ridley Scott 16
..
Eric England 1
Moustapha Akkad 1
Jay Oliva 1
Scott Speer 1
Leon Ford 1
Name: director_name, Length: 2397, dtype: int64
数值型数据也可以使用.value_counts
>>> fb_likes.value_counts()
1000.0 436
11000.0 206
2000.0 189
3000.0 150
12000.0 131
...
362.0 1
216.0 1
859.0 1
225.0 1
334.0 1
Name: actor_1_facebook_likes, Length: 877, dtype: int64
用.size
、.shape
、len()
查看个数,.uinique()
返回唯一值:
>>> director.size
4916
>>> director.shape
(4916,)
>>> len(director)
4916
>>> director.unique()
array(['James Cameron', 'Gore Verbinski', 'Sam Mendes', ...,
'Scott Smith', 'Benjamin Roberds', 'Daniel Hsia'], dtype=object)
.count()
返回的是非缺失值:
>>> director.count()
4814
>>> fb_likes.count()
4909
方法.min
、.max
、.mean
、.median
、.std
,可以查看统计值:
>>> fb_likes.min()
0.0
>>> fb_likes.max()
640000.0
>>> fb_likes.mean()
6494.488490527602
>>> fb_likes.median()
982.0
>>> fb_likes.std()
15106.986883848309
.describe
也可以返回统计信息:
>>> fb_likes.describe()
count 4909.000000
mean 6494.488491
std 15106.986884
min 0.000000
25% 607.000000
50% 982.000000
75% 11000.000000
max 640000.000000
Name: actor_1_facebook_likes, dtype: float64
>>> director.describe()
count 4814
unique 2397
top Steven Spielberg
freq 26
Name: director_name, dtype: object
.quantile()
方法可以返回分位数:
>>> fb_likes.quantile(0.2)
510.0
>>> fb_likes.quantile(
... [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
... )
0.1 240.0
0.2 510.0
0.3 694.0
0.4 854.0
0.5 982.0
0.6 1000.0
0.7 8000.0
0.8 13000.0
0.9 18000.0
Name: actor_1_facebook_likes, dtype: float64
.isna()
用于查看是否有缺失值:
>>> director.isna()
0 False
1 False
2 False
3 False
4 False
...
4911 False
4912 True
4913 False
4914 False
4915 False
Name: director_name, Length: 4916, dtype: bool
.fillna()
用于填充缺失值:
>>> fb_likes_filled = fb_likes.fillna(0)
>>> fb_likes_filled.count()
4916
.dropna()
用于删除缺失值:
>>> fb_likes_dropped = fb_likes.dropna()
>>> fb_likes_dropped.size
4909
对于.value_counts()
方法,将参数normalize
设为True
,返回的是相对频率:
>>> director.value_counts(normalize=True)
Steven Spielberg 0.005401
Woody Allen 0.004570
Clint Eastwood 0.004155
Martin Scorsese 0.004155
Ridley Scott 0.003324
...
Eric England 0.000208
Moustapha Akkad 0.000208
Jay Oliva 0.000208
Scott Speer 0.000208
Leon Ford 0.000208
Name: director_name, Length: 2397, dtype: float64
另一个查看是否有缺失值的属性是.hasnans
:
>>> director.hasnans
True
.notna()
方法返回是否不是缺失值:
>>> director.notna()
0 True
1 True
2 True
3 True
4 True
...
4911 True
4912 False
4913 True
4914 True
4915 True
Name: director_name, Length: 4916, dtype: bool
.isnull()
的作用和.isna()
相同,因为Pandas中使用NaN表示缺失值,后者更便于记忆。
Series运算
加载列imdb_score:
代码语言:javascript复制>>> movies = pd.read_csv("data/movie.csv")
>>> imdb_score = movies["imdb_score"]
>>> imdb_score
0 7.9
1 7.1
2 6.8
3 8.5
4 7.1
...
4911 7.7
4912 7.5
4913 6.3
4914 6.3
4915 6.6
Name: imdb_score, Length: 4916, dtype: float64
加减乘除、指数运算,直接对列操作就成:
代码语言:javascript复制>>> imdb_score 1
0 8.9
1 8.1
2 7.8
3 9.5
4 8.1
...
4911 8.7
4912 8.5
4913 7.3
4914 7.3
4915 7.6
Name: imdb_score, Length: 4916, dtype: float64
//
和%
分别返回除法的整数和余数部分:
>>> imdb_score // 7
0 1.0
1 1.0
2 0.0
3 1.0
4 1.0
...
4911 1.0
4912 1.0
4913 0.0
4914 0.0
4915 0.0
Name: imdb_score, Length: 4916, dtype: float64
六种比较运算符,>
、<
、>=
、<=
、==
、!=
返回的是布尔值:
>>> imdb_score > 7
0 True
1 True
2 False
3 True
4 True
...
4911 True
4912 True
4913 False
4914 False
4915 False
Name: imdb_score, Length: 4916, dtype: bool
>>> director = movies["director_name"]
>>> director == "James Cameron"
0 True
1 False
2 False
3 False
4 False
...
4911 False
4912 False
4913 False
4914 False
4915 False
Name: director_name, Length: 4916, dtype: bool
.add()
方法等同于
:
>>> imdb_score.add(1) # imdb_score 1
0 8.9
1 8.1
2 7.8
3 9.5
4 8.1
...
4911 8.7
4912 8.5
4913 7.3
4914 7.3
4915 7.6
Name: imdb_score, Length: 4916, dtype: float64
>>> imdb_score.gt(7) # imdb_score > 7
0 True
1 True
2 False
3 True
4 True
...
4911 True
4912 True
4913 False
4914 False
4915 False
Name: imdb_score, Length: 4916, dtype: bool
使用方法的原因是,方法中可以添加参数,比如.sub
方法中,可以设置参数fill_value
:
>>> money = pd.Series([100, 20, None])
>>> money – 15
0 85.0
1 5.0
2 NaN
dtype: float64
>>> money.sub(15, fill_value=0)
0 85.0
1 5.0
2 -15.0
dtype: float64
算数方法包括:.add
、.sub
、.mul
、.div
、.floordiv
、.mod
、.pow
。
比较方法包括:.lt
、.gt
、.le
、.ge
、.eq
、.ne
。
链式方法
将方法连用。
代码语言:javascript复制>>> movies = pd.read_csv("data/movie.csv")
>>> fb_likes = movies["actor_1_facebook_likes"]
>>> director = movies["director_name"]
代码语言:javascript复制>>> director.value_counts().head(3)
Steven Spielberg 26
Woody Allen 22
Clint Eastwood 20
Name: director_name, dtype: int64
统计缺失值的个数。
代码语言:javascript复制>>> fb_likes.isna().sum()
7
代码语言:javascript复制>>> fb_likes.dtype
dtype('float64')
>>> (fb_likes.fillna(0).astype(int).head())
0 1000
1 40000
2 11000
3 27000
4 131
Name: actor_1_facebook_likes, dtype: int64
.pipe()
可以用于检测链式方法中的中间值:
>>> def debug_ser(ser):
... print("BEFORE")
... print(ser)
... print("AFTER")
... return ser
>>> (fb_likes.fillna(0).pipe(debug_ser).astype(int).head())
BEFORE
0 1000.0
1 40000.0
2 11000.0
3 27000.0
4 131.0
...
4911 637.0
4912 841.0
4913 0.0
4914 946.0
4915 86.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64
AFTER
0 1000
1 40000
2 11000
3 27000
4 131
Name: actor_1_facebook_likes, dtype: int64
用全局变量存储中间值,也可以使用.pipe
:
>>> intermediate = None
>>> def get_intermediate(ser):
... global intermediate
... intermediate = ser
... return ser
>>> res = (
... fb_likes.fillna(0)
... .pipe(get_intermediate)
... .astype(int)
... .head()
... )
>>> intermediate
0 1000.0
1 40000.0
2 11000.0
3 27000.0
4 131.0
...
4911 637.0
4912 841.0
4913 0.0
4914 946.0
4915 86.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64
对列进行重命名
代码语言:javascript复制>>> movies = pd.read_csv("data/movie.csv")
先定义好列名字典
代码语言:javascript复制>>> col_map = {
... "director_name": "director",
... "num_critic_for_reviews": "critic_reviews",
... }
将列名字典传给rename
方法:
>>> movies.rename(columns=col_map).head()
color director ... aspec/ratio movie/likes
0 Color James Cameron ... 1.78 33000
1 Color Gore Verbinski ... 2.35 0
2 Color Sam Mendes ... 2.35 85000
3 Color Christopher Nolan ... 2.35 164000
4 NaN Doug Walker ... NaN 0