Pandas-17.缺失数据
以如下代码作为例子:
代码语言:javascript复制df = pd.DataFrame(np.random.randn(5,3), index=["a", "c", "e", "f", "h"], columns=["A","B","C"])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df)
'''
A B C
a -0.096388 -1.679405 -0.383818
b NaN NaN NaN
c -0.531495 -1.003009 0.815197
d NaN NaN NaN
e -0.588744 1.575706 1.617404
f -0.520550 -1.436264 -1.116896
g NaN NaN NaN
h -0.851603 0.778596 -1.862553
'''
检查缺失值
可以用isnull()
和notnull()
函数检查空或者非空:
print(df["B"].isnull())
print("-----")
print (df["A"].notnull())
'''
a False
b True
c False
d True
e False
f False
g True
h False
Name: B, dtype: bool
-----
a True
b False
c True
d False
e True
f True
g False
h True
Name: A, dtype: bool
'''
缺失值的默认计算
- 求和时,NAN将视为0
print(df["B"].sum())
print("-----")
print (df["b":"b"].sum(axis=1))
'''
-1.7643744977503546
-----
b 0.0
dtype: float64
'''
填充/清理缺失数据
-
fillna()
函数用非空数据填充NAN值 以如下代码作为例子:
df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one',
'two', 'three'])
df = df.reindex(['a', 'b', 'c'])
用标量值填充NAN
代码语言:javascript复制print(df)
print("-----")
print(df.fillna(0))
'''
one two three
a 0.882369 0.392508 -0.410003
b NaN NaN NaN
c 1.012354 0.968128 -0.196215
-----
one two three
a 0.882369 0.392508 -0.410003
b 0.000000 0.000000 0.000000
c 1.012354 0.968128 -0.196215
'''
用前值/后值填充
代码语言:javascript复制print(df)
print("-----")
print(df.fillna(method="pad"))
print("-----")
print(df.fillna(method="bfill"))
'''
one two three
a 0.882369 0.392508 -0.410003
b NaN NaN NaN
c 1.012354 0.968128 -0.196215
-----
one two three
a 0.882369 0.392508 -0.410003
b 0.882369 0.392508 -0.410003
c 1.012354 0.968128 -0.196215
-----
one two three
a 0.882369 0.392508 -0.410003
b 1.012354 0.968128 -0.196215
c 1.012354 0.968128 -0.196215
'''
排除缺失值
dropna()
函数和axis参数(默认0),排除行/列。
print (df.dropna())
print("---")
print (df.dropna(axis=1))
'''
one two three
a 0.882369 0.392508 -0.410003
c 1.012354 0.968128 -0.196215
---
Empty DataFrame
Columns: []
Index: [a, b, c]
'''
指定替换的值
replace ()
方法可以指定替换的值:
df1 = pd.DataFrame({'one':[10,20,30,40,50,2000],
'two':[1000,0,30,40,50,60]})
print (df1)
print (df1.replace({1000:10,2000:60}))
'''
one two
0 10 1000
1 20 0
2 30 30
3 40 40
4 50 50
5 2000 60
one two
0 10 10
1 20 0
2 30 30
3 40 40
4 50 50
5 60 60
'''