Pandas-17.缺失数据

2019-05-29 17:17:49 浏览数 (2)

Pandas-17.缺失数据

以如下代码作为例子:

代码语言:javascript复制
df = pd.DataFrame(np.random.randn(5,3), index=["a", "c", "e", "f", "h"], columns=["A","B","C"])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df)
'''
          A         B         C
a -0.096388 -1.679405 -0.383818
b       NaN       NaN       NaN
c -0.531495 -1.003009  0.815197
d       NaN       NaN       NaN
e -0.588744  1.575706  1.617404
f -0.520550 -1.436264 -1.116896
g       NaN       NaN       NaN
h -0.851603  0.778596 -1.862553
'''

检查缺失值

可以用isnull()notnull()函数检查空或者非空:

代码语言:javascript复制
print(df["B"].isnull())
print("-----")
print (df["A"].notnull())
'''
a    False
b     True
c    False
d     True
e    False
f    False
g     True
h    False
Name: B, dtype: bool
-----
a     True
b    False
c     True
d    False
e     True
f     True
g    False
h     True
Name: A, dtype: bool
'''

缺失值的默认计算

  • 求和时,NAN将视为0
代码语言:javascript复制
print(df["B"].sum())
print("-----")
print (df["b":"b"].sum(axis=1))
'''
-1.7643744977503546
-----
b    0.0
dtype: float64
'''

填充/清理缺失数据

  • fillna()函数用非空数据填充NAN值 以如下代码作为例子:
代码语言:javascript复制
df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one',
'two', 'three'])
df = df.reindex(['a', 'b', 'c'])

用标量值填充NAN

代码语言:javascript复制
print(df)
print("-----")
print(df.fillna(0))
'''
        one       two     three
a  0.882369  0.392508 -0.410003
b       NaN       NaN       NaN
c  1.012354  0.968128 -0.196215
-----
        one       two     three
a  0.882369  0.392508 -0.410003
b  0.000000  0.000000  0.000000
c  1.012354  0.968128 -0.196215
'''

用前值/后值填充

代码语言:javascript复制
print(df)
print("-----")
print(df.fillna(method="pad"))
print("-----")
print(df.fillna(method="bfill"))
'''
        one       two     three
a  0.882369  0.392508 -0.410003
b       NaN       NaN       NaN
c  1.012354  0.968128 -0.196215
-----
        one       two     three
a  0.882369  0.392508 -0.410003
b  0.882369  0.392508 -0.410003
c  1.012354  0.968128 -0.196215
-----
        one       two     three
a  0.882369  0.392508 -0.410003
b  1.012354  0.968128 -0.196215
c  1.012354  0.968128 -0.196215
'''

排除缺失值

dropna()函数和axis参数(默认0),排除行/列。

代码语言:javascript复制
print (df.dropna())
print("---")
print (df.dropna(axis=1))
'''
        one       two     three
a  0.882369  0.392508 -0.410003
c  1.012354  0.968128 -0.196215
---
Empty DataFrame
Columns: []
Index: [a, b, c]
'''

指定替换的值

replace ()方法可以指定替换的值:

代码语言:javascript复制
df1 = pd.DataFrame({'one':[10,20,30,40,50,2000],
'two':[1000,0,30,40,50,60]})
print (df1)
print (df1.replace({1000:10,2000:60}))
'''
    one   two
0    10  1000
1    20     0
2    30    30
3    40    40
4    50    50
5  2000    60
   one  two
0   10   10
1   20    0
2   30   30
3   40   40
4   50   50
5   60   60
'''

0 人点赞