1.基本概念
代码语言:python 复制import pandas as pd df = pd.read_csv("test.csv") df.sample(10)
代码语言:javascript复制获取前几行数据
```python
data.head()
代码语言:txt复制获取数据维度信息
df.shape
代码语言:txt复制获取数据表属性的相关信息
```python
data.info()
代码语言:txt复制获取数据表属性类型信息
```python
data.head()
代码语言:txt复制### 2.转换数据类型
```python
import pandas as pd
df = pd.DataFrame([{'col1':'a', 'col2':'1'},
{'col1':'b', 'col2':'2'}])
df.dtypes
代码语言:txt复制Object转换为数值型
```python
df'col2-int' = df'col2'.astype(int)
代码语言:txt复制含有不是数字的Object类型转换为数值型
```python
#此时由于含有不是数字的值,所以就算忽略报错后转换后的类型还是Object
s = pd.Series('1', '2', '4.7', 'pandas', '10')
s.astype(float,errors='ignore')
代码语言:txt复制```python
#这种方法可以将其转换为数值型,Pandas则变为Nan
pd.to_numeric(s, errors='coerce')
代码语言:txt复制转换为日期类型
```python
#可以将三列数据Month、Day、Year转换为日期类型的数据
pd.to_datetime(df['Month', 'Day', 'Year'])
代码语言:txt复制最终的代码
```python
import pandas as pd
import numpy as np
def convert_money(value):
new_value = value.replace(",","").replace("$","")
return float(new_value)
df2 = pd.read_csv("/home/aistudio/data/data20506/sales_types.csv",
dtype = {'Customer Number': 'int'},
converters = {'2016': convert_money,
'2017': convert_money,
'Percent Growth': lambda x: float(x.replace("%", "")) / 100,
'Jan Units': lambda x: pd.to_numeric(x, errors='coerce'),
'Active': lambda x: np.where(x =='Y', 1, 0),
})
df2'Date' = pd.to_datetime(df['Month', 'Day', 'Year'])
df2
代码语言:txt复制### 3.处理重复数据
```python
import pandas as pd
d = {'Name':'Newton', 'Galilei', 'Einstein', 'Feynman', 'Newton', 'Maxwell', 'Galilei',
'Age':26, 30, 28, 28, 26, 39, 40,
'Score':90, 80, 90, 100, 90, 70, 90}
df = pd.DataFrame(d,columns='Name','Age','Score')
df
代码语言:txt复制df.duplicated()函数使用
```python
Consider dataset containing ramen rating.
df = pd.DataFrame({
... 'brand': 'Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie',
... 'style': 'cup', 'cup', 'cup', 'pack', 'pack',
... 'rating': 4, 4, 3.5, 15, 5
... })
代码语言:txt复制df
brand style rating
0 Yum Yum cup 4.0
1 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
By default, for each set of duplicated values, the first occurrence
is set on False and all others on True.
df.duplicated()
0 False
1 True
2 False
3 False
4 False
dtype: bool
By using 'last', the last occurrence of each set of duplicated values
is set on False and all others on True.
df.duplicated(keep='last')
0 True
1 False
2 False
3 False
4 False
dtype: bool
By setting keep
on False, all duplicates are True.
df.duplicated(keep=False)
0 True
1 True
2 False
3 False
4 False
dtype: bool
To find duplicates on specific column(s), use subset
.
df.duplicated(subset='brand')
0 False
1 True
2 False
3 True
4 True
dtype: bool
代码语言:txt复制删除重复的
```python
#保留第一个
df.drop_duplicates()
#保留最后一个重复的元素
df.drop_duplicates('Age', keep='last')
代码语言:txt复制### 4.处理缺失数据
#### a.检查缺失数据
```python
#方法一
#isna()函数,若为空则为False,若不为空则为True
df = pd.DataFrame({"one":1, 2, np.nan, "two":np.nan, 3, 4})
df.isna() #返回m行n列,每个元素的值都会返回(True,False)
df.isna().any() #只返回1列,只要有一个是False就整个属性的值就为False
#方法二
#可以统计缺失率
(data.shape0 - data.count()) / data.shape0
代码语言:txt复制#### b.直接删除缺失数据
Dropna()函数
```python
Examples
df = pd.DataFrame({"name": 'Alfred', 'Batman', 'Catwoman',
... "toy": np.nan, 'Batmobile', 'Bullwhip',
... "born": [pd.NaT, pd.Timestamp("1940-04-25"),
... pd.NaT]})
代码语言:txt复制df
name toy born
0 Alfred NaN NaT
1 Batman Batmobile 1940-04-25
2 Catwoman Bullwhip NaT
Drop the rows where at least one element is missing.
代码语言:txt复制df.dropna()
name toy born
1 Batman Batmobile 1940-04-25
Drop the columns where at least one element is missing.
代码语言:txt复制df.dropna(axis='columns')
name
0 Alfred
1 Batman
2 Catwoman
Drop the rows where all elements are missing.
代码语言:txt复制df.dropna(how='all')
name toy born
0 Alfred NaN NaT
1 Batman Batmobile 1940-04-25
2 Catwoman Bullwhip NaT
Keep only the rows with at least 2 non-NA values.
代码语言:txt复制df.dropna(thresh=2)
name toy born
1 Batman Batmobile 1940-04-25
2 Catwoman Bullwhip NaT
Define in which columns to look for missing values.
代码语言:txt复制df.dropna(subset='name', 'born')
name toy born
1 Batman Batmobile 1940-04-25
Keep the DataFrame with valid entries in the same variable.
在同一个变量改变
代码语言:txt复制df.dropna(inplace=True)
df
name toy born
1 Batman Batmobile 1940-04-25
代码语言:txt复制#### c.用指定值填补缺失数据
```python
df = pd.DataFrame({'ColA':1, np.nan, np.nan, 4, 5, 6, 7, 'ColB':1, 1, 1, 1, 2, 2, 2})
df'ColA'.fillna(method='ffill')#以前面一个值填充
df'ColA'.fillna(method='bfill')#以后面一个值填充
代码语言:txt复制调用sklearn.impute中的SimpleImputer来填补缺失数据
```python
from sklearn.impute import SimpleImpute
#以均值填充空值
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
col_values = imp_mean.fit_transform(pdf2'Height-na'.values.reshape((-1, 1)))
#以常数填充空值
df = pd.DataFrame({"name": "Google", "Huawei", "Facebook", "Alibaba","price": 100, -1, -1, 90})
imp = SimpleImputer(missing_values=-1, strategy='constant', fill_value=110) # ⑤
imp.fit_transform(df'price'.values.reshape((-1, 1)))
代码语言:txt复制#### d.根据规律填补缺失值
利用sklearn.linear_model的LinearRegression来回归未知的数据
```python
df = pd.DataFrame({"one":np.random.randint(1, 100, 10),
"two": 2, 4, 6, 8, 10, 12, 14, 16, 18, 20,
"three":5, 9, 13, np.nan, 21, np.nan, 29, 33, 37, 41})
from sklearn.linear_model import LinearRegression # ⑥
df_train = df.dropna() #训练集
df_test = df[df'three'.isnull()] #测试集
regr = LinearRegression()
regr.fit(df_train'two'.values.reshape(-1, 1), df_train'three'.values.reshape(-1, 1)) # ⑦
df_three_pred = regr.predict(df_test'two'.values.reshape(-1, 1))
将所得数值填补到原数据集中
df.loc(df.three.isnull()), 'three' = df_three_pred
df
代码语言:txt复制使用sklearn.ensemble的RandomForestRegressor来进行预测
```python
import pandas as pd
train_data = pd.read_csv("/home/aistudio/data/data20507/train.csv")
train_data.info()
df = train_data['Age','Fare', 'Parch', 'SibSp', 'Pclass'] #可能跟年龄有关的特征
known_age = df[df'Age'.notnull()].values
unknown_age = df[df'Age'.isnull()].values
y = known_age:, 0
X = known_age:, 1:
from sklearn.ensemble import RandomForestRegressor # ⑩
rfr = RandomForestRegressor(random_state=0, n_estimators=2000, n_jobs=-1) # ○11
rfr.fit(X, y)
pred_age = rfr.predict(unknown_age:, 1:) # ○13
pred_age.mean()
train_data.loc(train_data.Age.isnull()), 'Age' = pred_age
train_data.isna().any()
代码语言:txt复制还可以利用KNN来填补缺失值
```python
%matplotlib inline
import seaborn as sns
from sklearn.datasets import load_iris # 引入鸢尾花数据集
import numpy as np
iris = load_iris()
X = iris.data
制造含有缺失值的数据集
rng = np.random.RandomState(0)
X_missing = X.copy()
mask = np.abs(X:, 2 - rng.normal(loc=5.5, scale=0.7, size=X.shape0)) < 0.6
X_missingmask, 3 = np.nan # X_missing是包含了缺失值的数据集
from missingpy import KNNImputer # 引入KNN填充缺失值的模型
imputer = KNNImputer(n_neighbors=3, weights="uniform")
X_imputed = imputer.fit_transform(X_missing)
#填补之前的数据分布
sns.distplot(X.reshape((-1, 1)))
#填补缺失数据后的分布
sns.distplot(X_imputed.reshape((-1, 1)))
代码语言:txt复制### 5.离群数据
#### 5.1 通过可视化
```python
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import load_boston
boston = load_boston()
x = boston.data
y = boston.target
columns = boston.feature_names
#create the dataframe
boston_df = pd.DataFrame(x)
boston_df.columns = columns
boston_df.head()
代码语言:txt复制#### Box-plot
```python
import seaborn as sns
sns.boxplot(x=boston_df'INDUS');
代码语言:txt复制![[箱线图.png]]
```python
sns.boxplot(x=boston_df'PTRATIO');
sns.boxplot(x=boston_df'DIS');
代码语言:txt复制#### Scatter-plot
```python
fig, ax = plt.subplots(figsize=(16,8))
ax.scatter(boston_df'INDUS', boston_df'TAX')
ax.set_xlabel('Proportion of non-retail business acres per town')
ax.set_ylabel('Full-value property-tax rate per $10,000')
plt.show();
代码语言:txt复制#### 5.2 通过数学计算
#### Z-Score
[Z-Score](https://en.wikipedia.org/wiki/Standard_score)是指观测点或数据的值超过观测值或测量值平均值的标准差的符号数。
```python
from scipy import stats
import numpy as np
z = np.abs(stats.zscore(boston_df))
print(z)
threshold = 3
print(np.where(z > 3))
代码语言:txt复制#### IQR score
[四分位数范围(IQR)](https://en.wikipedia.org/wiki/Interquartile_range),又称平均数或50%平均数,或技术上称为H-spread,是衡量统计学分散度的指标,等于75%和25%之间的差值,或上四分位数和下四分位数之间的差值,IQR=Q3 - Q1。[百度百科解释](https://img.yuanmabao.com/zijie/pic/2022/02/02/yjbnj1ce0ij.png]]
```python
Q1 = boston_df.quantile(0.25)
Q3 = boston_df.quantile(0.75)
IQR = Q3 - Q1
print(IQR)
代码语言:txt复制既然我们现在有了IQR估计,那么就可以选择离群值。下面的代码将产生带有真值和假值的结果。带有False的数据点表示这些值是有效的,而True则表示有释放。
```python
print((boston_df < (Q1 - 1.5 * IQR)) | (boston_df > (Q3 1.5 * IQR)))
5.3 处理离群数据
- 识别异常值--分析有无异常值的结果模型--做出结论。
- 如果你确定数值是错误的,就修正它。
- 如果离群值不在利益分配范围内,则删除。
- 考虑到数据的差距,使用抗离群值的统计工具,例如,稳健回归(用另一种参数估计方法)Robust_regression。
- 离群值的常见原因是两种分布的混合,可能是两个不同的子人群,也可能表明 "测量正确 "和 "测量误差";这通常是由混合模型来建模。 (Mixture model).
参考资料
https://blog.csdn.net/weixin_42199542/article/details/106898892
https://blog.csdn.net/bbbeoy/article/details/72124004