Kaggle统计分析入门
本文是针对kaggle上面一份肿瘤数据的统计分析,适合初学者快速入门:
- 基于直方图的频数统计
- 基于四分位法的异常点定位分析
- 描述统计分析
- 基于累计分布函数的分析
- 两两变量间分析
- 相关性分析…
数据集
数据地址为:https://www.kaggle.com/code/kanncaa1/statistical-learning-tutorial-for-beginners/notebook
最初的数据来自UCI官网:https://archive.ics.uci.edu/ml/datasets/Breast Cancer Wisconsin (Diagnostic)
导入库
In [1]:
代码语言:javascript复制import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
plt.style.use("ggplot")
import warnings
warnings.filterwarnings("ignore")
In [2]:
基本信息
In [3]:
代码语言:javascript复制df.shape
Out[3]:
代码语言:javascript复制(569, 33)
In [4]:
代码语言:javascript复制df.isnull().sum()
Out[4]:
代码语言:javascript复制id 0
diagnosis 0
radius_mean 0
texture_mean 0
perimeter_mean 0
area_mean 0
smoothness_mean 0
compactness_mean 0
concavity_mean 0
concave points_mean 0
symmetry_mean 0
fractal_dimension_mean 0
radius_se 0
texture_se 0
perimeter_se 0
area_se 0
smoothness_se 0
compactness_se 0
concavity_se 0
concave points_se 0
symmetry_se 0
fractal_dimension_se 0
radius_worst 0
texture_worst 0
perimeter_worst 0
area_worst 0
smoothness_worst 0
compactness_worst 0
concavity_worst 0
concave points_worst 0
symmetry_worst 0
fractal_dimension_worst 0
Unnamed: 32 569
dtype: int64
删除两个对分析无效的字段:
In [5]:
代码语言:javascript复制df.drop(["Unnamed: 32", "id"],axis=1,inplace=True)
剩余的全部的字段:
In [6]:
代码语言:javascript复制columns = df.columns
columns
Out[6]:
代码语言:javascript复制Index(['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
'fractal_dimension_se', 'radius_worst', 'texture_worst',
'perimeter_worst', 'area_worst', 'smoothness_worst',
'compactness_worst', 'concavity_worst', 'concave points_worst',
'symmetry_worst', 'fractal_dimension_worst'],
dtype='object')
分析1:直方图-Histogram
直方图统计的是每个值出现的频数
In [7]:
代码语言:javascript复制# radius_mean:均值
m = plt.hist(df[df["diagnosis"] == "M"].radius_mean,
bins=30,
fc=(1,0,0,0.5),
label="Maligant" # 恶性
)
b = plt.hist(df[df["diagnosis"] == "B"].radius_mean,
bins=30,
fc=(0,1,0,0.5),
label="Bening" # 良性
)
plt.legend()
plt.xlabel("Radius Mean Values")
plt.ylabel("Frequency")
plt.title("Histogram of Radius Mean for Bening and Malignant Tumors")
plt.show()
小结:
- 恶性肿瘤的半径平均值大多数是大于良性肿瘤
- 良性肿瘤(绿色)的分布大致上呈现钟型,符合正态分布
分析2:异常离群点分析
根据数据的4分位数来确定异常点。
In [8]:
代码语言:javascript复制data_b = df[df["diagnosis"] == "B"] # 良性肿瘤
data_m = df[df["diagnosis"] == "M"]
desc = data_b.radius_mean.describe()
q1 = desc[4]
q3 = desc[6]
iqr = q3 - q1
lower = q1 - 1.5*iqr
upper = q3 1.5*iqr
# 正常范围
print("正常范围: ({0}, {1})".format(round(lower,4), round(upper,4)))
正常范围: (7.645, 16.805)
In [9]:
代码语言:javascript复制# 异常点
print("Outliers:", data_b[(data_b.radius_mean < lower) | (data_b.radius_mean > upper)].radius_mean.values)
Outliers: [ 6.981 16.84 17.85 ]
分析3:箱型图定位异常
从箱型图能够直观地看到数据的异常点
In [10]:
代码语言:javascript复制# 基于Plotly
fig = px.box(df,
x="diagnosis",
y="radius_mean",
color="diagnosis")
fig.show()
代码语言:javascript复制# 基于seaborn
melted_df = pd.melt(df,
id_vars = "diagnosis",
value_vars = ['radius_mean', 'texture_mean'])
plt.figure(figsize=(15,10))
sns.boxplot(x="variable",
y="value",
hue="diagnosis",
data=melted_df
)
plt.show()
分析4:描述统计分析describe
良性肿瘤数据data_b的描述统计信息:
代码语言:javascript复制# 针对肿瘤半径:radius_mean
print("mean: ",data_b.radius_mean.mean())
print("variance: ",data_b.radius_mean.var())
print("standart deviation (std): ",data_b.radius_mean.std())
print("describe method: ",data_b.radius_mean.describe())
# ----------------
mean: 12.14652380952381
variance: 3.170221722043872
standart deviation (std): 1.7805116461410389
describe method: count 357.000000
mean 12.146524
std 1.780512
min 6.981000
25% 11.080000
50% 12.200000
75% 13.370000
max 17.850000
Name: radius_mean, dtype: float64
分析5:CDF分析(CDF累计分布函数)
CDF:Cumulative distribution function,中文名称是累计分布函数,表示的是变量取值小于或者等于x的概率。P(X <= x)
In [15]:
代码语言:javascript复制plt.hist(data_b.radius_mean,
bins=50,
fc=(0,1,0,0.5),
label="Bening",
normed=True,
cumulative=True
)
data_sorted=np.sort(data_b.radius_mean)
y = np.arange(len(data_sorted)) / float(len(data_sorted) - 1)
plt.title("CDF of Bening Tumor Radius Mean")
plt.plot(data_sorted,y,color="blue")
plt.show()
分析6:效应值分析-Effect size
Effect size描述的是两组数据之间的差异大小。值越大,说明两组数据的差异越明显。
一般规定为:
- <0.2:效应小
- [0.2,0.8]:中等效应
- >0.8:大效应
在这里分析的是良性和恶性肿瘤的radius_mean的值差异性
In [16]:
代码语言:javascript复制diff = data_m.radius_mean.mean() - data_b.radius_mean.mean()
var_b = data_b.radius_mean.var()
var_m = data_m.radius_mean.var()
var = (len(data_b) * var_b len(data_m) * var_m) / float(len(data_b) len(data_m))
effect_size = diff / np.sqrt(var)
print("Effect Size: ", effect_size)
Effect Size: 2.2048585165041428
很明显:这两组数据之间存在明显的效应;也和之间的结论吻合:良性肿瘤和恶性肿瘤的半径均值彼此间差异大
分析7:两两变量间的关系
两个变量
使用散点图结合柱状图来表示
In [17]:
代码语言:javascript复制plt.figure(figsize = (15,10))
sns.jointplot(df.radius_mean,
df.area_mean,
kind="reg")
plt.show()
可以看到这两个特征是正相关的
多个变量
In [18]:
代码语言:javascript复制sns.set(style="white")
df1 = df.loc[:,["radius_mean","area_mean","fractal_dimension_se"]]
g = sns.PairGrid(df1,diag_sharey = False,)
g.map_lower(sns.kdeplot,cmap="Blues_d")
g.map_upper(plt.scatter)
g.map_diag(sns.kdeplot,lw =3)
plt.show()
分析8:相关性分析-热力图
In [19]:
代码语言:javascript复制corr = df.corr() # 相关系数
f,ax = plt.subplots(figsize=(18,8))
sns.heatmap(corr, # 相关系数
annot=True,
linewidths=0.5,
fmt=".1f",
ax=ax
)
# ticks的旋转角度
plt.xticks(rotation=90)
plt.yticks(rotation=0)
# 标题
plt.title('Correlation Map')
# 保存
plt.savefig('graph.png')
plt.show()
分析9:协方差分析
协方差是衡量两个变量的变化趋势:
- 如果它们变化方向相同,协方差最大
- 如果它们是正交的,则协方差为零
- 如果指向相反的方向,则协方差为负数
In [20]:
代码语言:javascript复制# 协方差矩阵
np.cov(df.radius_mean, df.area_mean)
Out[20]:
代码语言:javascript复制array([[1.24189201e 01, 1.22448341e 03],
[1.22448341e 03, 1.23843554e 05]])
In [21]:
代码语言:javascript复制# 两个变量的协方差值
df.radius_mean.cov(df.area_mean)
Out[21]:
代码语言:javascript复制1224.483409346457
In [22]:
代码语言:javascript复制# 两个变量的协方差值
df.radius_mean.cov(df.fractal_dimension_se)
Out[22]:
代码语言:javascript复制-0.0003976248576440629
分析10:Pearson Correlation
假设有两个数组,A、B,则皮尔逊相关系数定义为:
Pearson=cov(A,B)std(A)∗std(B)
In [23]:
代码语言:javascript复制p1 = df.loc[:,["area_mean","radius_mean"]].corr(method= "pearson")
p2 = df.radius_mean.cov(df.area_mean)/(df.radius_mean.std()*df.area_mean.std())
print('Pearson Correlation Metric: n',p1)
Pearson Correlation Metric:
area_mean radius_mean
area_mean 1.000000 0.987357
radius_mean 0.987357 1.000000
In [24]:
代码语言:javascript复制print('Pearson Correlation Value: n', p2)
Pearson Correlation Value:
0.9873571700566132
分析11:Spearman’s Rank Correlation
Spearman’s Rank Correlation,中文可以称之为:斯皮尔曼下的排序相关性。
皮尔逊相关系数在求解的时候,需要变量之间是线性的,且大体上是正态分布的
但是如果当数据中存在异常值,或者变量的分布不是正态的,最好不要使用皮尔逊相关系数。
在这里采用基于斯皮尔曼的排序相关系数。
In [25]:
代码语言:javascript复制df_rank = df.rank()
spearman_corr = df_rank.loc[:,["area_mean","radius_mean"]].corr(method= "spearman")
spearman_corr # 基于斯皮尔曼的系数矩阵
Out[25]:
area_mean | radius_mean | |
---|---|---|
area_mean | 1.000000 | 0.999602 |
radius_mean | 0.999602 | 1.000000 |
对比皮尔逊相关系数和斯皮尔曼系数:
- 现有数据下,斯皮尔曼相关性比皮尔逊相关系数要大一点
- 当数据中存在异常离群点的时候,斯皮尔曼相关性系数拥有更好的鲁棒性