Python数据分析可视化–Titanic
这篇文章主要介绍泰坦尼克幸存者问题的数据处理以及可视乎部分,关于机器学习部分: 机器学习2:KNN决策树探究泰坦尼克号幸存者问题
文章目录
- Python数据分析可视化--Titanic
- 导入数据
- 数据探索
- 判断是否存在缺失值
- 关系探索
- 仓位和存活率关系
- 性别和存活率关系
- 兄弟姐妹和孩子对于存活关系
- 数据可视化分析
- 数据预处理
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pylab import style
style.use('ggplot')
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC,LinearSVC
import warnings
warnings.filterwarnings("ignore")
导入数据
代码语言:javascript复制df = pd.read_csv('train.csv',index_col=None)
test_df = pd.read_csv('test.csv')
数据探索
判断是否存在缺失值
代码语言:javascript复制np.any(pd.isnull(df))
代码语言:javascript复制True
代码语言:javascript复制np.any(df["Embarked"].isnull())
代码语言:javascript复制True
代码语言:javascript复制df.describe()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
关系探索
仓位和存活率关系
代码语言:javascript复制df[['Pclass','Survived']].groupby(by=["Pclass"]).mean().sort_values(by="Survived",ascending=False)
Survived | |
---|---|
Pclass | |
1 | 0.629630 |
2 | 0.472826 |
3 | 0.242363 |
从表可知,仓位高低与存活率存在一定的关系
性别和存活率关系
代码语言:javascript复制df['Sex'].value_counts()
代码语言:javascript复制male 577
female 314
Name: Sex, dtype: int64
代码语言:javascript复制df[['Sex','Survived']].groupby(by=["Sex"]).mean().sort_values(by="Survived",ascending=False)
Survived | |
---|---|
Sex | |
female | 0.742038 |
male | 0.188908 |
从表可知,女性的存活率明显高于男性,性别与存活率存在一定的关系
兄弟姐妹和孩子对于存活关系
代码语言:javascript复制df[['SibSp','Survived']].groupby(by=["SibSp"]).mean().sort_index(ascending=False)
Survived | |
---|---|
SibSp | |
8 | 0.000000 |
5 | 0.000000 |
4 | 0.166667 |
3 | 0.250000 |
2 | 0.464286 |
1 | 0.535885 |
0 | 0.345395 |
df[['Parch','Survived']].groupby(by=["Parch"]).mean().sort_index(ascending=False)
Survived | |
---|---|
Parch | |
6 | 0.000000 |
5 | 0.200000 |
4 | 0.000000 |
3 | 0.600000 |
2 | 0.500000 |
1 | 0.550847 |
0 | 0.343658 |
兄弟姐妹和孩子对于存活关系不大
数据可视化分析
- 年龄和存活率关系
g1 = sns.FacetGrid(df,col="Survived")
g1.map(plt.hist,"Age",bins=20)
plt.show()
- 船票等级和存活率关系
g2 = sns.FacetGrid(df,col="Survived",row="Pclass")
g2.map(plt.hist,"Age")
plt.show()
可以看到,船票等级和存活率存在关系
- 复合关系
g3 = sns.FacetGrid(df,row="Embarked",height=3,aspect=1.6)
g3.map(sns.pointplot,"Pclass",'Survived','Sex',palette="deep",hue_order=["female","male"])
g3.add_legend()
plt.show()
代码语言:javascript复制g4 = sns.FacetGrid(df,row="Embarked",col="Survived",height=3,aspect=1.6)
g4.map(sns.barplot,'Sex','Fare',palette="deep",hue_order=["female","male"])
g4.add_legend()
plt.show()
- 票价和存活率存在关系,登船口和存活率也存在关系
数据预处理
- 对于甲板层的分布对与于获救的分析,过于复杂,这里就不探究了
df = df.drop(["Ticket","Cabin"],axis=1)
- 缺失值填充
df['Age'].fillna(method="pad",inplace=True)
df["Embarked"].fillna(method="pad",inplace=True)
- 分类数据转化
df["Embarked"]
代码语言:javascript复制0 S
1 C
2 S
3 S
4 S
..
886 S
887 S
888 S
889 C
890 Q
Name: Embarked, Length: 891, dtype: object
代码语言:javascript复制"""性别转换"""
df["Sex"] = df["Sex"].map({"female":1,"male":0}).astype(int)
#df.loc[df['Sex'] == 'male', 'Sex'] = 1 # 用数值1来代替male,用0来代替female
#df.loc[df['Sex'] == 'female', 'Sex'] = 0
"""登船口转换"""
df["Embarked"] = df["Embarked"].map({"S":0,"C":1,"Q":2}).astype(int)