做数据分析的两大利器:python和R语言,这里介绍一个我用python学习的案例
第一步,设置工作目录
代码语言:javascript复制#encoding:utf8
import os
os.chdir("G:\Anaconda3\Scripts\lecture01\Feature_engineering_and_model_tuning\Feature-engineering_and_Parameter_Tuning_XGBoost")
第二步,加载包
代码语言:javascript复制import pandas as pd
import numpy as np
%matplotlib inline
第三步,载入数据
代码语言:javascript复制#载入数据:
train = pd.read_csv('Train.csv',encoding = "ISO-8859-1")
test = pd.read_csv('Test.csv',encoding = "ISO-8859-1")
第四步,查看数据
- 维数
train.shape, test.shape
((87020, 26), (37717, 24))
- 数据类型
#看看数据的基本情况
train.dtypes
ID object Gender object City object Monthly_Income int64 DOB object Lead_Creation_Date object Loan_Amount_Applied float64 Loan_Tenure_Applied float64 Existing_EMI float64 Employer_Name object Salary_Account object Mobile_Verified object Var5 int64 Var1 object Loan_Amount_Submitted float64 Loan_Tenure_Submitted float64 Interest_Rate float64 Processing_Fee float64 EMI_Loan_Submitted float64 Filled_Form object Device_Type object Var2 object Source object Var4 int64 LoggedIn int64 Disbursed int64 dtype: object
- 查看数据
#拿前5条出来看看
train.head(5)
- 数据合并
#合成一个总的data
train['source']= 'train'
test['source'] = 'test'
data=pd.concat([train, test],ignore_index=True)
data.shape
(124737, 27)
- 查看异常值
- 空值
data.apply(lambda x: sum(x.isnull()))
City 1401 DOB 0 Device_Type 0 Disbursed 37717 EMI_Loan_Submitted 84901 Employer_Name 113 Existing_EMI 111 Filled_Form 0 Gender 0 ID 0 Interest_Rate 84901 Lead_Creation_Date 0 Loan_Amount_Applied 111 Loan_Amount_Submitted 49535 Loan_Tenure_Applied 111 Loan_Tenure_Submitted 49535 LoggedIn 37717 Mobile_Verified 0 Monthly_Income 0 Processing_Fee 85346 Salary_Account 16801 Source 0 Var1 0 Var2 0 Var4 0 Var5 0 source 0 dtype: int64 - 查看每一列的取值种数
代码语言:javascript复制var = ['Gender','Salary_Account','Mobile_Verified','Var1','Filled_Form','Device_Type','Var2','Source']
for v in var:
print ('n%s这一列数据的不同取值和出现的次数n'%v)
print (data[v].value_counts())
Gender这一列数据的不同取值和出现的次数
Male 71398 Female 53339 Name: Gender, dtype: int64
Salary_Account这一列数据的不同取值和出现的次数
HDFC Bank 25180 ICICI Bank 19547 State Bank of India 17110 Axis Bank 12590 Citibank 3398 Kotak Bank 2955 IDBI Bank 2213 Punjab National Bank 1747 Bank of India 1713 Bank of Baroda 1675 Standard Chartered Bank 1434 Canara Bank 1385 Union Bank of India 1330 Yes Bank 1120 ING Vysya 996 Corporation bank 948 Indian Overseas Bank 901 State Bank of Hyderabad 854 Indian Bank 773 Oriental Bank of Commerce 761 IndusInd Bank 711 Andhra Bank 706 Central Bank of India 648 Syndicate Bank 614 Bank of Maharasthra 576 HSBC 474 State Bank of Bikaner & Jaipur 448 Karur Vysya Bank 435 State Bank of Mysore 385 Federal Bank 377 Vijaya Bank 354 Allahabad Bank 345 UCO Bank 344 State Bank of Travancore 333 Karnataka Bank 279 United Bank of India 276 Dena Bank 268 Saraswat Bank 265 State Bank of Patiala 263 South Indian Bank 223 Deutsche Bank 176 Abhyuday Co-op Bank Ltd 161 The Ratnakar Bank Ltd 113 Tamil Nadu Mercantile Bank 103 Punjab & Sind bank 84 J&K Bank 78 Lakshmi Vilas bank 69 Dhanalakshmi Bank Ltd 66 State Bank of Indore 32 Catholic Syrian Bank 27 India Bulls 21 B N P Paribas 15 Firstrand Bank Limited 11 GIC Housing Finance Ltd 10 Bank of Rajasthan 8 Kerala Gramin Bank 4 Industrial And Commercial Bank Of China Limited 3 Ahmedabad Mercantile Cooperative Bank 1 Name: Salary_Account, dtype: int64
Mobile_Verified这一列数据的不同取值和出现的次数
Y 80928 N 43809 Name: Mobile_Verified, dtype: int64
Var1这一列数据的不同取值和出现的次数
HBXX 84901 HBXC 12952 HBXB 6502 HAXA 4214 HBXA 3042 HAXB 2879 HBXD 2818 HAXC 2171 HBXH 1387 HCXF 990 HAYT 710 HAVC 570 HAXM 386 HCXD 348 HCYS 318 HVYS 252 HAZD 161 HCXG 114 HAXF 22 Name: Var1, dtype: int64
Filled_Form这一列数据的不同取值和出现的次数
N 96740 Y 27997 Name: Filled_Form, dtype: int64
Device_Type这一列数据的不同取值和出现的次数
Web-browser 92105 Mobile 32632 Name: Device_Type, dtype: int64
Var2这一列数据的不同取值和出现的次数
B 53481 G 47338 C 20366 E 1855 D 918 F 770 A 9 Name: Var2, dtype: int64
Source这一列数据的不同取值和出现的次数
S122 55249 S133 42900 S159 7999 S143 6140 S127 2804 S137 2450 S134 1900 S161 1109 S151 1018 S157 929 S153 705 S144 447 S156 432 S158 294 S123 112 S141 83 S162 60 S124 43 S150 19 S160 11 S136 5 S138 5 S155 5 S139 4 S129 4 S135 2 S142 1 S140 1 S154 1 S125 1 S130 1 S126 1 S132 1 S131 1 Name: Source, dtype: int64
- 单个特征分析
计算字段数
代码语言:javascript复制#City字段处理
len(data['City'].unique())
删掉属性
代码语言:javascript复制data.drop('City',axis=1,inplace=True)
- 缺失值用中位数填补
#找中位数去填补缺省值(因为缺省的不多)
data['Loan_Amount_Applied'].fillna(data['Loan_Amount_Applied'].median(),inplace=True)
- 数据集的One-Hot编码
data = pd.get_dummies(data, columns=var_to_encode)
data.columns