Titanic with ML(how to preprocessing and SimpleImputer)

2022-07-15 3 분 소요

[Notice] [ML_2]

Titanic with ML(how to preprocessing and SimpleImputer)

import numpy as np
import pandas as pd

train = pd.read_csv('https://bit.ly/fc-ml-titanic')

train.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

train.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

PassengerId: Passenger ID
Survived: survival, 1: survival, 0: death
Pclass: class
Name: your name
Sex: gender
Age: age
SibSp: Number of brothers, sisters, spouses
Parch: parent, number of children
Ticket: Ticket number
Fare: these days
Cabin: seat number
Embarked: Boarding Port

Preprocessing: Split train / validation sets

First, define features and labels.
After defining the features/labels, divide the train/validation set into appropriate proportions.

feature = ['Pclass', 'Sex', 'Age', 'Fare']

label = ['Survived']

train[feature].head()

	Pclass	Sex	Age	Fare
0	3	male	22.0	7.2500
1	1	female	38.0	71.2833
2	3	female	26.0	7.9250
3	1	female	35.0	53.1000
4	3	male	35.0	8.0500

train[label].head()

	Survived
0	0
1	1
2	1
3	1
4	0

from sklearn.model_selection import train_test_split

test_size: percentage to allocate to validation set (20% -> 0.2)
shuffle: shuffle option (default True)
random_state: random seed value

x_train, x_valid, y_train, y_valid = train_test_split(train[feature], train[label], test_size = 0.2, shuffle = True, random_state = 30)

x_train.shape, y_train.shape

((712, 4), (712, 1))

x_valid.shape, y_valid.shape

((179, 4), (179, 1))

Preprocessing: null

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

How to check for missing values is pandas’ isnull()

And you can check it at a glance through sum() to get the sum.

train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

train['Age'].isnull().sum()

Impute document

1. Handling missing values for Numerical Column data

train['Age'].fillna(0).describe()

count    891.000000
mean      23.799293
std       17.596074
min        0.000000
25%        6.000000
50%       24.000000
75%       35.000000
max       80.000000
Name: Age, dtype: float64

train['Age'].fillna(train['Age'].mean()).describe()

count    891.000000
mean      29.699118
std       13.002015
min        0.420000
25%       22.000000
50%       29.699118
75%       35.000000
max       80.000000
Name: Age, dtype: float64

Use SimpleImputer in order to deal with missing values

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy = 'mean')

imputer.fit(train[['Age', 'Pclass']])

SimpleImputer()

result = imputer.transform(train[['Age', 'Pclass']])

result

array([[22.        ,  3.        ],
       [38.        ,  1.        ],
       [26.        ,  3.        ],
       ...,
       [29.69911765,  3.        ],
       [26.        ,  1.        ],
       [32.        ,  3.        ]])

train[['Age', 'Pclass']] = result

train[['Age', 'Pclass']].isnull().sum()

Age       0
Pclass    0
dtype: int64

train[['Age', 'Pclass']].describe()

	Age	Pclass
count	891.000000	891.000000
mean	29.699118	2.308642
std	13.002015	0.836071
min	0.420000	1.000000
25%	22.000000	2.000000
50%	29.699118	3.000000
75%	35.000000	3.000000
max	80.000000	3.000000

train = pd.read_csv('https://bit.ly/fc-ml-titanic')

train[['Age', 'Pclass']].isnull().sum()

Age       177
Pclass      0
dtype: int64

imputer = SimpleImputer(strategy = 'median')

result = imputer.fit_transform(train[['Age', 'Pclass']])

train[['Age', 'Pclass']] = result

train[['Age', 'Pclass']].isnull().sum()

Age       0
Pclass    0
dtype: int64

train[['Age', 'Pclass']].describe()

	Age	Pclass
count	891.000000	891.000000
mean	29.361582	2.308642
std	13.019697	0.836071
min	0.420000	1.000000
25%	22.000000	2.000000
50%	28.000000	3.000000
75%	35.000000	3.000000
max	80.000000	3.000000

2. (Categorical Column) processing missing values for data

train = pd.read_csv('https://bit.ly/fc-ml-titanic')

train['Embarked'].fillna('S')

0      S
1      C
2      S
3      S
4      S
      ..
886    S
887    S
888    S
889    C
890    Q
Name: Embarked, Length: 891, dtype: object

imputer = SimpleImputer(strategy = 'most_frequent')

result = imputer.fit_transform(train[['Embarked', 'Cabin']])

train[['Embarked', 'Cabin']] = result

train[['Embarked', 'Cabin']].isnull().sum()

Embarked    0
Cabin       0
dtype: int64

Twitter Facebook LinkedIn

Titanic with ML(how to preprocessing and SimpleImputer)

Titanic with ML(how to preprocessing and SimpleImputer)

Preprocessing: Split train / validation sets

Preprocessing: null

1. Handling missing values for Numerical Column data

Use SimpleImputer in order to deal with missing values

2. (Categorical Column) processing missing values for data

공유하기

댓글남기기

참고

Predicting_income

seasonal_decompose

Dickey Fuller Test

Arima_forecast