Titanic with ML(how to preprocessing and SimpleImputer)
[Notice] [ML_2]
Titanic with ML(how to preprocessing and SimpleImputer)
import numpy as np
import pandas as pd
train = pd.read_csv('https://bit.ly/fc-ml-titanic')
train.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
train.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
-
PassengerId: Passenger ID
-
Survived: survival, 1: survival, 0: death
-
Pclass: class
-
Name: your name
-
Sex: gender
-
Age: age
-
SibSp: Number of brothers, sisters, spouses
-
Parch: parent, number of children
-
Ticket: Ticket number
-
Fare: these days
-
Cabin: seat number
-
Embarked: Boarding Port
Preprocessing: Split train / validation sets
-
First, define features and labels.
-
After defining the features/labels, divide the train/validation set into appropriate proportions.
feature = ['Pclass', 'Sex', 'Age', 'Fare']
label = ['Survived']
train[feature].head()
Pclass | Sex | Age | Fare | |
---|---|---|---|---|
0 | 3 | male | 22.0 | 7.2500 |
1 | 1 | female | 38.0 | 71.2833 |
2 | 3 | female | 26.0 | 7.9250 |
3 | 1 | female | 35.0 | 53.1000 |
4 | 3 | male | 35.0 | 8.0500 |
train[label].head()
Survived | |
---|---|
0 | 0 |
1 | 1 |
2 | 1 |
3 | 1 |
4 | 0 |
from sklearn.model_selection import train_test_split
-
test_size: percentage to allocate to validation set (20% -> 0.2)
-
shuffle: shuffle option (default True)
-
random_state: random seed value
x_train, x_valid, y_train, y_valid = train_test_split(train[feature], train[label], test_size = 0.2, shuffle = True, random_state = 30)
x_train.shape, y_train.shape
((712, 4), (712, 1))
x_valid.shape, y_valid.shape
((179, 4), (179, 1))
Preprocessing: null
train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float64 6 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Cabin 204 non-null object 11 Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB
How to check for missing values is pandas’ isnull()
And you can check it at a glance through sum() to get the sum.
train.isnull().sum()
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64
train['Age'].isnull().sum()
177
1. Handling missing values for Numerical Column data
train['Age'].fillna(0).describe()
count 891.000000 mean 23.799293 std 17.596074 min 0.000000 25% 6.000000 50% 24.000000 75% 35.000000 max 80.000000 Name: Age, dtype: float64
train['Age'].fillna(train['Age'].mean()).describe()
count 891.000000 mean 29.699118 std 13.002015 min 0.420000 25% 22.000000 50% 29.699118 75% 35.000000 max 80.000000 Name: Age, dtype: float64
Use SimpleImputer in order to deal with missing values
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy = 'mean')
imputer.fit(train[['Age', 'Pclass']])
SimpleImputer()
result = imputer.transform(train[['Age', 'Pclass']])
result
array([[22. , 3. ], [38. , 1. ], [26. , 3. ], ..., [29.69911765, 3. ], [26. , 1. ], [32. , 3. ]])
train[['Age', 'Pclass']] = result
train[['Age', 'Pclass']].isnull().sum()
Age 0 Pclass 0 dtype: int64
train[['Age', 'Pclass']].describe()
Age | Pclass | |
---|---|---|
count | 891.000000 | 891.000000 |
mean | 29.699118 | 2.308642 |
std | 13.002015 | 0.836071 |
min | 0.420000 | 1.000000 |
25% | 22.000000 | 2.000000 |
50% | 29.699118 | 3.000000 |
75% | 35.000000 | 3.000000 |
max | 80.000000 | 3.000000 |
train = pd.read_csv('https://bit.ly/fc-ml-titanic')
train[['Age', 'Pclass']].isnull().sum()
Age 177 Pclass 0 dtype: int64
imputer = SimpleImputer(strategy = 'median')
result = imputer.fit_transform(train[['Age', 'Pclass']])
train[['Age', 'Pclass']] = result
train[['Age', 'Pclass']].isnull().sum()
Age 0 Pclass 0 dtype: int64
train[['Age', 'Pclass']].describe()
Age | Pclass | |
---|---|---|
count | 891.000000 | 891.000000 |
mean | 29.361582 | 2.308642 |
std | 13.019697 | 0.836071 |
min | 0.420000 | 1.000000 |
25% | 22.000000 | 2.000000 |
50% | 28.000000 | 3.000000 |
75% | 35.000000 | 3.000000 |
max | 80.000000 | 3.000000 |
2. (Categorical Column) processing missing values for data
train = pd.read_csv('https://bit.ly/fc-ml-titanic')
train['Embarked'].fillna('S')
0 S 1 C 2 S 3 S 4 S .. 886 S 887 S 888 S 889 C 890 Q Name: Embarked, Length: 891, dtype: object
imputer = SimpleImputer(strategy = 'most_frequent')
result = imputer.fit_transform(train[['Embarked', 'Cabin']])
train[['Embarked', 'Cabin']] = result
train[['Embarked', 'Cabin']].isnull().sum()
Embarked 0 Cabin 0 dtype: int64
댓글남기기