3 분 소요

[Notice] [ML_2]

Titanic with ML(how to preprocessing and SimpleImputer)

import numpy as np
import pandas as pd
train = pd.read_csv('https://bit.ly/fc-ml-titanic')
train.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
train.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
  • PassengerId: Passenger ID

  • Survived: survival, 1: survival, 0: death

  • Pclass: class

  • Name: your name

  • Sex: gender

  • Age: age

  • SibSp: Number of brothers, sisters, spouses

  • Parch: parent, number of children

  • Ticket: Ticket number

  • Fare: these days

  • Cabin: seat number

  • Embarked: Boarding Port

Preprocessing: Split train / validation sets

  1. First, define features and labels.

  2. After defining the features/labels, divide the train/validation set into appropriate proportions.

feature = ['Pclass', 'Sex', 'Age', 'Fare']
label = ['Survived']
train[feature].head()
Pclass Sex Age Fare
0 3 male 22.0 7.2500
1 1 female 38.0 71.2833
2 3 female 26.0 7.9250
3 1 female 35.0 53.1000
4 3 male 35.0 8.0500
train[label].head()
Survived
0 0
1 1
2 1
3 1
4 0
from sklearn.model_selection import train_test_split
  • test_size: percentage to allocate to validation set (20% -> 0.2)

  • shuffle: shuffle option (default True)

  • random_state: random seed value

x_train, x_valid, y_train, y_valid = train_test_split(train[feature], train[label], test_size = 0.2, shuffle = True, random_state = 30)
x_train.shape, y_train.shape
((712, 4), (712, 1))
x_valid.shape, y_valid.shape
((179, 4), (179, 1))

Preprocessing: null

train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

How to check for missing values is pandas’ isnull()

And you can check it at a glance through sum() to get the sum.

train.isnull().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
train['Age'].isnull().sum()
177

Impute document

1. Handling missing values for Numerical Column data

train['Age'].fillna(0).describe()
count    891.000000
mean      23.799293
std       17.596074
min        0.000000
25%        6.000000
50%       24.000000
75%       35.000000
max       80.000000
Name: Age, dtype: float64
train['Age'].fillna(train['Age'].mean()).describe()
count    891.000000
mean      29.699118
std       13.002015
min        0.420000
25%       22.000000
50%       29.699118
75%       35.000000
max       80.000000
Name: Age, dtype: float64

Use SimpleImputer in order to deal with missing values

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy = 'mean')
imputer.fit(train[['Age', 'Pclass']])
SimpleImputer()
result = imputer.transform(train[['Age', 'Pclass']])
result
array([[22.        ,  3.        ],
       [38.        ,  1.        ],
       [26.        ,  3.        ],
       ...,
       [29.69911765,  3.        ],
       [26.        ,  1.        ],
       [32.        ,  3.        ]])
train[['Age', 'Pclass']] = result
train[['Age', 'Pclass']].isnull().sum()
Age       0
Pclass    0
dtype: int64
train[['Age', 'Pclass']].describe()
Age Pclass
count 891.000000 891.000000
mean 29.699118 2.308642
std 13.002015 0.836071
min 0.420000 1.000000
25% 22.000000 2.000000
50% 29.699118 3.000000
75% 35.000000 3.000000
max 80.000000 3.000000
train = pd.read_csv('https://bit.ly/fc-ml-titanic')
train[['Age', 'Pclass']].isnull().sum()
Age       177
Pclass      0
dtype: int64
imputer = SimpleImputer(strategy = 'median')
result = imputer.fit_transform(train[['Age', 'Pclass']])
train[['Age', 'Pclass']] = result
train[['Age', 'Pclass']].isnull().sum()
Age       0
Pclass    0
dtype: int64
train[['Age', 'Pclass']].describe()
Age Pclass
count 891.000000 891.000000
mean 29.361582 2.308642
std 13.019697 0.836071
min 0.420000 1.000000
25% 22.000000 2.000000
50% 28.000000 3.000000
75% 35.000000 3.000000
max 80.000000 3.000000

2. (Categorical Column) processing missing values for data

train = pd.read_csv('https://bit.ly/fc-ml-titanic')
train['Embarked'].fillna('S')
0      S
1      C
2      S
3      S
4      S
      ..
886    S
887    S
888    S
889    C
890    Q
Name: Embarked, Length: 891, dtype: object
imputer = SimpleImputer(strategy = 'most_frequent')
result = imputer.fit_transform(train[['Embarked', 'Cabin']])
train[['Embarked', 'Cabin']] = result
train[['Embarked', 'Cabin']].isnull().sum()
Embarked    0
Cabin       0
dtype: int64

댓글남기기