Predicting_income

2022-09-15 8 분 소요

[Notice] [start to stduy XAI]

Predicting the income range with financial data

Introduction to Financial Data and Overview of Predictive Models

The problem of predicting customer income ranges is one of the most important problems in financial data analysis.

Before we get into the analysis, let’s point out two things.

Properties of financial data

Financial data mainly has the following characteristics:

1) Combination of heterogeneous data: Data source, form, scale, etc. have different characteristics
2) skewedness of distribution: If the predicted value and the correct answer are far apart, the bias of the learning result may be high.
3) Unclearness of classification label: Income section, credit rating, product type, etc. include business logic, so classification is arbitrary → Analyst’s interpretation is important
4) multicollinearity of variables: Interdependence or correlation between variables may be strong
5) Nonlinearity of variables: The influence of variables may not be linear, e.g.) What is the effect of age on income?
Data may be incomplete (missing, truncated, censored) due to other practical limitations such as regulation, collection, and storage

Multi-classification and prediction of income brackets

When there are more than 3 classes (also called labels or levels) to predict, it is called a multiclassification problem. In simple terms, it is called multiclass classification or multinomial logistic regression if you use a regression method. It is assumed that the hierarchical relationship (inclusion relationship) between classes is equivalent.

Forecasting income brackets is a classic multiple classification problem. Before analyzing, let’s consider the following:

1) In case the division between classes is not clear: How should the division of income be established and how many classes should be decided?
2) If there is an order in the divisions between classes: To be precise, each income level should be viewed as an ordinal class.
3) Insufficient value for a specific class: How do you solve the difference between the number of customers in the high-income bracket and the number of customers in the middle-income bracket?

The multiclass classification problem has the following additional considerations compared to the binary classification problem.

1) Cautions when implementing the model: One-hot-encoding of variables, determination of objective function, etc.
2) Cautions when interpreting results: Accuracy, F1 score, Confusion Matrix, etc.

Load data to predict

Introduction to data

This topic uses data collected by the US Census Bureau and distributed by UCI to the US Adult Income dataset, with simulated variables added and modified by the instructor.
The first data to be used is the US Adult Income dataset, and the columns are as follows.
age : 나이
workclass: 직업구분
education: 교육수준
education.num: 교육수준(numerically coded)
marital.status: 혼인상태
occupation : 직업
relationship: 가족관계
race: 인종
sex: 성별
capital.gain: 자본이득
capital.loss: 자본손실
hours.per.week: 주당 근로시간
income : 소득 구분

Data from: https://archive.ics.uci.edu/ml/datasets/adult

Import data

import numpy as np  
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

datapath = 'https://github.com/mchoimis/financialML/raw/main/income/'
df = pd.io.parsers.read_csv(datapath + 'income.csv')
df.head()

	age	workclass	fnlwgt	education	education.num	marital.status	occupation	relationship	race	sex	capital.loss	hours.per.week	native.country	income
0	90	?	77053	HS-grad	9	Widowed	?	Not-in-family	White	Female	4356	40	United-States	<=50K
1	82	Private	132870	HS-grad	9	Widowed	Exec-managerial	Not-in-family	White	Female	4356	18	United-States	<=50K
2	66	?	186061	Some-college	10	Widowed	?	Unmarried	Black	Female	4356	40	United-States	<=50K
3	54	Private	140359	7th-8th	4	Divorced	Machine-op-inspct	Unmarried	White	Female	3900	40	United-States	<=50K
4	41	Private	264663	Some-college	10	Separated	Prof-specialty	Own-child	White	Female	3900	40	United-States	<=50K

Data preview

print(df.shape)
print(df.columns)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education.num   32561 non-null  int64 
 5   marital.status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital.gain    32561 non-null  int64 
 11  capital.loss    32561 non-null  int64 
 12  hours.per.week  32561 non-null  int64 
 13  native.country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB

Check Data

# Replace missing values ​​with NaN
df[df =='?'] = np.nan

# Filling out Missing Values ​​with Mode
for col in ['workclass', 'occupation', 'native.country']:
    df[col].fillna(df[col].mode()[0], inplace = True)

# result
df.head()
 

	age	workclass	fnlwgt	education	education.num	marital.status	occupation	relationship	race	sex	capital.loss	hours.per.week	native.country	income
0	90	Private	77053	HS-grad	9	Widowed	Prof-specialty	Not-in-family	White	Female	4356	40	United-States	<=50K
1	82	Private	132870	HS-grad	9	Widowed	Exec-managerial	Not-in-family	White	Female	4356	18	United-States	<=50K
2	66	Private	186061	Some-college	10	Widowed	Prof-specialty	Unmarried	Black	Female	4356	40	United-States	<=50K
3	54	Private	140359	7th-8th	4	Divorced	Machine-op-inspct	Unmarried	White	Female	3900	40	United-States	<=50K
4	41	Private	264663	Some-college	10	Separated	Prof-specialty	Own-child	White	Female	3900	40	United-States	<=50K

df.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education.num     0
marital.status    0
occupation        0
relationship      0
race              0
sex               0
capital.gain      0
capital.loss      0
hours.per.week    0
native.country    0
income            0
dtype: int64

Feature Engineering

Creating input features and target values

X = df.drop(['income','education','fnlwgt'], axis =1)
y = df['income']

X.head()

	age	workclass	education.num	marital.status	occupation	relationship	race	sex	capital.loss	hours.per.week	native.country
0	90	Private	9	Widowed	Prof-specialty	Not-in-family	White	Female	4356	40	United-States
1	82	Private	9	Widowed	Exec-managerial	Not-in-family	White	Female	4356	18	United-States
2	66	Private	10	Widowed	Prof-specialty	Unmarried	Black	Female	4356	40	United-States
3	54	Private	4	Divorced	Machine-op-inspct	Unmarried	White	Female	3900	40	United-States
4	41	Private	10	Separated	Prof-specialty	Own-child	White	Female	3900	40	United-States

y.head()

0    <=50K
1    <=50K
2    <=50K
3    <=50K
4    <=50K
Name: income, dtype: object

Divide the raw data into training set and test set

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)
X_train.head()

	age	workclass	education.num	marital.status	occupation	relationship	race	sex	hours.per.week	native.country
32098	40	State-gov	13	Married-civ-spouse	Exec-managerial	Wife	White	Female	20	United-States
25206	39	Local-gov	9	Married-civ-spouse	Machine-op-inspct	Husband	White	Male	38	United-States
23491	42	Private	10	Never-married	Exec-managerial	Not-in-family	White	Female	40	United-States
12367	27	Local-gov	9	Never-married	Farming-fishing	Own-child	White	Male	40	United-States
7054	38	Federal-gov	14	Married-civ-spouse	Exec-managerial	Husband	White	Male	40	United-States

Handling categorical variables

from sklearn.preprocessing import LabelEncoder

categorical = ['workclass', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'native.country']
for feature in categorical:
        le = LabelEncoder()
        X_train[feature] = le.fit_transform(X_train[feature])  
        X_test[feature] = le.transform(X_test[feature])    

Check the result of categorical variable processing

# Check the transformed categorical variable column (X_train)
X_train[categorical].head(3)

	workclass	marital.status	occupation	relationship	race	sex	native.country
32098	6	2	3	5	4	0	38
25206	1	2	6	0	4	1	38
23491	3	4	3	1	4	0	38

# Checking the converted categorical variable column (X_test)

X_test[categorical].head(3)

	workclass	marital.status	occupation	relationship	race	native.country
22278	3	6	11	4	4	38
8950	3	4	5	3	4	38
7838	3	4	7	1	1	39

X_train[categorical].nunique()

workclass          8
marital.status     7
occupation        14
relationship       6
race               5
sex                2
native.country    41
dtype: int64

X_test[categorical].nunique()

workclass          8
marital.status     7
occupation        14
relationship       6
race               5
sex                2
native.country    40
dtype: int64

Note: Handling of categorical variables

Categorical variables can be roughly divided into two methods.

Convert class to number
One-hot-encoding (dummy encoding)

In the case of financial data, categorical variables occupy most of the data, so when one-hot-encoding is performed, the majority of the entire dataset may have a value of 0. When there are many meaningless values in a high-dimensional dataset, it is said that the features are sparse, and the learning efficiency may not be high.

Scaling Features

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()   
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns = X.columns) 
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns = X.columns)

# Check the scaled X_train data
X_train.head()
 

	age	workclass	education.num	marital.status	occupation	relationship	race	sex	hours.per.week	native.country
32098	40	6	13	2	3	5	4	0	20	38
25206	39	1	9	2	6	0	4	1	38	38
23491	42	3	10	4	3	1	4	0	40	38
12367	27	1	9	4	4	3	4	1	40	38
7054	38	0	14	2	3	0	4	1	40	38

print(min(X_train['age']))
print(max(X_train['age']))
print(np.mean(X_train['age']))
print(np.var(X_train['age']))
print('\n')
print(min(X_test['age']))
print(max(X_test['age']))
print(np.mean(X_test['age']))
print(np.var(X_test['age']))

17
90
38.61429448929449
186.44402697680712


17
90
38.505476507319074
185.14136114309127

print(min(X_train_scaled['age']))
print(max(X_train_scaled['age']))
print(np.mean(X_train_scaled['age']))
print(np.var(X_train_scaled['age']))
print('\n')
print(min(X_test_scaled['age']))
print(max(X_test_scaled['age']))
print(np.mean(X_test_scaled['age']))
print(np.var(X_test_scaled['age']))

-1.5829486507307393
3.7632934651328265
1.7567165303651125e-16
1.0


-1.5829486507307393
3.7632934651328265
-0.007969414769866482
0.9930130996694361

Note: feature scaler provided by scikit-learn

StandardScaler: default scale, converts the mean of each feature to 0 and standard deviation to 1
RobustScaler: Similar to the above, but uses the median, quartile, and quartile values instead of the mean to minimize the influence of outliers
MinMaxScaler: scale so that the maximum and minimum values of all features are 1 and 0 respectively
Normalizer: Normalized per row, not feature (column), and adjusts the data so that the Euclidean distance is 1.

The reason for scaling is that training may not work properly when the values of the data are too large or too small. Also, for classifiers where the effect of scale is absolute (e.g. distance-based algorithms such as knn), it is essential to consider scaling.

On the other hand, some items may be better to keep the distribution of the original data. For example, when data is standardized on features that are concentrated in almost one place to make the distributions the same, small changes may be learned as large differences. You can also omit it if you use a classifier that is not significantly affected by scale (e.g., a tree-based ensemble algorithm), if the performance is acceptable or if you are less concerned about overfitting.

One thing to keep in mind when scaling is that the original data may lose its meaning. It may be difficult to improve the model if the explanatory power of the original feature is lost when the purpose of finding an answer is not the ultimate goal, but the interpretation of the model or its application to other datasets in the future is more important. Please consider this together.

Twitter Facebook LinkedIn