8 분 소요

[Notice] [start to stduy XAI]

Predicting the income range with financial data

Introduction to Financial Data and Overview of Predictive Models

The problem of predicting customer income ranges is one of the most important problems in financial data analysis.

Before we get into the analysis, let’s point out two things.

Properties of financial data

Financial data mainly has the following characteristics:

  • 1) Combination of heterogeneous data: Data source, form, scale, etc. have different characteristics

  • 2) skewedness of distribution: If the predicted value and the correct answer are far apart, the bias of the learning result may be high.

  • 3) Unclearness of classification label: Income section, credit rating, product type, etc. include business logic, so classification is arbitrary → Analyst’s interpretation is important

  • 4) multicollinearity of variables: Interdependence or correlation between variables may be strong

  • 5) Nonlinearity of variables: The influence of variables may not be linear, e.g.) What is the effect of age on income?

  • Data may be incomplete (missing, truncated, censored) due to other practical limitations such as regulation, collection, and storage

Multi-classification and prediction of income brackets

When there are more than 3 classes (also called labels or levels) to predict, it is called a multiclassification problem. In simple terms, it is called multiclass classification or multinomial logistic regression if you use a regression method. It is assumed that the hierarchical relationship (inclusion relationship) between classes is equivalent.

Forecasting income brackets is a classic multiple classification problem. Before analyzing, let’s consider the following:

  • 1) In case the division between classes is not clear: How should the division of income be established and how many classes should be decided?

  • 2) If there is an order in the divisions between classes: To be precise, each income level should be viewed as an ordinal class.

  • 3) Insufficient value for a specific class: How do you solve the difference between the number of customers in the high-income bracket and the number of customers in the middle-income bracket?

The multiclass classification problem has the following additional considerations compared to the binary classification problem.

  • 1) Cautions when implementing the model: One-hot-encoding of variables, determination of objective function, etc.

  • 2) Cautions when interpreting results: Accuracy, F1 score, Confusion Matrix, etc.


Load data to predict

Introduction to data

  • This topic uses data collected by the US Census Bureau and distributed by UCI to the US Adult Income dataset, with simulated variables added and modified by the instructor.

  • The first data to be used is the US Adult Income dataset, and the columns are as follows.

  • age : 나이

  • workclass: 직업구분

  • education: 교육수준

  • education.num: 교육수준(numerically coded)

  • marital.status: 혼인상태

  • occupation : 직업

  • relationship: 가족관계

  • race: 인종

  • sex: 성별

  • capital.gain: 자본이득

  • capital.loss: 자본손실

  • hours.per.week: 주당 근로시간

  • income : 소득 구분

Data from: https://archive.ics.uci.edu/ml/datasets/adult


Import data

import numpy as np  
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')
datapath = 'https://github.com/mchoimis/financialML/raw/main/income/'
df = pd.io.parsers.read_csv(datapath + 'income.csv')
df.head()
age workclass fnlwgt education education.num marital.status occupation relationship race sex capital.gain capital.loss hours.per.week native.country income
0 90 ? 77053 HS-grad 9 Widowed ? Not-in-family White Female 0 4356 40 United-States <=50K
1 82 Private 132870 HS-grad 9 Widowed Exec-managerial Not-in-family White Female 0 4356 18 United-States <=50K
2 66 ? 186061 Some-college 10 Widowed ? Unmarried Black Female 0 4356 40 United-States <=50K
3 54 Private 140359 7th-8th 4 Divorced Machine-op-inspct Unmarried White Female 0 3900 40 United-States <=50K
4 41 Private 264663 Some-college 10 Separated Prof-specialty Own-child White Female 0 3900 40 United-States <=50K

Data preview

print(df.shape)
print(df.columns)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education.num   32561 non-null  int64 
 5   marital.status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital.gain    32561 non-null  int64 
 11  capital.loss    32561 non-null  int64 
 12  hours.per.week  32561 non-null  int64 
 13  native.country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB

Check Data

# Replace missing values ​​with NaN
df[df =='?'] = np.nan
# Filling out Missing Values ​​with Mode
for col in ['workclass', 'occupation', 'native.country']:
    df[col].fillna(df[col].mode()[0], inplace = True)
# result
df.head()
 
age workclass fnlwgt education education.num marital.status occupation relationship race sex capital.gain capital.loss hours.per.week native.country income
0 90 Private 77053 HS-grad 9 Widowed Prof-specialty Not-in-family White Female 0 4356 40 United-States <=50K
1 82 Private 132870 HS-grad 9 Widowed Exec-managerial Not-in-family White Female 0 4356 18 United-States <=50K
2 66 Private 186061 Some-college 10 Widowed Prof-specialty Unmarried Black Female 0 4356 40 United-States <=50K
3 54 Private 140359 7th-8th 4 Divorced Machine-op-inspct Unmarried White Female 0 3900 40 United-States <=50K
4 41 Private 264663 Some-college 10 Separated Prof-specialty Own-child White Female 0 3900 40 United-States <=50K
df.isnull().sum()
age               0
workclass         0
fnlwgt            0
education         0
education.num     0
marital.status    0
occupation        0
relationship      0
race              0
sex               0
capital.gain      0
capital.loss      0
hours.per.week    0
native.country    0
income            0
dtype: int64

Feature Engineering

Creating input features and target values

X = df.drop(['income','education','fnlwgt'], axis =1)
y = df['income']
X.head()
age workclass education.num marital.status occupation relationship race sex capital.gain capital.loss hours.per.week native.country
0 90 Private 9 Widowed Prof-specialty Not-in-family White Female 0 4356 40 United-States
1 82 Private 9 Widowed Exec-managerial Not-in-family White Female 0 4356 18 United-States
2 66 Private 10 Widowed Prof-specialty Unmarried Black Female 0 4356 40 United-States
3 54 Private 4 Divorced Machine-op-inspct Unmarried White Female 0 3900 40 United-States
4 41 Private 10 Separated Prof-specialty Own-child White Female 0 3900 40 United-States

y.head()
0    <=50K
1    <=50K
2    <=50K
3    <=50K
4    <=50K
Name: income, dtype: object

Divide the raw data into training set and test set

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)
X_train.head()
age workclass education.num marital.status occupation relationship race sex capital.gain capital.loss hours.per.week native.country
32098 40 State-gov 13 Married-civ-spouse Exec-managerial Wife White Female 0 0 20 United-States
25206 39 Local-gov 9 Married-civ-spouse Machine-op-inspct Husband White Male 0 0 38 United-States
23491 42 Private 10 Never-married Exec-managerial Not-in-family White Female 0 0 40 United-States
12367 27 Local-gov 9 Never-married Farming-fishing Own-child White Male 0 0 40 United-States
7054 38 Federal-gov 14 Married-civ-spouse Exec-managerial Husband White Male 0 0 40 United-States

Handling categorical variables

from sklearn.preprocessing import LabelEncoder

categorical = ['workclass', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'native.country']
for feature in categorical:
        le = LabelEncoder()
        X_train[feature] = le.fit_transform(X_train[feature])  
        X_test[feature] = le.transform(X_test[feature])    

Check the result of categorical variable processing

# Check the transformed categorical variable column (X_train)
X_train[categorical].head(3)
workclass marital.status occupation relationship race sex native.country
32098 6 2 3 5 4 0 38
25206 1 2 6 0 4 1 38
23491 3 4 3 1 4 0 38
# Checking the converted categorical variable column (X_test)

X_test[categorical].head(3)
workclass marital.status occupation relationship race sex native.country
22278 3 6 11 4 4 0 38
8950 3 4 5 3 4 0 38
7838 3 4 7 1 1 0 39
X_train[categorical].nunique()
workclass          8
marital.status     7
occupation        14
relationship       6
race               5
sex                2
native.country    41
dtype: int64
X_test[categorical].nunique()
workclass          8
marital.status     7
occupation        14
relationship       6
race               5
sex                2
native.country    40
dtype: int64

Note: Handling of categorical variables

Categorical variables can be roughly divided into two methods.

  • Convert class to number

  • One-hot-encoding (dummy encoding)

In the case of financial data, categorical variables occupy most of the data, so when one-hot-encoding is performed, the majority of the entire dataset may have a value of 0. When there are many meaningless values ​​in a high-dimensional dataset, it is said that the features are sparse, and the learning efficiency may not be high.

Scaling Features

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()   
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns = X.columns) 
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns = X.columns)
# Check the scaled X_train data
X_train.head()
 
age workclass education.num marital.status occupation relationship race sex capital.gain capital.loss hours.per.week native.country
32098 40 6 13 2 3 5 4 0 0 0 20 38
25206 39 1 9 2 6 0 4 1 0 0 38 38
23491 42 3 10 4 3 1 4 0 0 0 40 38
12367 27 1 9 4 4 3 4 1 0 0 40 38
7054 38 0 14 2 3 0 4 1 0 0 40 38
print(min(X_train['age']))
print(max(X_train['age']))
print(np.mean(X_train['age']))
print(np.var(X_train['age']))
print('\n')
print(min(X_test['age']))
print(max(X_test['age']))
print(np.mean(X_test['age']))
print(np.var(X_test['age']))
17
90
38.61429448929449
186.44402697680712


17
90
38.505476507319074
185.14136114309127
print(min(X_train_scaled['age']))
print(max(X_train_scaled['age']))
print(np.mean(X_train_scaled['age']))
print(np.var(X_train_scaled['age']))
print('\n')
print(min(X_test_scaled['age']))
print(max(X_test_scaled['age']))
print(np.mean(X_test_scaled['age']))
print(np.var(X_test_scaled['age']))
-1.5829486507307393
3.7632934651328265
1.7567165303651125e-16
1.0


-1.5829486507307393
3.7632934651328265
-0.007969414769866482
0.9930130996694361

Note: feature scaler provided by scikit-learn

  • StandardScaler: default scale, converts the mean of each feature to 0 and standard deviation to 1

  • RobustScaler: Similar to the above, but uses the median, quartile, and quartile values ​​instead of the mean to minimize the influence of outliers

  • MinMaxScaler: scale so that the maximum and minimum values ​​of all features are 1 and 0 respectively

  • Normalizer: Normalized per row, not feature (column), and adjusts the data so that the Euclidean distance is 1.

The reason for scaling is that training may not work properly when the values ​​of the data are too large or too small. Also, for classifiers where the effect of scale is absolute (e.g. distance-based algorithms such as knn), it is essential to consider scaling. ​

On the other hand, some items may be better to keep the distribution of the original data. For example, when data is standardized on features that are concentrated in almost one place to make the distributions the same, small changes may be learned as large differences. You can also omit it if you use a classifier that is not significantly affected by scale (e.g., a tree-based ensemble algorithm), if the performance is acceptable or if you are less concerned about overfitting. ​

One thing to keep in mind when scaling is that the original data may lose its meaning. It may be difficult to improve the model if the explanatory power of the original feature is lost when the purpose of finding an answer is not the ultimate goal, but the interpretation of the model or its application to other datasets in the future is more important. Please consider this together.

댓글남기기