Predicting_income
[Notice] [start to stduy XAI]
Predicting the income range with financial data
Introduction to Financial Data and Overview of Predictive Models
The problem of predicting customer income ranges is one of the most important problems in financial data analysis.
Before we get into the analysis, let’s point out two things.
Properties of financial data
Financial data mainly has the following characteristics:
-
1) Combination of heterogeneous data: Data source, form, scale, etc. have different characteristics
-
2) skewedness of distribution: If the predicted value and the correct answer are far apart, the bias of the learning result may be high.
-
3) Unclearness of classification label: Income section, credit rating, product type, etc. include business logic, so classification is arbitrary → Analyst’s interpretation is important
-
4) multicollinearity of variables: Interdependence or correlation between variables may be strong
-
5) Nonlinearity of variables: The influence of variables may not be linear, e.g.) What is the effect of age on income?
-
Data may be incomplete (missing, truncated, censored) due to other practical limitations such as regulation, collection, and storage
Multi-classification and prediction of income brackets
When there are more than 3 classes (also called labels or levels) to predict, it is called a multiclassification problem. In simple terms, it is called multiclass classification or multinomial logistic regression if you use a regression method. It is assumed that the hierarchical relationship (inclusion relationship) between classes is equivalent.
Forecasting income brackets is a classic multiple classification problem. Before analyzing, let’s consider the following:
-
1) In case the division between classes is not clear: How should the division of income be established and how many classes should be decided?
-
2) If there is an order in the divisions between classes: To be precise, each income level should be viewed as an ordinal class.
-
3) Insufficient value for a specific class: How do you solve the difference between the number of customers in the high-income bracket and the number of customers in the middle-income bracket?
The multiclass classification problem has the following additional considerations compared to the binary classification problem.
-
1) Cautions when implementing the model: One-hot-encoding of variables, determination of objective function, etc.
-
2) Cautions when interpreting results: Accuracy, F1 score, Confusion Matrix, etc.
Load data to predict
Introduction to data
-
This topic uses data collected by the US Census Bureau and distributed by UCI to the US Adult Income dataset, with simulated variables added and modified by the instructor.
-
The first data to be used is the US Adult Income dataset, and the columns are as follows.
-
age
: 나이 -
workclass
: 직업구분 -
education
: 교육수준 -
education.num
: 교육수준(numerically coded) -
marital.status
: 혼인상태 -
occupation
: 직업 -
relationship
: 가족관계 -
race
: 인종 -
sex
: 성별 -
capital.gain
: 자본이득 -
capital.loss
: 자본손실 -
hours.per.week
: 주당 근로시간 -
income
: 소득 구분
Data from: https://archive.ics.uci.edu/ml/datasets/adult
Import data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
datapath = 'https://github.com/mchoimis/financialML/raw/main/income/'
df = pd.io.parsers.read_csv(datapath + 'income.csv')
df.head()
age | workclass | fnlwgt | education | education.num | marital.status | occupation | relationship | race | sex | capital.gain | capital.loss | hours.per.week | native.country | income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 90 | ? | 77053 | HS-grad | 9 | Widowed | ? | Not-in-family | White | Female | 0 | 4356 | 40 | United-States | <=50K |
1 | 82 | Private | 132870 | HS-grad | 9 | Widowed | Exec-managerial | Not-in-family | White | Female | 0 | 4356 | 18 | United-States | <=50K |
2 | 66 | ? | 186061 | Some-college | 10 | Widowed | ? | Unmarried | Black | Female | 0 | 4356 | 40 | United-States | <=50K |
3 | 54 | Private | 140359 | 7th-8th | 4 | Divorced | Machine-op-inspct | Unmarried | White | Female | 0 | 3900 | 40 | United-States | <=50K |
4 | 41 | Private | 264663 | Some-college | 10 | Separated | Prof-specialty | Own-child | White | Female | 0 | 3900 | 40 | United-States | <=50K |
Data preview
print(df.shape)
print(df.columns)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 32561 entries, 0 to 32560 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 32561 non-null int64 1 workclass 32561 non-null object 2 fnlwgt 32561 non-null int64 3 education 32561 non-null object 4 education.num 32561 non-null int64 5 marital.status 32561 non-null object 6 occupation 32561 non-null object 7 relationship 32561 non-null object 8 race 32561 non-null object 9 sex 32561 non-null object 10 capital.gain 32561 non-null int64 11 capital.loss 32561 non-null int64 12 hours.per.week 32561 non-null int64 13 native.country 32561 non-null object 14 income 32561 non-null object dtypes: int64(6), object(9) memory usage: 3.7+ MB
Check Data
# Replace missing values with NaN
df[df =='?'] = np.nan
# Filling out Missing Values with Mode
for col in ['workclass', 'occupation', 'native.country']:
df[col].fillna(df[col].mode()[0], inplace = True)
# result
df.head()
age | workclass | fnlwgt | education | education.num | marital.status | occupation | relationship | race | sex | capital.gain | capital.loss | hours.per.week | native.country | income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 90 | Private | 77053 | HS-grad | 9 | Widowed | Prof-specialty | Not-in-family | White | Female | 0 | 4356 | 40 | United-States | <=50K |
1 | 82 | Private | 132870 | HS-grad | 9 | Widowed | Exec-managerial | Not-in-family | White | Female | 0 | 4356 | 18 | United-States | <=50K |
2 | 66 | Private | 186061 | Some-college | 10 | Widowed | Prof-specialty | Unmarried | Black | Female | 0 | 4356 | 40 | United-States | <=50K |
3 | 54 | Private | 140359 | 7th-8th | 4 | Divorced | Machine-op-inspct | Unmarried | White | Female | 0 | 3900 | 40 | United-States | <=50K |
4 | 41 | Private | 264663 | Some-college | 10 | Separated | Prof-specialty | Own-child | White | Female | 0 | 3900 | 40 | United-States | <=50K |
df.isnull().sum()
age 0 workclass 0 fnlwgt 0 education 0 education.num 0 marital.status 0 occupation 0 relationship 0 race 0 sex 0 capital.gain 0 capital.loss 0 hours.per.week 0 native.country 0 income 0 dtype: int64
Feature Engineering
Creating input features and target values
X = df.drop(['income','education','fnlwgt'], axis =1)
y = df['income']
X.head()
age | workclass | education.num | marital.status | occupation | relationship | race | sex | capital.gain | capital.loss | hours.per.week | native.country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 90 | Private | 9 | Widowed | Prof-specialty | Not-in-family | White | Female | 0 | 4356 | 40 | United-States |
1 | 82 | Private | 9 | Widowed | Exec-managerial | Not-in-family | White | Female | 0 | 4356 | 18 | United-States |
2 | 66 | Private | 10 | Widowed | Prof-specialty | Unmarried | Black | Female | 0 | 4356 | 40 | United-States |
3 | 54 | Private | 4 | Divorced | Machine-op-inspct | Unmarried | White | Female | 0 | 3900 | 40 | United-States |
4 | 41 | Private | 10 | Separated | Prof-specialty | Own-child | White | Female | 0 | 3900 | 40 | United-States |
y.head()
0 <=50K 1 <=50K 2 <=50K 3 <=50K 4 <=50K Name: income, dtype: object
Divide the raw data into training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)
X_train.head()
age | workclass | education.num | marital.status | occupation | relationship | race | sex | capital.gain | capital.loss | hours.per.week | native.country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
32098 | 40 | State-gov | 13 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 0 | 0 | 20 | United-States |
25206 | 39 | Local-gov | 9 | Married-civ-spouse | Machine-op-inspct | Husband | White | Male | 0 | 0 | 38 | United-States |
23491 | 42 | Private | 10 | Never-married | Exec-managerial | Not-in-family | White | Female | 0 | 0 | 40 | United-States |
12367 | 27 | Local-gov | 9 | Never-married | Farming-fishing | Own-child | White | Male | 0 | 0 | 40 | United-States |
7054 | 38 | Federal-gov | 14 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 40 | United-States |
Handling categorical variables
from sklearn.preprocessing import LabelEncoder
categorical = ['workclass', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'native.country']
for feature in categorical:
le = LabelEncoder()
X_train[feature] = le.fit_transform(X_train[feature])
X_test[feature] = le.transform(X_test[feature])
Check the result of categorical variable processing
# Check the transformed categorical variable column (X_train)
X_train[categorical].head(3)
workclass | marital.status | occupation | relationship | race | sex | native.country | |
---|---|---|---|---|---|---|---|
32098 | 6 | 2 | 3 | 5 | 4 | 0 | 38 |
25206 | 1 | 2 | 6 | 0 | 4 | 1 | 38 |
23491 | 3 | 4 | 3 | 1 | 4 | 0 | 38 |
# Checking the converted categorical variable column (X_test)
X_test[categorical].head(3)
workclass | marital.status | occupation | relationship | race | sex | native.country | |
---|---|---|---|---|---|---|---|
22278 | 3 | 6 | 11 | 4 | 4 | 0 | 38 |
8950 | 3 | 4 | 5 | 3 | 4 | 0 | 38 |
7838 | 3 | 4 | 7 | 1 | 1 | 0 | 39 |
X_train[categorical].nunique()
workclass 8 marital.status 7 occupation 14 relationship 6 race 5 sex 2 native.country 41 dtype: int64
X_test[categorical].nunique()
workclass 8 marital.status 7 occupation 14 relationship 6 race 5 sex 2 native.country 40 dtype: int64
Note: Handling of categorical variables
Categorical variables can be roughly divided into two methods.
-
Convert class to number
-
One-hot-encoding (dummy encoding)
In the case of financial data, categorical variables occupy most of the data, so when one-hot-encoding is performed, the majority of the entire dataset may have a value of 0. When there are many meaningless values in a high-dimensional dataset, it is said that the features are sparse, and the learning efficiency may not be high.
Scaling Features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns = X.columns)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns = X.columns)
# Check the scaled X_train data
X_train.head()
age | workclass | education.num | marital.status | occupation | relationship | race | sex | capital.gain | capital.loss | hours.per.week | native.country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
32098 | 40 | 6 | 13 | 2 | 3 | 5 | 4 | 0 | 0 | 0 | 20 | 38 |
25206 | 39 | 1 | 9 | 2 | 6 | 0 | 4 | 1 | 0 | 0 | 38 | 38 |
23491 | 42 | 3 | 10 | 4 | 3 | 1 | 4 | 0 | 0 | 0 | 40 | 38 |
12367 | 27 | 1 | 9 | 4 | 4 | 3 | 4 | 1 | 0 | 0 | 40 | 38 |
7054 | 38 | 0 | 14 | 2 | 3 | 0 | 4 | 1 | 0 | 0 | 40 | 38 |
print(min(X_train['age']))
print(max(X_train['age']))
print(np.mean(X_train['age']))
print(np.var(X_train['age']))
print('\n')
print(min(X_test['age']))
print(max(X_test['age']))
print(np.mean(X_test['age']))
print(np.var(X_test['age']))
17 90 38.61429448929449 186.44402697680712 17 90 38.505476507319074 185.14136114309127
print(min(X_train_scaled['age']))
print(max(X_train_scaled['age']))
print(np.mean(X_train_scaled['age']))
print(np.var(X_train_scaled['age']))
print('\n')
print(min(X_test_scaled['age']))
print(max(X_test_scaled['age']))
print(np.mean(X_test_scaled['age']))
print(np.var(X_test_scaled['age']))
-1.5829486507307393 3.7632934651328265 1.7567165303651125e-16 1.0 -1.5829486507307393 3.7632934651328265 -0.007969414769866482 0.9930130996694361
Note: feature scaler provided by scikit-learn
-
StandardScaler
: default scale, converts the mean of each feature to 0 and standard deviation to 1 -
RobustScaler
: Similar to the above, but uses the median, quartile, and quartile values instead of the mean to minimize the influence of outliers -
MinMaxScaler
: scale so that the maximum and minimum values of all features are 1 and 0 respectively -
Normalizer
: Normalized per row, not feature (column), and adjusts the data so that the Euclidean distance is 1.
The reason for scaling is that training may not work properly when the values of the data are too large or too small. Also, for classifiers where the effect of scale is absolute (e.g. distance-based algorithms such as knn), it is essential to consider scaling.
On the other hand, some items may be better to keep the distribution of the original data. For example, when data is standardized on features that are concentrated in almost one place to make the distributions the same, small changes may be learned as large differences. You can also omit it if you use a classifier that is not significantly affected by scale (e.g., a tree-based ensemble algorithm), if the performance is acceptable or if you are less concerned about overfitting.
One thing to keep in mind when scaling is that the original data may lose its meaning. It may be difficult to improve the model if the explanatory power of the original feature is lost when the purpose of finding an answer is not the ultimate goal, but the interpretation of the model or its application to other datasets in the future is more important. Please consider this together.
댓글남기기