Acquisition_advertising

2022-08-06 7 분 소요

[Notice] [Marketing_Mix_Modeling]

Marketing mix modeling through sales data analysis

Data

Data is (https://www.kaggle.com/sazid28/advertising.csv/).

TV - TV media fee.
radio - Radio media fee.
newspaper - Newspaper media fee.
sales - Sales.

Explore data

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('Data/Advertising.csv')
print(df.shape)
df.tail() 

(200, 5)

	Unnamed: 0	TV	radio	newspaper	sales
195	196	38.2	3.7	13.8	7.6
196	197	94.2	4.9	8.1	9.7
197	198	177.0	9.3	6.4	12.8
198	199	283.6	42.0	66.2	25.5
199	200	232.1	8.6	8.7	13.4

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  200 non-null    int64  
 1   TV          200 non-null    float64
 2   radio       200 non-null    float64
 3   newspaper   200 non-null    float64
 4   sales       200 non-null    float64
dtypes: float64(4), int64(1)
memory usage: 7.9 KB

df = df[["TV", "radio", "newspaper", "sales"]]
print(df.shape)
df.tail()

(200, 4)

	TV	radio	newspaper	sales
195	38.2	3.7	13.8	7.6
196	94.2	4.9	8.1	9.7
197	177.0	9.3	6.4	12.8
198	283.6	42.0	66.2	25.5
199	232.1	8.6	8.7	13.4

df.describe()

	TV	radio	newspaper	sales
count	200.000000	200.000000	200.000000	200.000000
mean	147.042500	23.264000	30.554000	14.022500
std	85.854236	14.846809	21.778621	5.217457
min	0.700000	0.000000	0.300000	1.600000
25%	74.375000	9.975000	12.750000	10.375000
50%	149.750000	22.900000	25.750000	12.900000
75%	218.825000	36.525000	45.100000	17.400000
max	296.400000	49.600000	114.000000	27.000000

df.corr()

	TV	radio	newspaper	sales
TV	1.000000	0.054809	0.056648	0.782224
radio	0.054809	1.000000	0.354104	0.576223
newspaper	0.056648	0.354104	1.000000	0.228299
sales	0.782224	0.576223	0.228299	1.000000

corr = df.corr()
sns.heatmap(corr, annot = True)

<AxesSubplot:>

sns.pairplot(df[["TV", "radio", "newspaper", "sales"]])
plt.show()

Labels = df["sales"]
features = df[["TV", "radio", "newspaper"]]

print(Labels.shape)
Labels

(200,)

0      22.1
1      10.4
2       9.3
3      18.5
4      12.9
       ... 
195     7.6
196     9.7
197    12.8
198    25.5
199    13.4
Name: sales, Length: 200, dtype: float64

print(features.shape)
features

(200, 3)

	TV	radio	newspaper
0	230.1	37.8	69.2
1	44.5	39.3	45.1
2	17.2	45.9	69.3
3	151.5	41.3	58.5
4	180.8	10.8	58.4
...	...	...	...
195	38.2	3.7	13.8
196	94.2	4.9	8.1
197	177.0	9.3	6.4
198	283.6	42.0	66.2
199	232.1	8.6	8.7

200 rows × 3 columns

Data analysis

figure, ((ax1, ax2, ax3)) = plt.subplots(nrows=1, ncols=3)

figure.set_size_inches(20, 6)

sns.scatterplot(data=df, x="TV", y="sales", ax=ax1)
sns.scatterplot(data=df, x="radio", y="sales", ax=ax2)
sns.scatterplot(data=df, x="newspaper", y="sales", ax=ax3)

<AxesSubplot:xlabel='newspaper', ylabel='sales'>

매출액과의 scatter plot을 보면, TV가 매출액과 가장 관련이 높은 것 같아 보입니다.

라디오도 관련이 있지만 신문의 상관관계는 애매해 보입니다.

Linear regression (stats model)

import statsmodels.formula.api as sm
model = sm.ols(formula = "sales ~ TV + radio + newspaper", data = df).fit()

print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  sales   R-squared:                       0.897
Model:                            OLS   Adj. R-squared:                  0.896
Method:                 Least Squares   F-statistic:                     570.3
Date:                Tue, 26 Jul 2022   Prob (F-statistic):           1.58e-96
Time:                        13:51:44   Log-Likelihood:                -386.18
No. Observations:                 200   AIC:                             780.4
Df Residuals:                     196   BIC:                             793.6
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.9389      0.312      9.422      0.000       2.324       3.554
TV             0.0458      0.001     32.809      0.000       0.043       0.049
radio          0.1885      0.009     21.893      0.000       0.172       0.206
newspaper     -0.0010      0.006     -0.177      0.860      -0.013       0.011
==============================================================================
Omnibus:                       60.414   Durbin-Watson:                   2.084
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              151.241
Skew:                          -1.327   Prob(JB):                     1.44e-33
Kurtosis:                       6.332   Cond. No.                         454.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Linear regression analysis (sklearn)

from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression

model = LinearRegression().fit(features, Labels)
print(model.intercept_, model.coef_)

2.938889369459412 [ 0.04576465  0.18853002 -0.00103749]

model1 = sm.ols(formula = "sales ~ TV + radio + newspaper", data=df).fit()
model2 = sm.ols(formula = "sales ~ TV + radio", data=df).fit()
model3 = sm.ols(formula = "sales ~ TV", data=df).fit()

print(model1.summary())
print(model2.summary())
print(model3.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  sales   R-squared:                       0.897
Model:                            OLS   Adj. R-squared:                  0.896
Method:                 Least Squares   F-statistic:                     570.3
Date:                Tue, 26 Jul 2022   Prob (F-statistic):           1.58e-96
Time:                        14:29:52   Log-Likelihood:                -386.18
No. Observations:                 200   AIC:                             780.4
Df Residuals:                     196   BIC:                             793.6
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.9389      0.312      9.422      0.000       2.324       3.554
TV             0.0458      0.001     32.809      0.000       0.043       0.049
radio          0.1885      0.009     21.893      0.000       0.172       0.206
newspaper     -0.0010      0.006     -0.177      0.860      -0.013       0.011
==============================================================================
Omnibus:                       60.414   Durbin-Watson:                   2.084
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              151.241
Skew:                          -1.327   Prob(JB):                     1.44e-33
Kurtosis:                       6.332   Cond. No.                         454.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  sales   R-squared:                       0.897
Model:                            OLS   Adj. R-squared:                  0.896
Method:                 Least Squares   F-statistic:                     859.6
Date:                Tue, 26 Jul 2022   Prob (F-statistic):           4.83e-98
Time:                        14:29:52   Log-Likelihood:                -386.20
No. Observations:                 200   AIC:                             778.4
Df Residuals:                     197   BIC:                             788.3
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.9211      0.294      9.919      0.000       2.340       3.502
TV             0.0458      0.001     32.909      0.000       0.043       0.048
radio          0.1880      0.008     23.382      0.000       0.172       0.204
==============================================================================
Omnibus:                       60.022   Durbin-Watson:                   2.081
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              148.679
Skew:                          -1.323   Prob(JB):                     5.19e-33
Kurtosis:                       6.292   Cond. No.                         425.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  sales   R-squared:                       0.612
Model:                            OLS   Adj. R-squared:                  0.610
Method:                 Least Squares   F-statistic:                     312.1
Date:                Tue, 26 Jul 2022   Prob (F-statistic):           1.47e-42
Time:                        14:29:52   Log-Likelihood:                -519.05
No. Observations:                 200   AIC:                             1042.
Df Residuals:                     198   BIC:                             1049.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      7.0326      0.458     15.360      0.000       6.130       7.935
TV             0.0475      0.003     17.668      0.000       0.042       0.053
==============================================================================
Omnibus:                        0.531   Durbin-Watson:                   1.935
Prob(Omnibus):                  0.767   Jarque-Bera (JB):                0.669
Skew:                          -0.089   Prob(JB):                        0.716
Kurtosis:                       2.779   Cond. No.                         338.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

model1.predict({"TV" : 300, "radio": 10, "newspaper": 4})

0    18.549433
dtype: float64

sales = 2.9389 + 0.0458*300 + 0.1885*10 - 0.001*4
sales

18.5599

model3.predict({"TV":[300, 10, 50]})

0    21.293586
1     7.507960
2     9.409426
dtype: float64

Reanalysis after data transformation

import warnings
warnings.filterwarnings("ignore")

figure, ((ax1, ax2, ax3)) = plt.subplots(nrows=1, ncols=3)
figure.set_size_inches(20, 6)



sns.distplot(df['newspaper'], ax = ax1)
sns.distplot(df['TV'], ax = ax2)
sns.distplot(df['radio'], ax = ax3)

<AxesSubplot:xlabel='radio', ylabel='Density'>

import numpy as np

df["log_newspaper"] = np.log(df["newspaper"] + 1)
print(df.shape)
df[["log_newspaper", "newspaper"]]

(200, 5)

	log_newspaper	newspaper
0	4.251348	69.2
1	3.830813	45.1
2	4.252772	69.3
3	4.085976	58.5
4	4.084294	58.4
...	...	...
195	2.694627	13.8
196	2.208274	8.1
197	2.001480	6.4
198	4.207673	66.2
199	2.272126	8.7

200 rows × 2 columns

figure, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows=2, ncols=2)

 
figure.set_size_inches(20, 10)


sns.distplot(df["TV"], ax=ax1)
sns.distplot(df["radio"], ax=ax2)
sns.distplot(df["newspaper"], ax=ax3)
sns.distplot(df["log_newspaper"], ax=ax4)

<AxesSubplot:xlabel='log_newspaper', ylabel='Density'>

model1 = sm.ols(formula = "sales ~ TV + radio + newspaper", data=df).fit()

model4 = sm.ols(formula = "sales ~ TV + radio + log_newspaper", data=df).fit()

print(model1.summary())
print(model4.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  sales   R-squared:                       0.897
Model:                            OLS   Adj. R-squared:                  0.896
Method:                 Least Squares   F-statistic:                     570.3
Date:                Tue, 26 Jul 2022   Prob (F-statistic):           1.58e-96
Time:                        14:50:46   Log-Likelihood:                -386.18
No. Observations:                 200   AIC:                             780.4
Df Residuals:                     196   BIC:                             793.6
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.9389      0.312      9.422      0.000       2.324       3.554
TV             0.0458      0.001     32.809      0.000       0.043       0.049
radio          0.1885      0.009     21.893      0.000       0.172       0.206
newspaper     -0.0010      0.006     -0.177      0.860      -0.013       0.011
==============================================================================
Omnibus:                       60.414   Durbin-Watson:                   2.084
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              151.241
Skew:                          -1.327   Prob(JB):                     1.44e-33
Kurtosis:                       6.332   Cond. No.                         454.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  sales   R-squared:                       0.897
Model:                            OLS   Adj. R-squared:                  0.896
Method:                 Least Squares   F-statistic:                     570.4
Date:                Tue, 26 Jul 2022   Prob (F-statistic):           1.53e-96
Time:                        14:50:46   Log-Likelihood:                -386.15
No. Observations:                 200   AIC:                             780.3
Df Residuals:                     196   BIC:                             793.5
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         2.8057      0.495      5.671      0.000       1.830       3.782
TV                0.0457      0.001     32.822      0.000       0.043       0.048
radio             0.1874      0.008     22.510      0.000       0.171       0.204
log_newspaper     0.0414      0.142      0.291      0.772      -0.240       0.322
==============================================================================
Omnibus:                       59.391   Durbin-Watson:                   2.074
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              144.867
Skew:                          -1.316   Prob(JB):                     3.49e-32
Kurtosis:                       6.233   Cond. No.                         731.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Interpretation of results

The correlation coefficient for newspapers changed from negative to positive, but still not significant at the P-value 0.05 level.
newspaper is not shown to be significant. In other words, it can be said that the effect of newspaper advertisements on sales is insignificant.

If you do additional verification

Diagnosis of linear regression results:

Equal variance verification through residual distribution, normality verification through QQ plot

Model evaluation of linear regression

RMSE, MAE, MSE are mainly used

Application plan

It’s 2011 and your boss still thinks newspaper ads are valid, but the data analysis doesn’t.
Newspaper advertisements should be stopped, and TV and radio advertisements should be focused.
However, TV advertising seems to be a little less cost-effective.
Radio advertisements have a higher correlation coefficient. Our products sell well the more they advertise on the radio.

Twitter Facebook LinkedIn

Acquisition_advertising

Marketing mix modeling through sales data analysis

Data

Explore data

Data analysis

Linear regression (stats model)

Linear regression analysis (sklearn)

Reanalysis after data transformation

Interpretation of results

If you do additional verification

Model evaluation of linear regression

Application plan

공유하기

댓글남기기

참고

Predicting_income

seasonal_decompose

Dickey Fuller Test

Arima_forecast