Net Profit Forecasting

Overview:

This notebook aimed is to demonstrate the solution for simple regression problem, predict net profit given a set of attributes, first I start with simple EDA understand the data numerical and categorical distributions. Then I fit simple statistical model followed by linear regression and more complex one. each model includes scoring metrics and explanations. Finally, I communicate final finding of prediction of the provided Out-of-sample test set and the including factors or attributes that decided the final predictions using shapley values.

  1. Basic EDA.
  2. Feature Engineering.
  3. Statistical Model.
  4. Baseline Model
  5. Complex Model.
  6. Scoring.
  7. SHAPly Values, Interoperable ML.
# modeling
from sklearn.linear_model import LinearRegression
from lightgbm import LGBMRegressor
from statsmodels.regression.linear_model import OLS
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# model diagnosis:
from yellowbrick.model_selection import learning_curve
from yellowbrick.regressor import ResidualsPlot

# Interpretable Machine Learning:
import shap
# helper functions I wrote for this dataset:
from utils import *
import warnings
warnings.filterwarnings('ignore')

# Settings and aesthetics:
cmap_data = plt.cm.Paired
cmap_cv = plt.cm.coolwarm
cmap_data = plt.cm.Paired
cmap_cv = plt.cm.coolwarm

# Some basic settings here:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.options.display.max_columns = 30

EDA

modeling_data = pd.read_csv("../../Data/labeled_dataset.csv")
modeling_data.set_index("index", inplace=True)
modeling_data.head()

AgeMonthly premiumSocioeconomic categoryMonthly kilometersCoefficient bonus malusVehicle typeCRM scoreStandard of livingBrandYearly incomeCredit scoreYearly maintenance costNet profit
index
058.040.0Student973106SUV1643762Peugeot2042030980154.998558
126.027.0Labor worker637955 doors1263445Renault257501356677.840930
227.026.0Office worker978136SUV153986Renault679078669646.078889
322.08.0Student771963 doors1112366Peugeot15140320765-11.048213
460.020.0Unemployed7581013 doors1491441Peugeot128502878081.180078
modeling_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Age                      983 non-null    float64
 1   Monthly premium          989 non-null    float64
 2   Socioeconomic category   1000 non-null   object 
 3   Monthly kilometers       1000 non-null   int64  
 4   Coefficient bonus malus  1000 non-null   int64  
 5   Vehicle type             1000 non-null   object 
 6   CRM score                1000 non-null   int64  
 7   Standard of living       1000 non-null   int64  
 8   Brand                    948 non-null    object 
 9   Yearly income            1000 non-null   int64  
 10  Credit score             1000 non-null   int64  
 11  Yearly maintenance cost  1000 non-null   int64  
 12  Net profit               1000 non-null   float64
dtypes: float64(3), int64(7), object(3)
memory usage: 109.4+ KB
modeling_data.isnull().sum().to_frame().style.background_gradient(cmap='summer')
 0
Age17
Monthly premium11
Socioeconomic category0
Monthly kilometers0
Coefficient bonus malus0
Vehicle type0
CRM score0
Standard of living0
Brand52
Yearly income0
Credit score0
Yearly maintenance cost0
Net profit0
# fill nan with other Brand types:
modeling_data["Brand"].fillna("other", inplace=True)
modeling_data = modeling_data[modeling_data["Monthly premium"].notna()]
modeling_data.columns.tolist()
    ['Age',
     'Monthly premium',
     'Socioeconomic category',
     'Monthly kilometers',
     'Coefficient bonus malus',
     'Vehicle type',
     'CRM score',
     'Standard of living',
     'Brand',
     'Yearly income',
     'Credit score',
     'Yearly maintenance cost',
     'Net profit']
plt.figure(figsize=(10,6))
sns.set(style="whitegrid", font_scale=1.2)
chart = sns.countplot(x='Brand',
                      hue='Socioeconomic category',
                      data=modeling_data,
                      palette='deep')
chart.set_title('Histogram for Categorical Variables', fontsize=16, color='firebrick')
plt.show()

My Image Description

sns.set(style="whitegrid")
plt.figure(figsize=(11,7))
chart = sns.countplot(x='Brand',
                      hue='Vehicle type',
                      data=modeling_data,
                      palette='deep')
chart.set_title('Count-plot for Categorical Variables', fontsize=16, color='firebrick')
plt.show()

My Image Description

plt.figure(figsize=(11,7))
sns.boxplot(data=modeling_data[["Age",
                                "Monthly premium", 
                                "Coefficient bonus malus", 
                                "CRM score", 
                                "Net profit"]])
plt.title('Box-plots for Selected Variables', fontsize=16, color='firebrick')
plt.show()

My Image Description

# drop extreme outlier from the above observations:
modeling_data = modeling_data[modeling_data["CRM score"] < 254]
modeling_data = modeling_data[(modeling_data["Age"] < 124) & (modeling_data["Age"] > 0)]
# no warning outlier for credit score, monthly kilometers and yearly maintenance cost:
plt.figure(figsize=(11,7))
sns.boxplot(data=modeling_data[["Monthly kilometers", "Credit score"]])
plt.title('Box-plots for Selected Variables', fontsize=16, color='firebrick')
plt.grid(False)
plt.show()

My Image Description

plt.figure(figsize=(11,7))
sns.boxplot(x=modeling_data["Yearly maintenance cost"], color='aqua')
plt.title('Boxplot for Yearly Maintenance Cost', fontsize=16, color='firebrick')
plt.show()

My Image Description

plt.figure(figsize=(11,7))
sns.boxplot(x=modeling_data["Yearly income"], color='lightseagreen')
plt.title('Boxplot for Yearly Income', fontsize=16, color='firebrick')
plt.show()

My Image Description

sns.set(style="whitegrid", font_scale=1.2)
plt.figure(figsize=(8, 5))
sns.boxplot(x=modeling_data["Standard of living"])
plt.xlabel("Standard of living")
plt.ylabel("Values")
plt.title("Box Plot - Standard of living")
plt.show()

My Image Description

sns.set(style="whitegrid", font_scale=1.2)
plt.figure(figsize=(8, 6))
sns.histplot(data=modeling_data, x="Yearly income", hue="Socioeconomic category", multiple="stack")
plt.xlabel("Yearly income")
plt.ylabel("Count")
plt.title("Yearly income per Socioeconomic Category")
plt.show()

My Image Description

sns.set(style="whitegrid", font_scale=1.2)
plt.figure(figsize=(8, 6))
sns.histplot(data=modeling_data, x="Yearly income", hue="Vehicle type", multiple="stack")
plt.xlabel("Yearly income")
plt.ylabel("Count")
plt.title("Vehicle Type Per Yearly Income")
plt.show()

My Image Description

sns.set(style="whitegrid", font_scale=1.2)
plt.figure(figsize=(8, 5))
sns.boxplot(x=modeling_data["Standard of living"])
plt.xlabel("Standard of living")
plt.ylabel("Values")
plt.title("Box Plot - Standard of living")
plt.show()

My Image Description

sns.set(style="whitegrid", font_scale=1.2)
plt.figure(figsize=(8, 6))
sns.histplot(data=modeling_data, x="Yearly income", hue="Socioeconomic category", element="step")
plt.xlabel("Yearly income")
plt.ylabel("Count")
plt.title("Yearly income per Socioeconomic Category")
plt.show()

My Image Description

sns.set(style="whitegrid", font_scale=1.2)
plt.figure(figsize=(8, 6))
sns.histplot(data=modeling_data, x="Yearly income", hue="Vehicle type", multiple="stack")
plt.xlabel("Yearly income")
plt.ylabel("Count")
plt.title("Vehicle Type Per Yearly Income")
plt.show()

My Image Description

sns.set(style="whitegrid", font_scale=1.2)
plt.figure(figsize=(8, 5))
sns.histplot(data=modeling_data, x="Yearly income", hue="Brand", multiple="stack")
plt.xlabel("Yearly income")
plt.ylabel("Count")
plt.title("Brand Type Per Yearly Income")
plt.show()

My Image Description

sns.set(style="whitegrid", font_scale=1.2)
plt.figure(figsize=(8, 5))
sns.histplot(data=modeling_data, x="Net profit", hue="Socioeconomic category", multiple="stack")
plt.xlabel("Net profit")
plt.ylabel("Count")
plt.title("Net Profit per Socioeconomic Category")
plt.show()

My Image Description

sns.set(style="whitegrid", font_scale=1.2)
plt.figure(figsize=(8, 5))
sns.histplot(data=modeling_data, x="Net profit", hue="Vehicle type", multiple="stack")
plt.xlabel("Net profit")
plt.ylabel("Count")
plt.title("Net Profit Distribution Per Vehicle Type")
plt.show()

My Image Description

sns.set(style="whitegrid", font_scale=1.2)
plt.figure(figsize=(8, 5))
sns.histplot(data=modeling_data, x="Net profit", hue="Brand", multiple="stack")
plt.xlabel("Net profit")
plt.ylabel("Count")
plt.title("Net Profit per Brand")
plt.show()

My Image Description

Observations:

  • No warning outliers seen in the data except for yearly income, CRM score and Age.
  • Numerical data are normally distributed.

Encode categorical variables and clean data for modeling:

# categorize your variables here:
categories = [
    'Brand',
    'Vehicle type',
    'Socioeconomic category',
]
modeling_data_visualization = pd.get_dummies(modeling_data,
                               columns=categories,
                               drop_first=True)
modeling_data = pd.get_dummies(
                               modeling_data,
                               columns=categories,
                               drop_first=True
                               )
# rearrange columns:
ordered_corr_columns = modeling_data_visualization.copy()
ordered_corr_columns = ordered_corr_columns[['Age',
                                             'Monthly premium',
                                             'Monthly kilometers',
                                             'Coefficient bonus malus',
                                             'CRM score',
                                             'Standard of living',
                                             'Yearly income',
                                             'Credit score',
                                             'Yearly maintenance cost',
                                             'Brand_Citroen',
                                             'Brand_Opel',
                                             'Brand_Peugeot',
                                             'Brand_Renault',
                                             'Brand_Toyota',
                                             'Brand_Volkswagen',
                                             'Brand_other',
                                             'Vehicle type_3 doors',
                                             'Vehicle type_5 doors',
                                             'Vehicle type_SUV',
                                             'Vehicle type_utility',
                                             'Socioeconomic category_Labor worker',
                                             'Socioeconomic category_Office worker',
                                             'Socioeconomic category_Self employed',
                                             'Socioeconomic category_Student',
                                             'Socioeconomic category_Unemployed',
                                             'Net profit'
                                             ]]
plt.figure(figsize=(15, 15))
corr = ordered_corr_columns.corr()
corr_top = corr.index
sns.heatmap(modeling_data_visualization[corr_top].corr(),
            vmax=1.0,
            vmin=-1.0,
            linewidths=0.1,
            annot=True,
            annot_kws={"size": 8},
            square=True, cbar=False);

My Image Description

corr = ordered_corr_columns.corr()
plt.figure(figsize=(15, 15))
sns.heatmap(corr[(corr >= 0.1)],
            vmax=1.0,
            vmin=-1.0,
            linewidths=0.1,
            annot=True,
            annot_kws={"size": 8}, 
            square=True,
            cbar=False)

My Image Description

corr = ordered_corr_columns.corr()
plt.figure(figsize=(15, 15))
sns.heatmap(corr[(corr <= -0.1)], 
            vmax=1.0,
            vmin=-1.0,
            linewidths=0.1,
            annot=True,
            annot_kws={"size": 8}, 
            square=True,
            cbar=False);

My Image Description

fig, ax = plt.subplots(figsize=(8, 8))
sns.heatmap(modeling_data_visualization.corr()[["Net profit"]].sort_values("Net profit"),
            vmax=1,
            vmin=-1,
            annot=True, 
            ax=ax)
ax.invert_yaxis()

My Image Description

Observations:

  • There is not any warning high correlation between target variable and independent variables.
  • Multi-collinearity between Standard of living and yearly income, CRM score and Coefficient bonus malus, they might be problematic for linear models interpretation.
  • There are few independent variables can be used for fitting model that correlate positively/negatively with Net profit.
  • There is enough signals in the data simple model with fewer coefficient would generalize on unseen data.

Simple Model:

Features = modeling_data.columns.tolist()
Features.remove("Net profit")
Ys = modeling_data["Net profit"]
Xs = modeling_data[Features]

Multi-collinearity Check VIF:

calculate_vif(Xs).style.background_gradient(cmap='summer')
 variablesVIF
0Age63.237779
1Monthly premium3.050082
2Monthly kilometers7.627456
3Coefficient bonus malus432.507997
4CRM score718.673947
5Standard of living110.184626
6Yearly income115.704855
7Credit score3.866306
8Yearly maintenance cost35.804418
9Brand_Opel1.240398
10Brand_Peugeot2.379924
11Brand_Renault2.191521
12Brand_Toyota1.432597
13Brand_Volkswagen1.562894
14Brand_other1.252540
15Vehicle type_5 doors2.692416
16Vehicle type_SUV2.464803
17Vehicle type_utility1.374975
18Socioeconomic category_Office worker2.635837
19Socioeconomic category_Self employed1.248430
20Socioeconomic category_Student2.359763
21Socioeconomic category_Unemployed1.440898
fixed_Xs = Xs.drop(["Standard of living",
                    "Coefficient bonus malus",
                    "CRM score",
                    "Yearly maintenance cost"], axis=1)
calculate_vif(fixed_Xs).style.background_gradient(cmap='summer')
 variablesVIF
0Age7.099762
1Monthly premium2.836646
2Monthly kilometers6.413401
3Yearly income3.427115
4Credit score3.583600
5Brand_Opel1.204267
6Brand_Peugeot2.175342
7Brand_Renault2.026626
8Brand_Toyota1.377058
9Brand_Volkswagen1.472219
10Brand_other1.225380
11Vehicle type_5 doors2.558872
12Vehicle type_SUV2.345649
13Vehicle type_utility1.333014
14Socioeconomic category_Office worker2.430892
15Socioeconomic category_Self employed1.226225
16Socioeconomic category_Student2.203248
17Socioeconomic category_Unemployed1.386635
simple_model = LinearRegression().fit(fixed_Xs, Ys)
simple_model.score(fixed_Xs, Ys)
0.8834266631920002
coef = get_regressor_coefficients(simple_model, fixed_Xs.columns.tolist())
coefficient_plot(coef.values(), coef.keys())

My Image Description

est = OLS(Ys, Xs).fit()
print(est.summary())
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:             Net profit   R-squared (uncentered):                   0.930
Model:                            OLS   Adj. R-squared (uncentered):              0.929
Method:                 Least Squares   F-statistic:                              569.0
Date:                Mon, 26 Jun 2023   Prob (F-statistic):                        0.00
Time:                        15:46:44   Log-Likelihood:                         -3203.3
No. Observations:                 958   AIC:                                      6451.
Df Residuals:                     936   BIC:                                      6558.
Df Model:                          22                                                  
Covariance Type:            nonrobust                                                  
========================================================================================================
                                           coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------
Age                                      0.2148      0.048      4.474      0.000       0.121       0.309
Monthly premium                         -0.0320      0.017     -1.875      0.061      -0.066       0.002
Monthly kilometers                       0.0072      0.001      7.597      0.000       0.005       0.009
Coefficient bonus malus                 -0.0124      0.046     -0.270      0.787      -0.103       0.078
CRM score                                0.0156      0.044      0.354      0.724      -0.071       0.102
Standard of living                       0.0003      0.000      0.566      0.572      -0.001       0.001
Yearly income                           -0.0003   6.95e-05     -3.946      0.000      -0.000      -0.000
Credit score                            -0.0010      0.001     -1.340      0.181      -0.003       0.000
Yearly maintenance cost                 -0.0139      0.002     -8.276      0.000      -0.017      -0.011
Brand_Opel                              -1.5706      1.151     -1.364      0.173      -3.830       0.689
Brand_Peugeot                           -1.7943      0.666     -2.694      0.007      -3.101      -0.487
Brand_Renault                           -0.7600      0.683     -1.113      0.266      -2.100       0.580
Brand_Toyota                            -1.7598      0.905     -1.943      0.052      -3.537       0.017
Brand_Volkswagen                        -2.6734      0.830     -3.220      0.001      -4.303      -1.044
Brand_other                             -2.2858      1.109     -2.062      0.039      -4.461      -0.110
Vehicle type_5 doors                    18.3364      0.600     30.537      0.000      17.158      19.515
Vehicle type_SUV                        37.8587      0.621     60.937      0.000      36.639      39.078
Vehicle type_utility                    57.4583      0.921     62.415      0.000      55.652      59.265
Socioeconomic category_Office worker    -0.2975      0.620     -0.480      0.631      -1.514       0.919
Socioeconomic category_Self employed    -1.5587      1.130     -1.379      0.168      -3.776       0.659
Socioeconomic category_Student           3.5937      0.633      5.676      0.000       2.351       4.836
Socioeconomic category_Unemployed        3.9310      0.887      4.431      0.000       2.190       5.672
==============================================================================
Omnibus:                       15.750   Durbin-Watson:                   1.959
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               24.111
Skew:                          -0.131   Prob(JB):                     5.81e-06
Kurtosis:                       3.731   Cond. No.                     2.28e+05
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[3] The condition number is large, 2.28e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
statistical_results_as_frame = get_dataframe_from_summary(est)
statistical_results_as_frame[statistical_results_as_frame["P>|t|"] <= 0.000].
style.background_gradient(cmap='summer')
 coefstd errtP>|t|[0.0250.975]
Age0.2148000.0480004.4740000.0000000.1210000.309000
Monthly kilometers0.0072000.0010007.5970000.0000000.0050000.009000
Yearly income-0.0003000.000069-3.9460000.000000-0.000000-0.000000
Yearly maintenance cost-0.0139000.002000-8.2760000.000000-0.017000-0.011000
Vehicle type_5 doors18.3364000.60000030.5370000.00000017.15800019.515000
Vehicle type_SUV37.8587000.62100060.9370000.00000036.63900039.078000
Vehicle type_utility57.4583000.92100062.4150000.00000055.65200059.265000
Socioeconomic category_Student3.5937000.6330005.6760000.0000002.3510004.836000
Socioeconomic category_Unemployed3.9310000.8870004.4310000.0000002.1900005.672000
X_train, X_test, y_train, y_test = train_test_split(fixed_Xs,
                                                    Ys,
                                                    test_size=0.3,
                                                    random_state=10101)
visualizer = LearningCurve(
    simple_model, cv=20, scoring='r2',
    n_jobs=4
)
visualizer.fit(fixed_Xs,Ys )
visualizer.show()

My Image Description

<Axes: title={'center': 'Learning Curve for LinearRegression'}, xlabel='Training Instances', ylabel='Score'>
model = LinearRegression()
visualizer = ResidualsPlot(model)
visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
visualizer.show()

My Image Description

<Axes: title={'center': 'Residuals for LinearRegression Model'}, xlabel='Predicted Value', ylabel='Residuals'>

Observations:

  • As suspected Standard of living, yearly income and Coefficient bonus malus has strong correlation with other independent variables, they can be predicted by other independent variables.
  • As suspected model is stable against few features, after dropping independent variables with high VIF scores 0.886
  • P-values: Socioeconomic: student, unemployed, vehicle types 3,5 suv and utility, yearly income, age and monthly kilometers suggest strong significant relation between them and target variable on large populations.
  • Brand Renault and Toyota P-values suggest strong influence on target variable compared to other brands.
  • Residuals: indicates a good fit for linear model, predicted values greater than 40 are misfit for compared to the training samples.
  • unexpected high p-value for Socioeconomic Self employed brand Volkswagen, Credit score and Yearly maintenance cost independent variables.
  • Model tend to generalize when applying cross-validation of 20 folds, testing and training scores are matching after 20 folds/iterations.

Complex Model:

  • From the above observation we have constructed a stable model with simple features and linear repressor, I would vote for simplified models since they are easy to explain to stakeholders and easier to productionize compared to more complex model with high variance.
  • The fact that the data is synthetic and generated from known function makes it even easier to fit a model against normally distributed target variables.
  • Tho it’s not required for this particular dataset but the demonstration below will help explain what is happening behind the scene why applying a particular predictions.
complex_model = LGBMRegressor(random_state=np.random.RandomState().get_state()[1][0])
complex_model.fit(X_train, y_train)
LGBMRegressor(random_state=2147483648)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
{"training score": complex_model.score(X_train, y_train),
 "testing score": complex_model.score(X_test, y_test)}
{'training score': 0.9954600929824392, 'testing score': 0.9744489721140687}
visualizer = LearningCurve(
    complex_model,
    cv=20,
    scoring='r2',
    n_jobs=4
)
visualizer.fit(fixed_Xs,Ys )
visualizer.show()

My Image Description

<Axes: title={'center': 'Learning Curve for LGBMRegressor'}, xlabel='Training Instances', ylabel='Score'>
visualizer = ResidualsPlot(complex_model)
visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
visualizer.show()

My Image Description

<Axes: title={'center': 'Residuals for LGBMRegressor Model'}, xlabel='Predicted Value', ylabel='Residuals'>
feature_importance_plot(complex_model.feature_importances_, X_train.columns.tolist())

My Image Description

X_train, X_test, y_train, y_test = train_test_split(fixed_Xs,
                                                    Ys,
                                                    test_size=0.3,
                                                    random_state=10101)
complex_model.fit(X_train, y_train)
complex_model.score(X_test, y_test)
0.9744489721140687
{"training score": complex_model.score(X_train, y_train),
 "testing score": complex_model.score(X_test, y_test)}
{'training score': 0.9954600929824392, 'testing score': 0.9744489721140687}
complex_model.fit(fixed_Xs, Ys)
LGBMRegressor(random_state=2147483648)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
save_model(complex_model, "lgb.pkl")
model saved....

Observations:

  • Baseline boosting model is accurate enough to generalize on unseen data.

Predict on unseen Out-of-Sample Data:

golden_data = pd.read_csv("../../Data/scoring_dataset.csv")
golden_data = prepare_out_of_sample_data(golden_data)
predictions = complex_model.predict(golden_data)
predictions
array([ 58.91912859,  49.89491262,  55.44678162, -24.23093475,
       -14.2723258 ,  -8.53386753, -13.64680361,  -9.4624002 ,
        17.76568715,  11.00738228, -19.44384469,  13.65440611,
        10.68901548,  42.35671464,  64.73750346,  31.21405441,
        33.57720901,  66.54627244,  17.34307286, -17.41840316,
        10.57054402,  27.4537283 ,  69.35269097, -12.27879405,
        15.81947529,  -9.96537782,  -3.50549684,  10.60525138,
        29.81544625, -11.33242524,   7.88417464,  11.87869155,
       -21.32857179,  68.10191868,  34.02686042,  12.06295287,
         9.21788335,  11.4395712 , -22.66475147,  10.90707741,
       -11.89317256, -21.2898417 , -16.45650735,  58.73523363,
        11.32063774,  51.41548191,  -6.16929583, -17.21084875,
         8.84313568,  -0.23152295,  16.33390489,  -7.37397323,
        47.33042878,  11.59818288,  35.74334818,  31.79114097,
        31.00438204,  30.59942712,   3.56928081,  32.61225069,
        30.14717748, -16.59407291,  36.6298427 ,  34.46728649,
         5.73856431,   3.28154025, -15.33032491,  23.72479771,
       -13.24799205,   1.69154499,  28.76880574,  45.32716805,
        10.66541024, -16.7040101 ,  13.71867892,  38.66482697,
        64.90888121,  18.33568536,  64.48881502, -12.42543113,
        13.08235257,  32.67607121,   7.26262942,   0.93023009,
       -17.55597795,  52.73909182,  54.72162463,  56.50324747,
       -16.27963481,  -3.71731225,   1.58822318,  46.84579724,
         5.79652318,   4.07176173,  11.33118164,  -9.52012316,
        72.39011481, -13.65875517,  70.09885245,  15.08417505,
         5.53303608,  22.4515763 , -10.57416188,  -7.47169768,
        -3.75537854,  11.61249461,  11.04423492, -14.5301915 ,
        67.9224915 ,  68.21783413,  64.9184886 ,  45.26689911,
        14.07367844, -21.0615744 ,  -9.68622993,   7.99011859,
        70.128609  ,   1.33740385, -12.34573747, -16.53098494,
         0.64245634, -11.7774245 , -17.50944337,  10.03978082,
        12.20080724,   0.85478322, -14.41113003,  17.45473449,
        11.11047532, -13.92590323, -26.3335464 ,  -0.49609702,
        67.70577189,   8.76217364,  11.37061747,  13.94466969,
        31.95999566,  -4.73164322,  -4.78788719,  35.05434567,
       -17.65590835,  11.19294598, -23.46091508, -14.90541622,
        59.80512017,  -4.39993801,  65.87373554, -10.21459077,
         3.34296328,   3.46447491,  18.02436197,  70.92517188,
       -16.13044562,  15.22610701,  67.3571339 ,  31.31605017,
        16.25664797,  65.32404436,  -2.44943542,  11.07774281,
        -1.36484196,  46.27491016,  11.75272127,  31.86936077,
        37.50262643, -12.96696298,  57.84506126,  69.08963572,
       -11.01156093,  -4.98246628,  -2.70502784, -19.59879822,
        67.11843829,  12.43076254,  -2.18990868,  -5.76053432,
        30.75636258,   2.32066105, -27.34262945,  23.86179321,
         6.83557926,  38.6771704 ,   5.4039176 , -19.74764152,
        -1.87067716,  36.43479928,  -6.51198102,   1.46725762,
        54.16904585,  10.90929559,  63.31117438,  11.46281534,
        30.56631811,  19.12706775,  44.89570016,  -2.61577375,
         2.26721515,  30.26308875,  58.11514791,   5.97828353,
       -11.03024255,  -2.95225412,  49.22757724,  12.35191354,
        60.96914324,   7.13226438, -12.56451221,  66.19490992,
       -17.87080904,  27.71120954,   1.35281436,  35.80188293,
       -16.9561018 , -15.39986594,   0.58514808,  32.63009318,
        12.20705734,  -4.23991439,  66.45667292,  67.1467423 ,
        32.08794442,  71.41162079,  -6.69518428,  -6.21886221,
         0.12338652, -17.3127624 ,   5.34228616,  -5.28529409,
        12.78867194,  -0.79418746,   9.26418582,  -5.09157918,
        46.85243872,   8.764054  ,   3.67975868, -14.2596615 ,
        10.76950969,  40.52879738,  66.68991439,  23.82932953,
         3.60368565,  42.69170346,   8.47585468,  64.57462177,
        -7.65456788,  -6.94218944,  53.62941074,  -9.46495613,
        33.06521281,  53.70369254,  21.28483189,  35.32587129,
         8.0646854 ,  69.71120081,  48.09456914,  66.440138  ,
       -11.75407332,  29.66519856,  20.11456971, -15.5240168 ,
        -5.4359874 , -14.34577371,  -0.26161014,  -4.41381492,
       -16.60818486, -17.69304935,  -7.09102285,  -6.55568194,
       -11.53152147,  59.64454771,  15.40693518,  68.11999211,
        36.45709991,  -6.1543164 , -16.20525925, -11.24193557,
        18.08161099,  40.05749666,  35.53561637,  47.87203529,
        -4.38230475,   7.44872708,   7.24849564,   7.74243436,
        43.23607339,  -7.2239746 , -12.54961912,  11.10762799,
        51.24267263,  -3.18134816,  -5.56914843,  66.37558673,
        34.01434221])
golden_data_output = golden_data.copy()
golden_data_output["Predictions"] = predictions
golden_data_output.to_csv("../../Output Files/Data/prediction.csv")

Analyze Predictions with SHAP values:

shap.initjs()
explainer = shap.TreeExplainer(complex_model)
shap_values = explainer.shap_values(golden_data, predictions, check_additivity=False)

shap.summary_plot(shap_values, golden_data, max_display=len(golden_data.index))

My Image Description

Observations:

  • Vehicle type 5 doors and SUV to blame for low Net-profit forecasting.
  • Numerical variables Age and monthly kilometers and CRM score contributed the most for higher net profit.
  • Other features such as Brands, Socioeconomic status has no effect compared to variables on the top.
  • From feature importance plot it shows Yearly income, monthly kilometers, Age and CRM score are the most important factors in deciding net profit.
  • It would be insightful to understand what to blame for high or low net profit from an out-of-sample dataset, as per individual forecasts.
  • Higher age and monthly kilometers contributes higher net-profit, tho there are some cases where high value of age “red” decreases nonprofit, a dependence plot between two features will show how two similar values of monthly kilometers contributes differently to profit as it depends on interaction of two features.

Individual Explanations of High and Low net profit forecasts.

shap.plots._waterfall.waterfall_legacy(explainer.expected_value,
                                       shap_values[1],
                                       golden_data.iloc[1],
                                       golden_data.columns)
print(predictions[1])
golden_data.iloc[1]

My Image Description

Predicted-Profit             49.894912616226186
Age                                        21.0
Monthly premium                            14.0
Monthly kilometers                        694.0
Yearly income                           34900.0
Credit score                              833.0
Brand_Opel                                  1.0
Brand_Peugeot                               0.0
Brand_Renault                               0.0
Brand_Toyota                                0.0
Brand_Volkswagen                            0.0
Vehicle type_5 doors                        1.0
Vehicle type_SUV                            0.0
Vehicle type_utility                        0.0
Socioeconomic category_Office worker        1.0
Socioeconomic category_Self employed        0.0
Socioeconomic category_Student              0.0
Socioeconomic category_Unemployed           0.0
Brand_other                                 0.0
Name: 1001, dtype: float64
shap.plots._waterfall.waterfall_legacy(explainer.expected_value, 
                                       shap_values[292],
                                       golden_data.iloc[292],
                                       golden_data.columns)
print(predictions[292])
golden_data.iloc[292]

My Image Description

Predicted-Profit               34.0143422063608
Age                                        33.0
Monthly premium                             8.0
Monthly kilometers                        748.0
Yearly income                           32360.0
Credit score                              271.0
Brand_Opel                                  1.0
Brand_Peugeot                               0.0
Brand_Renault                               0.0
Brand_Toyota                                0.0
Brand_Volkswagen                            0.0
Vehicle type_5 doors                        0.0
Vehicle type_SUV                            0.0
Vehicle type_utility                        1.0
Socioeconomic category_Office worker        0.0
Socioeconomic category_Self employed        0.0
Socioeconomic category_Student              1.0
Socioeconomic category_Unemployed           0.0
Brand_other                                 0.0
Name: 1298, dtype: float64
shap.plots._waterfall.waterfall_legacy(explainer.expected_value,
                                       shap_values[185],
                                       golden_data.iloc[185],
                                       golden_data.columns)
print(predictions[185])
golden_data.iloc[185]

My Image Description

Predicted-Profit             36.434799275715484
Age                                        42.0
Monthly premium                            15.0
Monthly kilometers                        345.0
Yearly income                           21730.0
Credit score                               78.0
Brand_Opel                                  0.0
Brand_Peugeot                               0.0
Brand_Renault                               0.0
Brand_Toyota                                0.0
Brand_Volkswagen                            0.0
Vehicle type_5 doors                        0.0
Vehicle type_SUV                            0.0
Vehicle type_utility                        0.0
Socioeconomic category_Office worker        1.0
Socioeconomic category_Self employed        0.0
Socioeconomic category_Student              0.0
Socioeconomic category_Unemployed           0.0
Brand_other                                 0.0
Name: 1190, dtype: float64
shap.plots._waterfall.waterfall_legacy(explainer.expected_value,
                                       shap_values[3],
                                       golden_data.iloc[3],
                                       golden_data.columns)
print(predictions[3])
golden_data.iloc[3]

My Image Description

Predicted-Profit             -24.23093475368557
Age                                        33.0
Monthly premium                             2.0
Monthly kilometers                        970.0
Yearly income                           45450.0
Credit score                              990.0
Brand_Opel                                  0.0
Brand_Peugeot                               0.0
Brand_Renault                               0.0
Brand_Toyota                                0.0
Brand_Volkswagen                            0.0
Vehicle type_5 doors                        1.0
Vehicle type_SUV                            0.0
Vehicle type_utility                        0.0
Socioeconomic category_Office worker        0.0
Socioeconomic category_Self employed        0.0
Socioeconomic category_Student              1.0
Socioeconomic category_Unemployed           0.0
Brand_other                                 0.0
Name: 1003, dtype: float64
shap.plots._waterfall.waterfall_legacy(explainer.expected_value,
                                       shap_values[19],
                                       golden_data.iloc[19],
                                       golden_data.columns)
print(predictions[19])
golden_data.iloc[19]

My Image Description

Predicted-Profit             -17.41840315625436
Age                                        45.0
Monthly premium                             4.0
Monthly kilometers                        993.0
Yearly income                           29160.0
Credit score                              255.0
Brand_Opel                                  0.0
Brand_Peugeot                               0.0
Brand_Renault                               1.0
Brand_Toyota                                0.0
Brand_Volkswagen                            0.0
Vehicle type_5 doors                        0.0
Vehicle type_SUV                            0.0
Vehicle type_utility                        0.0
Socioeconomic category_Office worker        0.0
Socioeconomic category_Self employed        0.0
Socioeconomic category_Student              0.0
Socioeconomic category_Unemployed           0.0
Brand_other                                 0.0
Name: 1021, dtype: float64
shap.plots._waterfall.waterfall_legacy(explainer.expected_value,
                                       shap_values[140],
                                       golden_data.iloc[140],
                                       golden_data.columns)
print(predictions[140])
golden_data.iloc[140]

My Image Description

Predicted-Profit             -17.65590834806663
Age                                        30.0
Monthly premium                            27.0
Monthly kilometers                        915.0
Yearly income                           19460.0
Credit score                              746.0
Brand_Opel                                  0.0
Brand_Peugeot                               1.0
Brand_Renault                               0.0
Brand_Toyota                                0.0
Brand_Volkswagen                            0.0
Vehicle type_5 doors                        0.0
Vehicle type_SUV                            0.0
Vehicle type_utility                        0.0
Socioeconomic category_Office worker        0.0
Socioeconomic category_Self employed        0.0
Socioeconomic category_Student              0.0
Socioeconomic category_Unemployed           0.0
Brand_other                                 0.0
Name: 1145, dtype: float64

shap.plots._waterfall.waterfall_legacy(explainer.expected_value,
                                       shap_values[264], 
                                       golden_data.iloc[264],
                                       golden_data.columns)
print(predictions[264])
golden_data.iloc[264]

My Image Description

Predicted-Profit             -16.60818486301198
Age                                        41.0
Monthly premium                             5.0
Monthly kilometers                        801.0
Yearly income                           32950.0
Credit score                              824.0
Brand_Opel                                  0.0
Brand_Peugeot                               0.0
Brand_Renault                               1.0
Brand_Toyota                                0.0
Brand_Volkswagen                            0.0
Vehicle type_5 doors                        0.0
Vehicle type_SUV                            0.0
Vehicle type_utility                        0.0
Socioeconomic category_Office worker        0.0
Socioeconomic category_Self employed        0.0
Socioeconomic category_Student              0.0
Socioeconomic category_Unemployed           0.0
Brand_other                                 0.0
Name: 1270, dtype: float64