Will the customer buy product after the haircut?

This will be based off the survey data we have of over 100 responses

The Dataset

The following features from the survey are measured and included within the CSV:

  • Timestamp
  • E-mail Address
  • Work Zip Code
  • Home Zip Code
  • Business Name
  • City, State last haircut
  • Gender
  • Age
  • Race
  • Income Range
  • Time since last haircut
  • Time between haircuts
  • Buy Products
  • How much spent last haircut
  • Maximum spend for haircut
  • How find current barber
  • Leave reviews online
  • Importance of Price (1-5)
  • Importance of Convenience (1-5)
  • Importance of Atmosphere (1-5)
  • Importance of Additional Services (1-5)
  • Additional Comments

Correlation Matrix


From the pearson correlation matrix populated from pandas_profiling, we can see the fields that have a strong correlation to the 'Buy Products' response.

We will import the numeric_survey csv output from the latest survey and create a new dataframe with just the fields with strong correlations

In [13]:
#import dependencies
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

from sklearn.metrics import accuracy_score
In [2]:
# read in CSV from numeric_survey function
df = pd.read_csv('../data/survey04172018.csv',index_col=None)
df.head()
Out[2]:
work_zip home_zip gender_values age race_values income_values days_last_values time_between_values products_values spend_values max_spend_values how_find_values review_values price convenient atmosphere amenities
0 92614 92614 1 38.0 1.0 1.0 1.0 3.0 0.0 5.0 5.0 3.0 1.0 4 4 4 1
1 92660 92677 1 34.0 4.0 5.0 6.0 7.0 0.0 5.0 7.0 2.0 2.0 5 3 3 1
2 92612 92602 0 35.0 4.0 4.0 0.0 7.0 1.0 5.0 0.0 3.0 2.0 5 5 4 4
3 92620 92780 1 35.0 1.0 4.0 1.0 3.0 1.0 5.0 6.0 3.0 3.0 3 5 4 4
4 97205 97205 1 38.0 4.0 4.0 1.0 6.0 0.0 5.0 5.0 4.0 3.0 4 2 3 2
In [77]:
# create new dataframe with fields from correlation matrix
df2 = df[['products_values','spend_values','max_spend_values','atmosphere','amenities']]
df2.head()
Out[77]:
products_values spend_values max_spend_values atmosphere amenities
0 0.0 5.0 5.0 4 1
1 0.0 5.0 7.0 3 1
2 1.0 5.0 0.0 4 4
3 1.0 5.0 6.0 4 4
4 0.0 5.0 5.0 3 2
In [78]:
# sum of 'Buy Product' responses
df2.products_values.sum()
Out[78]:
17.0

Taking the sum of the product values we can see that from our survey we will have an imbalanced class. Over 80% of the answers from the survey were 'No'.


To account for this discrepany we will Up-sample the minority class

Up-sampling is the process of randomly duplicating observations from the minority class in order to reinforce its signal.

https://elitedatascience.com/imbalanced-classes


Up-sampling the minority class

In [79]:
# import dependency from sklearn to resample
from sklearn.utils import resample
In [80]:
# Separate majority and minority classes
df_majority = df2[df2.products_values==0]
df_minority = df2[df2.products_values==1]

# Upsample minority class
df_minority_upsampled = resample(df_minority, 
                                 replace=True,     # sample with replacement
                                 n_samples=104,    # to match majority class
                                 random_state=17) # reproducible results
 
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])
 
# Display new class counts
df_upsampled.products_values.value_counts()
Out[80]:
1.0    104
0.0     87
Name: products_values, dtype: int64

Split our data into training and testing.

In [81]:
# Assign X (data) and y (target)
y = df_upsampled.products_values
X = df_upsampled.drop('products_values', axis=1)

print(X.shape, y.shape)
(191, 4) (191,)
In [82]:
# split sample into training and test
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, stratify=y)

Create a Logistic Regression Model

In [83]:
### set class_weight to balanced
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(class_weight='balanced')
classifier
Out[83]:
LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

Fit (train) or model using the training data

In [84]:
### BEGIN SOLUTION
classifier.fit(X_train, y_train)
### END SOLUTION
Out[84]:
LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

Validate the model using the test data

In [85]:
### BEGIN SOLUTION
print(f"Training Data Score: {classifier.score(X_train, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test, y_test)}")
### END SOLUTION
Training Data Score: 0.7902097902097902
Testing Data Score: 0.5208333333333334

From the training data we currently have a data score of 0.79 and test data score of 0.52.

Ideally we would have wanted to have more survey results to create pull from a larger sampling. This was just a start and something we can build on with more responses.

Make predictions

In [86]:
# predict using x_test sample
predictions = classifier.predict(X_test)
print(f"First 10 Predictions:   {predictions[:10]}")
print(f"First 10 Actual labels: {y_test[:10].tolist()}")
First 10 Predictions:   [ 1.  1.  0.  0.  1.  1.  1.  0.  1.  0.]
First 10 Actual labels: [1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0]
In [87]:
# create data frame to show results
pd.DataFrame({"Prediction": predictions, "Actual": y_test}).reset_index(drop=True)
Out[87]:
Actual Prediction
0 1.0 1.0
1 0.0 1.0
2 1.0 0.0
3 0.0 0.0
4 0.0 1.0
5 0.0 1.0
6 1.0 1.0
7 0.0 0.0
8 1.0 1.0
9 1.0 0.0
10 1.0 1.0
11 1.0 1.0
12 1.0 1.0
13 1.0 0.0
14 1.0 0.0
15 1.0 1.0
16 1.0 1.0
17 0.0 1.0
18 1.0 1.0
19 0.0 0.0
20 1.0 0.0
21 0.0 1.0
22 0.0 0.0
23 1.0 0.0
24 0.0 1.0
25 1.0 1.0
26 0.0 0.0
27 1.0 0.0
28 0.0 1.0
29 0.0 1.0
30 0.0 0.0
31 1.0 0.0
32 0.0 1.0
33 0.0 1.0
34 0.0 1.0
35 0.0 1.0
36 1.0 1.0
37 1.0 1.0
38 0.0 1.0
39 1.0 1.0
40 1.0 1.0
41 0.0 0.0
42 1.0 1.0
43 1.0 0.0
44 1.0 1.0
45 0.0 1.0
46 0.0 0.0
47 1.0 1.0