Sign in

Capstone Project Report — Bertelsmann/Arvato Customer Segmentation Report


Photo from

Project Overview

This is final project of the Udacity Data Scientist Nanodegree. The goal of the project is to use appropriate data analytics tools and methodologies to predict potential customers for Arvato’s mail order organic products.

Historical data on general population demographics, on customers and on client response to previous campaign have been provided to assist in predicting which ones from general population are likely to be good responders to Arvato’s marketing campaign.

There are 3 major sections in this project:

1. Understanding business problem and data characteristics

2. Using Unsupervised Machine Learning to identify what segments of general population that match Arvato’s existing customer segments

3. Using Supervised Machine Learning to predict which customers are likely to response to company’s marketing campaign.

All the supporting analysis and documentation can be found at Github

Problem Statement

Marketing is crucial for the growth and sustainability of the business as it helps build company’s brand, engage customers, grow revenues and increase sales. One of the key pain point of business is to understand customers and identify their needs in order to tailor campaigns to customer segments most likely to purchase products. Customer segmentation helps business plan marketing campaigns easier, focusing on certain customer groups instead of targeting the mass market, therefore more efficient in terms of time, money and other resources.

  • What are the relationship between demographics of the company’s existing customers and the general population of Germany?
  • Which parts of the general population that are more likely to be part of the mail-order company’s main customer bases, and which parts of the general population are less so
  • How historical demographic data can help business to build prediction model, therefore be able to identify potential customers.

Fortunately, those business questions can be solved using analytics by involving appropriate data analytics tools and methodologies.

The approach to solve problem is to use an unsupervised machine learning algorithm KMeans to segment customers based on their demographic features, and then employ a binary classification machine learning model DecisionTreeClassifier to predict which ones of the general population are likely to be Arvato’s potential customers.

Datasets and Inputs

4 datasets provided by Arvato will be explored in this project:

1. Udacity_AZDIAS_052018.csv: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).

2. Udacity_CUSTOMERS_052018.csv: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).

3. Udacity_MAILOUT_052018_TRAIN.csv: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).

4. Udacity_MAILOUT_052018_TEST.csv: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

And 2 metadata files associated with these datasets:

1. DIAS Information Levels — Attributes 2017.xlsx is a top-level list of attributes and descriptions, organized by informational category.

2. DIAS Attributes — Values 2017.xlsx is a detailed mapping of data values for each feature in alphabetical order

Evaluation Metrics

This is a two-class classification problem. Due to large output class imbalance, where most individuals did not respond to the mailout, the most appropriate evaluation metric is the Area Under the Curve Receiver Operating Characteristics (ROC-AUC). The curve represents a degree or measure of separability and, the higher the score the better the model is performing.

Data Exploration

I’ve merged 2 metadata files DIAS Information Levels — Attributes 2017.xlsx and DIAS Attributes — Values 2017.xlsx to form a data dictionary for Arvato’s general population and customers demographic files.

It is interesting to see how NaN values are coded in major attributes of datasets

Value -1, 0 and even 9 (not displayed in screenshot above) represent ‘unknown’ .

We can see that most of attributes have % of missing values less than 30%, 7 columns with more than 30% NaN are considered ‘outliers’ , therefore being removed.

Plotting the columns and associate % missing values side by side, we can see that most attributes have missing values 30% or less in both general population and customer files. Those attributes will be retained, else (greater than 30%) will be dropped from datasets.

One interesting observation from the charts is that both files have the same number of colums (~ 80) with no missing values. And the MODE % of missing values in Population file is 12%, whereas in Customer file is 27%

Let’s have a look at NaN analysis at row level in the figure on the left. Population file seems to have less NaN than that in Customer file. The mean number of NaN at row level in Population is 37 whereas it is 72 in Customer file. Similarly at third IQR, it is 16 in Population file and 225 in Customer file.

Most attributes in general population and customer files are of numeric datatypes, only few categorical variables. As you can see in the screenshot of azdias file below, 360 out of 366 variables are numeric.

Data Pre-processing

Data cleansing

Data cleansing plays an important role as it improves data quality, therefore better prediction. The followings have been performed:

  • drop rows with more than 75% missing values
  • drop columns with more than 70% missing values
  • drop customer id column
  • drop categorical columns (only a few and not worth to employ Encoding technique)
  • drop 3 columns exist in Customer file but not exist in Population file
  • for numeric variables, replace NaN with values implying ‘unknown’ in data dictionary, in this case is -1

Feature scaling

With 98% of variables are numeric, feature scaling is essential preprocessing step, especially for KMeans. This distance-based algorithm is affected by the scale of variables.

There are many debates on StackExchange or StockOverflow about what Scaler should be employed. In this exercise I will use StandardScaler with default parameters:

scaler = StandardScaler()

df_scaled = pd.DataFrame(scaler.fit_transform(df), index=df.index, columns=df.columns)

The output of Sklearn.preprocessing.StandardScaler is an array, however I’ve converted it into Pandas dataframe as I need the column names for reverse engineering later.

Feature reduction

Due to the huge number of explanatory variables (350), and many of them may not contribute to prediction of target variable. I’ve used sklearn.decomposition.PCA to limit the number of components being used for machine learning.

Let’s have a look at the chart below. The choice of 150 principal components seems reasonable. It reduces more than half of features while still have more than 80% explanation power.

First 10 records of First PCA

The first principal components refer to social status and lifestyles of individuals (Mobility, Social Status, 1–2 family houses, number of buildings, share of cars, lifestyles)

First 10 records of Second PCA

This Is interesting cluster where the attributes all start with ‘KBA05’ which are vehicle related features

Unsupervised Machine Learning

KMeans Algorithm

The reason I chose Kmeans algorithm is that it is relatively simple to implement and scale to large datasets.

Let’s start with finding the optimum number of clusters

The plot above suggests number of clusters between 4 and 6, let pick number of clusters using calculation

The score difference between k and (k-1) reducing as number of clusters increasing. From figure on the left, we can see starting from k=5 , the score difference (-3515696.98195 ) become smaller.

So the clusters = 5 seems to be the right choice.

Apply KMeans(5) on PCA datasets I have the results below:

Customer Segmentation Report

Comparing the proportion of persons in each cluster of the general population and Avarto’s customers, we can see that there are big differences in Cluster 1 and Cluster 5.

The higher proportion of persons in a cluster for the customer data compared to that of general population suggests the people in that cluster are likely to be target audience for the company, because that population possess the characteristics of Arvato’s customers. That means the population in Cluster 1 and Cluster4 are more likely to be part of the mail-order company’s main customer base, the Cluster 3 and Cluster 5 of the general population are less so. There is not much difference in Cluster 2 , this suggest part of this population may be company’s potential customers.

Supervised Machine Learning

Data Exploration

Given mailout_train and mailout_test are similar to ‘customers’ file which have been thoroughly analysed in previous sections, my focus now is on distribution of ‘RESPONSE’ values in mailout_train dataset

We can clearly see there is huge imbalance in domain values of attribute ‘RESPONSE’, let check out the exact counts and proportion in each class below:

The percentage of individuals who has responded is 1.24% compared to 98.76% who did not respond.

Data preparation like remove NaN, impute NaN, drop columns, scaling will be similar to what I’ve done for azdias and customers files


As illustrated in previous section, the individual has not responded (0) is ~8 times more than ones who responded (1). Therefore, accuracy is a poor metric to use, and the Receiver Operating Characteristic Area Under the Curve (ROC AUC) will be used instead. The higher ROC AUC score, the better model performs. In other words, the closer ROC AUC is to 1, the better prediction who is likely to respond or not respond

Base Models

Given the imbalance of ‘Response’, I’ve calculated the class weight which will be input for my models where applicable

My selected algorithms are:

1. RandomForestClassifier

2. LogisticRegression

3. DecisionTreeClassifier

4. GradientBoostingClassifier

Model Evaluation and Validation

Given the AUC ROC training score in precious section. I picked GradientBoostingClassifier to explore and further fine tuning

The approach for GradientBoostingClassifier evaluation and validation below:

1) validate base model using StratifiedKfold (5) validation

2) fine tune hyperparameters using GridSearchCV

3) retrain the model using best parameters from step above

4) apply StratifiedKfold validation on tuned model

5) compare validation before and after tuning.

The hyperparameters using for GridSearchCV below:

param_grid = {

‘learning_rate: [0.05, 0.1, 0.15 ],



‘max_features’:[‘log2’,’sqrt’] }



The parameters that give the best validation scores when experimenting difference hyperparameter values are:

learning_rate = 0.05

max_depth = 3

n_estimators = 50

max_features = none

I will retrain the model using hyperparameters above. The validation scores below:

From the scores above we can see that GradientBoostingClassifier give consistent scores across 5 folds

There is slight improvement after tuning parameters , so the retrained model will be used to predict ‘RESPONSE’ in mailout_test file


It is pain point when tuning hyperparameters using GridSearchCV, it took forever to run. So I had to kill it and manually trained the model with different hyperparameters one by one. I believe the solution is reasonably adequate as I have experimented with different models, different hyperparameters, different validation techniques, thoroughly analyzing, wrangling and scaling data.

I think the validation score of 0.7687 is quite OK, in addition the consistency of scores across 5 folds give me confident that my model is robust against small perturbations in the training data, therefore reliable in predicting individuals’ responses.


The project gives me the opportunity to expand and apply my knowledge on real-world business problems. I’ve gone through the end-to-end solution journey starting from business understanding, data understanding and then using unsupervised machine learning to cluster customers, following by exploring 4 supervised learning classification models, experiment various validation and hyperparameter tuning techniques, and finally built a model that I am comfortable with. Of course, there are still lots of room for improvements. The first one is to use Pipeline for a more modular codes, another one could be feature engineering and lastly using more charts/diagrams to visualize results.

Works Cited

Brownlee, J. (2020, Aug 15). 8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset

Sklearn ROC AUC Score documentaion

Dogan, S. (2020 Apr 13). Why scree plot is important in PCA

KFold Cross Validation in Python