Korey Lee, Vice President, Data, South China Morning Post
One challenge every business encounters is how to retain customers. How do we retain, re-engage, and keep our users captivated? What patterns of engagement or triggers create loyal return customers?
In the news media business, we’re also seeking more meaningful ways to engage with our readers, but we have the added challenge of having less data than a software, subscription1, or e-commerce business has.
In January 2019, SCMP started working on a machine learning project that would predict if a reader would return and become a loyal reader. We decided to internally codename this project “Bluefin” for two reasons:
1. Bluefin tuna return to their birthplace to hatch, just like our users would hopefully return to find meaningful content and information.
2.Population levels of bluefin are sadly critically endangered in the wild, as are loyal customers - worth protecting and nurturing back to healthy populations.
This predictive scoring engine stack ranks our users and optimizes marketing campaigns to focus resources on our highest potential readers. All marketing budgets are limited and audience selection is critical when it comes to optimising retargeting campaigns. In order to compare the effectiveness of the engine, we employed A/B testing with the hypothesis that the Bluefin selected audience would outperform the control group.
Here are the steps we took to build this out:
1) Data Preparation and Training
As news cycles are rather unpredictable and may not capture seasonality, we wanted to ensure we selected a period that generally avoided holidays or big spikes in our traffic that would skew our model. We selected the period from June 2018 to November 2018 as it was a relatively “normal” period in our traffic with consistent return readers and also generally avoided holiday troughs.
2) Model Training
Loyalty is defined as a multi-session user who returned to the site with a pre-specified frequency and recency. While we included over 40 raw variables in the model, we found several criteria2 were most impactful in driving loyalty:
● Percentage of page views in each section
● Time on page
● Days between the last 2 visits
● Percentage of sessions on various platforms
● Percentage of sessions by source and medium
3) Loyalty Prediction
In terms of evaluation criteria, accuracy rates in this case can be misleading as there are more non-loyal readers than loyal readers rendering an imbalance.
Thus for these imbalanced classification cases, we measure using an F1 score that conveys the balance between precision and recall to evaluate a model’s performance. For marketing purposes, gain and uplift are used to select targeted campaign lists. We assigned thresholds from 0% to 100% reflecting the measurement matrix as follows:
Recall = True Positive / Total Actual Loyal Users
(The proportion of loyal users could be predicted correctly)
Precision = True positive / Total Predicted loyal readers
F1 score = 2 x (( Precision x Recall) / (Precision + Recall))
True Positive ⇒ Loyal reader predicted correctly as loyal reader
False Negative ⇒ Loyal reader predicted incorrectly as fly by reader
True Negative ⇒ Flyby reader predicted correctly as flyby reader
False Positive ⇒ Flyby reader predicted incorrectly as loyal reader
The proposed prediction had specific accuracy rate and recall limits, which then calibrated our threshold.
Lift and gain are metrics used to answer the following:
a)Quality of the predictive model
b) Selection of readers who are most likely to return to be used for campaign retargeting and lookalike prospecting for marketing
4) Overall workflow
This was built from 5 primary components which started with data extraction.
[The general workflow for the Bluefin model]
The scoring engine is fed back into the data engine to continuously improve the model on a monthly basis. In this way, the model is able to refresh and teach itself to learn from the latest data, recalibrating the scoring engine, and incorporating any new variables that might prove significant.
5) Model Technique & Considerations
Let’s go back to our primary objective for Bluefin: to use historical data to predict whether infrequent users will become loyal customers.
We ran separate models for each country since our readership in each region differs substantially in their content preferences and consequently the insights will be specific for each market. In this case, it doesn’t make sense to meld what should be multiple models together into one. Furthermore, we looked at random forest , grid search , k-cross validation 4 5 6, and logistic regression7 methodologies to assess which robust model would be best suited to our objectives. We also used the data from January to April 2019 to perform the cross-time validation, proving the model built on 2018 data can predict uplift in 2019’s data. In the end, we decided to use random forest because of its ability to determine how loyal users will return and which segments of visitors are most valuable for retargeting.
[ A visual representation of k-folder cross validation ]
4 Random forest is an ensemble method, which can be thought of as one model combining hundreds or thousands decision trees. With the bagging method, each tree is built on bootstrapped sample/variables. Random forest also helps to eliminate some dominant information and capture weak but useful information. Furthermore, this method helps to reduce the impact of outliers, saving us time on normalization.
5 By searching the different combination of “the number of trees”, “the maximum depth of trees”,
“the split method”, the best combinations or hyper-parameters, which best fit the data, will be chosen, instead of random set.
6 In K-fold cross validation, we further split our data set into K number of subsets called folds.
We then iteratively fit the model K times, each time training the data on K-1 of the folds and evaluating on the Kth fold (called the validation data). As an example, consider fitting a model with K = 5. The first iteration we train on the first four folds and evaluate on the fifth. The second time we train on the first, second, third, and fifth fold and evaluate on the fourth. We repeat this procedure 3 more times, each time evaluating on a different fold. At the very end of training, we averaged the performance on each of the folds to come up with final validation metrics for the model.
7 Logistic regression is most frequently used when the dependent variable is dichotomous and aims to explain the relationship between the outcome with all the other independent variables including nominal, ordinal, and continuous factors.
We tested the Bluefin audience in multivariate A/B test campaigns on Google Display Network in the US and Asia regions. We found that engagement increased by 58-78% and cost efficiency increased by 36%-52%.
While it’s still early days, we’re excited about engaging with our loyal readers in more meaningful ways and being able to leverage our marketing budgets more effectively and efficiently. Additionally, we’re also starting to use Bluefin to prospect for new look-alike audiences who share similar characteristics to our existing high potential loyal users.
7) Next steps
Algorithms are never finished and as our customers and marketplace continue to evolve, we need to adapt accordingly. We’ve designed Bluefin with feedback loops in mind and will continue to factor in new variables, customer behaviors, and further raise the bar on what we internally define as a loyal reader.