Beyond CLV: Modeling meaningful moments in the customer journey

An approach for modeling alternative customer actions like loyalty sign-ups, credit card enrollment, or gift giving

By Nick Resnick

Introduction

Predictive marketing analytics mostly focuses on the main revenue generating activities of a business, e.g. purchases in retail or bookings in hospitality. While it’s true that these activities have the most direct impact on customer lifetime value (CLV), they’re not the only moments in the customer journey that have an impact on CLV.

In recent years, businesses have expanded their services and products to create a more layered customer experience. Things like loyalty programs, credit cards, pop-up shops, and subscription services have become increasingly popular, yet they’ve largely been ignored by modern marketing analytics. There are a few data science problems in this context that would be interesting to study: 1. To what extent do these offerings actually impact CLV? 2. Can we measure the effectiveness of campaigns driving conversions of these offerings? 3. Can we even predict which customers are most likely to engage?

Enriched customer journey Figure 1: We envision a predictive modeling pipeline that considers various event types beyond purchases (green), such as loyalty signups (yellow) and page views (red)

A decent amount of research already exists on the first question, and the second one can be difficult to measure due to unreliability of email engagement and attribution data. At Amperity, we thought it would be interesting to tackle the third problem to see if we could identify customers most likely to engage in certain events along the customer journey. Many of the businesses we work with are interested in this question, and have established their own business rules for targeting customers likely to engage. We wanted to see if we could leverage Amperity’s unified customer data pipeline, along with expertise in predictive marketing analytics via the recent acquisition of Custora, to deliver more accurate predictions to our clients. The idea is that with more accurate predictions, businesses will a) be able to reduce marketing costs by targeting fewer people, and b) increase conversions in activities known to drive CLV.

We started by looking at the most popular alternative event type for most businesses: loyalty programs sign-ups. Our research question was: “Can we predict who’s going to sign up for a loyalty program in the next 90 days?” To test this out, we used first-party data from a major US airline with one of the most popular loyalty programs in the industry. Employing a random forest model with an diverse set of features, we achieved an F1 score of 0.87, which was a 37% lift over the best performing baseline approach.

Data and Environment

The schema of the data used in this research is shown in Figure 2. The core datasets used were loyalty membership records, bookings, and flights. We only considered customers that have booked at least once and filtered out those that enrolled in loyalty before their first booking. In cases where customers had more than one loyalty sign-up, we only considered the first instance.

Data science workflow Figure 2: Table schemas from Amperity’s internal database used for the research. Demographic data was taken from Bookings to preserve timing. Some seating and reservation data lived on a separate table that was joined to Flights by booking_id.

The research was conducted in databricks using pyspark1. A high-level workflow of this research is illustrated in Figure 3.

Models and Experimental Setup

Loyalty sign-ups are an exceedingly rare event– on average only about .2% of active customers that haven’t already enrolled do so in a given 90-day window. To increase the sample size of positive labels (i.e. enrolls) in the dataset, we generated examples using 24 feature/label split dates, each a month apart, from 2018-01-31 to 2020-01-01. For each of the 24 split dates, we generated features using all data up to that date, and computed a binary label representing whether each customer signed up for the loyalty program in the subsequent 90-day period. This process generated 900 MM examples with 1.74 MM positive cases. While the percentage stays constant, this allows us to downsample the negative cases (discussed below), while still having a decently sized train set.

We used spark MLlib’s random forest classification model for the analysis. A random forest was chosen due to its out-of-the-box feature importance calculations, its ability to model nonlinear relationships between features and labels, and, given that we eventually want to train loyalty sign-up models across Amperity’s tenants, its flexibility. We conducted cross validation using the full feature set to arrive at optimal hyperparameters2.

Data science workflow Figure 3: Data science workflow

Due to random forest’s sensitivity to imbalanced datasets, we conducted stratified sampling so that the train and test sets had the same number of positive and negative labels. In the interest of runtime we also downsampled from 900 MM to 300 MM examples before running the experiment. After both of these downsamples, the train and test datasets ended up with 810.6k and 346.7k examples, respectively3.

To evaluate the efficacy of our model, we created a handful of baseline models that represent the airline’s next best alternative for predicting loyalty sign-ups. The baseline models we generated internally include: number of flights all time, number of flights last year, number of miles flown all time, number of miles flown last year, days since last flight, and has upcoming flight. We also included the airline’s actual loyalty program targeting rule that they orchestrate out of Amperity, which is whether a customer flew exactly four days ago (i.e. recency == 4.0). For all baselines except the last two, we conducted threshold sweeping over the deciles of the feature to find the one that yielded the highest F1 score on the train set, and for the last two we predicted 1 if the boolean feature were true, 0 otherwise.

Features

The majority of the analysis focused on generating and evaluating features for the random forest model. There are six main classes of features we considered:

  • Demographic features. This includes things like age, email domain, and zip. Demographic features were derived from booking data associated with a timestamp to avoid the possibility of information leakage into the features. This was important because commonly demographic data only becomes available after a customer signs up for loyalty, so it would be possible for someone that signed up for loyalty after the feature generation period to have populated data in cross-sectional (i.e. “timeless”) datasets.
  • Historical flight features. Various properties of the customers flight history, including things like inferred home airport, most common service class, and last flight miles.
  • Upcoming flight features. Various properties of a customer’s upcoming flights, like how far in advance it is from the split date, how many miles are being flown, and if the customer has booked a first class ticket.
  • Seasonality features. This includes split date and last flight week to capture time-of-year information.
  • High dimensional flight features using word2vec. Each customer’s full airport and service class histories were concatenated into a string (e.g. “JFK-LAX-LAX-JFK” and “X-G” for airports and service classes, respectively), and fed into MLlib’s word2vec model 4 . The idea here was to create a feature embedding that preserves full flight history and encodes similarities between certain airports and service classes.
  • Flight/Demographic statistics by group. To the best of our knowledge, this is a novel way of incorporating interactions between features. Features in this class were calculated by grouping by some categorical feature (e.g. email domain, inferred home airport) and computing aggregated statistics for one or more numerical features (e.g. average miles flown, number of flights). These aggregated statistics are then joined back to each customer based on their value for the categorical feature.

Illustrations of how feature classes 1-5 and feature class 6 were calculated are shown in Figure 4, and Figure 5, respectively.

Example of feature generation over two split dates Figure 4: Example of how user-level features are generated for different split dates

To gain insight on how each feature class impacted model performance, we conducted a feature ablation study. A feature ablation study trains a model on each possible combination of feature classes (in our case 26 - 1 = 63 combinations), and records the error metric (in our case, F1) on the evaluation dataset for each. This is helpful for measuring how much predictive value each feature class adds to the model. Figure 6 contains the results of the feature ablation study.

Note that features that belong to different feature classes may encode similar information. For example, age and email domain are both encoded in the Demographic class as well as the Flight/Demographic Stats by Group class (since they’re both grouping variables.) This likely explains why the Demographic and Flight/Demographic Stats by Group feature classes have relatively similar performance.

Example of feature generation for stats by group Figure 5: Illustration of how features in the Flight/Demographic Stats by Group class are calculated

Speaking of which, we were interested to see that these were the two most predictive feature classes. Thanks to Amperity’s identity resolution pipeline, demographic data was relatively dense, and as we saw in our predictive CLV paper, accurate, dense demographic data can have a large impact on predicting events in the customer journey. On top of that, the Flight/Demographic Stats by Group class is the only one to encode feature interaction, so its performance may imply a strong nonlinear relationship between our feature set and loyalty sign-ups.

The Flight/Demographic Stats by Group class is especially useful for practitioners building models in pyspark that are looking to incorporate feature interactions, as the MLlib Interaction transformer isn’t available in pyspark as of 2.4.5. Even if it were available, since the statistics table for one group has dimension N x M + 1, where N is the number of distinct values in the categorial feature, and M is the number of numerical features you want to aggregate, it’s a much more computationally feasible approach than vanilla feature interaction.

Feature ablation upset plot Figure 6: Feature ablation results

Comparing the ablation plot to the results table below, you can see that the Seasonality class alone (which includes last flight week and the split date used to generate features) has an F1 score clost to 0.6, which is around the performance of the best performing naive model. You can also see that the fit with only the Upcoming Flight feature class has about the same F1 score as a baseline model using has_upcoming_flight as a boolean predictor. This implies that attributes of the upcoming flight don’t add much value beyond the fact that an upcoming flight exists.

One of the biggest learnings we had in this analysis was from optimizing the feature generation code. Originally written in pandas, the transformations were first ported to pyspark using PandasUDFType.GROUPED_MAP which allowed us to write a vectorized spark UDF minimal changes. Unfortunately, generating features over 24 split dates on a 5 MM customer downsample took roughly 19 hours, which made feature iteration largely infeasible. To resolve, we re-wrote the feature generation code in spark sql, which decreased the runtime to 26 minutes on the full 60 MM customer dataset. Wow. In short, the key was to group by customer and aggregate flights (or purchases, hotel stays, etc. depending on the industry) into an array of structs, then write a spark sql query that leverages built-in array aggregation functions on the grouped data.

Results

Using the best performing feature combination from the ablation study, our model produced an F1 score of 0.870, which is roughly a 37% increase over the baseline model F1 of 0.63. Table 2 contains a comparison between our model and the basket of baseline models we considered. We’re particularly excited by the 860% boost in F1 score between our model and the baseline that the airline currently uses in Amperity.

Results for loyalty program predictions Table 1: F1 scores for our random forest model against a series of baseline approaches

Conclusion and Future Work

Our goal was to generate useful predictions for loyalty program sign-ups as a primary use-case for modeling alternative moments in the customer journey. Such predictions allow businesses to target customers most likely to convert, reducing the cost of irrelevant outreach and increasing engagement in valuable actions. Given a solid F1 score and meaningful lift over the airline’s next best alternative, we’re excited to continue development of this ML pipeline across event types and industries.

We’re especially interested in the generalizability of the approach, both across event types and industries. Looking at the feature classes used in this research, you could imagine replacing instances of “flight” with “stay” and applying the model to a hospitality loyalty program use-case. Similarly, while this focused on loyalty program sign-ups, you could imagine predicting credit card sign-ups, website visits, or even traditional purchases in a similar framework. The main difference between each use-case is the features you’d include in the model5, though there’s likely a large amount of feature overlap between them.

On the engineering side, we had some great learnings about using spark sql for feature generation. Not only did it massively improve performance, but also established a pattern that makes it easier to iterate on features. And given that spark sql queries are language agnostic (assuming no UDFs), you can easily copy over the feature generation code developed in your data science env to use in your production env.

Finally, we’re reminded of the utility of Amperity’s core identity resolution pipeline given the importance of demographic features in the model. In the future, we’re interested in measuring the impact of unified data on customer-centric predictive output carefully, by comparing models trained on both raw and stitched data.

Thanks for reading and stay safe!

Footnotes

1: We used pyspark 2.4.5 and Databricks runtime 6.4. Amperity has an internal package that connects Databricks to S3/DL via wrappers around Databricks’ dbutils API. We also have custom code to ping our API from Databricks for read-only access to tenant data. The data is read in after running Amperity’s patented identity resolution model, which connects disparate customer data to a proprietary amperity id that represents the underlying customer entity.

2: Running cross validation on a grid search of maxDepth, minInstancesPerNode, and numTrees on the full feature set yielded winning values of 30, 2, and 300 respectively. That being said, in the interest of runtime, we decreased numTrees down to 150 after seeing a negligible impact on the evaluation dataset F1 score.

3: Quick math check: First downsampling from 900 MM to 300 MM, we’d expect roughly .33 * (1.74 MM) = 574k positive labels in both the train and test set. (810.6k + 346.7k) / 2 = approximately 578k positive cases.

4: While we didn’t conduct a formal cross validation process to tune the word2vec parameters, from some testing and exploratory analysis we settled on output vector size (i.e. vectorSize) of 2 and minimum word count in corpus (minCount) of 1.

5: You’d also need to use random forest regression instead of classification if predicting event types that can happen more than once in the holdout period.