In the first blog of this series, we explained why FMCG retailers need to use Artificial Intelligence (AI) and Machine Learning (ML) to stay competitive. We listed a few paths to progress and explained why the creation of a Shopping List is a good use case of ML for FMCG retail.

In this blog, we will focus on the science of ML in the context of the Shopping List use case. We will cover four main topics:

- The model selection and the detention of the success criteria
- Data preparation including why this is important
- The selection of the right model features
- Initial testing of the results

**Selecting the right model and the success criteria**

The core of the Shopping List requirement is the need to help a shopper find and select the 20+ items she currently needs out of an assortment of over 10,000 items. A good shopping list algorithm aims to provide high accuracy. For example, we might say if the first 20 items on our list contain at least 10 of a shopper’s 20 required items we will consider our list a success.

As we showed in the previous blog this is a more complex problem than we might initially think. Here are a few examples:

If a typical basket consists of 20 items and there are 10,000 SKUs in the store catalog, the random probability that a product will be purchased in the current visit is only 0.2%. To create a list that will include 10 products out of 20 that are currently required, the list size would have to be 5,000 items long. We can agree that a list of 5,000 items is anything but helpful!

A quick improvement would be to build the list only from products that were purchased by the shopper in the past. However, looking at the data from our previous blog, you will find that in order to list 10 relevant products we would still need a list that is 250 items long.

A good ML algorithm should help to keep the list much shorter and still relevant. This not only improves the shopping experience. It also allows us to leverage the extra space and shopper time that you save to promote other products.

The Shopping List problem is a flavor of the Recommendation System. In a typical application of recommendation systems, users are being recommended movies that they are likely to watch or songs they are likely to enjoy listening to. Making good recommendations are critical in media services such as Netflix and Spotify. Recommendation systems rank items in the inventory, which is typically very big, in a manner that matches the user’s profile and past usage. Similarly, in a Shopping List, we aim to rank products in an order that matches the shopper’s shopping basket. While recommending movies or songs and recommending products are about the ranking of items, there are certainly important differences between ranking different types of items: the cadence with which a product is purchased, price sensitivity and product substitutes are considerations that don’t exist when ranking movies or songs.

For our Shopping List, we use a Supervised model, where the main idea is to use historical data and the recent activities to train and later test how well the algorithm predicts future behavior based on past data. There are a few families of recommendation system algorithms that can be used to build a shopping list; Logistics Regression, Factorization Machines (FM) and Deep Neural Networks (DNN), to name a few. We at ciValue use XGBoost package (Gradient Boosting Trees) for the shopping list problem.

The success criteria should be defined by a mathematical formula that takes into account the probability of finding the desired product in the list (high probability is good). A list is relevant if the desired products are listed at the beginning of the list while if you have to scroll down a lot this means the list is less relevant.

For example, if a typical basket includes 20 items, the success score for a specific list can be anything between 0% to 100%. 100% is the highest score and it means that we have listed successfully all the 20 needed items at the first 20 items of the list, 50% means we listed 10 items in the first 20 and 0% means we did not list any desired product at the first 20 items of the list.

__Note for data scientists:__

*The accuracy metric described above is r-precision@k, where k is the number of items in the list. In the example above the value of k is 20. Typically, we measure our accuracy over several k values, such as 5, 10 and 20. To obtain a single metric for all shoppers we compute the mean or median of the r-precision@k computed for all shoppers. A different accuracy metric we use to measure shopping list accuracy is average precision (AveP). Opposed to r-precision@k that tests whether a relevant item is a top list, average precision considers the order of items and therefore doesn’t considers a list of limited size. The mean average precision (MAP) is used to obtain a single accuracy value for all shoppers.*

**Data Preparation**

A good ML algorithm requires good data, so the data preparation step is critical. While organizing the data, there are few specific things that FMCG retailer must keep in mind:

**Product definition from a shopper’s point of view.**In most other retail businesses, the straightforward way is to define a product as an SKU in the catalog. However, in FMCG in many cases from a shopper point of view, a group of SKUs represents the same product. If the only difference between the two SKUs is the package size or a small difference in flavor, shoppers might see them as one product. This relation between SKUs is known as substitutes.**Catalog cleansing.**In FMCG catalogs with tens of thousands of SKUs, many are not active or only available in a limited number of stores. This must be taken into consideration when providing recommendations**Measurable customers.**Many out of the million shoppers in the loyalty program database are not regular shoppers but have only made occasional visits to the store. It is critical to identify the right active shoppers to be used to train the model.- In order to keep the model updated and to re-train it on an ongoing basis, you much do pre-calculation of some of the model features.

For the initial development, we recommend using 12 to 24 months of historical purchasing data. Once a model is live, it is important to refresh the training model with recent data at least on a weekly basis to make sure seasonality and changes to the catalog can be taken into account.

Here are the 5 main steps for data preparation:

**ETL**– Extract data from the operational system and Load it into the data lake to be used by the ML workflow**Cardinalities**– Do a pre-aggregation of the data at the dimensions of Customer, Date, Store, SKU. For each shopper, pre-calculate main KPIs at all relevant levels (items, brands, and categories).**Classification and cleansing**– Identify measurable customers, relevant categories, and SKUs to be used for future calculations, for example:- Measurable Customers – The most valuable customers in the category. These shoppers will serve as a panel for calculating a category’s characteristics.
- Measurable Categories – A measurable category is a category that has a sufficient number of measurable customer visits. The non-measurable categories should be ignored by the ML models.
- Category replenishment Time Unit. For example, A weekly time unit for the Vegetable category in a grocery retailer.
- Category Nature – Measurable categories will be classified as either “Continuance” (customers who buy it, buy it consistently) or “Temporary” (customers normally forsake it after a while).
- Category’s Interesting period – the duration that is required to build a shopper profile for shoppers that buy in the category.

**Profiling**– Classification of customers into groups and ranges. For example: Active; Occasional; Gone etc.**New –**Use workflow to refresh and re-train the ML models that are based on this data.

**Selection of the right model’s features**

To run a supervised algorithm, you have to provide the model with a set of data that contains both the inputs and the desired outputs. The data is known as training data and consists of a set of training examples. Each training example is represented by an array of features, sometimes called a feature vector, and the full set of training data is represented by a matrix. Through iterative optimization, the algorithm builds a function that can later be used to predict the output associated with new inputs.

In the Shopping list problem, the list of input features might be for example:

- Shopper ID
- Product (SKU/UPC) ID
- Feature #1: Number of times the shopper purchased the product in the last 3 months.
- Feature #2: Number of days since the last purchase.
- Feature #3: Total spending on the product in the last 6 months.

The desired output can be:

- Indication (True or False) if the shopper purchased the product in her next visit to the store.

For a successful ML algorithm, it is most critical to find the features that will be used to train the model. The model’s features should include the ones that we assume will best predict the probability that the shopper will repurchase the product.

You will not be surprised to know that the top two features in predicting a product repurchasing pattern are:

- The number of past purchases – How many times the product was purchased.
- Recency – Number of days since the last purchase.

However, using just these two features will not be enough to give an optimal result.

Here are a few things that retailers must take into account:

- As discussed before, in many cases a single SKU does not represent the product from the shopper point of view. You must combine the data for a group of SKUs that are considered as one product by the shopper
- Recency might have a very different impact on different types of products. If you purchased a cucumber last week you will need one this week, but if you just purchased an expensive perfume it might take a couple of months before you buy a new one
- FMCG is a “Long Tail” business. So, in many cases, you will not have enough data at the SKU level to train the model and will need to use other features that are relevant. For example: Loyalty to the brand, access to the product’s shelf, alternative product preferences, or complementary products already purchased.

In the ciValue model, we use over 50 features to train the shopping list model. In addition to SKU-level features, these include features that describe the customer behavior at the department, category, section (shelf), and brand levels.

There are different approaches to measure the importance and the impact of each one of the features on the final score (e.g. probability to purchase). It is highly recommended that you analyze the outcome and validate that it matches your basic instincts as a marketer. If you find that an important feature is not in the game it might be an indication of a data preparation problem.

The table below shows the top contributing model’s features to the probability of a product to be purchased, computed for a certain family of products using the SHAP package (based on Shapley values from game theory). A feature contribution to a product score may differ between different product families or shopper’s profile.

Looking at the table you can see for example that:

- The most important feature is the “Section Brand Recency”. This is the duration since the last purchase of the brand on the product shelf. The probability to repurchase the SKU is lower (X is negative) if a long time passed since the previous purchase. In general, for most of the shoppers, the impact is negative (color RED is on the negative side of the X).
- The #3 feature is “Last Year Purchases”. As expected, the number of times SKU was purchased during last year is a clear indication of a high re-purchasing probability. A higher value means a higher probability to repurchase.

** **

**Initial testing of the model**

One of the challenges with ML is that the results (output) of the algorithm are coming in the form of a black box. In the use case of a shopping list, the output will be a function (decision tree or regression formula) that gives a probability for a product the be purchased by a specific shopper.

In order to validate the scoring, we recommend you view the results in a simple excel that will give you a feeling if the output matches your instincts as a professional and as a buyer.

Here is an example of a quick and easy sanity check that can be helpful:

Select a product or SKU that you well understand. In this example, I am using diapers for babies that are 3 to 6 months old. We know that the purchasing window for this product is 90 days, and we also know that shoppers are highly loyal to the diapers brand. So, in most cases, they will keep purchasing the same brand for 90 days.

So, you expect to find that people that just purchased the product will re-purchase the same product again. However, as the time since they last bought the product increases the chance for repurchasing dramatically decreases.

If we extract all customers that purchased this product in the last 12 months and calculate the probability score for the product to be repurchased in the next visit, we can use a simple excel pivot to see the correlation between the number of days since the last purchase (0-365) and the score (0%-100%).

These are the results.

As you can see, out of the 24,575 shoppers that purchased these SKUs in the last 12 months 938 will get a score that is 50% or higher, 13,601 from the other end will get a score of 0%.

For the ones that purchased in the last 30 days: 54% (1,565/2,887) have a high probability (30%+) to repurchase in their next visit.

From the other end, for the 75% of the shoppers that did not purchase the product for 180 days or more the algorithm predicts a 0% probability that they will repurchase it again. As the score goes down over time, you should remove the SKU from their shopping list.

We highly recommend repeating this kind of test with a high-frequency product (such as milk) and for a low-frequency product (such a shampoo) and see how well the algorithm aligns with your view of the purchasing pattern.

**Next blog**

While data science is very important, there is often a dissonance between what the science tells us (i.e. probability) and the ideal customer experience. In the third and last blog of this series, we will review some examples and will show how to combine data science with customer-centric experience design to create the best possible experience.