The world’s leading publication for data science, AI, and ML professionals.

How to Use Spatial Data to Identify CPG Demand Hotspots

In recent years, the consumption and promotion of products which fall under the category of Organic / Natural / Local has increased…

Hands-on Tutorials

Detecting Demand hotspots for Consumer Packaged Goods using Spatial Analytics.

In recent years, the consumption and promotion of products which fall under the category of Organic / Natural / Local has increased dramatically. These are specific types of products which have not undergone any industrial process or lack certain food additives and preservatives. In the US, the marketing of organic products has grown significantly, appealing to a new generation of consumers looking for healthy products and plant-based food. According to the Organic Trade Association[1], organic food is the fastest growing sector in the US food industry:

Organic is the fastest growing sector of the U.S. food industry. Organic food sales increase by double digits annually, far outstripping the growth rate for the overall food market. Now, an unprecedented and conclusive study links economic health to organic agriculture.

This growth in demand for organic products is closely linked to cultural, socio-economic and health factors. In this case study, we analyze these factors spatially as an exercise to understand which features and city areas might help a CPG data and marketing professional identify where to prioritize in terms of rolling out distribution and identifying POS (points of sale) for certain organic food products in two major US cities, namely New York and Philadelpia.

Data

In order to do this, we selected different data sources available from Carto’s Data Observatory [2] that could help us identify which areas in a city are better suited for the distribution of organic products. The datasets that we have used for this analysis are the following:

  • MastercardGeographic Insights: providing sales-based dynamics of a location with indices measuring the evolution of credit card spend, number of transactions, average tickets, etc. happening in a retail area over time;
  • Spatial.aiGeosocial Segments: behavioral segments based on the analysing social media feeds with location information;
  • DstilleryBehavioral Audiences: audiences derived from online behaviors;
  • Pitney BowesPoints of Interest: database with the location of businesses and other points of interest categorized by classes and industry groups;
  • AGSSociodemographics: basic socio-demographic and socio-economic attributes estimated at current year and projected 5 years into the future.

Methodology

Our analysis follows three main steps:

  1. Identification of target areas with high potential for a successful rollout of organic products.
  2. Analysis of the different factors that characterize and have driven the selection of the target areas
  3. Identification of twin areas in San Francisco based on those selected in New York and Philadelphia.

Identifying Target Areas for Distributing Organic Products

In general, organic products or "bio" products are considered premium, typically with higher prices. For example, in Detroit, organic milk is 88% more expensive with respect to the regular milk [3]. This means that it will be preferable to place such products in stores located where consumers are willing to pay that premium, whether it’s where they live, work, or spend their leisure time.

Therefore, in order to identify the potential areas to rollout the distribution of such products we followed these 3 steps:

  1. Identification of areas with higher average ticket size in Grocery Stores based on Mastercard data;
  2. Identification of areas where organic food has potentially a higher demand via the exploration of social media posts (using Spatial.ai geosocial segmentation) and internet search behaviours (with Dstillery’s audience data);
  3. Intersection of the areas identified in the above two steps; these will be the resulting selected target areas for the reminder of the case study.

Note that all the three sources we have used in this phase provide features aggregated at census block group level.

Step 1 takes into account the index based on the monthly average ticket from credit card transactions per census block group in grocery stores and looks at the areas with higher ticket sizes. The rationale behind this is based on the fact that organic products are considered "premium", as mentioned at the beginning of this section. As a result, shoppers must have the capability and the will to spend the extra bucks to purchase organic products.

First, we check whether this variable is spatially correlated (calculating its Moran’s I measure in the two cities), evaluating whether the pattern expressed is clustered, dispersed, or random. The result shows a spatial pattern with values at a location being affected by the values at the nearby locations. So, as a next step we compute the spatial lag as the average value of the average ticket index in a location and its neighboring areas; which is basically like a smoothing over space.

Then, the spatial lag of the average ticket is quantized into 5 different quants based on the FisherJenks algorithm [4] in an effort to minimize each class’s average deviation from the class mean, while maximizing each class’s deviation from the means of the other groups. From the result, we select the top 2 groups as the census block groups with more potential to be "profitable" for our use case from the credit card spend point of view.

The next step is to identify areas where there is more interest in organic products based on behavioral data. For that purpose, the datasets from Spatial.ai and Dstillery are used. The former has an affinity index for organic food, which we analyze following the same procedure used with the Mastercard index. As the Spatial.ai index ranges values from 0 to 100, the quantization method used is the quantile.

From the Dstillery dataset, we select the interest type named "Organic and local food", and we follow a similar analysis used for the Mastercard and Spatial.ai indices. Again, spatial correlation was observed and was found to be statistically significant. Though, in this case, the correlation was much lower than in the previous dataset from Spatial.ai.

Having already identified the most relevant areas in which to sell organic products based on each data source individually, we select the final set of target areas as the product of the intersection of the outer merge of the latter two (from behavioral data) with the first one (from credit card transaction data):

{Selected areas} = {Mastercard ∩ {Spatial.ai ∪ Dstillery}}

In the figures below, the selection process for each city is illustrated. In each figure, 4 subfigures are shown, each one showing the areas selected using the Mastercard, Spatial.ai, Dstillery and the final selection respectively.

Characterizing the Selected Areas

Having already identified the areas of interest, we now want to further understand which are the factors that characterize them and examine the driving attributes that make an area attractive for placing organic products. For that purpose, we analyze the sociodemographic and socioeconomic factors in the selected areas provided by AGS, the number of Points of Interest (POI) from the Pitney Bowes (now Precisely) database aggregated by business group, as well as the geosocial segments from Spatial.ai. We will compare the behaviors of these factors between the selected and non-selected areas. Note that in terms of sociodemographic and socioeconomic attributes we selected those available in the dataset as a 5-year projection.

In order to identify the driving factors, first we compute and compare the distribution of each feature for the selected and non-selected areas. To compare them, we perform a t-test, in order to evaluate if the means of two sets of areas are significantly different from each other. We drop the features that we identify as having the same distribution in the selected and non-selected areas.

Additionally, for the Spatial.ai Geosocial Segments, in order to reduce the dimension of the features, an additional procedure is followed in order to identify the segments for which there are greater differences between the selected and non-selected areas. For that purpose, the average value within the selected areas and non-selected areas, as well as the ratio of those average values are utilized as features. In the tables below, the average index in the entire city and in the selected areas as well as the ratio among the two values are shown. These calculated features are further clustered, and we chose the geosocial segments in the two anti-diametrical sides. The rationale for this selection is that in the center, the features show similar behavior between the selected areas and the non-selected areas, while in the sides, either a high or a low value is observed, making the distinction between them obvious.

The above procedure was followed by disregarding the constant and the correlated features. For the correlated features, the 80% threshold was used.

The tables show the values for the top 5 and bottom 5 features based on the ratio between the city average and the average in the selected areas in the respective cities. It’s interesting to note that the geosocial segment " _ED06_ingredientattentive" popped up in Philadelphia and " _ED03_trendyeats " in New York.

Selecting the driving factors

Having formed and cleaned the features, we then build a classifier in order to derive the final impact of each selected feature. But first, we need to solve the issue of imbalanced data, as the number of selected areas is much smaller compared to the rest of the areas: 5,389 compared to 623 selected in New York, and 1,043 compared to 35 selected in Philadelphia.

To achieve this, an upsampling technique is used, SMOTE [5], to generate artificial data. The training set is upsampled, generating "new" data. The reason for this is to create enough samples of each class so the classifier can identify and correct, while not being overwhelmed by the major class, the different driving factors and their impact.

The upsampling process is followed by the classification process, where a random forest classifier is used. The hyperparameter tuning is performed which attempts to minimize the incorrectly identified "selected" areas and maintain good accuracy. As performance metrics for the classification, the confusion matrix and the received operating curve are reported in the figures below.

For both cities, we can see that the performance of the classification method is good. Some might argue that overfitting has occurred, but because the selected areas were successfully identified, the confidence level of an appropriate classifier on top of an imbalanced dataset is increased.

Looking at the importance of the top 20 features for each city, using the Shapley values [6], the information regarding the importance of the main driving factors can be extracted. Also, similarities between the driving factors for both cities can be observed. From the top 20 features, 10 common features between the cities can be identified.

Top 20 features for New York City

Top 20 features for Philadelphia

Looking at the most important features in New York we can see that areas with higher income, with the presence of the "LGTB culture" and "artistic appreciation" geosocial segments, as well as the segments related to premium foods and drinks, are the best suited for the distribution of organic products. We see a similar trend in Philadelphia, with most driving features being shared in both cities.

Identifying Twin Areas in Different Cities

Having analyzed the driving factors behind area selection in New York and Philadelphia, the twin areas method described in one of our previous posts [7] can be applied in order to identify target areas for the distribution of organic products in other parts of the US. As an example for this exercise, we picked San Francisco.

Now we imagine that we have already established a distribution strategy for organic products in New York City and detected the census block group that is yielding optimal results in terms of sales of our products. The purpose of the twin areas method is to help us identify similar areas in a different city leveraging the driving factors (i.e., features) identified in the previous section.

As the top performing area we have selected this census block group in Manhattan:

Census block group in New York City for which we are going to look for twin areas in San Francisco

The features that will be used as criteria for the twin areas method, are the common features in New York and Philadelphia that were identified in the previous section:

  • ‘Per capita income (projected, five years)’,
  • ‘Average household Income (projected, five years)’,
  • ‘EB03_lgbtq_culture’,
  • ‘ED09_hops_and_brews’,
  • ‘ED08_wine_lovers’,
  • ‘ED04_whiskey_business’,
  • ‘Median household income (projected, five years)’,
  • ‘ED02_coffee_connoisseur’,
  • ‘LEGAL SERVICES’,
  • ‘ED01_sweet_treats’

The map below illustrates the results from the twin areas method, showcasing the census block groups in San Francisco with similarity scores greater than 0. We can filter the areas using the histogram widget to identify the most similar twins based on the features listed above.

Conclusions

In this case study we illustrated how to leverage new types of spatial data, such as aggregated credit card transactions and social media behavior, in order to define a methodology to select and characterize the optimal areas for the rollout of organic products in New York City and Philadelphia, enabling CPG firms to see where they may have gaps in their POS networks. Finally, we also applied the Twin Areas method in order to identify the best areas in San Francisco based on the driving factors identified in the other two cities, which are closely tied to high income areas, greater concentration of different types of retail stores, and geosocial segments related to premium food products.

The combination of new location data streams and spatial Data Science techniques opens up a new array of opportunities to define more optimal strategies for the distribution of consumer goods; allowing for a greater understanding of the different retail areas based on their consumer segments so CPG companies can place their products as close as possible to areas with the greatest potential demand to ensure increased sales and ROI for CPG brands.

References

[1] https://ota.com/hotspots

[2] http://www.carto.com/data

[3] https://www.marketwatch.com/story/heres-why-prices-of-organic-food-are-dropping-2019-01-24#:~:text=Last%20year%2C%20organic%20food%20and,percent%20more%2C%20according%20to%20Nielsen.&text=The%20average%20price%20for%20a,a%20gallon%20of%20regular%20milk

[4] https://en.wikipedia.org/wiki/Jenks_natural_breaks_optimization

[5] https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis#SMOTE

[6] https://en.wikipedia.org/wiki/Shapley_value

[7] https://carto.com/blog/spatial-data-science-site-planning/


Originally published at https://carto.com on January 12, 2021.


Related Articles