PROJECT DESIGN

In this case we will work as consultants for an ecommerce in the cosmetics sector.

This company has had a flat evolution during the last months and has contracted us to analyze its transactional data and implement CRO actions customized to its situation based on said analysis.

In this case, among other things, we are going to learn:

  • how are the data of an ecommerce
  • Analysis techniques aimed at increasing billing and margin in an e-commerce, both basic or generic, as well as some advanced techniques specific to this sector
  • the main metrics on which we have to work and some CRO actions that we can put into practice to improve them
  • to build two very powerful analytical resources for this sector: an RFM segmentation and a recommendation system.

Therefore, much of what we learn here is of general application in practically any ecommerce.

TARGET

Analyze transactional data to try potential CRO actions that increase visits, conversions and average ticket, and therefore increase global ecommerce billing.

Create advanced analytics assets such as RFM segmentation and a recommendation system that drive goal achievement.

LEVERS

As always, we will first understand the business, and its main processes, metrics and concepts.

png

The first step is when a user arrives at the ecommerce website. It will normally come from:

  • Payment campaigns: paid ads such as Facebook Ads or Google Ads
  • Organic content: blog, rss, ...
  • Direct traffic: knows the url and enters it in the browser

That traffic is called visits, and the pages they see are called page views, although in our case we will call it views.

The user browses the web and when he likes a product, he puts it in the cart.

Finally you can remove products from the cart, leave without buying anything, or finally place the order.

A common process is cross-selling, in which the user is recommended other products that might also interest him.

Even when it's gone we can tell the user again through retargeting or email marketing.

This entire process is called funnel or also customer journey.

In the online environment practically everything can be registered.

The user record can be logged or not.

The sequence of actions that a user does in the same browsing session is called a session.

The ratio of purchases to visits is called the conversion ratio.

In addition, there are other key metrics that we have to master to correctly manage an ecommerce:

  • CPA
  • AOV
  • Purchase frequency
  • LTV
  • Churn

KEY CONCEPT: There are only 3 ways to grow a business:

  1. More customers: this means getting more visits and higher conversion
  2. More frequency: this means getting the same customers to buy more times
  3. Higher average ticket: this implies getting more or more expensive purchases in the same shopping session

To achieve these 3 effects we work on the following operational levers:

  • Customer journey: how we can optimize each step of the process
  • Clients: how we can use the information available from clients to optimize the campaigns we carry out
  • Products: how we can optimize the product catalog and identify in a personalized way which products we have to put in front of each client

In our case, we will understand CRO in a broad way, that is, as the discipline that puts actions into practice to work on the previous levers and concepts.

KPI's

  • Visits
  • Conversion
  • Purchase frequency
  • Average ticket
  • Cart abandonment rate
  • LTV

ENTITIES AND DATA

In our case, the entities that we have in the granularity of the data are:

  • Users
  • Customers
  • Sessions
  • Events
  • Products

SEED QUESTIONS

Having understood the levers, kpis and entities, we can now ask the seed questions:

About the customer journey:

  • What is a typical purchasing process like?
  • How many products are viewed, added to cart, abandoned and purchased on average in each session?
  • How has the trend of these indicators been in recent months?

About customers:

  • How many products does each customer buy?
  • How much does each customer spend?
  • Are there "better customers" that need to be identified and treated differently?
  • Do customers repeat purchases in the following months?
  • What is the average LTV of a client?
  • Can we design personalized campaigns to customer value?

About the products:

  • What are the most sold products?
  • Are there products that are not sold?
  • Is there a relationship between the price of the product and its sales volume?
  • Are there products that are visited but not purchased?
  • Are there products that are repeatedly removed from the cart?
  • Could personalized product recommendations be made for each client?

CREATION OF THE ANALYTICAL DATAMART

SET UP

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns


#Automcompletar rápido
%config IPCompleter.greedy=True

#Formato de display
pd.options.display.float_format = '{:15.2f}'.format
pd.options.display.max_columns = 8

DATA UPLOAD

import sqlalchemy as sa
con = sa.create_engine('sqlite:///ecommerce.db')
from sqlalchemy import inspect
insp = inspect(con)
tablas = insp.get_table_names()
tablas
['2019-Dec', '2019-Nov', '2019-Oct', '2020-Feb', '2020-Jan']
oct = pd.read_sql('2019-Oct', con)
nov = pd.read_sql('2019-Nov', con)
dic = pd.read_sql('2019-Dec', con)
ene = pd.read_sql('2020-Jan', con)
feb = pd.read_sql('2020-Feb', con)

DATA INTEGRATION

df = pd.concat([oct,nov,dic,ene,feb], axis = 0)
df
index event_time event_type product_id ... brand price user_id user_session
0 68 2019-10-01 00:01:46 UTC view 5843665 ... f.o.x 9.44 462033176 a18e0999-61a1-4218-8f8f-61ec1d375361
1 72 2019-10-01 00:01:55 UTC cart 5868461 ... italwax 3.57 514753614 e2fecb2d-22d0-df2c-c661-15da44b3ccf1
2 95 2019-10-01 00:02:50 UTC view 5877456 ... jessnail 122.22 527418424 86e77869-afbc-4dff-9aa2-6b7dd8c90770
3 122 2019-10-01 00:03:41 UTC view 5649270 ... concept 6.19 555448072 b5f72ceb-0730-44de-a932-d16db62390df
4 124 2019-10-01 00:03:44 UTC view 18082 ... cnd 16.03 552006247 2d8f304b-de45-4e59-8f40-50c603843fe5
... ... ... ... ... ... ... ... ... ...
429785 4156660 2020-02-29 23:58:49 UTC cart 5815662 ... None 0.92 147995998 5ff96629-3627-493e-a25b-5a871ec78c90
429786 4156663 2020-02-29 23:58:57 UTC view 5815665 ... None 0.59 147995998 5ff96629-3627-493e-a25b-5a871ec78c90
429787 4156668 2020-02-29 23:59:05 UTC cart 5815665 ... None 0.59 147995998 5ff96629-3627-493e-a25b-5a871ec78c90
429788 4156675 2020-02-29 23:59:28 UTC view 5817692 ... None 0.79 619841242 18af673b-7fb9-4202-a66d-5c855bc0fd2d
429789 4156680 2020-02-29 23:59:54 UTC view 5716351 ... irisk 0.79 619841242 18af673b-7fb9-4202-a66d-5c855bc0fd2d

2095076 rows × 10 columns

DATA QUALITY

Types of variables

df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2095076 entries, 0 to 429789
Data columns (total 10 columns):
 #   Column         Dtype  
---  ------         -----  
 0   index          int64  
 1   event_time     object 
 2   event_type     object 
 3   product_id     int64  
 4   category_id    int64  
 5   category_code  object 
 6   brand          object 
 7   price          float64
 8   user_id        int64  
 9   user_session   object 
dtypes: float64(1), int64(4), object(5)
memory usage: 175.8+ MB

We remove the column index.

df.drop(columns = 'index', inplace = True)

Analysis and correction of types.

  • pass event_time to datetime

We pass event_time to datetime.

def datetime_rapido(dt,formato):

    def divide_fecha(fecha):
        division = fecha.split()
        date = division[0]
        time = division[1]
        cadena = date + ' ' + time
        return cadena

    resultado = pd.to_datetime(dt.apply(lambda x: divide_fecha(x)), format = formato)

    return resultado

Ejecutamos la función.

formato = '%Y-%m-%d %H:%M:%S'

df.event_time = datetime_rapido(df.event_time,formato)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2095076 entries, 0 to 429789
Data columns (total 9 columns):
 #   Column         Dtype         
---  ------         -----         
 0   event_time     datetime64[ns]
 1   event_type     object        
 2   product_id     int64         
 3   category_id    int64         
 4   category_code  object        
 5   brand          object        
 6   price          float64       
 7   user_id        int64         
 8   user_session   object        
dtypes: datetime64[ns](1), float64(1), int64(3), object(4)
memory usage: 159.8+ MB

Variable names

df.columns = ['fecha',
              'evento',
              'producto',
              'categoria',
              'categoria_cod',
              'marca',
              'precio',
              'usuario',
              'sesion']
df
fecha evento producto categoria ... marca precio usuario sesion
0 2019-10-01 00:01:46 view 5843665 1487580005092295511 ... f.o.x 9.44 462033176 a18e0999-61a1-4218-8f8f-61ec1d375361
1 2019-10-01 00:01:55 cart 5868461 1487580013069861041 ... italwax 3.57 514753614 e2fecb2d-22d0-df2c-c661-15da44b3ccf1
2 2019-10-01 00:02:50 view 5877456 1487580006300255120 ... jessnail 122.22 527418424 86e77869-afbc-4dff-9aa2-6b7dd8c90770
3 2019-10-01 00:03:41 view 5649270 1487580013749338323 ... concept 6.19 555448072 b5f72ceb-0730-44de-a932-d16db62390df
4 2019-10-01 00:03:44 view 18082 1487580005411062629 ... cnd 16.03 552006247 2d8f304b-de45-4e59-8f40-50c603843fe5
... ... ... ... ... ... ... ... ... ...
429785 2020-02-29 23:58:49 cart 5815662 1487580006317032337 ... None 0.92 147995998 5ff96629-3627-493e-a25b-5a871ec78c90
429786 2020-02-29 23:58:57 view 5815665 1487580006317032337 ... None 0.59 147995998 5ff96629-3627-493e-a25b-5a871ec78c90
429787 2020-02-29 23:59:05 cart 5815665 1487580006317032337 ... None 0.59 147995998 5ff96629-3627-493e-a25b-5a871ec78c90
429788 2020-02-29 23:59:28 view 5817692 1487580010872045658 ... None 0.79 619841242 18af673b-7fb9-4202-a66d-5c855bc0fd2d
429789 2020-02-29 23:59:54 view 5716351 1487580010872045658 ... irisk 0.79 619841242 18af673b-7fb9-4202-a66d-5c855bc0fd2d

2095076 rows × 9 columns

Analysis of nulls

df.isna().sum().sort_values(ascending = False)
categoria_cod    2060411
marca             891646
sesion               506
fecha                  0
evento                 0
producto               0
categoria              0
precio                 0
usuario                0
dtype: int64

Conclusions:

  • categoria_cod has almost all records set to null
  • brand has almost half of the records to null
  • there are 500 nulls in session

Actions:

  • eliminate the variables category_cod and brand
  • remove session nulls as it is a relevant variable
df = df.drop(columns = ['categoria_cod','marca']).dropna()
df
fecha evento producto categoria precio usuario sesion
0 2019-10-01 00:01:46 view 5843665 1487580005092295511 9.44 462033176 a18e0999-61a1-4218-8f8f-61ec1d375361
1 2019-10-01 00:01:55 cart 5868461 1487580013069861041 3.57 514753614 e2fecb2d-22d0-df2c-c661-15da44b3ccf1
2 2019-10-01 00:02:50 view 5877456 1487580006300255120 122.22 527418424 86e77869-afbc-4dff-9aa2-6b7dd8c90770
3 2019-10-01 00:03:41 view 5649270 1487580013749338323 6.19 555448072 b5f72ceb-0730-44de-a932-d16db62390df
4 2019-10-01 00:03:44 view 18082 1487580005411062629 16.03 552006247 2d8f304b-de45-4e59-8f40-50c603843fe5
... ... ... ... ... ... ... ...
429785 2020-02-29 23:58:49 cart 5815662 1487580006317032337 0.92 147995998 5ff96629-3627-493e-a25b-5a871ec78c90
429786 2020-02-29 23:58:57 view 5815665 1487580006317032337 0.59 147995998 5ff96629-3627-493e-a25b-5a871ec78c90
429787 2020-02-29 23:59:05 cart 5815665 1487580006317032337 0.59 147995998 5ff96629-3627-493e-a25b-5a871ec78c90
429788 2020-02-29 23:59:28 view 5817692 1487580010872045658 0.79 619841242 18af673b-7fb9-4202-a66d-5c855bc0fd2d
429789 2020-02-29 23:59:54 view 5716351 1487580010872045658 0.79 619841242 18af673b-7fb9-4202-a66d-5c855bc0fd2d

2094570 rows × 7 columns

Analysis of numerical variables

df.describe()
usuario categoria producto precio ... segundo festivo black_friday san_valentin
count 2074026.00 2074026.00 2074026.00 2074026.00 ... 2074026.00 2074026.00 2074026.00 2074026.00
mean 521758909.23 1553192964541272064.00 5485203.09 8.50 ... 29.50 0.06 0.01 0.01
std 87354480.68 168128659843838048.00 1304219.20 19.21 ... 17.29 0.24 0.10 0.08
min 4661182.00 1487580004807082752.00 3752.00 0.05 ... 0.00 0.00 0.00 0.00
25% 482863572.00 1487580005754995456.00 5724633.00 2.06 ... 15.00 0.00 0.00 0.00
50% 553690395.00 1487580008246412288.00 5811652.00 4.06 ... 29.00 0.00 0.00 0.00
75% 578763499.00 1487580013489291520.00 5858221.00 6.98 ... 44.00 0.00 0.00 0.00
max 622087993.00 2242903426784559104.00 5932585.00 327.78 ... 59.00 1.00 1.00 1.00

8 rows × 13 columns

We see negatives in the price. Let's go deeper.

df[df.precio <= 0]
fecha evento producto categoria precio usuario sesion
343 2019-10-01 02:15:41 view 5892052 1487580010377117763 0.00 555455025 320f6021-30ac-4a58-ae17-bac1cc32aac3
924 2019-10-01 05:16:30 view 5889621 1487580010561667147 0.00 523988665 00849bd2-fcd2-4cb4-af31-4e264f151848
933 2019-10-01 05:18:03 view 5889622 1487580010561667147 0.00 523988665 80cfe614-f0a5-4101-a2b6-a21227590470
937 2019-10-01 05:18:46 view 5889623 1487580010561667147 0.00 523988665 c2cd0464-3d2b-48e2-9667-bac248fe297a
1077 2019-10-01 05:38:01 view 5889627 1487580010561667147 0.00 523988665 8b2bf9d8-43f0-43b2-bed3-13b2c956cada
... ... ... ... ... ... ... ...
428011 2020-02-29 20:04:49 cart 5824841 1897124478404526487 0.00 469761446 8bf369b4-92c0-4fb8-88a5-8a2dd0947e46
428012 2020-02-29 20:04:49 cart 5826413 1487580005511725929 0.00 469761446 8bf369b4-92c0-4fb8-88a5-8a2dd0947e46
428013 2020-02-29 20:04:49 cart 5832437 1487580007675986893 0.00 469761446 8bf369b4-92c0-4fb8-88a5-8a2dd0947e46
428014 2020-02-29 20:04:49 cart 5851606 2055161088059638328 0.00 469761446 8bf369b4-92c0-4fb8-88a5-8a2dd0947e46
428370 2020-02-29 20:26:16 view 5923106 1487580008246412266 0.00 622047714 74f04dc6-2b3c-4565-beda-f575d73ed81c

20544 rows × 7 columns

It's about 20000 records, we could delete them.

But first, do they perhaps focus on a specific product?

df[df.precio <= 0].producto.value_counts().head(10)
5896186    79
5903915    50
5873428    37
5851294    29
5851304    29
5837624    28
5712583    27
5851272    27
5907812    26
5899512    26
Name: producto, dtype: int64

It doesn't seem like it's a specific product issue, so we're going to remove all logs.

df = df[df.precio > 0]
df
fecha evento producto categoria precio usuario sesion
0 2019-10-01 00:01:46 view 5843665 1487580005092295511 9.44 462033176 a18e0999-61a1-4218-8f8f-61ec1d375361
1 2019-10-01 00:01:55 cart 5868461 1487580013069861041 3.57 514753614 e2fecb2d-22d0-df2c-c661-15da44b3ccf1
2 2019-10-01 00:02:50 view 5877456 1487580006300255120 122.22 527418424 86e77869-afbc-4dff-9aa2-6b7dd8c90770
3 2019-10-01 00:03:41 view 5649270 1487580013749338323 6.19 555448072 b5f72ceb-0730-44de-a932-d16db62390df
4 2019-10-01 00:03:44 view 18082 1487580005411062629 16.03 552006247 2d8f304b-de45-4e59-8f40-50c603843fe5
... ... ... ... ... ... ... ...
429785 2020-02-29 23:58:49 cart 5815662 1487580006317032337 0.92 147995998 5ff96629-3627-493e-a25b-5a871ec78c90
429786 2020-02-29 23:58:57 view 5815665 1487580006317032337 0.59 147995998 5ff96629-3627-493e-a25b-5a871ec78c90
429787 2020-02-29 23:59:05 cart 5815665 1487580006317032337 0.59 147995998 5ff96629-3627-493e-a25b-5a871ec78c90
429788 2020-02-29 23:59:28 view 5817692 1487580010872045658 0.79 619841242 18af673b-7fb9-4202-a66d-5c855bc0fd2d
429789 2020-02-29 23:59:54 view 5716351 1487580010872045658 0.79 619841242 18af673b-7fb9-4202-a66d-5c855bc0fd2d

2074026 rows × 7 columns

Analysis of categorical variables

df.evento.nunique()
4
df.evento.value_counts()
view                961558
cart                574547
remove_from_cart    410357
purchase            127564
Name: evento, dtype: int64
df.producto.nunique()
45327
df.categoria.nunique()
508

Index

Let's put the date as the index.

df.set_index('fecha', inplace = True)
df
evento producto categoria precio usuario sesion
fecha
2019-10-01 00:01:46 view 5843665 1487580005092295511 9.44 462033176 a18e0999-61a1-4218-8f8f-61ec1d375361
2019-10-01 00:01:55 cart 5868461 1487580013069861041 3.57 514753614 e2fecb2d-22d0-df2c-c661-15da44b3ccf1
2019-10-01 00:02:50 view 5877456 1487580006300255120 122.22 527418424 86e77869-afbc-4dff-9aa2-6b7dd8c90770
2019-10-01 00:03:41 view 5649270 1487580013749338323 6.19 555448072 b5f72ceb-0730-44de-a932-d16db62390df
2019-10-01 00:03:44 view 18082 1487580005411062629 16.03 552006247 2d8f304b-de45-4e59-8f40-50c603843fe5
... ... ... ... ... ... ...
2020-02-29 23:58:49 cart 5815662 1487580006317032337 0.92 147995998 5ff96629-3627-493e-a25b-5a871ec78c90
2020-02-29 23:58:57 view 5815665 1487580006317032337 0.59 147995998 5ff96629-3627-493e-a25b-5a871ec78c90
2020-02-29 23:59:05 cart 5815665 1487580006317032337 0.59 147995998 5ff96629-3627-493e-a25b-5a871ec78c90
2020-02-29 23:59:28 view 5817692 1487580010872045658 0.79 619841242 18af673b-7fb9-4202-a66d-5c855bc0fd2d
2020-02-29 23:59:54 view 5716351 1487580010872045658 0.79 619841242 18af673b-7fb9-4202-a66d-5c855bc0fd2d

2074026 rows × 6 columns

DATA TRANSFORMATION

We are going to create 3 types of new variables

  • extract components
  • Calendar variables: Local holidays (Russia)
  • Exogenous indicators: Days not necessarily holidays but with commercial interest: Black Friday, Cyber Monday, Three Kings, San Valentin

Date Components

def componentes_fecha(dataframe):
    date = dataframe.index.date
    año = dataframe.index.year
    mes = dataframe.index.month
    dia = dataframe.index.day
    hora = dataframe.index.hour
    minuto = dataframe.index.minute
    segundo = dataframe.index.second


    return(pd.DataFrame({'date':date, 'año':año,'mes':mes, 'dia':dia, 'hora':hora, 'minuto':minuto, 'segundo':segundo}))
df = pd.concat([df.reset_index(),componentes_fecha(df)], axis = 1).set_index('fecha')
df
evento producto categoria precio ... dia hora minuto segundo
fecha
2019-10-01 00:01:46 view 5843665 1487580005092295511 9.44 ... 1 0 1 46
2019-10-01 00:01:55 cart 5868461 1487580013069861041 3.57 ... 1 0 1 55
2019-10-01 00:02:50 view 5877456 1487580006300255120 122.22 ... 1 0 2 50
2019-10-01 00:03:41 view 5649270 1487580013749338323 6.19 ... 1 0 3 41
2019-10-01 00:03:44 view 18082 1487580005411062629 16.03 ... 1 0 3 44
... ... ... ... ... ... ... ... ... ...
2020-02-29 23:58:49 cart 5815662 1487580006317032337 0.92 ... 29 23 58 49
2020-02-29 23:58:57 view 5815665 1487580006317032337 0.59 ... 29 23 58 57
2020-02-29 23:59:05 cart 5815665 1487580006317032337 0.59 ... 29 23 59 5
2020-02-29 23:59:28 view 5817692 1487580010872045658 0.79 ... 29 23 59 28
2020-02-29 23:59:54 view 5716351 1487580010872045658 0.79 ... 29 23 59 54

2074026 rows × 13 columns

Calendar variables: holidays

To incorporate holidays we can use the holidays.

It's not perfect, but it gives us a lot of flexibility because it has parties from various countries and even at the community level.

And we can see the list of countries and the basic use in:

https://github.com/dr-prodigy/python-holidays

import holidays

We define the object festive_ru since this ecommerce is Russian.

festivo_ru = holidays.RU(years=2020)
festivo_ru
{datetime.date(2020, 1, 1): 'Новый год', datetime.date(2020, 1, 2): 'Новый год', datetime.date(2020, 1, 3): 'Новый год', datetime.date(2020, 1, 4): 'Новый год', datetime.date(2020, 1, 5): 'Новый год', datetime.date(2020, 1, 6): 'Новый год', datetime.date(2020, 1, 7): 'Православное Рождество', datetime.date(2020, 1, 8): 'Новый год', datetime.date(2020, 2, 23): 'День защитника отечества', datetime.date(2020, 3, 8): 'День женщин', datetime.date(2020, 5, 1): 'Праздник Весны и Труда', datetime.date(2020, 5, 9): 'День Победы', datetime.date(2020, 6, 12): 'День России', datetime.date(2020, 11, 4): 'День народного единства', datetime.date(2020, 12, 31): 'Новый год'}

We are going to add a variable that tells each record if it was a holiday or not.

df['festivo'] = df.date.apply(lambda x: 1 if (x in festivo_ru) else 0)
df
evento producto categoria precio ... hora minuto segundo festivo
fecha
2019-10-01 00:01:46 view 5843665 1487580005092295511 9.44 ... 0 1 46 0
2019-10-01 00:01:55 cart 5868461 1487580013069861041 3.57 ... 0 1 55 0
2019-10-01 00:02:50 view 5877456 1487580006300255120 122.22 ... 0 2 50 0
2019-10-01 00:03:41 view 5649270 1487580013749338323 6.19 ... 0 3 41 0
2019-10-01 00:03:44 view 18082 1487580005411062629 16.03 ... 0 3 44 0
... ... ... ... ... ... ... ... ... ...
2020-02-29 23:58:49 cart 5815662 1487580006317032337 0.92 ... 23 58 49 0
2020-02-29 23:58:57 view 5815665 1487580006317032337 0.59 ... 23 58 57 0
2020-02-29 23:59:05 cart 5815665 1487580006317032337 0.59 ... 23 59 5 0
2020-02-29 23:59:28 view 5817692 1487580010872045658 0.79 ... 23 59 28 0
2020-02-29 23:59:54 view 5716351 1487580010872045658 0.79 ... 23 59 54 0

2074026 rows × 14 columns

We check the holidays.

df[df.festivo == 1].date.value_counts().sort_index()
2019-11-04    16430
2019-12-31     2848
2020-01-01     7644
2020-01-02    10776
2020-01-03    10617
2020-01-04    13084
2020-01-05    14554
2020-01-06    10621
2020-01-07    12922
2020-01-08    14004
2020-02-23     9817
Name: date, dtype: int64

Exogenous indicators

We are going to add indicators for Black Friday and Valentine's Day.

df['black_friday'] = 0
df.loc['2019-11-29','black_friday'] = 1

df['san_valentin'] = 0
df.loc['2020-02-14','san_valentin'] = 1

Comprobamos

df['black_friday'].value_counts()
0    2051695
1      22331
Name: black_friday, dtype: int64
df['san_valentin'].value_counts()
0    2061781
1      12245
Name: san_valentin, dtype: int64

FINAL ANALYTICAL BOARD

We review what we have.

df.head()
evento producto categoria precio ... segundo festivo black_friday san_valentin
fecha
2019-10-01 00:01:46 view 5843665 1487580005092295511 9.44 ... 46 0 0 0
2019-10-01 00:01:55 cart 5868461 1487580013069861041 3.57 ... 55 0 0 0
2019-10-01 00:02:50 view 5877456 1487580006300255120 122.22 ... 50 0 0 0
2019-10-01 00:03:41 view 5649270 1487580013749338323 6.19 ... 41 0 0 0
2019-10-01 00:03:44 view 18082 1487580005411062629 16.03 ... 44 0 0 0

5 rows × 16 columns

Let's put the columns in a more natural order.

variables = df.columns.to_list()
variables
['evento',
 'producto',
 'categoria',
 'precio',
 'usuario',
 'sesion',
 'date',
 'año',
 'mes',
 'dia',
 'hora',
 'minuto',
 'segundo',
 'festivo',
 'black_friday',
 'san_valentin']
orden = ['usuario',
         'sesion',
         'categoria',
         'evento',
         'producto',
         'precio']

orden
['usuario', 'sesion', 'categoria', 'evento', 'producto', 'precio']
resto = [nombre for nombre in variables if nombre not in orden]

resto
['date',
 'año',
 'mes',
 'dia',
 'hora',
 'minuto',
 'segundo',
 'festivo',
 'black_friday',
 'san_valentin']
df = df[orden + resto]
df
usuario sesion categoria evento ... segundo festivo black_friday san_valentin
fecha
2019-10-01 00:01:46 462033176 a18e0999-61a1-4218-8f8f-61ec1d375361 1487580005092295511 view ... 46 0 0 0
2019-10-01 00:01:55 514753614 e2fecb2d-22d0-df2c-c661-15da44b3ccf1 1487580013069861041 cart ... 55 0 0 0
2019-10-01 00:02:50 527418424 86e77869-afbc-4dff-9aa2-6b7dd8c90770 1487580006300255120 view ... 50 0 0 0
2019-10-01 00:03:41 555448072 b5f72ceb-0730-44de-a932-d16db62390df 1487580013749338323 view ... 41 0 0 0
2019-10-01 00:03:44 552006247 2d8f304b-de45-4e59-8f40-50c603843fe5 1487580005411062629 view ... 44 0 0 0
... ... ... ... ... ... ... ... ... ...
2020-02-29 23:58:49 147995998 5ff96629-3627-493e-a25b-5a871ec78c90 1487580006317032337 cart ... 49 0 0 0
2020-02-29 23:58:57 147995998 5ff96629-3627-493e-a25b-5a871ec78c90 1487580006317032337 view ... 57 0 0 0
2020-02-29 23:59:05 147995998 5ff96629-3627-493e-a25b-5a871ec78c90 1487580006317032337 cart ... 5 0 0 0
2020-02-29 23:59:28 619841242 18af673b-7fb9-4202-a66d-5c855bc0fd2d 1487580010872045658 view ... 28 0 0 0
2020-02-29 23:59:54 619841242 18af673b-7fb9-4202-a66d-5c855bc0fd2d 1487580010872045658 view ... 54 0 0 0

2074026 rows × 16 columns

We save as pickle so as not to lose the metadata.

df.to_pickle('tablon_analitico.pickle')

ANALYSIS AND INSIGHTS

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import fiser_tools as fs
fs.misc.dark_theme()


#Automcompletar rápido
%config IPCompleter.greedy=True

#Formato de display
pd.options.display.float_format = '{:15.2f}'.format
pd.options.display.max_columns = 8
df = pd.read_pickle('tablon_analitico.pickle')
df
usuario sesion categoria evento ... segundo festivo black_friday san_valentin
fecha
2019-10-01 00:01:46 462033176 a18e0999-61a1-4218-8f8f-61ec1d375361 1487580005092295511 view ... 46 0 0 0
2019-10-01 00:01:55 514753614 e2fecb2d-22d0-df2c-c661-15da44b3ccf1 1487580013069861041 cart ... 55 0 0 0
2019-10-01 00:02:50 527418424 86e77869-afbc-4dff-9aa2-6b7dd8c90770 1487580006300255120 view ... 50 0 0 0
2019-10-01 00:03:41 555448072 b5f72ceb-0730-44de-a932-d16db62390df 1487580013749338323 view ... 41 0 0 0
2019-10-01 00:03:44 552006247 2d8f304b-de45-4e59-8f40-50c603843fe5 1487580005411062629 view ... 44 0 0 0
... ... ... ... ... ... ... ... ... ...
2020-02-29 23:58:49 147995998 5ff96629-3627-493e-a25b-5a871ec78c90 1487580006317032337 cart ... 49 0 0 0
2020-02-29 23:58:57 147995998 5ff96629-3627-493e-a25b-5a871ec78c90 1487580006317032337 view ... 57 0 0 0
2020-02-29 23:59:05 147995998 5ff96629-3627-493e-a25b-5a871ec78c90 1487580006317032337 cart ... 5 0 0 0
2020-02-29 23:59:28 619841242 18af673b-7fb9-4202-a66d-5c855bc0fd2d 1487580010872045658 view ... 28 0 0 0
2020-02-29 23:59:54 619841242 18af673b-7fb9-4202-a66d-5c855bc0fd2d 1487580010872045658 view ... 54 0 0 0

2074026 rows × 16 columns

Understanding events

How is the Customer Journey working?

eventos = df.evento.value_counts()
eventos
view                961558
cart                574547
remove_from_cart    410357
purchase            127564
Name: evento, dtype: int64
kpi_visualizaciones_p = 100
kpi_carrito_p = eventos.loc['cart'] / eventos.loc['view'] * 100
kpi_abandono_p = eventos.loc['remove_from_cart'] / eventos.loc['cart'] * 100
kpi_compra_p = eventos.loc['purchase'] / eventos.loc['cart'] * 100

kpis = pd.DataFrame({'kpi':['visitas','carrito','compra'],
                     'valor':[kpi_visualizaciones_p,kpi_carrito_p,kpi_compra_p]})

kpis
kpi valor
0 visitas 100.00
1 carrito 59.75
2 compra 22.20
from plotly import graph_objects as go

fig = go.Figure(go.Funnel(
    y = kpis.kpi,
    x = kpis.valor.round(2),
    marker = {'color': ['red','blue','green']},
    opacity = 0.3
    ))

fig.update_layout(
    title = 'Funnel Conversión Inicial')

fig.show()

funnel

Conclusions:

  • Starting rates are 60% cart on views and 22% purchase on cart
  • Therefore, there are 40% of visits that need to be worked on to get more carts, and 78% of carts that need to be worked on to get more purchases.

How many products are viewed, added to cart, abandoned and purchased on average in each session?

Unlike the macro analysis of the funnel, this analysis is per session, which makes it more operational.

Knowing the main kpis per session allows us to establish the baseline to measure the results of CRO actions.

First we create a dataframe with the granularity at the session and event level that we need.

sesion_prod = df.groupby(['sesion','evento']).producto.count()
sesion_prod
sesion                                evento
0000597b-de39-4a77-9fe5-02c8792ca14e  view      3
0000645a-8160-4a3d-91bf-154bff0a22e3  view      2
000090e1-da13-42b1-a31b-91a9ee5e6a88  view      1
0000b3cb-5422-4bf2-b8fe-5c1831d0dc1b  view      1
0000de26-bd58-42c9-9173-4763c76b398e  view      1
                                               ..
ffff6695-b64d-4a67-aa14-34b3b7f63c3f  view      2
ffff7d69-b706-4c64-9d6d-da57a04bc32b  view      1
ffff8044-2a22-4846-8a72-999e870abbe9  view      1
ffff91d4-7879-4a4b-8b26-c67915a27dc8  view      1
ffffbe0a-d2c2-47c7-afab-680bfdfda50d  view      1
Name: producto, Length: 581763, dtype: int64

We pass the events to columns.

sesion_prod = sesion_prod.unstack().fillna(0)
sesion_prod
evento cart purchase remove_from_cart view
sesion
0000597b-de39-4a77-9fe5-02c8792ca14e 0.00 0.00 0.00 3.00
0000645a-8160-4a3d-91bf-154bff0a22e3 0.00 0.00 0.00 2.00
000090e1-da13-42b1-a31b-91a9ee5e6a88 0.00 0.00 0.00 1.00
0000b3cb-5422-4bf2-b8fe-5c1831d0dc1b 0.00 0.00 0.00 1.00
0000de26-bd58-42c9-9173-4763c76b398e 0.00 0.00 0.00 1.00
... ... ... ... ...
ffff6695-b64d-4a67-aa14-34b3b7f63c3f 0.00 0.00 0.00 2.00
ffff7d69-b706-4c64-9d6d-da57a04bc32b 0.00 0.00 0.00 1.00
ffff8044-2a22-4846-8a72-999e870abbe9 0.00 0.00 0.00 1.00
ffff91d4-7879-4a4b-8b26-c67915a27dc8 0.00 0.00 0.00 1.00
ffffbe0a-d2c2-47c7-afab-680bfdfda50d 0.00 0.00 0.00 1.00

446054 rows × 4 columns

To check we calculate the totals and it should give us the same as globally.

sesion_prod.sum()
evento
cart                     574547.00
purchase                 127564.00
remove_from_cart         410357.00
view                     961558.00
dtype: float64

We rearrange the columns.

sesion_prod = sesion_prod[['view','cart','remove_from_cart','purchase']]
sesion_prod
evento view cart remove_from_cart purchase
sesion
0000597b-de39-4a77-9fe5-02c8792ca14e 3.00 0.00 0.00 0.00
0000645a-8160-4a3d-91bf-154bff0a22e3 2.00 0.00 0.00 0.00
000090e1-da13-42b1-a31b-91a9ee5e6a88 1.00 0.00 0.00 0.00
0000b3cb-5422-4bf2-b8fe-5c1831d0dc1b 1.00 0.00 0.00 0.00
0000de26-bd58-42c9-9173-4763c76b398e 1.00 0.00 0.00 0.00
... ... ... ... ...
ffff6695-b64d-4a67-aa14-34b3b7f63c3f 2.00 0.00 0.00 0.00
ffff7d69-b706-4c64-9d6d-da57a04bc32b 1.00 0.00 0.00 0.00
ffff8044-2a22-4846-8a72-999e870abbe9 1.00 0.00 0.00 0.00
ffff91d4-7879-4a4b-8b26-c67915a27dc8 1.00 0.00 0.00 0.00
ffffbe0a-d2c2-47c7-afab-680bfdfda50d 1.00 0.00 0.00 0.00

446054 rows × 4 columns

We calculated the mean of each event per session.

media_eventos_sesion = sesion_prod.mean()
media_eventos_sesion
evento
view                          2.16
cart                          1.29
remove_from_cart              0.92
purchase                      0.29
dtype: float64

Conclusion:

In each session, on average:

  • 2.2 products are seen
  • 1.3 products are added to the cart
  • 0.9 products are removed from the cart
  • 0.3 products are purchased

As we said, these are the numbers that we must increase with the CRO actions.

Are there differences between hourly events?

We create the dataframe at event and time granularity.

eventos_hora = df.groupby(['evento','hora']).producto.count()
eventos_hora
evento  hora
cart    0        6475
        1        5555
        2        6433
        3        8544
        4       11242
                ...  
view    19      63730
        20      57311
        21      38905
        22      23043
        23      13307
Name: producto, Length: 96, dtype: int64

We pass the events to columns.

eventos_hora = eventos_hora.unstack(level = 0)
eventos_hora
evento cart purchase remove_from_cart view
hora
0 6475 962 3238 8731
1 5555 1128 3930 7280
2 6433 1220 3509 8378
3 8544 1535 5331 11807
4 11242 2389 8095 18365
5 16890 3491 11913 27438
6 21993 5125 16223 38055
7 27069 5951 17883 46072
8 29526 7158 21156 49587
9 32095 7593 21680 54185
10 32901 7816 23982 56458
11 33284 8495 25496 57594
12 34258 8250 23714 57530
13 31996 8133 22852 55534
14 30451 7122 21835 52184
15 28789 6485 20162 49809
16 28775 6531 19791 51055
17 32525 6242 24330 55667
18 36435 8211 30551 59533
19 39609 7435 27666 63730
20 34828 7256 24985 57311
21 23228 4606 17396 38905
22 13589 2883 8680 23043
23 8057 1547 5959 13307

Let's visualize how the events are distributed per hour.

eventos_hora.plot()
plt.xticks(ticks = eventos_hora.index);

png

There is a global pattern as expected.

But to better see the differences we can create a new variable that is the ratio of purchases per visit in each hour.

eventos_hora['compras_visitas'] = eventos_hora.purchase / eventos_hora.view * 100
eventos_hora
evento cart purchase remove_from_cart view compras_visitas
hora
0 6475 962 3238 8731 11.02
1 5555 1128 3930 7280 15.49
2 6433 1220 3509 8378 14.56
3 8544 1535 5331 11807 13.00
4 11242 2389 8095 18365 13.01
5 16890 3491 11913 27438 12.72
6 21993 5125 16223 38055 13.47
7 27069 5951 17883 46072 12.92
8 29526 7158 21156 49587 14.44
9 32095 7593 21680 54185 14.01
10 32901 7816 23982 56458 13.84
11 33284 8495 25496 57594 14.75
12 34258 8250 23714 57530 14.34
13 31996 8133 22852 55534 14.65
14 30451 7122 21835 52184 13.65
15 28789 6485 20162 49809 13.02
16 28775 6531 19791 51055 12.79
17 32525 6242 24330 55667 11.21
18 36435 8211 30551 59533 13.79
19 39609 7435 27666 63730 11.67
20 34828 7256 24985 57311 12.66
21 23228 4606 17396 38905 11.84
22 13589 2883 8680 23043 12.51
23 8057 1547 5959 13307 11.63

We rearrange the variables

eventos_hora = eventos_hora[['view','cart','remove_from_cart','purchase','compras_visitas']]
eventos_hora
evento view cart remove_from_cart purchase compras_visitas
hora
0 8731 6475 3238 962 11.02
1 7280 5555 3930 1128 15.49
2 8378 6433 3509 1220 14.56
3 11807 8544 5331 1535 13.00
4 18365 11242 8095 2389 13.01
5 27438 16890 11913 3491 12.72
6 38055 21993 16223 5125 13.47
7 46072 27069 17883 5951 12.92
8 49587 29526 21156 7158 14.44
9 54185 32095 21680 7593 14.01
10 56458 32901 23982 7816 13.84
11 57594 33284 25496 8495 14.75
12 57530 34258 23714 8250 14.34
13 55534 31996 22852 8133 14.65
14 52184 30451 21835 7122 13.65
15 49809 28789 20162 6485 13.02
16 51055 28775 19791 6531 12.79
17 55667 32525 24330 6242 11.21
18 59533 36435 30551 8211 13.79
19 63730 39609 27666 7435 11.67
20 57311 34828 24985 7256 12.66
21 38905 23228 17396 4606 11.84
22 23043 13589 8680 2883 12.51
23 13307 8057 5959 1547 11.63

We visualize to see if there are hours in which proportionally more is purchased.

plt.figure(figsize = (12,6))
sns.lineplot(data = eventos_hora, x = eventos_hora.index, y = 'compras_visitas')
plt.xticks(eventos_hora.index);

png

Conclusions:

  • The hours in which people buy the most are 1, 8, 11 to 13 and 18
  • The hours in which people do not buy are 24:00, from 3 to 7, from 14 to 17 and from 19 to 23

We are now going to analyze not proportionally, but absolutely, whether or not there are more frequent hours for each type of event.

plt.figure(figsize = (12,12))
sns.heatmap(data = eventos_hora);

png

The problem is that since each event has a different scale, this graph does not allow us to differentiate the patterns well.

To solve it we can use the typing of variables that we learned in the statistics module.

def tipificar(variable):
    media = variable.mean()
    dt = variable.std()
    return(variable.apply(lambda x: (x - media) / dt))
eventos_hora_tip = eventos_hora.apply(tipificar)
eventos_hora_tip
evento view cart remove_from_cart purchase compras_visitas
hora
0 -1.60 -1.56 -1.63 -1.62 -1.83
1 -1.68 -1.64 -1.54 -1.56 1.91
2 -1.62 -1.56 -1.59 -1.53 1.13
3 -1.45 -1.37 -1.38 -1.41 -0.17
4 -1.11 -1.13 -1.06 -1.09 -0.17
5 -0.65 -0.63 -0.61 -0.68 -0.41
6 -0.10 -0.17 -0.10 -0.07 0.22
7 0.31 0.28 0.09 0.24 -0.24
8 0.49 0.50 0.48 0.69 1.03
9 0.72 0.73 0.54 0.85 0.67
10 0.84 0.80 0.81 0.93 0.53
11 0.90 0.83 0.99 1.19 1.29
12 0.89 0.92 0.78 1.09 0.95
13 0.79 0.72 0.67 1.05 1.20
14 0.62 0.58 0.56 0.67 0.37
15 0.50 0.43 0.36 0.44 -0.16
16 0.56 0.43 0.32 0.45 -0.35
17 0.80 0.77 0.85 0.35 -1.67
18 1.00 1.11 1.58 1.08 0.49
19 1.21 1.40 1.24 0.79 -1.29
20 0.88 0.97 0.93 0.72 -0.46
21 -0.06 -0.06 0.03 -0.26 -1.14
22 -0.87 -0.92 -0.99 -0.91 -0.58
23 -1.37 -1.42 -1.31 -1.40 -1.32
plt.figure(figsize = (12,12))
sns.heatmap(data = eventos_hora_tip);

png

Let's also pull out the line graphs to see it more clearly.

eventos_hora_tip.plot(subplots = True, sharex = False, figsize = (12,12),xticks = eventos_hora_tip.index);

png

Conclusions:

  • INSIGHT #1: All metrics are maximized in the time slots between 9 a.m. and 1 p.m. and between 6 p.m. and 8 p.m.
  • This information is very relevant, for example, for paid ads, both for traffic generation and retargeting
  • In addition, there seems to be some subtype of user who buys at 1 in the morning, who, although not very frequent, does buy a lot

What is the average monthly billing?

df.loc[df.evento == 'purchase'].groupby('mes').precio.sum().mean()
124309.92

¿Cúal es la tendencia en los últimos meses?

tendencia = df.groupby('evento').resample('W').evento.count().unstack(level = 0)
tendencia
evento cart purchase remove_from_cart view
fecha
2019-10-06 31483 4440 14647 36353
2019-10-13 28151 5422 17989 44410
2019-10-20 23920 5033 15303 39486
2019-10-27 25651 5665 18411 40383
2019-11-03 24087 5746 16491 39365
2019-11-10 29142 6663 24008 46177
2019-11-17 25335 5141 17215 41170
2019-11-24 38069 9754 27973 56477
2019-12-01 31994 7493 23106 48883
2019-12-08 23265 5105 19443 42055
2019-12-15 24636 5953 18246 45874
2019-12-22 19927 4701 15452 39237
2019-12-29 17051 3705 11102 32803
2020-01-05 16735 3294 13464 31909
2020-01-12 26264 5589 17956 46873
2020-01-19 28402 6913 22945 50210
2020-01-26 26353 6359 18544 48478
2020-02-02 29193 7120 21102 52432
2020-02-09 28796 5853 20050 48422
2020-02-16 27836 6332 22601 47213
2020-02-23 25619 6000 18146 43627
2020-03-01 22638 5283 16163 39721
tendencia = tendencia[['view','cart','remove_from_cart','purchase']]
tendencia
evento view cart remove_from_cart purchase
fecha
2019-10-06 36353 31483 14647 4440
2019-10-13 44410 28151 17989 5422
2019-10-20 39486 23920 15303 5033
2019-10-27 40383 25651 18411 5665
2019-11-03 39365 24087 16491 5746
2019-11-10 46177 29142 24008 6663
2019-11-17 41170 25335 17215 5141
2019-11-24 56477 38069 27973 9754
2019-12-01 48883 31994 23106 7493
2019-12-08 42055 23265 19443 5105
2019-12-15 45874 24636 18246 5953
2019-12-22 39237 19927 15452 4701
2019-12-29 32803 17051 11102 3705
2020-01-05 31909 16735 13464 3294
2020-01-12 46873 26264 17956 5589
2020-01-19 50210 28402 22945 6913
2020-01-26 48478 26353 18544 6359
2020-02-02 52432 29193 21102 7120
2020-02-09 48422 28796 20050 5853
2020-02-16 47213 27836 22601 6332
2020-02-23 43627 25619 18146 6000
2020-03-01 39721 22638 16163 5283
tendencia.plot(subplots = True, figsize = (12,6), sharex = True, xticks = tendencia.index, x_compat=True, rot = 90);

png

The trend is flat across all metrics, confirming the need for CRO stocks.

There is a significant peak in the week of the 24th, obviously due to Black Friday, we are going to do the same analysis but daily and only for November and December to see the effect.

tendencia_diaria = df.loc['2019-11':'2019-12'].groupby('evento').resample('D').evento.count().unstack(level = 0)
tendencia_diaria
evento cart purchase remove_from_cart view
fecha
2019-11-01 3565 709 2810 5352
2019-11-02 3015 912 2124 4857
2019-11-03 3540 755 2622 5583
2019-11-04 4652 676 4854 6248
2019-11-05 4118 753 2711 7213
... ... ... ... ...
2019-12-27 2023 507 1335 4058
2019-12-28 1744 329 1193 3704
2019-12-29 2134 263 1149 3939
2019-12-30 1364 258 823 3434
2019-12-31 563 114 447 1724

61 rows × 4 columns

tendencia_diaria = tendencia_diaria[['view','cart','remove_from_cart','purchase']]
tendencia_diaria
evento view cart remove_from_cart purchase
fecha
2019-11-01 5352 3565 2810 709
2019-11-02 4857 3015 2124 912
2019-11-03 5583 3540 2622 755
2019-11-04 6248 4652 4854 676
2019-11-05 7213 4118 2711 753
... ... ... ... ...
2019-12-27 4058 2023 1335 507
2019-12-28 3704 1744 1193 329
2019-12-29 3939 2134 1149 263
2019-12-30 3434 1364 823 258
2019-12-31 1724 563 447 114

61 rows × 4 columns

tendencia_diaria.plot(subplots = True, figsize = (16,10), sharex = True, xticks = tendencia_diaria.index, x_compat=True, rot = 90);

png

Conclusions:

  • Indeed the peak coincides with black friday (day 29)
  • But there is still a bigger peak a few days before, on the 22nd, possibly due to the start of Black Friday week
  • Surprisingly, the days of Christmas themselves have a decreasing trend, which means that consumers have clearly advanced their purchases

We are going to do the same analysis for January and February.

tendencia_diaria = df.loc['2020-01':'2020-02'].groupby('evento').resample('D').evento.count().unstack(level = 0)
tendencia_diaria = tendencia_diaria[['view','cart','remove_from_cart','purchase']]
tendencia_diaria.plot(subplots = True, figsize = (16,10), sharex = True, xticks = tendencia_diaria.index, x_compat=True, rot = 90);

png

Conclusions:

  • During the week of Kings there is no sales peak either
  • Nor the days before Valentine's Day
  • But there is a very pronounced peak on January 27, surely some local event

INSIGHT #2 The big takeaway is that all the Christmas shopping pie is delivered on Black Friday week

Moments of truth?

Could we manage to identify moments at the day-hour level in which the greatest number of purchases take place?

It would be very useful to focus a large part of the investment of campaigns just at those moments.

compras_dia_hora = df.loc[df.evento == 'purchase'].groupby(['date','hora']).evento.count().unstack(level = 0).fillna(0)
compras_dia_hora
date 2019-10-01 2019-10-02 2019-10-03 2019-10-04 ... 2020-02-26 2020-02-27 2020-02-28 2020-02-29
hora
0 13.00 18.00 1.00 2.00 ... 5.00 40.00 0.00 0.00
1 0.00 0.00 5.00 0.00 ... 0.00 5.00 26.00 33.00
2 0.00 0.00 0.00 24.00 ... 0.00 0.00 0.00 8.00
3 0.00 24.00 10.00 0.00 ... 0.00 10.00 0.00 0.00
4 15.00 0.00 45.00 27.00 ... 10.00 148.00 16.00 0.00
5 49.00 9.00 6.00 17.00 ... 6.00 48.00 11.00 3.00
6 23.00 34.00 18.00 10.00 ... 94.00 26.00 58.00 35.00
7 26.00 60.00 26.00 54.00 ... 30.00 53.00 38.00 65.00
8 28.00 71.00 129.00 49.00 ... 120.00 80.00 67.00 25.00
9 24.00 34.00 90.00 61.00 ... 38.00 92.00 20.00 22.00
10 15.00 62.00 43.00 22.00 ... 66.00 29.00 64.00 19.00
11 95.00 80.00 83.00 36.00 ... 80.00 69.00 86.00 63.00
12 9.00 43.00 100.00 67.00 ... 52.00 57.00 40.00 58.00
13 16.00 76.00 69.00 18.00 ... 50.00 59.00 6.00 23.00
14 74.00 31.00 38.00 36.00 ... 70.00 51.00 26.00 44.00
15 25.00 10.00 45.00 28.00 ... 51.00 28.00 44.00 46.00
16 99.00 21.00 33.00 42.00 ... 55.00 10.00 14.00 59.00
17 88.00 80.00 55.00 31.00 ... 6.00 32.00 27.00 34.00
18 53.00 24.00 35.00 54.00 ... 98.00 220.00 46.00 55.00
19 29.00 25.00 19.00 14.00 ... 28.00 85.00 56.00 21.00
20 53.00 22.00 63.00 17.00 ... 85.00 7.00 12.00 15.00
21 1.00 55.00 25.00 42.00 ... 44.00 22.00 16.00 17.00
22 33.00 10.00 0.00 42.00 ... 5.00 49.00 2.00 21.00
23 0.00 0.00 7.00 0.00 ... 0.00 6.00 5.00 1.00

24 rows × 152 columns

plt.figure(figsize = (20,14))
sns.heatmap(compras_dia_hora);

png

Understanding customers

To analyze at the customer level, it is best to create a dataframe of only buyers with customer granularity and the variables that interest us.

We must be careful with the aggregation function that we use in each one.

clientes = df.loc[df.evento == 'purchase'].groupby(['usuario']).agg({'producto':'count',
                                                          'sesion':'nunique', 
                                                          'precio': 'mean',
                                                          'date': 'max'})

clientes
producto sesion precio date
usuario
25392526 3 1 7.38 2019-12-18
27756757 1 1 20.63 2020-01-27
50748978 9 1 1.11 2019-12-14
52747911 3 1 7.67 2019-10-10
65241811 5 1 8.36 2019-11-11
... ... ... ... ...
621995551 5 1 2.09 2020-02-29
622021687 1 1 13.33 2020-02-29
622041514 3 1 0.63 2020-02-29
622042698 3 1 28.04 2020-02-29
622065819 4 1 5.12 2020-02-29

11040 rows × 4 columns

we rename

clientes.columns = ['productos_tot_num','compras_tot_num','precio_medio_prod','ult_compra']
clientes
productos_tot_num compras_tot_num precio_medio_prod ult_compra
usuario
25392526 3 1 7.38 2019-12-18
27756757 1 1 20.63 2020-01-27
50748978 9 1 1.11 2019-12-14
52747911 3 1 7.67 2019-10-10
65241811 5 1 8.36 2019-11-11
... ... ... ... ...
621995551 5 1 2.09 2020-02-29
622021687 1 1 13.33 2020-02-29
622041514 3 1 0.63 2020-02-29
622042698 3 1 28.04 2020-02-29
622065819 4 1 5.12 2020-02-29

11040 rows × 4 columns

We are going to calculate additional variables.

clientes['gasto_tot'] = clientes.productos_tot_num * clientes.precio_medio_prod
clientes['productos_por_compra'] = clientes.productos_tot_num / clientes.compras_tot_num
clientes
productos_tot_num compras_tot_num precio_medio_prod ult_compra gasto_tot productos_por_compra
usuario
25392526 3 1 7.38 2019-12-18 22.14 3.00
27756757 1 1 20.63 2020-01-27 20.63 1.00
50748978 9 1 1.11 2019-12-14 10.01 9.00
52747911 3 1 7.67 2019-10-10 23.02 3.00
65241811 5 1 8.36 2019-11-11 41.79 5.00
... ... ... ... ... ... ...
621995551 5 1 2.09 2020-02-29 10.46 5.00
622021687 1 1 13.33 2020-02-29 13.33 1.00
622041514 3 1 0.63 2020-02-29 1.90 3.00
622042698 3 1 28.04 2020-02-29 84.13 3.00
622065819 4 1 5.12 2020-02-29 20.48 4.00

11040 rows × 6 columns

How are customers distributed in terms of spending?

sns.histplot(data = clientes, x = 'gasto_tot', bins = 50)
plt.xlim([0,300]);

png

The vast majority of clients have spent less than €50 in the period.

How are customers distributed in terms of the number of purchases?

sns.countplot(data = clientes, x = 'compras_tot_num');

png

INSIGHT #3 The vast majority of customers only make one purchase.

There is a long way to go to improve this ratio through:

  • email marketing with newsletters and personalized offers

How many products does an average customer buy with each purchase?

clientes.productos_por_compra.describe()
count          11040.00
mean               7.79
std                9.49
min                1.00
25%                3.00
50%                5.00
75%               10.00
max              219.00
Name: productos_por_compra, dtype: float64

INSIGHT #4 The medium purchase includes 5 products.

But 25% of customers buy more than 10 products in the same purchase.

There is a long way to go to improve this ratio through:

  • recommendation systems at the time of purchase

Which clients have generated the most income for us?

clientes.nlargest(n = 10, columns = 'gasto_tot')
productos_tot_num compras_tot_num precio_medio_prod ult_compra gasto_tot productos_por_compra
usuario
573823111 268 2 5.82 2020-02-21 1559.21 134.00
539751397 236 13 6.16 2020-02-19 1453.37 18.15
556579890 506 4 2.75 2020-02-27 1392.45 126.50
442763940 195 8 6.37 2019-12-23 1241.53 24.38
561592095 94 3 11.81 2019-10-31 1109.70 31.33
527739278 244 13 4.39 2020-02-16 1071.00 18.77
527806771 195 13 4.86 2020-02-20 948.01 15.00
430220205 190 6 4.99 2020-02-29 947.30 31.67
491009486 219 1 4.32 2020-02-12 946.20 219.00
520501669 64 11 14.27 2020-01-17 913.01 5.82

To calculate we calculate the average total spend per customer.

clientes.gasto_tot.describe()
count          11040.00
mean              56.30
std               81.73
min                0.13
25%               16.22
50%               32.74
75%               60.30
max             1559.21
Name: gasto_tot, dtype: float64

INSIGHT #5 There are customers with average spending dozens of times higher than the average.

These customers must be retained through loyalty programs.

What is customer survival?

Since we only have 5 months of history, we are going to create cohort analyzes 3 months into the future, which gives us 3 cohorts.

We prepare a dataframe only with buyers and with the user and month variables.

c = df.loc[df.evento == 'purchase', ['usuario','mes']]
c
usuario mes
fecha
2019-10-01 00:26:49 536128518 10
2019-10-01 00:26:49 536128518 10
2019-10-01 00:26:49 536128518 10
2019-10-01 00:26:49 536128518 10
2019-10-01 00:26:49 536128518 10
... ... ...
2020-02-29 22:29:19 622065819 2
2020-02-29 22:29:19 622065819 2
2020-02-29 22:29:19 622065819 2
2020-02-29 22:29:19 622065819 2
2020-02-29 23:26:42 610361057 2

127564 rows × 2 columns

We pass the months to columns.

c = pd.crosstab(c.usuario,c.mes).reset_index()
c
mes usuario 1 2 10 11 12
0 25392526 0 0 0 0 3
1 27756757 1 0 0 0 0
2 50748978 0 0 0 0 9
3 52747911 0 0 3 0 0
4 65241811 0 0 0 5 0
... ... ... ... ... ... ...
11035 621995551 0 5 0 0 0
11036 622021687 0 1 0 0 0
11037 622041514 0 3 0 0 0
11038 622042698 0 3 0 0 0
11039 622065819 0 4 0 0 0

11040 rows × 6 columns

We rename and delete the user that we no longer need it.

c.columns = ['usuario','c4','c5','c1','c2','c3']
c.drop(columns = 'usuario', inplace = True)
c
c4 c5 c1 c2 c3
0 0 0 0 0 3
1 1 0 0 0 0
2 0 0 0 0 9
3 0 0 3 0 0
4 0 0 0 5 0
... ... ... ... ... ...
11035 0 5 0 0 0
11036 0 1 0 0 0
11037 0 3 0 0 0
11038 0 3 0 0 0
11039 0 4 0 0 0

11040 rows × 5 columns

The first cohort will be the one from month 2, since we want to select "new" clients (unless they were not there the previous month)

c2 = c.loc[(c.c1 == 0) & (c.c2 > 0)]
c2
c4 c5 c1 c2 c3
4 0 0 0 5 0
6 0 0 0 10 0
8 0 0 0 27 17
9 0 0 0 3 0
13 0 0 0 4 0
... ... ... ... ... ...
7702 0 0 0 5 0
7703 0 5 0 2 0
7705 0 0 0 1 0
7708 0 0 0 5 6
7709 0 0 0 1 0

2640 rows × 5 columns

We go to a binary dataframe since we only care if that customer has purchased or not in each month.

def binarizar(variable):
    variable = variable.transform(lambda x: 1 if (x > 0) else 0)
    return(variable)
c2_b = c2.apply(binarizar)
c2_b
c4 c5 c1 c2 c3
4 0 0 0 1 0
6 0 0 0 1 0
8 0 0 0 1 1
9 0 0 0 1 0
13 0 0 0 1 0
... ... ... ... ... ...
7702 0 0 0 1 0
7703 0 1 0 1 0
7705 0 0 0 1 0
7708 0 0 0 1 1
7709 0 0 0 1 0

2640 rows × 5 columns

We calculate the percentage of customers in this cohort who have continued to purchase in subsequent months.

c2_f = c2_b.sum() / c2_b.shape[0]
c2_f = c2_f.sort_index()
c2_f
c1              0.00
c2              1.00
c3              0.10
c4              0.10
c5              0.08
dtype: float64

We replicated the entire process for cohort 3

c3 = c.loc[(c.c2 == 0) & (c.c3 > 0)]
c3_b = c3.apply(binarizar)
c3_f = c3_b.sum() / c3_b.shape[0]
c3_f = c3_f.sort_index()
c3_f['c1'] = 0
c3_f
c1              0.00
c2              0.00
c3              1.00
c4              0.10
c5              0.08
dtype: float64

We replicated the entire process for cohort 4

c4 = c.loc[(c.c3 == 0) & (c.c4 > 0)]
c4_b = c4.apply(binarizar)
c4_f = c4_b.sum() / c4_b.shape[0]
c4_f = c4_f.sort_index()
c4_f['c1'] = 0
c4_f['c2'] = 0
c4_f
c1              0.00
c2              0.00
c3              0.00
c4              1.00
c5              0.12
dtype: float64

We create the cohort dataframe.

cohortes = pd.DataFrame({'c2':c2_f,'c3':c3_f,'c4':c4_f})
cohortes
c2 c3 c4
c1 0.00 0.00 0.00
c2 1.00 0.00 0.00
c3 0.10 1.00 0.00
c4 0.10 0.10 1.00
c5 0.08 0.08 0.12
cohortes = cohortes.drop(index = 'c1').T
cohortes
c2 c3 c4 c5
c2 1.00 0.10 0.10 0.08
c3 0.00 1.00 0.10 0.08
c4 0.00 0.00 1.00 0.12
plt.figure(figsize = (12,6))
sns.heatmap(cohortes,annot = True, fmt = '.0%', cmap='Greys');

png

INSIGHT #6: 90% of new customers do not buy again in the following months

What is the LTV of the clients?

Taking into account 90% of the fact that new customers do not buy again in the following months, we can calculate the LTV with the historical one that we have without fear of being very wrong.

To do this, we are going to take the customers of cohort 2 and calculate the total of their purchases.

maestro_ltv = df.loc[(df.evento == 'purchase') & (df.mes != 10) & (df.mes == 11),'usuario'].to_list()
maestro_ltv
[549319657,
 549319657,
 549319657,
 549319657,
 ...
 ...
 510126861,
 510126861,
 510126861,
 440684807,
 ...]
clientes_ltv = clientes.loc[clientes.index.isin(maestro_ltv)]
clientes_ltv
productos_tot_num compras_tot_num precio_medio_prod ult_compra gasto_tot productos_por_compra
usuario
65241811 5 1 8.36 2019-11-11 41.79 5.00
80577370 10 2 10.62 2019-11-29 106.24 5.00
88211255 22 4 4.86 2020-02-25 106.87 5.50
93279832 44 2 3.19 2019-12-19 140.51 22.00
94390236 3 1 9.73 2019-11-07 29.20 3.00
... ... ... ... ... ... ...
579798049 5 1 2.10 2019-11-30 10.52 5.00
579813390 7 2 2.98 2020-02-04 20.83 3.50
579834429 1 1 27.14 2019-11-30 27.14 1.00
579900887 11 2 5.67 2019-12-02 62.34 5.50
579903865 1 1 8.43 2019-11-30 8.43 1.00

3105 rows × 6 columns

clientes_ltv.gasto_tot.describe()
count           3105.00
mean              79.62
std              113.62
min                0.13
25%               20.29
50%               41.49
75%               90.00
max             1453.37
Name: gasto_tot, dtype: float64

Given the variability of the mean, it would be safer to take the median.

INSIGHT #7: The average LTV is €42.

Applying our margin on that figure and the % that we want to dedicate to acquisition, we get the maximum amount to invest in CPA.

Applying CRO actions will allow you to increase the LTV and therefore also the CPA, being a very important strategic advantage.

On which customers to run the next campaigns (RFM)?

We are going to learn a technique called RFM (Recency - Frequency - Monetary).

This technique is very powerful for retail contexts and therefore also in ecommerce.

It allows responding to needs such as:

  • What is the ratio of customers who place a single order and repeat customers
  • Which are the VIP clients (who potentially need loyalty programs and personalized attention)
  • What is the number of new customers (to encourage them to return to place an order)
  • How many and which are the customers who have not made purchases for a long time
  • How many and which are the clients in which it is not worth investing more time and resources
  • Etc

Despite its power, it is very simple to build, therefore it is almost mandatory in this type of analysis.

The first thing is to identify the variables with which to create each of the dimensions:

  • Recency: last_purchase
  • Frequency: purchases_tot_num
  • Monetary: total_expense

And discretize each of them.

We are going to leave the recency for last because it will require a previous transformation.

We start with Frequency

clientes['F'] = clientes.compras_tot_num.transform(lambda x: pd.cut(x,5, labels = False)) + 1
clientes
productos_tot_num compras_tot_num precio_medio_prod ult_compra gasto_tot productos_por_compra F
usuario
25392526 3 1 7.38 2019-12-18 22.14 3.00 1
27756757 1 1 20.63 2020-01-27 20.63 1.00 1
50748978 9 1 1.11 2019-12-14 10.01 9.00 1
52747911 3 1 7.67 2019-10-10 23.02 3.00 1
65241811 5 1 8.36 2019-11-11 41.79 5.00 1
... ... ... ... ... ... ... ...
621995551 5 1 2.09 2020-02-29 10.46 5.00 1
622021687 1 1 13.33 2020-02-29 13.33 1.00 1
622041514 3 1 0.63 2020-02-29 1.90 3.00 1
622042698 3 1 28.04 2020-02-29 84.13 3.00 1
622065819 4 1 5.12 2020-02-29 20.48 4.00 1

11040 rows × 7 columns

we check

clientes.groupby('F').compras_tot_num.mean()
F
1              1.31
2              7.06
3             12.00
4             16.50
5             23.50
Name: compras_tot_num, dtype: float64

Now Monetary

clientes['M'] = clientes.gasto_tot.transform(lambda x: pd.cut(x,5, labels = False)) + 1
clientes.groupby('M').gasto_tot.mean()
M
1             48.36
2            410.98
3            765.18
4           1043.96
5           1468.34
Name: gasto_tot, dtype: float64

For the recency we have to transform the date into a number, for example the distance in days from each date to the most recent date available.

mas_reciente = clientes.ult_compra.max()

clientes['ult_compra_dias'] = clientes.ult_compra.transform(lambda x: mas_reciente - x)

clientes
productos_tot_num compras_tot_num precio_medio_prod ult_compra ... productos_por_compra F M ult_compra_dias
usuario
25392526 3 1 7.38 2019-12-18 ... 3.00 1 1 73 days
27756757 1 1 20.63 2020-01-27 ... 1.00 1 1 33 days
50748978 9 1 1.11 2019-12-14 ... 9.00 1 1 77 days
52747911 3 1 7.67 2019-10-10 ... 3.00 1 1 142 days
65241811 5 1 8.36 2019-11-11 ... 5.00 1 1 110 days
... ... ... ... ... ... ... ... ... ...
621995551 5 1 2.09 2020-02-29 ... 5.00 1 1 0 days
622021687 1 1 13.33 2020-02-29 ... 1.00 1 1 0 days
622041514 3 1 0.63 2020-02-29 ... 3.00 1 1 0 days
622042698 3 1 28.04 2020-02-29 ... 3.00 1 1 0 days
622065819 4 1 5.12 2020-02-29 ... 4.00 1 1 0 days

11040 rows × 9 columns

A timedelta has been created for us, we have to pass it to number of days.

clientes['ult_compra_dias'] = clientes.ult_compra_dias.dt.days
clientes
productos_tot_num compras_tot_num precio_medio_prod ult_compra ... productos_por_compra F M ult_compra_dias
usuario
25392526 3 1 7.38 2019-12-18 ... 3.00 1 1 73
27756757 1 1 20.63 2020-01-27 ... 1.00 1 1 33
50748978 9 1 1.11 2019-12-14 ... 9.00 1 1 77
52747911 3 1 7.67 2019-10-10 ... 3.00 1 1 142
65241811 5 1 8.36 2019-11-11 ... 5.00 1 1 110
... ... ... ... ... ... ... ... ... ...
621995551 5 1 2.09 2020-02-29 ... 5.00 1 1 0
622021687 1 1 13.33 2020-02-29 ... 1.00 1 1 0
622041514 3 1 0.63 2020-02-29 ... 3.00 1 1 0
622042698 3 1 28.04 2020-02-29 ... 3.00 1 1 0
622065819 4 1 5.12 2020-02-29 ... 4.00 1 1 0

11040 rows × 9 columns

We can now create the R, but keep in mind that in this case the best are the lowest values.

clientes['R'] = clientes.ult_compra_dias.transform(lambda x: pd.cut(x,5, labels = False)) + 1
clientes.groupby('R').ult_compra_dias.mean()
R
1             14.62
2             43.04
3             75.94
4            103.85
5            135.91
Name: ult_compra_dias, dtype: float64

To standardize its interpretation with the rest of the dimensions, we are going to turn it around.

clientes['R'] = 6 - clientes.R
clientes.groupby('R').ult_compra_dias.mean()
R
1            135.91
2            103.85
3             75.94
4             43.04
5             14.62
Name: ult_compra_dias, dtype: float64

We integrate into an rfm dataframe.

clientes
productos_tot_num compras_tot_num precio_medio_prod ult_compra ... F M ult_compra_dias R
usuario
25392526 3 1 7.38 2019-12-18 ... 1 1 73 3
27756757 1 1 20.63 2020-01-27 ... 1 1 33 4
50748978 9 1 1.11 2019-12-14 ... 1 1 77 3
52747911 3 1 7.67 2019-10-10 ... 1 1 142 1
65241811 5 1 8.36 2019-11-11 ... 1 1 110 2
... ... ... ... ... ... ... ... ... ...
621995551 5 1 2.09 2020-02-29 ... 1 1 0 5
622021687 1 1 13.33 2020-02-29 ... 1 1 0 5
622041514 3 1 0.63 2020-02-29 ... 1 1 0 5
622042698 3 1 28.04 2020-02-29 ... 1 1 0 5
622065819 4 1 5.12 2020-02-29 ... 1 1 0 5

11040 rows × 10 columns

We create additional variables.

clientes['valor'] = clientes.R + clientes.F + clientes.M
clientes['RFM'] = clientes.apply(lambda x: str(x.R) + str(x.F) + str(x.M), axis = 1)
clientes
productos_tot_num compras_tot_num precio_medio_prod ult_compra ... ult_compra_dias R valor RFM
usuario
25392526 3 1 7.38 2019-12-18 ... 73 3 5 311
27756757 1 1 20.63 2020-01-27 ... 33 4 6 411
50748978 9 1 1.11 2019-12-14 ... 77 3 5 311
52747911 3 1 7.67 2019-10-10 ... 142 1 3 111
65241811 5 1 8.36 2019-11-11 ... 110 2 4 211
... ... ... ... ... ... ... ... ... ...
621995551 5 1 2.09 2020-02-29 ... 0 5 7 511
622021687 1 1 13.33 2020-02-29 ... 0 5 7 511
622041514 3 1 0.63 2020-02-29 ... 0 5 7 511
622042698 3 1 28.04 2020-02-29 ... 0 5 7 511
622065819 4 1 5.12 2020-02-29 ... 0 5 7 511

11040 rows × 12 columns

On this dataframe we can already do an infinite number of analyses.

For example, combining it with the minicube technique we can obtain all kinds of insights.

#Paso 1: Seleccionar qué variables serán la métricas y cuales las dimensiones
metricas = ['productos_tot_num','compras_tot_num','gasto_tot']
dimensiones = ['R','F','M','RFM','valor']

minicubo = clientes[dimensiones + metricas]
minicubo
R F M RFM valor productos_tot_num compras_tot_num gasto_tot
usuario
25392526 3 1 1 311 5 3 1 22.14
27756757 4 1 1 411 6 1 1 20.63
50748978 3 1 1 311 5 9 1 10.01
52747911 1 1 1 111 3 3 1 23.02
65241811 2 1 1 211 4 5 1 41.79
... ... ... ... ... ... ... ... ...
621995551 5 1 1 511 7 5 1 10.46
622021687 5 1 1 511 7 1 1 13.33
622041514 5 1 1 511 7 3 1 1.90
622042698 5 1 1 511 7 3 1 84.13
622065819 5 1 1 511 7 4 1 20.48

11040 rows × 8 columns

#Paso 2: pasar a transaccional las dimensiones
minicubo = minicubo.melt(id_vars = metricas)
minicubo
productos_tot_num compras_tot_num gasto_tot variable value
0 3 1 22.14 R 3
1 1 1 20.63 R 4
2 9 1 10.01 R 3
3 3 1 23.02 R 1
4 5 1 41.79 R 2
... ... ... ... ... ...
55195 5 1 10.46 valor 7
55196 1 1 13.33 valor 7
55197 3 1 1.90 valor 7
55198 3 1 84.13 valor 7
55199 4 1 20.48 valor 7

55200 rows × 5 columns

#Paso 3: Agregar las métricas por "variable" y "valor" con las funciones deseadas
minicubo = minicubo.groupby(['variable','value'], as_index = False)[metricas].mean()
minicubo
variable value productos_tot_num compras_tot_num gasto_tot
0 F 1 10.61 1.31 52.09
1 F 2 71.42 7.06 320.47
2 F 3 123.64 12.00 643.20
3 F 4 156.75 16.50 560.15
4 F 5 124.00 23.50 652.42
... ... ... ... ... ...
58 valor 9 98.02 7.25 491.71
59 valor 10 140.89 10.22 625.93
60 valor 11 291.00 5.75 1189.31
61 valor 12 189.80 16.60 833.43
62 valor 13 179.00 18.00 1136.70

63 rows × 5 columns

To analyze each dimension we select it.

minicubo[minicubo.variable == 'F']
variable value productos_tot_num compras_tot_num gasto_tot
0 F 1 10.61 1.31 52.09
1 F 2 71.42 7.06 320.47
2 F 3 123.64 12.00 643.20
3 F 4 156.75 16.50 560.15
4 F 5 124.00 23.50 652.42

And we analyze it graphically.

minicubo[minicubo.variable == 'F'].set_index('value').plot.bar(subplots = True, sharex = False, figsize = (12,12))
plt.tight_layout();
c:\Users\sgarciam\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\indexes\base.py:6982: FutureWarning:

In a future version, the Index constructor will not infer numeric dtypes when passed object-dtype sequences (matching Series behavior)

png

minicubo[minicubo.variable == 'R']
variable value productos_tot_num compras_tot_num gasto_tot
10 R 1 8.15 1.08 41.56
11 R 2 9.25 1.18 45.58
12 R 3 9.54 1.29 47.25
13 R 4 11.72 1.44 58.19
14 R 5 16.83 1.82 79.04
minicubo[minicubo.variable == 'R'].set_index('value').plot.bar(subplots = True, sharex = False, figsize = (12,12))
plt.tight_layout();
c:\Users\sgarciam\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\indexes\base.py:6982: FutureWarning:

In a future version, the Index constructor will not infer numeric dtypes when passed object-dtype sequences (matching Series behavior)

png

minicubo[minicubo.variable == 'M']
variable value productos_tot_num compras_tot_num gasto_tot
5 M 1 10.12 1.34 48.36
6 M 2 74.28 4.31 410.98
7 M 3 138.50 6.86 765.18
8 M 4 189.50 7.33 1043.96
9 M 5 336.67 6.33 1468.34
minicubo[minicubo.variable == 'M'].set_index('value').plot.bar(subplots = True, sharex = False, figsize = (12,12))
plt.tight_layout();
c:\Users\sgarciam\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\indexes\base.py:6982: FutureWarning:

In a future version, the Index constructor will not infer numeric dtypes when passed object-dtype sequences (matching Series behavior)

png

minicubo[minicubo.variable == 'RFM']
variable value productos_tot_num compras_tot_num gasto_tot
15 RFM 111 7.97 1.08 39.28
16 RFM 112 37.38 1.50 397.98
17 RFM 114 94.00 3.00 1109.70
18 RFM 211 8.94 1.17 43.16
19 RFM 212 47.00 1.73 391.14
20 RFM 213 35.50 1.00 662.84
21 RFM 221 85.67 6.33 194.48
22 RFM 311 8.98 1.26 43.91
23 RFM 312 53.46 2.46 376.38
24 RFM 321 39.00 6.50 141.28
25 RFM 322 121.00 6.50 389.19
26 RFM 324 195.00 8.00 1241.53
27 RFM 411 10.37 1.36 49.99
28 RFM 412 60.69 2.54 404.22
29 RFM 413 130.50 2.50 868.30
30 RFM 421 41.00 6.92 205.31
31 RFM 422 92.88 7.12 440.47
32 RFM 423 153.33 7.33 672.97
33 RFM 433 64.00 11.00 913.01
34 RFM 511 11.94 1.50 56.07
35 RFM 512 78.79 3.05 399.18
36 RFM 513 133.00 3.14 742.48
37 RFM 514 219.00 1.00 946.20
38 RFM 515 387.00 3.00 1475.83
39 RFM 521 47.88 6.83 192.19
40 RFM 522 89.51 7.51 442.50
41 RFM 523 184.67 7.33 766.90
42 RFM 524 190.00 6.00 947.30
43 RFM 531 58.25 11.50 233.84
44 RFM 532 94.00 12.50 448.09
45 RFM 533 200.00 11.00 858.26
46 RFM 534 219.50 13.00 1009.51
47 RFM 535 236.00 13.00 1453.37
48 RFM 541 121.50 16.50 288.65
49 RFM 543 192.00 16.50 831.65
50 RFM 552 126.00 24.00 484.81
51 RFM 553 122.00 23.00 820.04
minicubo[minicubo.variable == 'RFM'].set_index('value').plot.bar(subplots = True, sharex = False, figsize = (12,12))
plt.tight_layout();

png

minicubo[minicubo.variable == 'valor']
variable value productos_tot_num compras_tot_num gasto_tot
52 valor 3 7.97 1.08 39.28
53 valor 4 9.04 1.17 44.41
54 valor 5 9.31 1.27 46.08
55 valor 6 10.73 1.38 53.07
56 valor 7 12.64 1.54 60.42
57 valor 8 66.13 5.06 313.01
58 valor 9 98.02 7.25 491.71
59 valor 10 140.89 10.22 625.93
60 valor 11 291.00 5.75 1189.31
61 valor 12 189.80 16.60 833.43
62 valor 13 179.00 18.00 1136.70
minicubo[minicubo.variable == 'valor'].set_index('value').plot.bar(subplots = True, sharex = False, figsize = (12,12))
plt.tight_layout();
c:\Users\sgarciam\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\indexes\base.py:6982: FutureWarning:

In a future version, the Index constructor will not infer numeric dtypes when passed object-dtype sequences (matching Series behavior)

png

The analysis could be improved because in F and M the outliers cause most of the data to be concentrated in category 1.

What should be done is eliminate those atypical ones and redo the exercise.

I leave it to you as a practice task.

But with this analysis we are able to identify the customers who are most likely to respond best to new campaigns, as well as gain a lot of valuable insights for the business.

Understanding the products

We are going to create a dataframe at the product level to be able to analyze this dimension.

We first calculate the counts of each event in each product.

prod = df.groupby(['producto','evento']).size()
prod
producto  evento          
3752      view                 10
3762      cart                127
          purchase             28
          remove_from_cart     59
          view                258
                             ... 
5932538   view                  1
5932540   cart                  1
          view                  2
5932578   view                  1
5932585   view                  2
Length: 137068, dtype: int64
prod  = prod.unstack(level = 1).fillna(0)
prod
evento cart purchase remove_from_cart view
producto
3752 0.00 0.00 0.00 10.00
3762 127.00 28.00 59.00 258.00
3763 10.00 2.00 2.00 51.00
3771 0.00 0.00 0.00 9.00
3774 26.00 7.00 13.00 76.00
... ... ... ... ...
5932537 1.00 0.00 0.00 1.00
5932538 0.00 0.00 0.00 1.00
5932540 1.00 0.00 0.00 2.00
5932578 0.00 0.00 0.00 1.00
5932585 0.00 0.00 0.00 2.00

45327 rows × 4 columns

We are going to incorporate the price, for this we first create a price master by product.

maestro_precios = df.groupby('producto', as_index = False).precio.mean()
maestro_precios
producto precio
0 3752 15.71
1 3762 19.29
2 3763 16.00
3 3771 15.08
4 3774 15.92
... ... ...
45322 5932537 1.43
45323 5932538 1.43
45324 5932540 1.43
45325 5932578 6.02
45326 5932585 6.33

45327 rows × 2 columns

prod = pd.merge(left = prod, right = maestro_precios, how = 'left', on = 'producto')
prod
producto cart purchase remove_from_cart view precio
0 3752 0.00 0.00 0.00 10.00 15.71
1 3762 127.00 28.00 59.00 258.00 19.29
2 3763 10.00 2.00 2.00 51.00 16.00
3 3771 0.00 0.00 0.00 9.00 15.08
4 3774 26.00 7.00 13.00 76.00 15.92
... ... ... ... ... ... ...
45322 5932537 1.00 0.00 0.00 1.00 1.43
45323 5932538 0.00 0.00 0.00 1.00 1.43
45324 5932540 1.00 0.00 0.00 2.00 1.43
45325 5932578 0.00 0.00 0.00 1.00 6.02
45326 5932585 0.00 0.00 0.00 2.00 6.33

45327 rows × 6 columns

We rearrange the names.

prod
producto cart purchase remove_from_cart view precio
0 3752 0.00 0.00 0.00 10.00 15.71
1 3762 127.00 28.00 59.00 258.00 19.29
2 3763 10.00 2.00 2.00 51.00 16.00
3 3771 0.00 0.00 0.00 9.00 15.08
4 3774 26.00 7.00 13.00 76.00 15.92
... ... ... ... ... ... ...
45322 5932537 1.00 0.00 0.00 1.00 1.43
45323 5932538 0.00 0.00 0.00 1.00 1.43
45324 5932540 1.00 0.00 0.00 2.00 1.43
45325 5932578 0.00 0.00 0.00 1.00 6.02
45326 5932585 0.00 0.00 0.00 2.00 6.33

45327 rows × 6 columns

prod = prod[['producto','view','cart','remove_from_cart','purchase','precio']]
prod
producto view cart remove_from_cart purchase precio
0 3752 10.00 0.00 0.00 0.00 15.71
1 3762 258.00 127.00 59.00 28.00 19.29
2 3763 51.00 10.00 2.00 2.00 16.00
3 3771 9.00 0.00 0.00 0.00 15.08
4 3774 76.00 26.00 13.00 7.00 15.92
... ... ... ... ... ... ...
45322 5932537 1.00 1.00 0.00 0.00 1.43
45323 5932538 1.00 0.00 0.00 0.00 1.43
45324 5932540 2.00 1.00 0.00 0.00 1.43
45325 5932578 1.00 0.00 0.00 0.00 6.02
45326 5932585 2.00 0.00 0.00 0.00 6.33

45327 rows × 6 columns

What are the most sold products?

prod.sort_values('purchase',ascending = False)[0:20]
producto view cart remove_from_cart purchase precio
16807 5809910 9195.00 2796.00 1249.00 764.00 5.21
28178 5854897 624.00 2486.00 793.00 483.00 0.32
6644 5700037 1150.00 2603.00 716.00 361.00 0.40
314 5304 516.00 1184.00 426.00 341.00 0.32
9900 5751422 2204.00 1119.00 625.00 331.00 10.87
15394 5802432 701.00 2495.00 745.00 322.00 0.32
16809 5809912 3059.00 1352.00 863.00 321.00 5.19
18415 5815662 1219.00 1697.00 653.00 310.00 0.91
9862 5751383 2341.00 1035.00 550.00 298.00 10.24
14043 5792800 1527.00 911.00 512.00 285.00 10.25
26312 5849033 2099.00 1035.00 583.00 278.00 10.25
5386 5686925 344.00 1677.00 499.00 231.00 0.35
6653 5700046 432.00 1376.00 381.00 215.00 0.40
1761 5528035 1146.00 719.00 401.00 200.00 9.44
22111 5833330 680.00 576.00 359.00 194.00 0.92
16808 5809911 1923.00 828.00 599.00 189.00 5.21
18525 5816170 1642.00 751.00 532.00 182.00 5.22
5420 5687151 508.00 540.00 288.00 179.00 1.90
8232 5729864 160.00 505.00 211.00 176.00 0.41
24787 5843836 165.00 1007.00 265.00 172.00 0.38

Possibly we would be able to increase sales and the average ticket simply by highlighting these products in the store.

Are there products that are not being sold and could be removed from the catalogue?

prod[prod.purchase == 0]
producto view cart remove_from_cart purchase precio
0 3752 10.00 0.00 0.00 0.00 15.71
3 3771 9.00 0.00 0.00 0.00 15.08
6 3790 10.00 0.00 0.00 0.00 7.92
8 3809 2.00 0.00 0.00 0.00 12.54
9 3812 1.00 0.00 0.00 0.00 12.54
... ... ... ... ... ... ...
45322 5932537 1.00 1.00 0.00 0.00 1.43
45323 5932538 1.00 0.00 0.00 0.00 1.43
45324 5932540 2.00 1.00 0.00 0.00 1.43
45325 5932578 1.00 0.00 0.00 0.00 6.02
45326 5932585 2.00 0.00 0.00 0.00 6.33

21850 rows × 6 columns

INSIGHT #8: Almost half of the products have not had any sales in the 5 months of history.

A whole new analysis could be started on these products:

  • They are not visible?
  • Are they seen but not bought?
  • Is it because they are replaced by other own products?
  • Is it because they are much cheaper in the competition?
  • Etc

They could be removed from the catalog, or at least from the store, newsletter, etc., so that they do not take up space from the products that are sold.

What is the relationship between price and sales volume?

Since this analysis includes sales we will eliminate the products that have not had any.

sns.scatterplot(data = prod[prod.purchase > 0], x = 'precio', y = 'purchase', hue = 'precio');

png

Yes, there is a clear decreasing relationship.

Let's zoom in for example below €50 to understand it better.

sns.scatterplot(data = prod[(prod.purchase > 0) & (prod.precio < 50)], x = 'precio', y = 'purchase', hue = 'precio');

png

Are there products that customers regret and remove more from the cart?

prod.insert(loc = 4,
            column = 'remove_from_cart_porc',
            value = prod.remove_from_cart / prod.cart *100 )
prod
producto view cart remove_from_cart remove_from_cart_porc purchase precio
0 3752 10.00 0.00 0.00 NaN 0.00 15.71
1 3762 258.00 127.00 59.00 46.46 28.00 19.29
2 3763 51.00 10.00 2.00 20.00 2.00 16.00
3 3771 9.00 0.00 0.00 NaN 0.00 15.08
4 3774 76.00 26.00 13.00 50.00 7.00 15.92
... ... ... ... ... ... ... ...
45322 5932537 1.00 1.00 0.00 0.00 0.00 1.43
45323 5932538 1.00 0.00 0.00 NaN 0.00 1.43
45324 5932540 2.00 1.00 0.00 0.00 0.00 1.43
45325 5932578 1.00 0.00 0.00 NaN 0.00 6.02
45326 5932585 2.00 0.00 0.00 NaN 0.00 6.33

45327 rows × 7 columns

prod.loc[prod.cart > 30].sort_values('remove_from_cart_porc', ascending = False)[0:30]
producto view cart remove_from_cart remove_from_cart_porc purchase precio
14330 5797131 26.00 38.00 136.00 357.89 7.00 4.43
37937 5893670 36.00 35.00 109.00 311.43 3.00 4.90
29128 5858481 41.00 31.00 64.00 206.45 7.00 4.55
16658 5809346 8.00 34.00 62.00 182.35 7.00 0.78
26120 5848412 34.00 37.00 66.00 178.38 12.00 0.79
37944 5893677 64.00 41.00 70.00 170.73 10.00 4.69
8416 5731470 39.00 34.00 58.00 170.59 10.00 6.32
3217 5635096 32.00 32.00 52.00 162.50 11.00 4.42
3244 5635127 41.00 32.00 52.00 162.50 10.00 4.43
21617 5830537 35.00 37.00 60.00 162.16 8.00 1.73
39359 5900645 47.00 33.00 52.00 157.58 8.00 4.39
6222 5696152 81.00 41.00 64.00 156.10 12.00 2.37
29629 5859474 44.00 43.00 67.00 155.81 12.00 1.72
31887 5867624 26.00 46.00 70.00 152.17 10.00 3.89
9227 5741027 89.00 35.00 53.00 151.43 7.00 5.19
2123 5560972 73.00 51.00 76.00 149.02 11.00 2.98
6235 5696184 38.00 41.00 61.00 148.78 7.00 2.37
17716 5813067 48.00 43.00 63.00 146.51 5.00 1.72
221 4874 26.00 39.00 57.00 146.15 10.00 0.37
27643 5853242 47.00 43.00 62.00 144.19 6.00 3.15
33396 5875280 53.00 32.00 46.00 143.75 5.00 5.53
34325 5877765 62.00 35.00 50.00 142.86 9.00 9.02
23183 5837619 128.00 81.00 115.00 141.98 18.00 1.73
23833 5839637 55.00 43.00 61.00 141.86 12.00 2.37
28667 5857018 33.00 32.00 45.00 140.62 7.00 3.15
13942 5789608 57.00 32.00 45.00 140.62 7.00 4.69
3005 5619864 80.00 47.00 66.00 140.43 8.00 2.84
3205 5635081 26.00 35.00 49.00 140.00 4.00 4.40
30741 5863821 42.00 51.00 71.00 139.22 1.00 4.49
10701 5760769 38.00 31.00 43.00 138.71 6.00 2.62

It would be necessary to see why these products are removed more times than they are added:

  • If the reason makes sense: review what happens with these products (other alternative products, etc.)
  • If it does not have it, delete these records and analyze only those with remove_from_cart_perc less than or equal to 100

What are the most viewed products?

prod.view.sort_values(ascending = False)[0:20].plot.bar();

png

Possibly we would be able to increase sales and the average ticket simply by highlighting these products in the store.

Provided that in addition to being seen they are also sold.

Are there products wanted but not purchased?

For example, products that many customers look at but then do not buy.

If we found them, we would have to check what happens to them.

sns.scatterplot(data = prod, x = 'view', y = 'purchase');

png

We're going to remove the outlier and zoom into the many views few purchases window.

sns.scatterplot(data = prod.loc[prod.view < 4000], x = 'view', y = 'purchase', hue = 'precio')
plt.xlim(1000,3000)
plt.ylim(0,150)
(0.0, 150.0)

png

There is an opportunity with these products, because for some reason they generate the interest of the clients, but in the end they do not buy them.

It would be necessary to do an analysis on them.

Building a recommendation system

One of the assets that can most increase the sales of an ecommerce is a recommendation system.

We could already apply a basic one with the analysis of the most viewed and the best sold previously carried out.

But the real power comes when we create a recommender that personalizes for each purchase.

Types of recommender systems:

  • Collaborative filtering:
    • Item based
    • User based
  • From content

In our case we are going to develop one with collaborative filtering based on items.

The steps to follow are:

  1. Create the dataframe with the kpi of interest
  2. Reduce dimension (optional)
  3. Select a distance metric
  4. Compute the item-item matrix
  5. Create the prioritization logic

Create the dataframe with the kpi of interest

In this case we will use what is called an implicit kpi, which will be the number of times that the products have been purchased by the same user.

Explicit Kpis would be, for example, stars or scores from 1 to 10.

Since this is an algorithm that takes time to calculate, we are going to reduce the problem and calculate it only for the 100 best-selling products.

First we calculate a master with the top 100 best-selling products.

mas_vendidos = prod.sort_values('purchase', ascending = False).producto[0:100]
mas_vendidos
16807    5809910
28178    5854897
6644     5700037
314         5304
9900     5751422
          ...   
30395    5862564
9778     5749720
9732     5749149
22751    5835859
22116    5833335
Name: producto, Length: 100, dtype: int64

We create a temporary dataframe filtering by these products.

temp = df.loc[df.producto.isin(mas_vendidos)]
temp
usuario sesion categoria evento ... segundo festivo black_friday san_valentin
fecha
2019-10-01 00:26:49 536128518 a31f0991-645e-4472-a012-95eb2f814568 1487580006317032337 purchase ... 49 0 0 0
2019-10-01 00:46:20 555415545 b9cc1771-9062-4e08-a3ad-363314cd17d8 1602943681873052386 view ... 20 0 0 0
2019-10-01 00:48:13 555415545 b9cc1771-9062-4e08-a3ad-363314cd17d8 1602943681873052386 view ... 13 0 0 0
2019-10-01 00:52:39 555415545 b9cc1771-9062-4e08-a3ad-363314cd17d8 1487580005092295511 view ... 39 0 0 0
2019-10-01 01:33:26 555456891 b3239dc3-f107-4034-a507-4c41f646e38a 1487580005092295511 view ... 26 0 0 0
... ... ... ... ... ... ... ... ... ...
2020-02-29 23:11:44 615102046 17b94398-0397-4c59-bc84-fe91dde0a8ec 1487580006509970331 cart ... 44 0 0 0
2020-02-29 23:12:40 615102046 17b94398-0397-4c59-bc84-fe91dde0a8ec 2195085255034011676 cart ... 40 0 0 0
2020-02-29 23:12:50 599909613 1c6c708d-135d-487b-afa9-4bbcfd28db4d 1602943681873052386 cart ... 50 0 0 0
2020-02-29 23:20:21 231719601 a7467d5c-e848-406f-97f4-fcb6a4113e68 1602943681873052386 view ... 21 0 0 0
2020-02-29 23:58:49 147995998 5ff96629-3627-493e-a25b-5a871ec78c90 1487580006317032337 cart ... 49 0 0 0

168170 rows × 16 columns

We create the user-item array.

usuario_item = temp.loc[temp.evento == 'purchase'].groupby(['usuario','producto']).size().unstack(level = 1).fillna(0)
usuario_item
producto 4497 4600 4768 4938 ... 5857360 5862564 5862943 5889300
usuario
25392526 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00
50748978 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00
74332980 0.00 0.00 0.00 1.00 ... 0.00 0.00 0.00 0.00
80577370 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00
88211255 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00
... ... ... ... ... ... ... ... ... ...
621646584 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00
621788730 0.00 0.00 0.00 0.00 ... 0.00 0.00 1.00 0.00
621925941 1.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00
621974977 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00
622021687 0.00 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00

5064 rows × 100 columns

Select a distance metric

Most common metrics:

  • Euclidean distance
  • Correlation
  • cosine

In this case we are going to take, for example, the Euclidean distance.

We operationalize it using Scipy's spatial.distance.euclidean function.

from scipy import spatial

Calculate item-item array

We create the recommender that takes as input a user-item array and returns an item-item array with the Euclidean distance as data.

def recomendador(dataframe):

    def distancia(producto):
        return(dataframe.apply(lambda x: spatial.distance.euclidean(x,producto)))

    return(dataframe.apply(lambda x: distancia(x)))
item_item = recomendador(usuario_item)
item_item
producto 4497 4600 4768 4938 ... 5857360 5862564 5862943 5889300
producto
4497 0.00 14.42 14.49 15.62 ... 15.78 15.17 16.40 15.33
4600 14.42 0.00 10.68 14.49 ... 14.46 13.86 14.73 13.89
4768 14.49 10.68 0.00 14.56 ... 14.39 13.86 14.80 14.11
4938 15.62 14.49 14.56 0.00 ... 15.46 14.97 15.72 15.13
4958 15.91 14.59 14.73 15.52 ... 15.94 15.26 16.12 15.03
... ... ... ... ... ... ... ... ... ...
5857007 15.07 13.60 13.75 14.59 ... 14.83 14.46 15.17 14.42
5857360 15.78 14.46 14.39 15.46 ... 0.00 13.75 16.19 15.43
5862564 15.17 13.86 13.86 14.97 ... 13.75 0.00 15.39 14.66
5862943 16.40 14.73 14.80 15.72 ... 16.19 15.39 0.00 15.23
5889300 15.33 13.89 14.11 15.13 ... 15.43 14.66 15.23 0.00

100 rows × 100 columns

Create the prioritization logic

We already have the recommender ready.

What we would have to do is a call to this table every time a user looks at a product or puts it in the cart.

But to make it more effective we could use all the information accumulated from the session or even from the entire user if he is logged in.

That means we need a system to recommend products whether the input is from a single product or multiple ones.

And that at the same time returns several recommendations, to cover all the "gaps" of recommendation that our web could have.

We will apply a very simple algorithm that will do:

  1. Create an array with the input products to extract their vectors from the item-item matrix
  2. Calculate the sum of distances of all products
  3. Remove themselves so as not to self-recommend.
  4. Return the 10 with the least distance
#En el caso de varios productos vendrá del servidor web como una cadena separada con punto y coma
def priorizador(productos,devuelve = 10):
    #crear array con productos de entrada
    array = np.int64(productos.split(';'))

    #extraer sus vectores de la matriz total
    matriz = item_item[array]

    #calcular la suma de distancias
    suma_distancias = matriz.agg(sum,axis = 1)

    #eliminar los productos input
    suma_distancias = suma_distancias.loc[~suma_distancias.index.isin(list(array))]

    #Devolver los 10 con menor distancia
    return(suma_distancias.sort_values()[0:devuelve])

We check how it works if we pass it a product

priorizador('4497')
producto
5724230             14.39
4600                14.42
5550302             14.49
4768                14.49
5749149             14.56
5833318             14.63
5824810             14.70
5835859             14.70
5809303             14.73
5833335             14.73
dtype: float64

We check how it works if we pass several products to it

priorizador('4497;4600;4768')
producto
5749149             40.25
5833318             40.47
5833335             40.81
5809303             40.81
5724230             41.00
5824810             41.08
5835859             41.23
5550302             41.47
5816169             41.51
5844894             41.55
dtype: float64

CONCLUSIONS

The current trend is flat across all metrics, confirming the need for CRO stocks.

After the analysis carried out on the transactional data, a CRO plan has been developed with 12 specific initiatives organized into 5 major business levers that with a high probability will increase the baselines, achieving a global increase in ecommerce income.

Baseline

In each session, on average:

  • KPIs per session: 2.2 products are viewed
  • KPIs per session: 1.3 products are added to the cart
  • KPIs per session: 0.9 products are removed from the cart
  • KPIs per session: 0.3 products are purchased
  • Cross-selling: median of 5 products per purchase
  • Recurrence: 10% of customers buy again after the first month
  • Conversion: 60% add to cart on views
  • Conversion: 22% of purchase on additions to cart
  • Conversion: 13% purchase on views
  • Average monthly turnover: €125,000

Actions to increase views

  1. Review paid campaigns (generation and retargeting) to concentrate investment in slots between 9 a.m. and 1 p.m. and between 6 p.m. and 8 p.m.
  2. Concentrate the investment of the Christmas and post-Christmas period in the week of Black Friday
  3. Increase the investment until reaching the maximum CPA based on the LTV that we have identified

Conversion increase actions

  1. Preconfigure the home page with the products identified in the most viewed and most sold analysis.
  2. Work on products with high cart abandonment rates
  3. Work on products that are highly viewed but rarely purchased

Increase cross-sell actions

  1. The median purchase includes 5 products
  2. Increase this ratio by recommending in real time with the new recommender

Actions to increase purchase frequency

  1. 90% of customers only make one purchase
  2. Create a regular newsletter with the new recommender to increase the frequency of visits
  3. Promotional campaigns on the top segments of the RFM segmentation

Customer loyalty actions

  1. Create a loyalty program segmented by the new RFM segmentation