In this case we will work as consultants for an ecommerce in the cosmetics sector.
This company has had a flat evolution during the last months and has contracted us to analyze its transactional data and implement CRO actions customized to its situation based on said analysis.
In this case, among other things, we are going to learn:
Therefore, much of what we learn here is of general application in practically any ecommerce.
Analyze transactional data to try potential CRO actions that increase visits, conversions and average ticket, and therefore increase global ecommerce billing.
Create advanced analytics assets such as RFM segmentation and a recommendation system that drive goal achievement.
As always, we will first understand the business, and its main processes, metrics and concepts.
The first step is when a user arrives at the ecommerce website. It will normally come from:
That traffic is called visits, and the pages they see are called page views, although in our case we will call it views.
The user browses the web and when he likes a product, he puts it in the cart.
Finally you can remove products from the cart, leave without buying anything, or finally place the order.
A common process is cross-selling, in which the user is recommended other products that might also interest him.
Even when it's gone we can tell the user again through retargeting or email marketing.
This entire process is called funnel or also customer journey.
In the online environment practically everything can be registered.
The user record can be logged or not.
The sequence of actions that a user does in the same browsing session is called a session.
The ratio of purchases to visits is called the conversion ratio.
In addition, there are other key metrics that we have to master to correctly manage an ecommerce:
KEY CONCEPT: There are only 3 ways to grow a business:
To achieve these 3 effects we work on the following operational levers:
In our case, we will understand CRO in a broad way, that is, as the discipline that puts actions into practice to work on the previous levers and concepts.
In our case, the entities that we have in the granularity of the data are:
Having understood the levers, kpis and entities, we can now ask the seed questions:
About the customer journey:
About customers:
About the products:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
#Automcompletar rápido
%config IPCompleter.greedy=True
#Formato de display
pd.options.display.float_format = '{:15.2f}'.format
pd.options.display.max_columns = 8
import sqlalchemy as sa
con = sa.create_engine('sqlite:///ecommerce.db')
from sqlalchemy import inspect
insp = inspect(con)
tablas = insp.get_table_names()
tablas
['2019-Dec', '2019-Nov', '2019-Oct', '2020-Feb', '2020-Jan']
oct = pd.read_sql('2019-Oct', con)
nov = pd.read_sql('2019-Nov', con)
dic = pd.read_sql('2019-Dec', con)
ene = pd.read_sql('2020-Jan', con)
feb = pd.read_sql('2020-Feb', con)
df = pd.concat([oct,nov,dic,ene,feb], axis = 0)
df
index | event_time | event_type | product_id | ... | brand | price | user_id | user_session | |
---|---|---|---|---|---|---|---|---|---|
0 | 68 | 2019-10-01 00:01:46 UTC | view | 5843665 | ... | f.o.x | 9.44 | 462033176 | a18e0999-61a1-4218-8f8f-61ec1d375361 |
1 | 72 | 2019-10-01 00:01:55 UTC | cart | 5868461 | ... | italwax | 3.57 | 514753614 | e2fecb2d-22d0-df2c-c661-15da44b3ccf1 |
2 | 95 | 2019-10-01 00:02:50 UTC | view | 5877456 | ... | jessnail | 122.22 | 527418424 | 86e77869-afbc-4dff-9aa2-6b7dd8c90770 |
3 | 122 | 2019-10-01 00:03:41 UTC | view | 5649270 | ... | concept | 6.19 | 555448072 | b5f72ceb-0730-44de-a932-d16db62390df |
4 | 124 | 2019-10-01 00:03:44 UTC | view | 18082 | ... | cnd | 16.03 | 552006247 | 2d8f304b-de45-4e59-8f40-50c603843fe5 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
429785 | 4156660 | 2020-02-29 23:58:49 UTC | cart | 5815662 | ... | None | 0.92 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 |
429786 | 4156663 | 2020-02-29 23:58:57 UTC | view | 5815665 | ... | None | 0.59 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 |
429787 | 4156668 | 2020-02-29 23:59:05 UTC | cart | 5815665 | ... | None | 0.59 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 |
429788 | 4156675 | 2020-02-29 23:59:28 UTC | view | 5817692 | ... | None | 0.79 | 619841242 | 18af673b-7fb9-4202-a66d-5c855bc0fd2d |
429789 | 4156680 | 2020-02-29 23:59:54 UTC | view | 5716351 | ... | irisk | 0.79 | 619841242 | 18af673b-7fb9-4202-a66d-5c855bc0fd2d |
2095076 rows × 10 columns
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2095076 entries, 0 to 429789
Data columns (total 10 columns):
# Column Dtype
--- ------ -----
0 index int64
1 event_time object
2 event_type object
3 product_id int64
4 category_id int64
5 category_code object
6 brand object
7 price float64
8 user_id int64
9 user_session object
dtypes: float64(1), int64(4), object(5)
memory usage: 175.8+ MB
We remove the column index.
df.drop(columns = 'index', inplace = True)
Analysis and correction of types.
We pass event_time to datetime.
def datetime_rapido(dt,formato):
def divide_fecha(fecha):
division = fecha.split()
date = division[0]
time = division[1]
cadena = date + ' ' + time
return cadena
resultado = pd.to_datetime(dt.apply(lambda x: divide_fecha(x)), format = formato)
return resultado
Ejecutamos la función.
formato = '%Y-%m-%d %H:%M:%S'
df.event_time = datetime_rapido(df.event_time,formato)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2095076 entries, 0 to 429789
Data columns (total 9 columns):
# Column Dtype
--- ------ -----
0 event_time datetime64[ns]
1 event_type object
2 product_id int64
3 category_id int64
4 category_code object
5 brand object
6 price float64
7 user_id int64
8 user_session object
dtypes: datetime64[ns](1), float64(1), int64(3), object(4)
memory usage: 159.8+ MB
df.columns = ['fecha',
'evento',
'producto',
'categoria',
'categoria_cod',
'marca',
'precio',
'usuario',
'sesion']
df
fecha | evento | producto | categoria | ... | marca | precio | usuario | sesion | |
---|---|---|---|---|---|---|---|---|---|
0 | 2019-10-01 00:01:46 | view | 5843665 | 1487580005092295511 | ... | f.o.x | 9.44 | 462033176 | a18e0999-61a1-4218-8f8f-61ec1d375361 |
1 | 2019-10-01 00:01:55 | cart | 5868461 | 1487580013069861041 | ... | italwax | 3.57 | 514753614 | e2fecb2d-22d0-df2c-c661-15da44b3ccf1 |
2 | 2019-10-01 00:02:50 | view | 5877456 | 1487580006300255120 | ... | jessnail | 122.22 | 527418424 | 86e77869-afbc-4dff-9aa2-6b7dd8c90770 |
3 | 2019-10-01 00:03:41 | view | 5649270 | 1487580013749338323 | ... | concept | 6.19 | 555448072 | b5f72ceb-0730-44de-a932-d16db62390df |
4 | 2019-10-01 00:03:44 | view | 18082 | 1487580005411062629 | ... | cnd | 16.03 | 552006247 | 2d8f304b-de45-4e59-8f40-50c603843fe5 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
429785 | 2020-02-29 23:58:49 | cart | 5815662 | 1487580006317032337 | ... | None | 0.92 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 |
429786 | 2020-02-29 23:58:57 | view | 5815665 | 1487580006317032337 | ... | None | 0.59 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 |
429787 | 2020-02-29 23:59:05 | cart | 5815665 | 1487580006317032337 | ... | None | 0.59 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 |
429788 | 2020-02-29 23:59:28 | view | 5817692 | 1487580010872045658 | ... | None | 0.79 | 619841242 | 18af673b-7fb9-4202-a66d-5c855bc0fd2d |
429789 | 2020-02-29 23:59:54 | view | 5716351 | 1487580010872045658 | ... | irisk | 0.79 | 619841242 | 18af673b-7fb9-4202-a66d-5c855bc0fd2d |
2095076 rows × 9 columns
df.isna().sum().sort_values(ascending = False)
categoria_cod 2060411
marca 891646
sesion 506
fecha 0
evento 0
producto 0
categoria 0
precio 0
usuario 0
dtype: int64
Conclusions:
Actions:
df = df.drop(columns = ['categoria_cod','marca']).dropna()
df
fecha | evento | producto | categoria | precio | usuario | sesion | |
---|---|---|---|---|---|---|---|
0 | 2019-10-01 00:01:46 | view | 5843665 | 1487580005092295511 | 9.44 | 462033176 | a18e0999-61a1-4218-8f8f-61ec1d375361 |
1 | 2019-10-01 00:01:55 | cart | 5868461 | 1487580013069861041 | 3.57 | 514753614 | e2fecb2d-22d0-df2c-c661-15da44b3ccf1 |
2 | 2019-10-01 00:02:50 | view | 5877456 | 1487580006300255120 | 122.22 | 527418424 | 86e77869-afbc-4dff-9aa2-6b7dd8c90770 |
3 | 2019-10-01 00:03:41 | view | 5649270 | 1487580013749338323 | 6.19 | 555448072 | b5f72ceb-0730-44de-a932-d16db62390df |
4 | 2019-10-01 00:03:44 | view | 18082 | 1487580005411062629 | 16.03 | 552006247 | 2d8f304b-de45-4e59-8f40-50c603843fe5 |
... | ... | ... | ... | ... | ... | ... | ... |
429785 | 2020-02-29 23:58:49 | cart | 5815662 | 1487580006317032337 | 0.92 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 |
429786 | 2020-02-29 23:58:57 | view | 5815665 | 1487580006317032337 | 0.59 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 |
429787 | 2020-02-29 23:59:05 | cart | 5815665 | 1487580006317032337 | 0.59 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 |
429788 | 2020-02-29 23:59:28 | view | 5817692 | 1487580010872045658 | 0.79 | 619841242 | 18af673b-7fb9-4202-a66d-5c855bc0fd2d |
429789 | 2020-02-29 23:59:54 | view | 5716351 | 1487580010872045658 | 0.79 | 619841242 | 18af673b-7fb9-4202-a66d-5c855bc0fd2d |
2094570 rows × 7 columns
df.describe()
usuario | categoria | producto | precio | ... | segundo | festivo | black_friday | san_valentin | |
---|---|---|---|---|---|---|---|---|---|
count | 2074026.00 | 2074026.00 | 2074026.00 | 2074026.00 | ... | 2074026.00 | 2074026.00 | 2074026.00 | 2074026.00 |
mean | 521758909.23 | 1553192964541272064.00 | 5485203.09 | 8.50 | ... | 29.50 | 0.06 | 0.01 | 0.01 |
std | 87354480.68 | 168128659843838048.00 | 1304219.20 | 19.21 | ... | 17.29 | 0.24 | 0.10 | 0.08 |
min | 4661182.00 | 1487580004807082752.00 | 3752.00 | 0.05 | ... | 0.00 | 0.00 | 0.00 | 0.00 |
25% | 482863572.00 | 1487580005754995456.00 | 5724633.00 | 2.06 | ... | 15.00 | 0.00 | 0.00 | 0.00 |
50% | 553690395.00 | 1487580008246412288.00 | 5811652.00 | 4.06 | ... | 29.00 | 0.00 | 0.00 | 0.00 |
75% | 578763499.00 | 1487580013489291520.00 | 5858221.00 | 6.98 | ... | 44.00 | 0.00 | 0.00 | 0.00 |
max | 622087993.00 | 2242903426784559104.00 | 5932585.00 | 327.78 | ... | 59.00 | 1.00 | 1.00 | 1.00 |
8 rows × 13 columns
We see negatives in the price. Let's go deeper.
df[df.precio <= 0]
fecha | evento | producto | categoria | precio | usuario | sesion | |
---|---|---|---|---|---|---|---|
343 | 2019-10-01 02:15:41 | view | 5892052 | 1487580010377117763 | 0.00 | 555455025 | 320f6021-30ac-4a58-ae17-bac1cc32aac3 |
924 | 2019-10-01 05:16:30 | view | 5889621 | 1487580010561667147 | 0.00 | 523988665 | 00849bd2-fcd2-4cb4-af31-4e264f151848 |
933 | 2019-10-01 05:18:03 | view | 5889622 | 1487580010561667147 | 0.00 | 523988665 | 80cfe614-f0a5-4101-a2b6-a21227590470 |
937 | 2019-10-01 05:18:46 | view | 5889623 | 1487580010561667147 | 0.00 | 523988665 | c2cd0464-3d2b-48e2-9667-bac248fe297a |
1077 | 2019-10-01 05:38:01 | view | 5889627 | 1487580010561667147 | 0.00 | 523988665 | 8b2bf9d8-43f0-43b2-bed3-13b2c956cada |
... | ... | ... | ... | ... | ... | ... | ... |
428011 | 2020-02-29 20:04:49 | cart | 5824841 | 1897124478404526487 | 0.00 | 469761446 | 8bf369b4-92c0-4fb8-88a5-8a2dd0947e46 |
428012 | 2020-02-29 20:04:49 | cart | 5826413 | 1487580005511725929 | 0.00 | 469761446 | 8bf369b4-92c0-4fb8-88a5-8a2dd0947e46 |
428013 | 2020-02-29 20:04:49 | cart | 5832437 | 1487580007675986893 | 0.00 | 469761446 | 8bf369b4-92c0-4fb8-88a5-8a2dd0947e46 |
428014 | 2020-02-29 20:04:49 | cart | 5851606 | 2055161088059638328 | 0.00 | 469761446 | 8bf369b4-92c0-4fb8-88a5-8a2dd0947e46 |
428370 | 2020-02-29 20:26:16 | view | 5923106 | 1487580008246412266 | 0.00 | 622047714 | 74f04dc6-2b3c-4565-beda-f575d73ed81c |
20544 rows × 7 columns
It's about 20000 records, we could delete them.
But first, do they perhaps focus on a specific product?
df[df.precio <= 0].producto.value_counts().head(10)
5896186 79
5903915 50
5873428 37
5851294 29
5851304 29
5837624 28
5712583 27
5851272 27
5907812 26
5899512 26
Name: producto, dtype: int64
It doesn't seem like it's a specific product issue, so we're going to remove all logs.
df = df[df.precio > 0]
df
fecha | evento | producto | categoria | precio | usuario | sesion | |
---|---|---|---|---|---|---|---|
0 | 2019-10-01 00:01:46 | view | 5843665 | 1487580005092295511 | 9.44 | 462033176 | a18e0999-61a1-4218-8f8f-61ec1d375361 |
1 | 2019-10-01 00:01:55 | cart | 5868461 | 1487580013069861041 | 3.57 | 514753614 | e2fecb2d-22d0-df2c-c661-15da44b3ccf1 |
2 | 2019-10-01 00:02:50 | view | 5877456 | 1487580006300255120 | 122.22 | 527418424 | 86e77869-afbc-4dff-9aa2-6b7dd8c90770 |
3 | 2019-10-01 00:03:41 | view | 5649270 | 1487580013749338323 | 6.19 | 555448072 | b5f72ceb-0730-44de-a932-d16db62390df |
4 | 2019-10-01 00:03:44 | view | 18082 | 1487580005411062629 | 16.03 | 552006247 | 2d8f304b-de45-4e59-8f40-50c603843fe5 |
... | ... | ... | ... | ... | ... | ... | ... |
429785 | 2020-02-29 23:58:49 | cart | 5815662 | 1487580006317032337 | 0.92 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 |
429786 | 2020-02-29 23:58:57 | view | 5815665 | 1487580006317032337 | 0.59 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 |
429787 | 2020-02-29 23:59:05 | cart | 5815665 | 1487580006317032337 | 0.59 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 |
429788 | 2020-02-29 23:59:28 | view | 5817692 | 1487580010872045658 | 0.79 | 619841242 | 18af673b-7fb9-4202-a66d-5c855bc0fd2d |
429789 | 2020-02-29 23:59:54 | view | 5716351 | 1487580010872045658 | 0.79 | 619841242 | 18af673b-7fb9-4202-a66d-5c855bc0fd2d |
2074026 rows × 7 columns
df.evento.nunique()
4
df.evento.value_counts()
view 961558
cart 574547
remove_from_cart 410357
purchase 127564
Name: evento, dtype: int64
df.producto.nunique()
45327
df.categoria.nunique()
508
Let's put the date as the index.
df.set_index('fecha', inplace = True)
df
evento | producto | categoria | precio | usuario | sesion | |
---|---|---|---|---|---|---|
fecha | ||||||
2019-10-01 00:01:46 | view | 5843665 | 1487580005092295511 | 9.44 | 462033176 | a18e0999-61a1-4218-8f8f-61ec1d375361 |
2019-10-01 00:01:55 | cart | 5868461 | 1487580013069861041 | 3.57 | 514753614 | e2fecb2d-22d0-df2c-c661-15da44b3ccf1 |
2019-10-01 00:02:50 | view | 5877456 | 1487580006300255120 | 122.22 | 527418424 | 86e77869-afbc-4dff-9aa2-6b7dd8c90770 |
2019-10-01 00:03:41 | view | 5649270 | 1487580013749338323 | 6.19 | 555448072 | b5f72ceb-0730-44de-a932-d16db62390df |
2019-10-01 00:03:44 | view | 18082 | 1487580005411062629 | 16.03 | 552006247 | 2d8f304b-de45-4e59-8f40-50c603843fe5 |
... | ... | ... | ... | ... | ... | ... |
2020-02-29 23:58:49 | cart | 5815662 | 1487580006317032337 | 0.92 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 |
2020-02-29 23:58:57 | view | 5815665 | 1487580006317032337 | 0.59 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 |
2020-02-29 23:59:05 | cart | 5815665 | 1487580006317032337 | 0.59 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 |
2020-02-29 23:59:28 | view | 5817692 | 1487580010872045658 | 0.79 | 619841242 | 18af673b-7fb9-4202-a66d-5c855bc0fd2d |
2020-02-29 23:59:54 | view | 5716351 | 1487580010872045658 | 0.79 | 619841242 | 18af673b-7fb9-4202-a66d-5c855bc0fd2d |
2074026 rows × 6 columns
We are going to create 3 types of new variables
def componentes_fecha(dataframe):
date = dataframe.index.date
año = dataframe.index.year
mes = dataframe.index.month
dia = dataframe.index.day
hora = dataframe.index.hour
minuto = dataframe.index.minute
segundo = dataframe.index.second
return(pd.DataFrame({'date':date, 'año':año,'mes':mes, 'dia':dia, 'hora':hora, 'minuto':minuto, 'segundo':segundo}))
df = pd.concat([df.reset_index(),componentes_fecha(df)], axis = 1).set_index('fecha')
df
evento | producto | categoria | precio | ... | dia | hora | minuto | segundo | |
---|---|---|---|---|---|---|---|---|---|
fecha | |||||||||
2019-10-01 00:01:46 | view | 5843665 | 1487580005092295511 | 9.44 | ... | 1 | 0 | 1 | 46 |
2019-10-01 00:01:55 | cart | 5868461 | 1487580013069861041 | 3.57 | ... | 1 | 0 | 1 | 55 |
2019-10-01 00:02:50 | view | 5877456 | 1487580006300255120 | 122.22 | ... | 1 | 0 | 2 | 50 |
2019-10-01 00:03:41 | view | 5649270 | 1487580013749338323 | 6.19 | ... | 1 | 0 | 3 | 41 |
2019-10-01 00:03:44 | view | 18082 | 1487580005411062629 | 16.03 | ... | 1 | 0 | 3 | 44 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2020-02-29 23:58:49 | cart | 5815662 | 1487580006317032337 | 0.92 | ... | 29 | 23 | 58 | 49 |
2020-02-29 23:58:57 | view | 5815665 | 1487580006317032337 | 0.59 | ... | 29 | 23 | 58 | 57 |
2020-02-29 23:59:05 | cart | 5815665 | 1487580006317032337 | 0.59 | ... | 29 | 23 | 59 | 5 |
2020-02-29 23:59:28 | view | 5817692 | 1487580010872045658 | 0.79 | ... | 29 | 23 | 59 | 28 |
2020-02-29 23:59:54 | view | 5716351 | 1487580010872045658 | 0.79 | ... | 29 | 23 | 59 | 54 |
2074026 rows × 13 columns
To incorporate holidays we can use the holidays.
It's not perfect, but it gives us a lot of flexibility because it has parties from various countries and even at the community level.
And we can see the list of countries and the basic use in:
https://github.com/dr-prodigy/python-holidays
import holidays
We define the object festive_ru since this ecommerce is Russian.
festivo_ru = holidays.RU(years=2020)
festivo_ru
{datetime.date(2020, 1, 1): 'Новый год', datetime.date(2020, 1, 2): 'Новый год', datetime.date(2020, 1, 3): 'Новый год', datetime.date(2020, 1, 4): 'Новый год', datetime.date(2020, 1, 5): 'Новый год', datetime.date(2020, 1, 6): 'Новый год', datetime.date(2020, 1, 7): 'Православное Рождество', datetime.date(2020, 1, 8): 'Новый год', datetime.date(2020, 2, 23): 'День защитника отечества', datetime.date(2020, 3, 8): 'День женщин', datetime.date(2020, 5, 1): 'Праздник Весны и Труда', datetime.date(2020, 5, 9): 'День Победы', datetime.date(2020, 6, 12): 'День России', datetime.date(2020, 11, 4): 'День народного единства', datetime.date(2020, 12, 31): 'Новый год'}
We are going to add a variable that tells each record if it was a holiday or not.
df['festivo'] = df.date.apply(lambda x: 1 if (x in festivo_ru) else 0)
df
evento | producto | categoria | precio | ... | hora | minuto | segundo | festivo | |
---|---|---|---|---|---|---|---|---|---|
fecha | |||||||||
2019-10-01 00:01:46 | view | 5843665 | 1487580005092295511 | 9.44 | ... | 0 | 1 | 46 | 0 |
2019-10-01 00:01:55 | cart | 5868461 | 1487580013069861041 | 3.57 | ... | 0 | 1 | 55 | 0 |
2019-10-01 00:02:50 | view | 5877456 | 1487580006300255120 | 122.22 | ... | 0 | 2 | 50 | 0 |
2019-10-01 00:03:41 | view | 5649270 | 1487580013749338323 | 6.19 | ... | 0 | 3 | 41 | 0 |
2019-10-01 00:03:44 | view | 18082 | 1487580005411062629 | 16.03 | ... | 0 | 3 | 44 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2020-02-29 23:58:49 | cart | 5815662 | 1487580006317032337 | 0.92 | ... | 23 | 58 | 49 | 0 |
2020-02-29 23:58:57 | view | 5815665 | 1487580006317032337 | 0.59 | ... | 23 | 58 | 57 | 0 |
2020-02-29 23:59:05 | cart | 5815665 | 1487580006317032337 | 0.59 | ... | 23 | 59 | 5 | 0 |
2020-02-29 23:59:28 | view | 5817692 | 1487580010872045658 | 0.79 | ... | 23 | 59 | 28 | 0 |
2020-02-29 23:59:54 | view | 5716351 | 1487580010872045658 | 0.79 | ... | 23 | 59 | 54 | 0 |
2074026 rows × 14 columns
We check the holidays.
df[df.festivo == 1].date.value_counts().sort_index()
2019-11-04 16430
2019-12-31 2848
2020-01-01 7644
2020-01-02 10776
2020-01-03 10617
2020-01-04 13084
2020-01-05 14554
2020-01-06 10621
2020-01-07 12922
2020-01-08 14004
2020-02-23 9817
Name: date, dtype: int64
We are going to add indicators for Black Friday and Valentine's Day.
df['black_friday'] = 0
df.loc['2019-11-29','black_friday'] = 1
df['san_valentin'] = 0
df.loc['2020-02-14','san_valentin'] = 1
Comprobamos
df['black_friday'].value_counts()
0 2051695
1 22331
Name: black_friday, dtype: int64
df['san_valentin'].value_counts()
0 2061781
1 12245
Name: san_valentin, dtype: int64
We review what we have.
df.head()
evento | producto | categoria | precio | ... | segundo | festivo | black_friday | san_valentin | |
---|---|---|---|---|---|---|---|---|---|
fecha | |||||||||
2019-10-01 00:01:46 | view | 5843665 | 1487580005092295511 | 9.44 | ... | 46 | 0 | 0 | 0 |
2019-10-01 00:01:55 | cart | 5868461 | 1487580013069861041 | 3.57 | ... | 55 | 0 | 0 | 0 |
2019-10-01 00:02:50 | view | 5877456 | 1487580006300255120 | 122.22 | ... | 50 | 0 | 0 | 0 |
2019-10-01 00:03:41 | view | 5649270 | 1487580013749338323 | 6.19 | ... | 41 | 0 | 0 | 0 |
2019-10-01 00:03:44 | view | 18082 | 1487580005411062629 | 16.03 | ... | 44 | 0 | 0 | 0 |
5 rows × 16 columns
Let's put the columns in a more natural order.
variables = df.columns.to_list()
variables
['evento',
'producto',
'categoria',
'precio',
'usuario',
'sesion',
'date',
'año',
'mes',
'dia',
'hora',
'minuto',
'segundo',
'festivo',
'black_friday',
'san_valentin']
orden = ['usuario',
'sesion',
'categoria',
'evento',
'producto',
'precio']
orden
['usuario', 'sesion', 'categoria', 'evento', 'producto', 'precio']
resto = [nombre for nombre in variables if nombre not in orden]
resto
['date',
'año',
'mes',
'dia',
'hora',
'minuto',
'segundo',
'festivo',
'black_friday',
'san_valentin']
df = df[orden + resto]
df
usuario | sesion | categoria | evento | ... | segundo | festivo | black_friday | san_valentin | |
---|---|---|---|---|---|---|---|---|---|
fecha | |||||||||
2019-10-01 00:01:46 | 462033176 | a18e0999-61a1-4218-8f8f-61ec1d375361 | 1487580005092295511 | view | ... | 46 | 0 | 0 | 0 |
2019-10-01 00:01:55 | 514753614 | e2fecb2d-22d0-df2c-c661-15da44b3ccf1 | 1487580013069861041 | cart | ... | 55 | 0 | 0 | 0 |
2019-10-01 00:02:50 | 527418424 | 86e77869-afbc-4dff-9aa2-6b7dd8c90770 | 1487580006300255120 | view | ... | 50 | 0 | 0 | 0 |
2019-10-01 00:03:41 | 555448072 | b5f72ceb-0730-44de-a932-d16db62390df | 1487580013749338323 | view | ... | 41 | 0 | 0 | 0 |
2019-10-01 00:03:44 | 552006247 | 2d8f304b-de45-4e59-8f40-50c603843fe5 | 1487580005411062629 | view | ... | 44 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2020-02-29 23:58:49 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 | 1487580006317032337 | cart | ... | 49 | 0 | 0 | 0 |
2020-02-29 23:58:57 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 | 1487580006317032337 | view | ... | 57 | 0 | 0 | 0 |
2020-02-29 23:59:05 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 | 1487580006317032337 | cart | ... | 5 | 0 | 0 | 0 |
2020-02-29 23:59:28 | 619841242 | 18af673b-7fb9-4202-a66d-5c855bc0fd2d | 1487580010872045658 | view | ... | 28 | 0 | 0 | 0 |
2020-02-29 23:59:54 | 619841242 | 18af673b-7fb9-4202-a66d-5c855bc0fd2d | 1487580010872045658 | view | ... | 54 | 0 | 0 | 0 |
2074026 rows × 16 columns
We save as pickle so as not to lose the metadata.
df.to_pickle('tablon_analitico.pickle')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import fiser_tools as fs
fs.misc.dark_theme()
#Automcompletar rápido
%config IPCompleter.greedy=True
#Formato de display
pd.options.display.float_format = '{:15.2f}'.format
pd.options.display.max_columns = 8
df = pd.read_pickle('tablon_analitico.pickle')
df
usuario | sesion | categoria | evento | ... | segundo | festivo | black_friday | san_valentin | |
---|---|---|---|---|---|---|---|---|---|
fecha | |||||||||
2019-10-01 00:01:46 | 462033176 | a18e0999-61a1-4218-8f8f-61ec1d375361 | 1487580005092295511 | view | ... | 46 | 0 | 0 | 0 |
2019-10-01 00:01:55 | 514753614 | e2fecb2d-22d0-df2c-c661-15da44b3ccf1 | 1487580013069861041 | cart | ... | 55 | 0 | 0 | 0 |
2019-10-01 00:02:50 | 527418424 | 86e77869-afbc-4dff-9aa2-6b7dd8c90770 | 1487580006300255120 | view | ... | 50 | 0 | 0 | 0 |
2019-10-01 00:03:41 | 555448072 | b5f72ceb-0730-44de-a932-d16db62390df | 1487580013749338323 | view | ... | 41 | 0 | 0 | 0 |
2019-10-01 00:03:44 | 552006247 | 2d8f304b-de45-4e59-8f40-50c603843fe5 | 1487580005411062629 | view | ... | 44 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2020-02-29 23:58:49 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 | 1487580006317032337 | cart | ... | 49 | 0 | 0 | 0 |
2020-02-29 23:58:57 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 | 1487580006317032337 | view | ... | 57 | 0 | 0 | 0 |
2020-02-29 23:59:05 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 | 1487580006317032337 | cart | ... | 5 | 0 | 0 | 0 |
2020-02-29 23:59:28 | 619841242 | 18af673b-7fb9-4202-a66d-5c855bc0fd2d | 1487580010872045658 | view | ... | 28 | 0 | 0 | 0 |
2020-02-29 23:59:54 | 619841242 | 18af673b-7fb9-4202-a66d-5c855bc0fd2d | 1487580010872045658 | view | ... | 54 | 0 | 0 | 0 |
2074026 rows × 16 columns
eventos = df.evento.value_counts()
eventos
view 961558
cart 574547
remove_from_cart 410357
purchase 127564
Name: evento, dtype: int64
kpi_visualizaciones_p = 100
kpi_carrito_p = eventos.loc['cart'] / eventos.loc['view'] * 100
kpi_abandono_p = eventos.loc['remove_from_cart'] / eventos.loc['cart'] * 100
kpi_compra_p = eventos.loc['purchase'] / eventos.loc['cart'] * 100
kpis = pd.DataFrame({'kpi':['visitas','carrito','compra'],
'valor':[kpi_visualizaciones_p,kpi_carrito_p,kpi_compra_p]})
kpis
kpi | valor | |
---|---|---|
0 | visitas | 100.00 |
1 | carrito | 59.75 |
2 | compra | 22.20 |
from plotly import graph_objects as go
fig = go.Figure(go.Funnel(
y = kpis.kpi,
x = kpis.valor.round(2),
marker = {'color': ['red','blue','green']},
opacity = 0.3
))
fig.update_layout(
title = 'Funnel Conversión Inicial')
fig.show()
Conclusions:
Unlike the macro analysis of the funnel, this analysis is per session, which makes it more operational.
Knowing the main kpis per session allows us to establish the baseline to measure the results of CRO actions.
First we create a dataframe with the granularity at the session and event level that we need.
sesion_prod = df.groupby(['sesion','evento']).producto.count()
sesion_prod
sesion evento
0000597b-de39-4a77-9fe5-02c8792ca14e view 3
0000645a-8160-4a3d-91bf-154bff0a22e3 view 2
000090e1-da13-42b1-a31b-91a9ee5e6a88 view 1
0000b3cb-5422-4bf2-b8fe-5c1831d0dc1b view 1
0000de26-bd58-42c9-9173-4763c76b398e view 1
..
ffff6695-b64d-4a67-aa14-34b3b7f63c3f view 2
ffff7d69-b706-4c64-9d6d-da57a04bc32b view 1
ffff8044-2a22-4846-8a72-999e870abbe9 view 1
ffff91d4-7879-4a4b-8b26-c67915a27dc8 view 1
ffffbe0a-d2c2-47c7-afab-680bfdfda50d view 1
Name: producto, Length: 581763, dtype: int64
We pass the events to columns.
sesion_prod = sesion_prod.unstack().fillna(0)
sesion_prod
evento | cart | purchase | remove_from_cart | view |
---|---|---|---|---|
sesion | ||||
0000597b-de39-4a77-9fe5-02c8792ca14e | 0.00 | 0.00 | 0.00 | 3.00 |
0000645a-8160-4a3d-91bf-154bff0a22e3 | 0.00 | 0.00 | 0.00 | 2.00 |
000090e1-da13-42b1-a31b-91a9ee5e6a88 | 0.00 | 0.00 | 0.00 | 1.00 |
0000b3cb-5422-4bf2-b8fe-5c1831d0dc1b | 0.00 | 0.00 | 0.00 | 1.00 |
0000de26-bd58-42c9-9173-4763c76b398e | 0.00 | 0.00 | 0.00 | 1.00 |
... | ... | ... | ... | ... |
ffff6695-b64d-4a67-aa14-34b3b7f63c3f | 0.00 | 0.00 | 0.00 | 2.00 |
ffff7d69-b706-4c64-9d6d-da57a04bc32b | 0.00 | 0.00 | 0.00 | 1.00 |
ffff8044-2a22-4846-8a72-999e870abbe9 | 0.00 | 0.00 | 0.00 | 1.00 |
ffff91d4-7879-4a4b-8b26-c67915a27dc8 | 0.00 | 0.00 | 0.00 | 1.00 |
ffffbe0a-d2c2-47c7-afab-680bfdfda50d | 0.00 | 0.00 | 0.00 | 1.00 |
446054 rows × 4 columns
To check we calculate the totals and it should give us the same as globally.
sesion_prod.sum()
evento
cart 574547.00
purchase 127564.00
remove_from_cart 410357.00
view 961558.00
dtype: float64
We rearrange the columns.
sesion_prod = sesion_prod[['view','cart','remove_from_cart','purchase']]
sesion_prod
evento | view | cart | remove_from_cart | purchase |
---|---|---|---|---|
sesion | ||||
0000597b-de39-4a77-9fe5-02c8792ca14e | 3.00 | 0.00 | 0.00 | 0.00 |
0000645a-8160-4a3d-91bf-154bff0a22e3 | 2.00 | 0.00 | 0.00 | 0.00 |
000090e1-da13-42b1-a31b-91a9ee5e6a88 | 1.00 | 0.00 | 0.00 | 0.00 |
0000b3cb-5422-4bf2-b8fe-5c1831d0dc1b | 1.00 | 0.00 | 0.00 | 0.00 |
0000de26-bd58-42c9-9173-4763c76b398e | 1.00 | 0.00 | 0.00 | 0.00 |
... | ... | ... | ... | ... |
ffff6695-b64d-4a67-aa14-34b3b7f63c3f | 2.00 | 0.00 | 0.00 | 0.00 |
ffff7d69-b706-4c64-9d6d-da57a04bc32b | 1.00 | 0.00 | 0.00 | 0.00 |
ffff8044-2a22-4846-8a72-999e870abbe9 | 1.00 | 0.00 | 0.00 | 0.00 |
ffff91d4-7879-4a4b-8b26-c67915a27dc8 | 1.00 | 0.00 | 0.00 | 0.00 |
ffffbe0a-d2c2-47c7-afab-680bfdfda50d | 1.00 | 0.00 | 0.00 | 0.00 |
446054 rows × 4 columns
We calculated the mean of each event per session.
media_eventos_sesion = sesion_prod.mean()
media_eventos_sesion
evento
view 2.16
cart 1.29
remove_from_cart 0.92
purchase 0.29
dtype: float64
Conclusion:
In each session, on average:
As we said, these are the numbers that we must increase with the CRO actions.
We create the dataframe at event and time granularity.
eventos_hora = df.groupby(['evento','hora']).producto.count()
eventos_hora
evento hora
cart 0 6475
1 5555
2 6433
3 8544
4 11242
...
view 19 63730
20 57311
21 38905
22 23043
23 13307
Name: producto, Length: 96, dtype: int64
We pass the events to columns.
eventos_hora = eventos_hora.unstack(level = 0)
eventos_hora
evento | cart | purchase | remove_from_cart | view |
---|---|---|---|---|
hora | ||||
0 | 6475 | 962 | 3238 | 8731 |
1 | 5555 | 1128 | 3930 | 7280 |
2 | 6433 | 1220 | 3509 | 8378 |
3 | 8544 | 1535 | 5331 | 11807 |
4 | 11242 | 2389 | 8095 | 18365 |
5 | 16890 | 3491 | 11913 | 27438 |
6 | 21993 | 5125 | 16223 | 38055 |
7 | 27069 | 5951 | 17883 | 46072 |
8 | 29526 | 7158 | 21156 | 49587 |
9 | 32095 | 7593 | 21680 | 54185 |
10 | 32901 | 7816 | 23982 | 56458 |
11 | 33284 | 8495 | 25496 | 57594 |
12 | 34258 | 8250 | 23714 | 57530 |
13 | 31996 | 8133 | 22852 | 55534 |
14 | 30451 | 7122 | 21835 | 52184 |
15 | 28789 | 6485 | 20162 | 49809 |
16 | 28775 | 6531 | 19791 | 51055 |
17 | 32525 | 6242 | 24330 | 55667 |
18 | 36435 | 8211 | 30551 | 59533 |
19 | 39609 | 7435 | 27666 | 63730 |
20 | 34828 | 7256 | 24985 | 57311 |
21 | 23228 | 4606 | 17396 | 38905 |
22 | 13589 | 2883 | 8680 | 23043 |
23 | 8057 | 1547 | 5959 | 13307 |
Let's visualize how the events are distributed per hour.
eventos_hora.plot()
plt.xticks(ticks = eventos_hora.index);
There is a global pattern as expected.
But to better see the differences we can create a new variable that is the ratio of purchases per visit in each hour.
eventos_hora['compras_visitas'] = eventos_hora.purchase / eventos_hora.view * 100
eventos_hora
evento | cart | purchase | remove_from_cart | view | compras_visitas |
---|---|---|---|---|---|
hora | |||||
0 | 6475 | 962 | 3238 | 8731 | 11.02 |
1 | 5555 | 1128 | 3930 | 7280 | 15.49 |
2 | 6433 | 1220 | 3509 | 8378 | 14.56 |
3 | 8544 | 1535 | 5331 | 11807 | 13.00 |
4 | 11242 | 2389 | 8095 | 18365 | 13.01 |
5 | 16890 | 3491 | 11913 | 27438 | 12.72 |
6 | 21993 | 5125 | 16223 | 38055 | 13.47 |
7 | 27069 | 5951 | 17883 | 46072 | 12.92 |
8 | 29526 | 7158 | 21156 | 49587 | 14.44 |
9 | 32095 | 7593 | 21680 | 54185 | 14.01 |
10 | 32901 | 7816 | 23982 | 56458 | 13.84 |
11 | 33284 | 8495 | 25496 | 57594 | 14.75 |
12 | 34258 | 8250 | 23714 | 57530 | 14.34 |
13 | 31996 | 8133 | 22852 | 55534 | 14.65 |
14 | 30451 | 7122 | 21835 | 52184 | 13.65 |
15 | 28789 | 6485 | 20162 | 49809 | 13.02 |
16 | 28775 | 6531 | 19791 | 51055 | 12.79 |
17 | 32525 | 6242 | 24330 | 55667 | 11.21 |
18 | 36435 | 8211 | 30551 | 59533 | 13.79 |
19 | 39609 | 7435 | 27666 | 63730 | 11.67 |
20 | 34828 | 7256 | 24985 | 57311 | 12.66 |
21 | 23228 | 4606 | 17396 | 38905 | 11.84 |
22 | 13589 | 2883 | 8680 | 23043 | 12.51 |
23 | 8057 | 1547 | 5959 | 13307 | 11.63 |
We rearrange the variables
eventos_hora = eventos_hora[['view','cart','remove_from_cart','purchase','compras_visitas']]
eventos_hora
evento | view | cart | remove_from_cart | purchase | compras_visitas |
---|---|---|---|---|---|
hora | |||||
0 | 8731 | 6475 | 3238 | 962 | 11.02 |
1 | 7280 | 5555 | 3930 | 1128 | 15.49 |
2 | 8378 | 6433 | 3509 | 1220 | 14.56 |
3 | 11807 | 8544 | 5331 | 1535 | 13.00 |
4 | 18365 | 11242 | 8095 | 2389 | 13.01 |
5 | 27438 | 16890 | 11913 | 3491 | 12.72 |
6 | 38055 | 21993 | 16223 | 5125 | 13.47 |
7 | 46072 | 27069 | 17883 | 5951 | 12.92 |
8 | 49587 | 29526 | 21156 | 7158 | 14.44 |
9 | 54185 | 32095 | 21680 | 7593 | 14.01 |
10 | 56458 | 32901 | 23982 | 7816 | 13.84 |
11 | 57594 | 33284 | 25496 | 8495 | 14.75 |
12 | 57530 | 34258 | 23714 | 8250 | 14.34 |
13 | 55534 | 31996 | 22852 | 8133 | 14.65 |
14 | 52184 | 30451 | 21835 | 7122 | 13.65 |
15 | 49809 | 28789 | 20162 | 6485 | 13.02 |
16 | 51055 | 28775 | 19791 | 6531 | 12.79 |
17 | 55667 | 32525 | 24330 | 6242 | 11.21 |
18 | 59533 | 36435 | 30551 | 8211 | 13.79 |
19 | 63730 | 39609 | 27666 | 7435 | 11.67 |
20 | 57311 | 34828 | 24985 | 7256 | 12.66 |
21 | 38905 | 23228 | 17396 | 4606 | 11.84 |
22 | 23043 | 13589 | 8680 | 2883 | 12.51 |
23 | 13307 | 8057 | 5959 | 1547 | 11.63 |
We visualize to see if there are hours in which proportionally more is purchased.
plt.figure(figsize = (12,6))
sns.lineplot(data = eventos_hora, x = eventos_hora.index, y = 'compras_visitas')
plt.xticks(eventos_hora.index);
Conclusions:
We are now going to analyze not proportionally, but absolutely, whether or not there are more frequent hours for each type of event.
plt.figure(figsize = (12,12))
sns.heatmap(data = eventos_hora);
The problem is that since each event has a different scale, this graph does not allow us to differentiate the patterns well.
To solve it we can use the typing of variables that we learned in the statistics module.
def tipificar(variable):
media = variable.mean()
dt = variable.std()
return(variable.apply(lambda x: (x - media) / dt))
eventos_hora_tip = eventos_hora.apply(tipificar)
eventos_hora_tip
evento | view | cart | remove_from_cart | purchase | compras_visitas |
---|---|---|---|---|---|
hora | |||||
0 | -1.60 | -1.56 | -1.63 | -1.62 | -1.83 |
1 | -1.68 | -1.64 | -1.54 | -1.56 | 1.91 |
2 | -1.62 | -1.56 | -1.59 | -1.53 | 1.13 |
3 | -1.45 | -1.37 | -1.38 | -1.41 | -0.17 |
4 | -1.11 | -1.13 | -1.06 | -1.09 | -0.17 |
5 | -0.65 | -0.63 | -0.61 | -0.68 | -0.41 |
6 | -0.10 | -0.17 | -0.10 | -0.07 | 0.22 |
7 | 0.31 | 0.28 | 0.09 | 0.24 | -0.24 |
8 | 0.49 | 0.50 | 0.48 | 0.69 | 1.03 |
9 | 0.72 | 0.73 | 0.54 | 0.85 | 0.67 |
10 | 0.84 | 0.80 | 0.81 | 0.93 | 0.53 |
11 | 0.90 | 0.83 | 0.99 | 1.19 | 1.29 |
12 | 0.89 | 0.92 | 0.78 | 1.09 | 0.95 |
13 | 0.79 | 0.72 | 0.67 | 1.05 | 1.20 |
14 | 0.62 | 0.58 | 0.56 | 0.67 | 0.37 |
15 | 0.50 | 0.43 | 0.36 | 0.44 | -0.16 |
16 | 0.56 | 0.43 | 0.32 | 0.45 | -0.35 |
17 | 0.80 | 0.77 | 0.85 | 0.35 | -1.67 |
18 | 1.00 | 1.11 | 1.58 | 1.08 | 0.49 |
19 | 1.21 | 1.40 | 1.24 | 0.79 | -1.29 |
20 | 0.88 | 0.97 | 0.93 | 0.72 | -0.46 |
21 | -0.06 | -0.06 | 0.03 | -0.26 | -1.14 |
22 | -0.87 | -0.92 | -0.99 | -0.91 | -0.58 |
23 | -1.37 | -1.42 | -1.31 | -1.40 | -1.32 |
plt.figure(figsize = (12,12))
sns.heatmap(data = eventos_hora_tip);
Let's also pull out the line graphs to see it more clearly.
eventos_hora_tip.plot(subplots = True, sharex = False, figsize = (12,12),xticks = eventos_hora_tip.index);
Conclusions:
df.loc[df.evento == 'purchase'].groupby('mes').precio.sum().mean()
124309.92
tendencia = df.groupby('evento').resample('W').evento.count().unstack(level = 0)
tendencia
evento | cart | purchase | remove_from_cart | view |
---|---|---|---|---|
fecha | ||||
2019-10-06 | 31483 | 4440 | 14647 | 36353 |
2019-10-13 | 28151 | 5422 | 17989 | 44410 |
2019-10-20 | 23920 | 5033 | 15303 | 39486 |
2019-10-27 | 25651 | 5665 | 18411 | 40383 |
2019-11-03 | 24087 | 5746 | 16491 | 39365 |
2019-11-10 | 29142 | 6663 | 24008 | 46177 |
2019-11-17 | 25335 | 5141 | 17215 | 41170 |
2019-11-24 | 38069 | 9754 | 27973 | 56477 |
2019-12-01 | 31994 | 7493 | 23106 | 48883 |
2019-12-08 | 23265 | 5105 | 19443 | 42055 |
2019-12-15 | 24636 | 5953 | 18246 | 45874 |
2019-12-22 | 19927 | 4701 | 15452 | 39237 |
2019-12-29 | 17051 | 3705 | 11102 | 32803 |
2020-01-05 | 16735 | 3294 | 13464 | 31909 |
2020-01-12 | 26264 | 5589 | 17956 | 46873 |
2020-01-19 | 28402 | 6913 | 22945 | 50210 |
2020-01-26 | 26353 | 6359 | 18544 | 48478 |
2020-02-02 | 29193 | 7120 | 21102 | 52432 |
2020-02-09 | 28796 | 5853 | 20050 | 48422 |
2020-02-16 | 27836 | 6332 | 22601 | 47213 |
2020-02-23 | 25619 | 6000 | 18146 | 43627 |
2020-03-01 | 22638 | 5283 | 16163 | 39721 |
tendencia = tendencia[['view','cart','remove_from_cart','purchase']]
tendencia
evento | view | cart | remove_from_cart | purchase |
---|---|---|---|---|
fecha | ||||
2019-10-06 | 36353 | 31483 | 14647 | 4440 |
2019-10-13 | 44410 | 28151 | 17989 | 5422 |
2019-10-20 | 39486 | 23920 | 15303 | 5033 |
2019-10-27 | 40383 | 25651 | 18411 | 5665 |
2019-11-03 | 39365 | 24087 | 16491 | 5746 |
2019-11-10 | 46177 | 29142 | 24008 | 6663 |
2019-11-17 | 41170 | 25335 | 17215 | 5141 |
2019-11-24 | 56477 | 38069 | 27973 | 9754 |
2019-12-01 | 48883 | 31994 | 23106 | 7493 |
2019-12-08 | 42055 | 23265 | 19443 | 5105 |
2019-12-15 | 45874 | 24636 | 18246 | 5953 |
2019-12-22 | 39237 | 19927 | 15452 | 4701 |
2019-12-29 | 32803 | 17051 | 11102 | 3705 |
2020-01-05 | 31909 | 16735 | 13464 | 3294 |
2020-01-12 | 46873 | 26264 | 17956 | 5589 |
2020-01-19 | 50210 | 28402 | 22945 | 6913 |
2020-01-26 | 48478 | 26353 | 18544 | 6359 |
2020-02-02 | 52432 | 29193 | 21102 | 7120 |
2020-02-09 | 48422 | 28796 | 20050 | 5853 |
2020-02-16 | 47213 | 27836 | 22601 | 6332 |
2020-02-23 | 43627 | 25619 | 18146 | 6000 |
2020-03-01 | 39721 | 22638 | 16163 | 5283 |
tendencia.plot(subplots = True, figsize = (12,6), sharex = True, xticks = tendencia.index, x_compat=True, rot = 90);
The trend is flat across all metrics, confirming the need for CRO stocks.
There is a significant peak in the week of the 24th, obviously due to Black Friday, we are going to do the same analysis but daily and only for November and December to see the effect.
tendencia_diaria = df.loc['2019-11':'2019-12'].groupby('evento').resample('D').evento.count().unstack(level = 0)
tendencia_diaria
evento | cart | purchase | remove_from_cart | view |
---|---|---|---|---|
fecha | ||||
2019-11-01 | 3565 | 709 | 2810 | 5352 |
2019-11-02 | 3015 | 912 | 2124 | 4857 |
2019-11-03 | 3540 | 755 | 2622 | 5583 |
2019-11-04 | 4652 | 676 | 4854 | 6248 |
2019-11-05 | 4118 | 753 | 2711 | 7213 |
... | ... | ... | ... | ... |
2019-12-27 | 2023 | 507 | 1335 | 4058 |
2019-12-28 | 1744 | 329 | 1193 | 3704 |
2019-12-29 | 2134 | 263 | 1149 | 3939 |
2019-12-30 | 1364 | 258 | 823 | 3434 |
2019-12-31 | 563 | 114 | 447 | 1724 |
61 rows × 4 columns
tendencia_diaria = tendencia_diaria[['view','cart','remove_from_cart','purchase']]
tendencia_diaria
evento | view | cart | remove_from_cart | purchase |
---|---|---|---|---|
fecha | ||||
2019-11-01 | 5352 | 3565 | 2810 | 709 |
2019-11-02 | 4857 | 3015 | 2124 | 912 |
2019-11-03 | 5583 | 3540 | 2622 | 755 |
2019-11-04 | 6248 | 4652 | 4854 | 676 |
2019-11-05 | 7213 | 4118 | 2711 | 753 |
... | ... | ... | ... | ... |
2019-12-27 | 4058 | 2023 | 1335 | 507 |
2019-12-28 | 3704 | 1744 | 1193 | 329 |
2019-12-29 | 3939 | 2134 | 1149 | 263 |
2019-12-30 | 3434 | 1364 | 823 | 258 |
2019-12-31 | 1724 | 563 | 447 | 114 |
61 rows × 4 columns
tendencia_diaria.plot(subplots = True, figsize = (16,10), sharex = True, xticks = tendencia_diaria.index, x_compat=True, rot = 90);
Conclusions:
We are going to do the same analysis for January and February.
tendencia_diaria = df.loc['2020-01':'2020-02'].groupby('evento').resample('D').evento.count().unstack(level = 0)
tendencia_diaria = tendencia_diaria[['view','cart','remove_from_cart','purchase']]
tendencia_diaria.plot(subplots = True, figsize = (16,10), sharex = True, xticks = tendencia_diaria.index, x_compat=True, rot = 90);
Conclusions:
INSIGHT #2 The big takeaway is that all the Christmas shopping pie is delivered on Black Friday week
Could we manage to identify moments at the day-hour level in which the greatest number of purchases take place?
It would be very useful to focus a large part of the investment of campaigns just at those moments.
compras_dia_hora = df.loc[df.evento == 'purchase'].groupby(['date','hora']).evento.count().unstack(level = 0).fillna(0)
compras_dia_hora
date | 2019-10-01 | 2019-10-02 | 2019-10-03 | 2019-10-04 | ... | 2020-02-26 | 2020-02-27 | 2020-02-28 | 2020-02-29 |
---|---|---|---|---|---|---|---|---|---|
hora | |||||||||
0 | 13.00 | 18.00 | 1.00 | 2.00 | ... | 5.00 | 40.00 | 0.00 | 0.00 |
1 | 0.00 | 0.00 | 5.00 | 0.00 | ... | 0.00 | 5.00 | 26.00 | 33.00 |
2 | 0.00 | 0.00 | 0.00 | 24.00 | ... | 0.00 | 0.00 | 0.00 | 8.00 |
3 | 0.00 | 24.00 | 10.00 | 0.00 | ... | 0.00 | 10.00 | 0.00 | 0.00 |
4 | 15.00 | 0.00 | 45.00 | 27.00 | ... | 10.00 | 148.00 | 16.00 | 0.00 |
5 | 49.00 | 9.00 | 6.00 | 17.00 | ... | 6.00 | 48.00 | 11.00 | 3.00 |
6 | 23.00 | 34.00 | 18.00 | 10.00 | ... | 94.00 | 26.00 | 58.00 | 35.00 |
7 | 26.00 | 60.00 | 26.00 | 54.00 | ... | 30.00 | 53.00 | 38.00 | 65.00 |
8 | 28.00 | 71.00 | 129.00 | 49.00 | ... | 120.00 | 80.00 | 67.00 | 25.00 |
9 | 24.00 | 34.00 | 90.00 | 61.00 | ... | 38.00 | 92.00 | 20.00 | 22.00 |
10 | 15.00 | 62.00 | 43.00 | 22.00 | ... | 66.00 | 29.00 | 64.00 | 19.00 |
11 | 95.00 | 80.00 | 83.00 | 36.00 | ... | 80.00 | 69.00 | 86.00 | 63.00 |
12 | 9.00 | 43.00 | 100.00 | 67.00 | ... | 52.00 | 57.00 | 40.00 | 58.00 |
13 | 16.00 | 76.00 | 69.00 | 18.00 | ... | 50.00 | 59.00 | 6.00 | 23.00 |
14 | 74.00 | 31.00 | 38.00 | 36.00 | ... | 70.00 | 51.00 | 26.00 | 44.00 |
15 | 25.00 | 10.00 | 45.00 | 28.00 | ... | 51.00 | 28.00 | 44.00 | 46.00 |
16 | 99.00 | 21.00 | 33.00 | 42.00 | ... | 55.00 | 10.00 | 14.00 | 59.00 |
17 | 88.00 | 80.00 | 55.00 | 31.00 | ... | 6.00 | 32.00 | 27.00 | 34.00 |
18 | 53.00 | 24.00 | 35.00 | 54.00 | ... | 98.00 | 220.00 | 46.00 | 55.00 |
19 | 29.00 | 25.00 | 19.00 | 14.00 | ... | 28.00 | 85.00 | 56.00 | 21.00 |
20 | 53.00 | 22.00 | 63.00 | 17.00 | ... | 85.00 | 7.00 | 12.00 | 15.00 |
21 | 1.00 | 55.00 | 25.00 | 42.00 | ... | 44.00 | 22.00 | 16.00 | 17.00 |
22 | 33.00 | 10.00 | 0.00 | 42.00 | ... | 5.00 | 49.00 | 2.00 | 21.00 |
23 | 0.00 | 0.00 | 7.00 | 0.00 | ... | 0.00 | 6.00 | 5.00 | 1.00 |
24 rows × 152 columns
plt.figure(figsize = (20,14))
sns.heatmap(compras_dia_hora);
To analyze at the customer level, it is best to create a dataframe of only buyers with customer granularity and the variables that interest us.
We must be careful with the aggregation function that we use in each one.
clientes = df.loc[df.evento == 'purchase'].groupby(['usuario']).agg({'producto':'count',
'sesion':'nunique',
'precio': 'mean',
'date': 'max'})
clientes
producto | sesion | precio | date | |
---|---|---|---|---|
usuario | ||||
25392526 | 3 | 1 | 7.38 | 2019-12-18 |
27756757 | 1 | 1 | 20.63 | 2020-01-27 |
50748978 | 9 | 1 | 1.11 | 2019-12-14 |
52747911 | 3 | 1 | 7.67 | 2019-10-10 |
65241811 | 5 | 1 | 8.36 | 2019-11-11 |
... | ... | ... | ... | ... |
621995551 | 5 | 1 | 2.09 | 2020-02-29 |
622021687 | 1 | 1 | 13.33 | 2020-02-29 |
622041514 | 3 | 1 | 0.63 | 2020-02-29 |
622042698 | 3 | 1 | 28.04 | 2020-02-29 |
622065819 | 4 | 1 | 5.12 | 2020-02-29 |
11040 rows × 4 columns
we rename
clientes.columns = ['productos_tot_num','compras_tot_num','precio_medio_prod','ult_compra']
clientes
productos_tot_num | compras_tot_num | precio_medio_prod | ult_compra | |
---|---|---|---|---|
usuario | ||||
25392526 | 3 | 1 | 7.38 | 2019-12-18 |
27756757 | 1 | 1 | 20.63 | 2020-01-27 |
50748978 | 9 | 1 | 1.11 | 2019-12-14 |
52747911 | 3 | 1 | 7.67 | 2019-10-10 |
65241811 | 5 | 1 | 8.36 | 2019-11-11 |
... | ... | ... | ... | ... |
621995551 | 5 | 1 | 2.09 | 2020-02-29 |
622021687 | 1 | 1 | 13.33 | 2020-02-29 |
622041514 | 3 | 1 | 0.63 | 2020-02-29 |
622042698 | 3 | 1 | 28.04 | 2020-02-29 |
622065819 | 4 | 1 | 5.12 | 2020-02-29 |
11040 rows × 4 columns
We are going to calculate additional variables.
clientes['gasto_tot'] = clientes.productos_tot_num * clientes.precio_medio_prod
clientes['productos_por_compra'] = clientes.productos_tot_num / clientes.compras_tot_num
clientes
productos_tot_num | compras_tot_num | precio_medio_prod | ult_compra | gasto_tot | productos_por_compra | |
---|---|---|---|---|---|---|
usuario | ||||||
25392526 | 3 | 1 | 7.38 | 2019-12-18 | 22.14 | 3.00 |
27756757 | 1 | 1 | 20.63 | 2020-01-27 | 20.63 | 1.00 |
50748978 | 9 | 1 | 1.11 | 2019-12-14 | 10.01 | 9.00 |
52747911 | 3 | 1 | 7.67 | 2019-10-10 | 23.02 | 3.00 |
65241811 | 5 | 1 | 8.36 | 2019-11-11 | 41.79 | 5.00 |
... | ... | ... | ... | ... | ... | ... |
621995551 | 5 | 1 | 2.09 | 2020-02-29 | 10.46 | 5.00 |
622021687 | 1 | 1 | 13.33 | 2020-02-29 | 13.33 | 1.00 |
622041514 | 3 | 1 | 0.63 | 2020-02-29 | 1.90 | 3.00 |
622042698 | 3 | 1 | 28.04 | 2020-02-29 | 84.13 | 3.00 |
622065819 | 4 | 1 | 5.12 | 2020-02-29 | 20.48 | 4.00 |
11040 rows × 6 columns
sns.histplot(data = clientes, x = 'gasto_tot', bins = 50)
plt.xlim([0,300]);
The vast majority of clients have spent less than €50 in the period.
sns.countplot(data = clientes, x = 'compras_tot_num');
INSIGHT #3 The vast majority of customers only make one purchase.
There is a long way to go to improve this ratio through:
clientes.productos_por_compra.describe()
count 11040.00
mean 7.79
std 9.49
min 1.00
25% 3.00
50% 5.00
75% 10.00
max 219.00
Name: productos_por_compra, dtype: float64
INSIGHT #4 The medium purchase includes 5 products.
But 25% of customers buy more than 10 products in the same purchase.
There is a long way to go to improve this ratio through:
clientes.nlargest(n = 10, columns = 'gasto_tot')
productos_tot_num | compras_tot_num | precio_medio_prod | ult_compra | gasto_tot | productos_por_compra | |
---|---|---|---|---|---|---|
usuario | ||||||
573823111 | 268 | 2 | 5.82 | 2020-02-21 | 1559.21 | 134.00 |
539751397 | 236 | 13 | 6.16 | 2020-02-19 | 1453.37 | 18.15 |
556579890 | 506 | 4 | 2.75 | 2020-02-27 | 1392.45 | 126.50 |
442763940 | 195 | 8 | 6.37 | 2019-12-23 | 1241.53 | 24.38 |
561592095 | 94 | 3 | 11.81 | 2019-10-31 | 1109.70 | 31.33 |
527739278 | 244 | 13 | 4.39 | 2020-02-16 | 1071.00 | 18.77 |
527806771 | 195 | 13 | 4.86 | 2020-02-20 | 948.01 | 15.00 |
430220205 | 190 | 6 | 4.99 | 2020-02-29 | 947.30 | 31.67 |
491009486 | 219 | 1 | 4.32 | 2020-02-12 | 946.20 | 219.00 |
520501669 | 64 | 11 | 14.27 | 2020-01-17 | 913.01 | 5.82 |
To calculate we calculate the average total spend per customer.
clientes.gasto_tot.describe()
count 11040.00
mean 56.30
std 81.73
min 0.13
25% 16.22
50% 32.74
75% 60.30
max 1559.21
Name: gasto_tot, dtype: float64
INSIGHT #5 There are customers with average spending dozens of times higher than the average.
These customers must be retained through loyalty programs.
Since we only have 5 months of history, we are going to create cohort analyzes 3 months into the future, which gives us 3 cohorts.
We prepare a dataframe only with buyers and with the user and month variables.
c = df.loc[df.evento == 'purchase', ['usuario','mes']]
c
usuario | mes | |
---|---|---|
fecha | ||
2019-10-01 00:26:49 | 536128518 | 10 |
2019-10-01 00:26:49 | 536128518 | 10 |
2019-10-01 00:26:49 | 536128518 | 10 |
2019-10-01 00:26:49 | 536128518 | 10 |
2019-10-01 00:26:49 | 536128518 | 10 |
... | ... | ... |
2020-02-29 22:29:19 | 622065819 | 2 |
2020-02-29 22:29:19 | 622065819 | 2 |
2020-02-29 22:29:19 | 622065819 | 2 |
2020-02-29 22:29:19 | 622065819 | 2 |
2020-02-29 23:26:42 | 610361057 | 2 |
127564 rows × 2 columns
We pass the months to columns.
c = pd.crosstab(c.usuario,c.mes).reset_index()
c
mes | usuario | 1 | 2 | 10 | 11 | 12 |
---|---|---|---|---|---|---|
0 | 25392526 | 0 | 0 | 0 | 0 | 3 |
1 | 27756757 | 1 | 0 | 0 | 0 | 0 |
2 | 50748978 | 0 | 0 | 0 | 0 | 9 |
3 | 52747911 | 0 | 0 | 3 | 0 | 0 |
4 | 65241811 | 0 | 0 | 0 | 5 | 0 |
... | ... | ... | ... | ... | ... | ... |
11035 | 621995551 | 0 | 5 | 0 | 0 | 0 |
11036 | 622021687 | 0 | 1 | 0 | 0 | 0 |
11037 | 622041514 | 0 | 3 | 0 | 0 | 0 |
11038 | 622042698 | 0 | 3 | 0 | 0 | 0 |
11039 | 622065819 | 0 | 4 | 0 | 0 | 0 |
11040 rows × 6 columns
We rename and delete the user that we no longer need it.
c.columns = ['usuario','c4','c5','c1','c2','c3']
c.drop(columns = 'usuario', inplace = True)
c
c4 | c5 | c1 | c2 | c3 | |
---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 3 |
1 | 1 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 9 |
3 | 0 | 0 | 3 | 0 | 0 |
4 | 0 | 0 | 0 | 5 | 0 |
... | ... | ... | ... | ... | ... |
11035 | 0 | 5 | 0 | 0 | 0 |
11036 | 0 | 1 | 0 | 0 | 0 |
11037 | 0 | 3 | 0 | 0 | 0 |
11038 | 0 | 3 | 0 | 0 | 0 |
11039 | 0 | 4 | 0 | 0 | 0 |
11040 rows × 5 columns
The first cohort will be the one from month 2, since we want to select "new" clients (unless they were not there the previous month)
c2 = c.loc[(c.c1 == 0) & (c.c2 > 0)]
c2
c4 | c5 | c1 | c2 | c3 | |
---|---|---|---|---|---|
4 | 0 | 0 | 0 | 5 | 0 |
6 | 0 | 0 | 0 | 10 | 0 |
8 | 0 | 0 | 0 | 27 | 17 |
9 | 0 | 0 | 0 | 3 | 0 |
13 | 0 | 0 | 0 | 4 | 0 |
... | ... | ... | ... | ... | ... |
7702 | 0 | 0 | 0 | 5 | 0 |
7703 | 0 | 5 | 0 | 2 | 0 |
7705 | 0 | 0 | 0 | 1 | 0 |
7708 | 0 | 0 | 0 | 5 | 6 |
7709 | 0 | 0 | 0 | 1 | 0 |
2640 rows × 5 columns
We go to a binary dataframe since we only care if that customer has purchased or not in each month.
def binarizar(variable):
variable = variable.transform(lambda x: 1 if (x > 0) else 0)
return(variable)
c2_b = c2.apply(binarizar)
c2_b
c4 | c5 | c1 | c2 | c3 | |
---|---|---|---|---|---|
4 | 0 | 0 | 0 | 1 | 0 |
6 | 0 | 0 | 0 | 1 | 0 |
8 | 0 | 0 | 0 | 1 | 1 |
9 | 0 | 0 | 0 | 1 | 0 |
13 | 0 | 0 | 0 | 1 | 0 |
... | ... | ... | ... | ... | ... |
7702 | 0 | 0 | 0 | 1 | 0 |
7703 | 0 | 1 | 0 | 1 | 0 |
7705 | 0 | 0 | 0 | 1 | 0 |
7708 | 0 | 0 | 0 | 1 | 1 |
7709 | 0 | 0 | 0 | 1 | 0 |
2640 rows × 5 columns
We calculate the percentage of customers in this cohort who have continued to purchase in subsequent months.
c2_f = c2_b.sum() / c2_b.shape[0]
c2_f = c2_f.sort_index()
c2_f
c1 0.00
c2 1.00
c3 0.10
c4 0.10
c5 0.08
dtype: float64
We replicated the entire process for cohort 3
c3 = c.loc[(c.c2 == 0) & (c.c3 > 0)]
c3_b = c3.apply(binarizar)
c3_f = c3_b.sum() / c3_b.shape[0]
c3_f = c3_f.sort_index()
c3_f['c1'] = 0
c3_f
c1 0.00
c2 0.00
c3 1.00
c4 0.10
c5 0.08
dtype: float64
We replicated the entire process for cohort 4
c4 = c.loc[(c.c3 == 0) & (c.c4 > 0)]
c4_b = c4.apply(binarizar)
c4_f = c4_b.sum() / c4_b.shape[0]
c4_f = c4_f.sort_index()
c4_f['c1'] = 0
c4_f['c2'] = 0
c4_f
c1 0.00
c2 0.00
c3 0.00
c4 1.00
c5 0.12
dtype: float64
We create the cohort dataframe.
cohortes = pd.DataFrame({'c2':c2_f,'c3':c3_f,'c4':c4_f})
cohortes
c2 | c3 | c4 | |
---|---|---|---|
c1 | 0.00 | 0.00 | 0.00 |
c2 | 1.00 | 0.00 | 0.00 |
c3 | 0.10 | 1.00 | 0.00 |
c4 | 0.10 | 0.10 | 1.00 |
c5 | 0.08 | 0.08 | 0.12 |
cohortes = cohortes.drop(index = 'c1').T
cohortes
c2 | c3 | c4 | c5 | |
---|---|---|---|---|
c2 | 1.00 | 0.10 | 0.10 | 0.08 |
c3 | 0.00 | 1.00 | 0.10 | 0.08 |
c4 | 0.00 | 0.00 | 1.00 | 0.12 |
plt.figure(figsize = (12,6))
sns.heatmap(cohortes,annot = True, fmt = '.0%', cmap='Greys');
INSIGHT #6: 90% of new customers do not buy again in the following months
Taking into account 90% of the fact that new customers do not buy again in the following months, we can calculate the LTV with the historical one that we have without fear of being very wrong.
To do this, we are going to take the customers of cohort 2 and calculate the total of their purchases.
maestro_ltv = df.loc[(df.evento == 'purchase') & (df.mes != 10) & (df.mes == 11),'usuario'].to_list()
maestro_ltv
[549319657,
549319657,
549319657,
549319657,
...
...
510126861,
510126861,
510126861,
440684807,
...]
clientes_ltv = clientes.loc[clientes.index.isin(maestro_ltv)]
clientes_ltv
productos_tot_num | compras_tot_num | precio_medio_prod | ult_compra | gasto_tot | productos_por_compra | |
---|---|---|---|---|---|---|
usuario | ||||||
65241811 | 5 | 1 | 8.36 | 2019-11-11 | 41.79 | 5.00 |
80577370 | 10 | 2 | 10.62 | 2019-11-29 | 106.24 | 5.00 |
88211255 | 22 | 4 | 4.86 | 2020-02-25 | 106.87 | 5.50 |
93279832 | 44 | 2 | 3.19 | 2019-12-19 | 140.51 | 22.00 |
94390236 | 3 | 1 | 9.73 | 2019-11-07 | 29.20 | 3.00 |
... | ... | ... | ... | ... | ... | ... |
579798049 | 5 | 1 | 2.10 | 2019-11-30 | 10.52 | 5.00 |
579813390 | 7 | 2 | 2.98 | 2020-02-04 | 20.83 | 3.50 |
579834429 | 1 | 1 | 27.14 | 2019-11-30 | 27.14 | 1.00 |
579900887 | 11 | 2 | 5.67 | 2019-12-02 | 62.34 | 5.50 |
579903865 | 1 | 1 | 8.43 | 2019-11-30 | 8.43 | 1.00 |
3105 rows × 6 columns
clientes_ltv.gasto_tot.describe()
count 3105.00
mean 79.62
std 113.62
min 0.13
25% 20.29
50% 41.49
75% 90.00
max 1453.37
Name: gasto_tot, dtype: float64
Given the variability of the mean, it would be safer to take the median.
INSIGHT #7: The average LTV is €42.
Applying our margin on that figure and the % that we want to dedicate to acquisition, we get the maximum amount to invest in CPA.
Applying CRO actions will allow you to increase the LTV and therefore also the CPA, being a very important strategic advantage.
We are going to learn a technique called RFM (Recency - Frequency - Monetary).
This technique is very powerful for retail contexts and therefore also in ecommerce.
It allows responding to needs such as:
Despite its power, it is very simple to build, therefore it is almost mandatory in this type of analysis.
The first thing is to identify the variables with which to create each of the dimensions:
And discretize each of them.
We are going to leave the recency for last because it will require a previous transformation.
We start with Frequency
clientes['F'] = clientes.compras_tot_num.transform(lambda x: pd.cut(x,5, labels = False)) + 1
clientes
productos_tot_num | compras_tot_num | precio_medio_prod | ult_compra | gasto_tot | productos_por_compra | F | |
---|---|---|---|---|---|---|---|
usuario | |||||||
25392526 | 3 | 1 | 7.38 | 2019-12-18 | 22.14 | 3.00 | 1 |
27756757 | 1 | 1 | 20.63 | 2020-01-27 | 20.63 | 1.00 | 1 |
50748978 | 9 | 1 | 1.11 | 2019-12-14 | 10.01 | 9.00 | 1 |
52747911 | 3 | 1 | 7.67 | 2019-10-10 | 23.02 | 3.00 | 1 |
65241811 | 5 | 1 | 8.36 | 2019-11-11 | 41.79 | 5.00 | 1 |
... | ... | ... | ... | ... | ... | ... | ... |
621995551 | 5 | 1 | 2.09 | 2020-02-29 | 10.46 | 5.00 | 1 |
622021687 | 1 | 1 | 13.33 | 2020-02-29 | 13.33 | 1.00 | 1 |
622041514 | 3 | 1 | 0.63 | 2020-02-29 | 1.90 | 3.00 | 1 |
622042698 | 3 | 1 | 28.04 | 2020-02-29 | 84.13 | 3.00 | 1 |
622065819 | 4 | 1 | 5.12 | 2020-02-29 | 20.48 | 4.00 | 1 |
11040 rows × 7 columns
we check
clientes.groupby('F').compras_tot_num.mean()
F
1 1.31
2 7.06
3 12.00
4 16.50
5 23.50
Name: compras_tot_num, dtype: float64
Now Monetary
clientes['M'] = clientes.gasto_tot.transform(lambda x: pd.cut(x,5, labels = False)) + 1
clientes.groupby('M').gasto_tot.mean()
M
1 48.36
2 410.98
3 765.18
4 1043.96
5 1468.34
Name: gasto_tot, dtype: float64
For the recency we have to transform the date into a number, for example the distance in days from each date to the most recent date available.
mas_reciente = clientes.ult_compra.max()
clientes['ult_compra_dias'] = clientes.ult_compra.transform(lambda x: mas_reciente - x)
clientes
productos_tot_num | compras_tot_num | precio_medio_prod | ult_compra | ... | productos_por_compra | F | M | ult_compra_dias | |
---|---|---|---|---|---|---|---|---|---|
usuario | |||||||||
25392526 | 3 | 1 | 7.38 | 2019-12-18 | ... | 3.00 | 1 | 1 | 73 days |
27756757 | 1 | 1 | 20.63 | 2020-01-27 | ... | 1.00 | 1 | 1 | 33 days |
50748978 | 9 | 1 | 1.11 | 2019-12-14 | ... | 9.00 | 1 | 1 | 77 days |
52747911 | 3 | 1 | 7.67 | 2019-10-10 | ... | 3.00 | 1 | 1 | 142 days |
65241811 | 5 | 1 | 8.36 | 2019-11-11 | ... | 5.00 | 1 | 1 | 110 days |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
621995551 | 5 | 1 | 2.09 | 2020-02-29 | ... | 5.00 | 1 | 1 | 0 days |
622021687 | 1 | 1 | 13.33 | 2020-02-29 | ... | 1.00 | 1 | 1 | 0 days |
622041514 | 3 | 1 | 0.63 | 2020-02-29 | ... | 3.00 | 1 | 1 | 0 days |
622042698 | 3 | 1 | 28.04 | 2020-02-29 | ... | 3.00 | 1 | 1 | 0 days |
622065819 | 4 | 1 | 5.12 | 2020-02-29 | ... | 4.00 | 1 | 1 | 0 days |
11040 rows × 9 columns
A timedelta has been created for us, we have to pass it to number of days.
clientes['ult_compra_dias'] = clientes.ult_compra_dias.dt.days
clientes
productos_tot_num | compras_tot_num | precio_medio_prod | ult_compra | ... | productos_por_compra | F | M | ult_compra_dias | |
---|---|---|---|---|---|---|---|---|---|
usuario | |||||||||
25392526 | 3 | 1 | 7.38 | 2019-12-18 | ... | 3.00 | 1 | 1 | 73 |
27756757 | 1 | 1 | 20.63 | 2020-01-27 | ... | 1.00 | 1 | 1 | 33 |
50748978 | 9 | 1 | 1.11 | 2019-12-14 | ... | 9.00 | 1 | 1 | 77 |
52747911 | 3 | 1 | 7.67 | 2019-10-10 | ... | 3.00 | 1 | 1 | 142 |
65241811 | 5 | 1 | 8.36 | 2019-11-11 | ... | 5.00 | 1 | 1 | 110 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
621995551 | 5 | 1 | 2.09 | 2020-02-29 | ... | 5.00 | 1 | 1 | 0 |
622021687 | 1 | 1 | 13.33 | 2020-02-29 | ... | 1.00 | 1 | 1 | 0 |
622041514 | 3 | 1 | 0.63 | 2020-02-29 | ... | 3.00 | 1 | 1 | 0 |
622042698 | 3 | 1 | 28.04 | 2020-02-29 | ... | 3.00 | 1 | 1 | 0 |
622065819 | 4 | 1 | 5.12 | 2020-02-29 | ... | 4.00 | 1 | 1 | 0 |
11040 rows × 9 columns
We can now create the R, but keep in mind that in this case the best are the lowest values.
clientes['R'] = clientes.ult_compra_dias.transform(lambda x: pd.cut(x,5, labels = False)) + 1
clientes.groupby('R').ult_compra_dias.mean()
R
1 14.62
2 43.04
3 75.94
4 103.85
5 135.91
Name: ult_compra_dias, dtype: float64
To standardize its interpretation with the rest of the dimensions, we are going to turn it around.
clientes['R'] = 6 - clientes.R
clientes.groupby('R').ult_compra_dias.mean()
R
1 135.91
2 103.85
3 75.94
4 43.04
5 14.62
Name: ult_compra_dias, dtype: float64
We integrate into an rfm dataframe.
clientes
productos_tot_num | compras_tot_num | precio_medio_prod | ult_compra | ... | F | M | ult_compra_dias | R | |
---|---|---|---|---|---|---|---|---|---|
usuario | |||||||||
25392526 | 3 | 1 | 7.38 | 2019-12-18 | ... | 1 | 1 | 73 | 3 |
27756757 | 1 | 1 | 20.63 | 2020-01-27 | ... | 1 | 1 | 33 | 4 |
50748978 | 9 | 1 | 1.11 | 2019-12-14 | ... | 1 | 1 | 77 | 3 |
52747911 | 3 | 1 | 7.67 | 2019-10-10 | ... | 1 | 1 | 142 | 1 |
65241811 | 5 | 1 | 8.36 | 2019-11-11 | ... | 1 | 1 | 110 | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
621995551 | 5 | 1 | 2.09 | 2020-02-29 | ... | 1 | 1 | 0 | 5 |
622021687 | 1 | 1 | 13.33 | 2020-02-29 | ... | 1 | 1 | 0 | 5 |
622041514 | 3 | 1 | 0.63 | 2020-02-29 | ... | 1 | 1 | 0 | 5 |
622042698 | 3 | 1 | 28.04 | 2020-02-29 | ... | 1 | 1 | 0 | 5 |
622065819 | 4 | 1 | 5.12 | 2020-02-29 | ... | 1 | 1 | 0 | 5 |
11040 rows × 10 columns
We create additional variables.
clientes['valor'] = clientes.R + clientes.F + clientes.M
clientes['RFM'] = clientes.apply(lambda x: str(x.R) + str(x.F) + str(x.M), axis = 1)
clientes
productos_tot_num | compras_tot_num | precio_medio_prod | ult_compra | ... | ult_compra_dias | R | valor | RFM | |
---|---|---|---|---|---|---|---|---|---|
usuario | |||||||||
25392526 | 3 | 1 | 7.38 | 2019-12-18 | ... | 73 | 3 | 5 | 311 |
27756757 | 1 | 1 | 20.63 | 2020-01-27 | ... | 33 | 4 | 6 | 411 |
50748978 | 9 | 1 | 1.11 | 2019-12-14 | ... | 77 | 3 | 5 | 311 |
52747911 | 3 | 1 | 7.67 | 2019-10-10 | ... | 142 | 1 | 3 | 111 |
65241811 | 5 | 1 | 8.36 | 2019-11-11 | ... | 110 | 2 | 4 | 211 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
621995551 | 5 | 1 | 2.09 | 2020-02-29 | ... | 0 | 5 | 7 | 511 |
622021687 | 1 | 1 | 13.33 | 2020-02-29 | ... | 0 | 5 | 7 | 511 |
622041514 | 3 | 1 | 0.63 | 2020-02-29 | ... | 0 | 5 | 7 | 511 |
622042698 | 3 | 1 | 28.04 | 2020-02-29 | ... | 0 | 5 | 7 | 511 |
622065819 | 4 | 1 | 5.12 | 2020-02-29 | ... | 0 | 5 | 7 | 511 |
11040 rows × 12 columns
On this dataframe we can already do an infinite number of analyses.
For example, combining it with the minicube technique we can obtain all kinds of insights.
#Paso 1: Seleccionar qué variables serán la métricas y cuales las dimensiones
metricas = ['productos_tot_num','compras_tot_num','gasto_tot']
dimensiones = ['R','F','M','RFM','valor']
minicubo = clientes[dimensiones + metricas]
minicubo
R | F | M | RFM | valor | productos_tot_num | compras_tot_num | gasto_tot | |
---|---|---|---|---|---|---|---|---|
usuario | ||||||||
25392526 | 3 | 1 | 1 | 311 | 5 | 3 | 1 | 22.14 |
27756757 | 4 | 1 | 1 | 411 | 6 | 1 | 1 | 20.63 |
50748978 | 3 | 1 | 1 | 311 | 5 | 9 | 1 | 10.01 |
52747911 | 1 | 1 | 1 | 111 | 3 | 3 | 1 | 23.02 |
65241811 | 2 | 1 | 1 | 211 | 4 | 5 | 1 | 41.79 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
621995551 | 5 | 1 | 1 | 511 | 7 | 5 | 1 | 10.46 |
622021687 | 5 | 1 | 1 | 511 | 7 | 1 | 1 | 13.33 |
622041514 | 5 | 1 | 1 | 511 | 7 | 3 | 1 | 1.90 |
622042698 | 5 | 1 | 1 | 511 | 7 | 3 | 1 | 84.13 |
622065819 | 5 | 1 | 1 | 511 | 7 | 4 | 1 | 20.48 |
11040 rows × 8 columns
#Paso 2: pasar a transaccional las dimensiones
minicubo = minicubo.melt(id_vars = metricas)
minicubo
productos_tot_num | compras_tot_num | gasto_tot | variable | value | |
---|---|---|---|---|---|
0 | 3 | 1 | 22.14 | R | 3 |
1 | 1 | 1 | 20.63 | R | 4 |
2 | 9 | 1 | 10.01 | R | 3 |
3 | 3 | 1 | 23.02 | R | 1 |
4 | 5 | 1 | 41.79 | R | 2 |
... | ... | ... | ... | ... | ... |
55195 | 5 | 1 | 10.46 | valor | 7 |
55196 | 1 | 1 | 13.33 | valor | 7 |
55197 | 3 | 1 | 1.90 | valor | 7 |
55198 | 3 | 1 | 84.13 | valor | 7 |
55199 | 4 | 1 | 20.48 | valor | 7 |
55200 rows × 5 columns
#Paso 3: Agregar las métricas por "variable" y "valor" con las funciones deseadas
minicubo = minicubo.groupby(['variable','value'], as_index = False)[metricas].mean()
minicubo
variable | value | productos_tot_num | compras_tot_num | gasto_tot | |
---|---|---|---|---|---|
0 | F | 1 | 10.61 | 1.31 | 52.09 |
1 | F | 2 | 71.42 | 7.06 | 320.47 |
2 | F | 3 | 123.64 | 12.00 | 643.20 |
3 | F | 4 | 156.75 | 16.50 | 560.15 |
4 | F | 5 | 124.00 | 23.50 | 652.42 |
... | ... | ... | ... | ... | ... |
58 | valor | 9 | 98.02 | 7.25 | 491.71 |
59 | valor | 10 | 140.89 | 10.22 | 625.93 |
60 | valor | 11 | 291.00 | 5.75 | 1189.31 |
61 | valor | 12 | 189.80 | 16.60 | 833.43 |
62 | valor | 13 | 179.00 | 18.00 | 1136.70 |
63 rows × 5 columns
To analyze each dimension we select it.
minicubo[minicubo.variable == 'F']
variable | value | productos_tot_num | compras_tot_num | gasto_tot | |
---|---|---|---|---|---|
0 | F | 1 | 10.61 | 1.31 | 52.09 |
1 | F | 2 | 71.42 | 7.06 | 320.47 |
2 | F | 3 | 123.64 | 12.00 | 643.20 |
3 | F | 4 | 156.75 | 16.50 | 560.15 |
4 | F | 5 | 124.00 | 23.50 | 652.42 |
And we analyze it graphically.
minicubo[minicubo.variable == 'F'].set_index('value').plot.bar(subplots = True, sharex = False, figsize = (12,12))
plt.tight_layout();
c:\Users\sgarciam\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\indexes\base.py:6982: FutureWarning:
In a future version, the Index constructor will not infer numeric dtypes when passed object-dtype sequences (matching Series behavior)
minicubo[minicubo.variable == 'R']
variable | value | productos_tot_num | compras_tot_num | gasto_tot | |
---|---|---|---|---|---|
10 | R | 1 | 8.15 | 1.08 | 41.56 |
11 | R | 2 | 9.25 | 1.18 | 45.58 |
12 | R | 3 | 9.54 | 1.29 | 47.25 |
13 | R | 4 | 11.72 | 1.44 | 58.19 |
14 | R | 5 | 16.83 | 1.82 | 79.04 |
minicubo[minicubo.variable == 'R'].set_index('value').plot.bar(subplots = True, sharex = False, figsize = (12,12))
plt.tight_layout();
c:\Users\sgarciam\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\indexes\base.py:6982: FutureWarning:
In a future version, the Index constructor will not infer numeric dtypes when passed object-dtype sequences (matching Series behavior)
minicubo[minicubo.variable == 'M']
variable | value | productos_tot_num | compras_tot_num | gasto_tot | |
---|---|---|---|---|---|
5 | M | 1 | 10.12 | 1.34 | 48.36 |
6 | M | 2 | 74.28 | 4.31 | 410.98 |
7 | M | 3 | 138.50 | 6.86 | 765.18 |
8 | M | 4 | 189.50 | 7.33 | 1043.96 |
9 | M | 5 | 336.67 | 6.33 | 1468.34 |
minicubo[minicubo.variable == 'M'].set_index('value').plot.bar(subplots = True, sharex = False, figsize = (12,12))
plt.tight_layout();
c:\Users\sgarciam\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\indexes\base.py:6982: FutureWarning:
In a future version, the Index constructor will not infer numeric dtypes when passed object-dtype sequences (matching Series behavior)
minicubo[minicubo.variable == 'RFM']
variable | value | productos_tot_num | compras_tot_num | gasto_tot | |
---|---|---|---|---|---|
15 | RFM | 111 | 7.97 | 1.08 | 39.28 |
16 | RFM | 112 | 37.38 | 1.50 | 397.98 |
17 | RFM | 114 | 94.00 | 3.00 | 1109.70 |
18 | RFM | 211 | 8.94 | 1.17 | 43.16 |
19 | RFM | 212 | 47.00 | 1.73 | 391.14 |
20 | RFM | 213 | 35.50 | 1.00 | 662.84 |
21 | RFM | 221 | 85.67 | 6.33 | 194.48 |
22 | RFM | 311 | 8.98 | 1.26 | 43.91 |
23 | RFM | 312 | 53.46 | 2.46 | 376.38 |
24 | RFM | 321 | 39.00 | 6.50 | 141.28 |
25 | RFM | 322 | 121.00 | 6.50 | 389.19 |
26 | RFM | 324 | 195.00 | 8.00 | 1241.53 |
27 | RFM | 411 | 10.37 | 1.36 | 49.99 |
28 | RFM | 412 | 60.69 | 2.54 | 404.22 |
29 | RFM | 413 | 130.50 | 2.50 | 868.30 |
30 | RFM | 421 | 41.00 | 6.92 | 205.31 |
31 | RFM | 422 | 92.88 | 7.12 | 440.47 |
32 | RFM | 423 | 153.33 | 7.33 | 672.97 |
33 | RFM | 433 | 64.00 | 11.00 | 913.01 |
34 | RFM | 511 | 11.94 | 1.50 | 56.07 |
35 | RFM | 512 | 78.79 | 3.05 | 399.18 |
36 | RFM | 513 | 133.00 | 3.14 | 742.48 |
37 | RFM | 514 | 219.00 | 1.00 | 946.20 |
38 | RFM | 515 | 387.00 | 3.00 | 1475.83 |
39 | RFM | 521 | 47.88 | 6.83 | 192.19 |
40 | RFM | 522 | 89.51 | 7.51 | 442.50 |
41 | RFM | 523 | 184.67 | 7.33 | 766.90 |
42 | RFM | 524 | 190.00 | 6.00 | 947.30 |
43 | RFM | 531 | 58.25 | 11.50 | 233.84 |
44 | RFM | 532 | 94.00 | 12.50 | 448.09 |
45 | RFM | 533 | 200.00 | 11.00 | 858.26 |
46 | RFM | 534 | 219.50 | 13.00 | 1009.51 |
47 | RFM | 535 | 236.00 | 13.00 | 1453.37 |
48 | RFM | 541 | 121.50 | 16.50 | 288.65 |
49 | RFM | 543 | 192.00 | 16.50 | 831.65 |
50 | RFM | 552 | 126.00 | 24.00 | 484.81 |
51 | RFM | 553 | 122.00 | 23.00 | 820.04 |
minicubo[minicubo.variable == 'RFM'].set_index('value').plot.bar(subplots = True, sharex = False, figsize = (12,12))
plt.tight_layout();
minicubo[minicubo.variable == 'valor']
variable | value | productos_tot_num | compras_tot_num | gasto_tot | |
---|---|---|---|---|---|
52 | valor | 3 | 7.97 | 1.08 | 39.28 |
53 | valor | 4 | 9.04 | 1.17 | 44.41 |
54 | valor | 5 | 9.31 | 1.27 | 46.08 |
55 | valor | 6 | 10.73 | 1.38 | 53.07 |
56 | valor | 7 | 12.64 | 1.54 | 60.42 |
57 | valor | 8 | 66.13 | 5.06 | 313.01 |
58 | valor | 9 | 98.02 | 7.25 | 491.71 |
59 | valor | 10 | 140.89 | 10.22 | 625.93 |
60 | valor | 11 | 291.00 | 5.75 | 1189.31 |
61 | valor | 12 | 189.80 | 16.60 | 833.43 |
62 | valor | 13 | 179.00 | 18.00 | 1136.70 |
minicubo[minicubo.variable == 'valor'].set_index('value').plot.bar(subplots = True, sharex = False, figsize = (12,12))
plt.tight_layout();
c:\Users\sgarciam\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\indexes\base.py:6982: FutureWarning:
In a future version, the Index constructor will not infer numeric dtypes when passed object-dtype sequences (matching Series behavior)
The analysis could be improved because in F and M the outliers cause most of the data to be concentrated in category 1.
What should be done is eliminate those atypical ones and redo the exercise.
I leave it to you as a practice task.
But with this analysis we are able to identify the customers who are most likely to respond best to new campaigns, as well as gain a lot of valuable insights for the business.
We are going to create a dataframe at the product level to be able to analyze this dimension.
We first calculate the counts of each event in each product.
prod = df.groupby(['producto','evento']).size()
prod
producto evento
3752 view 10
3762 cart 127
purchase 28
remove_from_cart 59
view 258
...
5932538 view 1
5932540 cart 1
view 2
5932578 view 1
5932585 view 2
Length: 137068, dtype: int64
prod = prod.unstack(level = 1).fillna(0)
prod
evento | cart | purchase | remove_from_cart | view |
---|---|---|---|---|
producto | ||||
3752 | 0.00 | 0.00 | 0.00 | 10.00 |
3762 | 127.00 | 28.00 | 59.00 | 258.00 |
3763 | 10.00 | 2.00 | 2.00 | 51.00 |
3771 | 0.00 | 0.00 | 0.00 | 9.00 |
3774 | 26.00 | 7.00 | 13.00 | 76.00 |
... | ... | ... | ... | ... |
5932537 | 1.00 | 0.00 | 0.00 | 1.00 |
5932538 | 0.00 | 0.00 | 0.00 | 1.00 |
5932540 | 1.00 | 0.00 | 0.00 | 2.00 |
5932578 | 0.00 | 0.00 | 0.00 | 1.00 |
5932585 | 0.00 | 0.00 | 0.00 | 2.00 |
45327 rows × 4 columns
We are going to incorporate the price, for this we first create a price master by product.
maestro_precios = df.groupby('producto', as_index = False).precio.mean()
maestro_precios
producto | precio | |
---|---|---|
0 | 3752 | 15.71 |
1 | 3762 | 19.29 |
2 | 3763 | 16.00 |
3 | 3771 | 15.08 |
4 | 3774 | 15.92 |
... | ... | ... |
45322 | 5932537 | 1.43 |
45323 | 5932538 | 1.43 |
45324 | 5932540 | 1.43 |
45325 | 5932578 | 6.02 |
45326 | 5932585 | 6.33 |
45327 rows × 2 columns
prod = pd.merge(left = prod, right = maestro_precios, how = 'left', on = 'producto')
prod
producto | cart | purchase | remove_from_cart | view | precio | |
---|---|---|---|---|---|---|
0 | 3752 | 0.00 | 0.00 | 0.00 | 10.00 | 15.71 |
1 | 3762 | 127.00 | 28.00 | 59.00 | 258.00 | 19.29 |
2 | 3763 | 10.00 | 2.00 | 2.00 | 51.00 | 16.00 |
3 | 3771 | 0.00 | 0.00 | 0.00 | 9.00 | 15.08 |
4 | 3774 | 26.00 | 7.00 | 13.00 | 76.00 | 15.92 |
... | ... | ... | ... | ... | ... | ... |
45322 | 5932537 | 1.00 | 0.00 | 0.00 | 1.00 | 1.43 |
45323 | 5932538 | 0.00 | 0.00 | 0.00 | 1.00 | 1.43 |
45324 | 5932540 | 1.00 | 0.00 | 0.00 | 2.00 | 1.43 |
45325 | 5932578 | 0.00 | 0.00 | 0.00 | 1.00 | 6.02 |
45326 | 5932585 | 0.00 | 0.00 | 0.00 | 2.00 | 6.33 |
45327 rows × 6 columns
We rearrange the names.
prod
producto | cart | purchase | remove_from_cart | view | precio | |
---|---|---|---|---|---|---|
0 | 3752 | 0.00 | 0.00 | 0.00 | 10.00 | 15.71 |
1 | 3762 | 127.00 | 28.00 | 59.00 | 258.00 | 19.29 |
2 | 3763 | 10.00 | 2.00 | 2.00 | 51.00 | 16.00 |
3 | 3771 | 0.00 | 0.00 | 0.00 | 9.00 | 15.08 |
4 | 3774 | 26.00 | 7.00 | 13.00 | 76.00 | 15.92 |
... | ... | ... | ... | ... | ... | ... |
45322 | 5932537 | 1.00 | 0.00 | 0.00 | 1.00 | 1.43 |
45323 | 5932538 | 0.00 | 0.00 | 0.00 | 1.00 | 1.43 |
45324 | 5932540 | 1.00 | 0.00 | 0.00 | 2.00 | 1.43 |
45325 | 5932578 | 0.00 | 0.00 | 0.00 | 1.00 | 6.02 |
45326 | 5932585 | 0.00 | 0.00 | 0.00 | 2.00 | 6.33 |
45327 rows × 6 columns
prod = prod[['producto','view','cart','remove_from_cart','purchase','precio']]
prod
producto | view | cart | remove_from_cart | purchase | precio | |
---|---|---|---|---|---|---|
0 | 3752 | 10.00 | 0.00 | 0.00 | 0.00 | 15.71 |
1 | 3762 | 258.00 | 127.00 | 59.00 | 28.00 | 19.29 |
2 | 3763 | 51.00 | 10.00 | 2.00 | 2.00 | 16.00 |
3 | 3771 | 9.00 | 0.00 | 0.00 | 0.00 | 15.08 |
4 | 3774 | 76.00 | 26.00 | 13.00 | 7.00 | 15.92 |
... | ... | ... | ... | ... | ... | ... |
45322 | 5932537 | 1.00 | 1.00 | 0.00 | 0.00 | 1.43 |
45323 | 5932538 | 1.00 | 0.00 | 0.00 | 0.00 | 1.43 |
45324 | 5932540 | 2.00 | 1.00 | 0.00 | 0.00 | 1.43 |
45325 | 5932578 | 1.00 | 0.00 | 0.00 | 0.00 | 6.02 |
45326 | 5932585 | 2.00 | 0.00 | 0.00 | 0.00 | 6.33 |
45327 rows × 6 columns
prod.sort_values('purchase',ascending = False)[0:20]
producto | view | cart | remove_from_cart | purchase | precio | |
---|---|---|---|---|---|---|
16807 | 5809910 | 9195.00 | 2796.00 | 1249.00 | 764.00 | 5.21 |
28178 | 5854897 | 624.00 | 2486.00 | 793.00 | 483.00 | 0.32 |
6644 | 5700037 | 1150.00 | 2603.00 | 716.00 | 361.00 | 0.40 |
314 | 5304 | 516.00 | 1184.00 | 426.00 | 341.00 | 0.32 |
9900 | 5751422 | 2204.00 | 1119.00 | 625.00 | 331.00 | 10.87 |
15394 | 5802432 | 701.00 | 2495.00 | 745.00 | 322.00 | 0.32 |
16809 | 5809912 | 3059.00 | 1352.00 | 863.00 | 321.00 | 5.19 |
18415 | 5815662 | 1219.00 | 1697.00 | 653.00 | 310.00 | 0.91 |
9862 | 5751383 | 2341.00 | 1035.00 | 550.00 | 298.00 | 10.24 |
14043 | 5792800 | 1527.00 | 911.00 | 512.00 | 285.00 | 10.25 |
26312 | 5849033 | 2099.00 | 1035.00 | 583.00 | 278.00 | 10.25 |
5386 | 5686925 | 344.00 | 1677.00 | 499.00 | 231.00 | 0.35 |
6653 | 5700046 | 432.00 | 1376.00 | 381.00 | 215.00 | 0.40 |
1761 | 5528035 | 1146.00 | 719.00 | 401.00 | 200.00 | 9.44 |
22111 | 5833330 | 680.00 | 576.00 | 359.00 | 194.00 | 0.92 |
16808 | 5809911 | 1923.00 | 828.00 | 599.00 | 189.00 | 5.21 |
18525 | 5816170 | 1642.00 | 751.00 | 532.00 | 182.00 | 5.22 |
5420 | 5687151 | 508.00 | 540.00 | 288.00 | 179.00 | 1.90 |
8232 | 5729864 | 160.00 | 505.00 | 211.00 | 176.00 | 0.41 |
24787 | 5843836 | 165.00 | 1007.00 | 265.00 | 172.00 | 0.38 |
Possibly we would be able to increase sales and the average ticket simply by highlighting these products in the store.
prod[prod.purchase == 0]
producto | view | cart | remove_from_cart | purchase | precio | |
---|---|---|---|---|---|---|
0 | 3752 | 10.00 | 0.00 | 0.00 | 0.00 | 15.71 |
3 | 3771 | 9.00 | 0.00 | 0.00 | 0.00 | 15.08 |
6 | 3790 | 10.00 | 0.00 | 0.00 | 0.00 | 7.92 |
8 | 3809 | 2.00 | 0.00 | 0.00 | 0.00 | 12.54 |
9 | 3812 | 1.00 | 0.00 | 0.00 | 0.00 | 12.54 |
... | ... | ... | ... | ... | ... | ... |
45322 | 5932537 | 1.00 | 1.00 | 0.00 | 0.00 | 1.43 |
45323 | 5932538 | 1.00 | 0.00 | 0.00 | 0.00 | 1.43 |
45324 | 5932540 | 2.00 | 1.00 | 0.00 | 0.00 | 1.43 |
45325 | 5932578 | 1.00 | 0.00 | 0.00 | 0.00 | 6.02 |
45326 | 5932585 | 2.00 | 0.00 | 0.00 | 0.00 | 6.33 |
21850 rows × 6 columns
INSIGHT #8: Almost half of the products have not had any sales in the 5 months of history.
A whole new analysis could be started on these products:
They could be removed from the catalog, or at least from the store, newsletter, etc., so that they do not take up space from the products that are sold.
Since this analysis includes sales we will eliminate the products that have not had any.
sns.scatterplot(data = prod[prod.purchase > 0], x = 'precio', y = 'purchase', hue = 'precio');
Yes, there is a clear decreasing relationship.
Let's zoom in for example below €50 to understand it better.
sns.scatterplot(data = prod[(prod.purchase > 0) & (prod.precio < 50)], x = 'precio', y = 'purchase', hue = 'precio');
prod.insert(loc = 4,
column = 'remove_from_cart_porc',
value = prod.remove_from_cart / prod.cart *100 )
prod
producto | view | cart | remove_from_cart | remove_from_cart_porc | purchase | precio | |
---|---|---|---|---|---|---|---|
0 | 3752 | 10.00 | 0.00 | 0.00 | NaN | 0.00 | 15.71 |
1 | 3762 | 258.00 | 127.00 | 59.00 | 46.46 | 28.00 | 19.29 |
2 | 3763 | 51.00 | 10.00 | 2.00 | 20.00 | 2.00 | 16.00 |
3 | 3771 | 9.00 | 0.00 | 0.00 | NaN | 0.00 | 15.08 |
4 | 3774 | 76.00 | 26.00 | 13.00 | 50.00 | 7.00 | 15.92 |
... | ... | ... | ... | ... | ... | ... | ... |
45322 | 5932537 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 1.43 |
45323 | 5932538 | 1.00 | 0.00 | 0.00 | NaN | 0.00 | 1.43 |
45324 | 5932540 | 2.00 | 1.00 | 0.00 | 0.00 | 0.00 | 1.43 |
45325 | 5932578 | 1.00 | 0.00 | 0.00 | NaN | 0.00 | 6.02 |
45326 | 5932585 | 2.00 | 0.00 | 0.00 | NaN | 0.00 | 6.33 |
45327 rows × 7 columns
prod.loc[prod.cart > 30].sort_values('remove_from_cart_porc', ascending = False)[0:30]
producto | view | cart | remove_from_cart | remove_from_cart_porc | purchase | precio | |
---|---|---|---|---|---|---|---|
14330 | 5797131 | 26.00 | 38.00 | 136.00 | 357.89 | 7.00 | 4.43 |
37937 | 5893670 | 36.00 | 35.00 | 109.00 | 311.43 | 3.00 | 4.90 |
29128 | 5858481 | 41.00 | 31.00 | 64.00 | 206.45 | 7.00 | 4.55 |
16658 | 5809346 | 8.00 | 34.00 | 62.00 | 182.35 | 7.00 | 0.78 |
26120 | 5848412 | 34.00 | 37.00 | 66.00 | 178.38 | 12.00 | 0.79 |
37944 | 5893677 | 64.00 | 41.00 | 70.00 | 170.73 | 10.00 | 4.69 |
8416 | 5731470 | 39.00 | 34.00 | 58.00 | 170.59 | 10.00 | 6.32 |
3217 | 5635096 | 32.00 | 32.00 | 52.00 | 162.50 | 11.00 | 4.42 |
3244 | 5635127 | 41.00 | 32.00 | 52.00 | 162.50 | 10.00 | 4.43 |
21617 | 5830537 | 35.00 | 37.00 | 60.00 | 162.16 | 8.00 | 1.73 |
39359 | 5900645 | 47.00 | 33.00 | 52.00 | 157.58 | 8.00 | 4.39 |
6222 | 5696152 | 81.00 | 41.00 | 64.00 | 156.10 | 12.00 | 2.37 |
29629 | 5859474 | 44.00 | 43.00 | 67.00 | 155.81 | 12.00 | 1.72 |
31887 | 5867624 | 26.00 | 46.00 | 70.00 | 152.17 | 10.00 | 3.89 |
9227 | 5741027 | 89.00 | 35.00 | 53.00 | 151.43 | 7.00 | 5.19 |
2123 | 5560972 | 73.00 | 51.00 | 76.00 | 149.02 | 11.00 | 2.98 |
6235 | 5696184 | 38.00 | 41.00 | 61.00 | 148.78 | 7.00 | 2.37 |
17716 | 5813067 | 48.00 | 43.00 | 63.00 | 146.51 | 5.00 | 1.72 |
221 | 4874 | 26.00 | 39.00 | 57.00 | 146.15 | 10.00 | 0.37 |
27643 | 5853242 | 47.00 | 43.00 | 62.00 | 144.19 | 6.00 | 3.15 |
33396 | 5875280 | 53.00 | 32.00 | 46.00 | 143.75 | 5.00 | 5.53 |
34325 | 5877765 | 62.00 | 35.00 | 50.00 | 142.86 | 9.00 | 9.02 |
23183 | 5837619 | 128.00 | 81.00 | 115.00 | 141.98 | 18.00 | 1.73 |
23833 | 5839637 | 55.00 | 43.00 | 61.00 | 141.86 | 12.00 | 2.37 |
28667 | 5857018 | 33.00 | 32.00 | 45.00 | 140.62 | 7.00 | 3.15 |
13942 | 5789608 | 57.00 | 32.00 | 45.00 | 140.62 | 7.00 | 4.69 |
3005 | 5619864 | 80.00 | 47.00 | 66.00 | 140.43 | 8.00 | 2.84 |
3205 | 5635081 | 26.00 | 35.00 | 49.00 | 140.00 | 4.00 | 4.40 |
30741 | 5863821 | 42.00 | 51.00 | 71.00 | 139.22 | 1.00 | 4.49 |
10701 | 5760769 | 38.00 | 31.00 | 43.00 | 138.71 | 6.00 | 2.62 |
It would be necessary to see why these products are removed more times than they are added:
prod.view.sort_values(ascending = False)[0:20].plot.bar();
Possibly we would be able to increase sales and the average ticket simply by highlighting these products in the store.
Provided that in addition to being seen they are also sold.
For example, products that many customers look at but then do not buy.
If we found them, we would have to check what happens to them.
sns.scatterplot(data = prod, x = 'view', y = 'purchase');
We're going to remove the outlier and zoom into the many views few purchases window.
sns.scatterplot(data = prod.loc[prod.view < 4000], x = 'view', y = 'purchase', hue = 'precio')
plt.xlim(1000,3000)
plt.ylim(0,150)
(0.0, 150.0)
There is an opportunity with these products, because for some reason they generate the interest of the clients, but in the end they do not buy them.
It would be necessary to do an analysis on them.
One of the assets that can most increase the sales of an ecommerce is a recommendation system.
We could already apply a basic one with the analysis of the most viewed and the best sold previously carried out.
But the real power comes when we create a recommender that personalizes for each purchase.
Types of recommender systems:
In our case we are going to develop one with collaborative filtering based on items.
The steps to follow are:
In this case we will use what is called an implicit kpi, which will be the number of times that the products have been purchased by the same user.
Explicit Kpis would be, for example, stars or scores from 1 to 10.
Since this is an algorithm that takes time to calculate, we are going to reduce the problem and calculate it only for the 100 best-selling products.
First we calculate a master with the top 100 best-selling products.
mas_vendidos = prod.sort_values('purchase', ascending = False).producto[0:100]
mas_vendidos
16807 5809910
28178 5854897
6644 5700037
314 5304
9900 5751422
...
30395 5862564
9778 5749720
9732 5749149
22751 5835859
22116 5833335
Name: producto, Length: 100, dtype: int64
We create a temporary dataframe filtering by these products.
temp = df.loc[df.producto.isin(mas_vendidos)]
temp
usuario | sesion | categoria | evento | ... | segundo | festivo | black_friday | san_valentin | |
---|---|---|---|---|---|---|---|---|---|
fecha | |||||||||
2019-10-01 00:26:49 | 536128518 | a31f0991-645e-4472-a012-95eb2f814568 | 1487580006317032337 | purchase | ... | 49 | 0 | 0 | 0 |
2019-10-01 00:46:20 | 555415545 | b9cc1771-9062-4e08-a3ad-363314cd17d8 | 1602943681873052386 | view | ... | 20 | 0 | 0 | 0 |
2019-10-01 00:48:13 | 555415545 | b9cc1771-9062-4e08-a3ad-363314cd17d8 | 1602943681873052386 | view | ... | 13 | 0 | 0 | 0 |
2019-10-01 00:52:39 | 555415545 | b9cc1771-9062-4e08-a3ad-363314cd17d8 | 1487580005092295511 | view | ... | 39 | 0 | 0 | 0 |
2019-10-01 01:33:26 | 555456891 | b3239dc3-f107-4034-a507-4c41f646e38a | 1487580005092295511 | view | ... | 26 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2020-02-29 23:11:44 | 615102046 | 17b94398-0397-4c59-bc84-fe91dde0a8ec | 1487580006509970331 | cart | ... | 44 | 0 | 0 | 0 |
2020-02-29 23:12:40 | 615102046 | 17b94398-0397-4c59-bc84-fe91dde0a8ec | 2195085255034011676 | cart | ... | 40 | 0 | 0 | 0 |
2020-02-29 23:12:50 | 599909613 | 1c6c708d-135d-487b-afa9-4bbcfd28db4d | 1602943681873052386 | cart | ... | 50 | 0 | 0 | 0 |
2020-02-29 23:20:21 | 231719601 | a7467d5c-e848-406f-97f4-fcb6a4113e68 | 1602943681873052386 | view | ... | 21 | 0 | 0 | 0 |
2020-02-29 23:58:49 | 147995998 | 5ff96629-3627-493e-a25b-5a871ec78c90 | 1487580006317032337 | cart | ... | 49 | 0 | 0 | 0 |
168170 rows × 16 columns
We create the user-item array.
usuario_item = temp.loc[temp.evento == 'purchase'].groupby(['usuario','producto']).size().unstack(level = 1).fillna(0)
usuario_item
producto | 4497 | 4600 | 4768 | 4938 | ... | 5857360 | 5862564 | 5862943 | 5889300 |
---|---|---|---|---|---|---|---|---|---|
usuario | |||||||||
25392526 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 |
50748978 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 |
74332980 | 0.00 | 0.00 | 0.00 | 1.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 |
80577370 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 |
88211255 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
621646584 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 |
621788730 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 1.00 | 0.00 |
621925941 | 1.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 |
621974977 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 |
622021687 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 |
5064 rows × 100 columns
Most common metrics:
In this case we are going to take, for example, the Euclidean distance.
We operationalize it using Scipy's spatial.distance.euclidean function.
from scipy import spatial
We create the recommender that takes as input a user-item array and returns an item-item array with the Euclidean distance as data.
def recomendador(dataframe):
def distancia(producto):
return(dataframe.apply(lambda x: spatial.distance.euclidean(x,producto)))
return(dataframe.apply(lambda x: distancia(x)))
item_item = recomendador(usuario_item)
item_item
producto | 4497 | 4600 | 4768 | 4938 | ... | 5857360 | 5862564 | 5862943 | 5889300 |
---|---|---|---|---|---|---|---|---|---|
producto | |||||||||
4497 | 0.00 | 14.42 | 14.49 | 15.62 | ... | 15.78 | 15.17 | 16.40 | 15.33 |
4600 | 14.42 | 0.00 | 10.68 | 14.49 | ... | 14.46 | 13.86 | 14.73 | 13.89 |
4768 | 14.49 | 10.68 | 0.00 | 14.56 | ... | 14.39 | 13.86 | 14.80 | 14.11 |
4938 | 15.62 | 14.49 | 14.56 | 0.00 | ... | 15.46 | 14.97 | 15.72 | 15.13 |
4958 | 15.91 | 14.59 | 14.73 | 15.52 | ... | 15.94 | 15.26 | 16.12 | 15.03 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5857007 | 15.07 | 13.60 | 13.75 | 14.59 | ... | 14.83 | 14.46 | 15.17 | 14.42 |
5857360 | 15.78 | 14.46 | 14.39 | 15.46 | ... | 0.00 | 13.75 | 16.19 | 15.43 |
5862564 | 15.17 | 13.86 | 13.86 | 14.97 | ... | 13.75 | 0.00 | 15.39 | 14.66 |
5862943 | 16.40 | 14.73 | 14.80 | 15.72 | ... | 16.19 | 15.39 | 0.00 | 15.23 |
5889300 | 15.33 | 13.89 | 14.11 | 15.13 | ... | 15.43 | 14.66 | 15.23 | 0.00 |
100 rows × 100 columns
We already have the recommender ready.
What we would have to do is a call to this table every time a user looks at a product or puts it in the cart.
But to make it more effective we could use all the information accumulated from the session or even from the entire user if he is logged in.
That means we need a system to recommend products whether the input is from a single product or multiple ones.
And that at the same time returns several recommendations, to cover all the "gaps" of recommendation that our web could have.
We will apply a very simple algorithm that will do:
#En el caso de varios productos vendrá del servidor web como una cadena separada con punto y coma
def priorizador(productos,devuelve = 10):
#crear array con productos de entrada
array = np.int64(productos.split(';'))
#extraer sus vectores de la matriz total
matriz = item_item[array]
#calcular la suma de distancias
suma_distancias = matriz.agg(sum,axis = 1)
#eliminar los productos input
suma_distancias = suma_distancias.loc[~suma_distancias.index.isin(list(array))]
#Devolver los 10 con menor distancia
return(suma_distancias.sort_values()[0:devuelve])
We check how it works if we pass it a product
priorizador('4497')
producto
5724230 14.39
4600 14.42
5550302 14.49
4768 14.49
5749149 14.56
5833318 14.63
5824810 14.70
5835859 14.70
5809303 14.73
5833335 14.73
dtype: float64
We check how it works if we pass several products to it
priorizador('4497;4600;4768')
producto
5749149 40.25
5833318 40.47
5833335 40.81
5809303 40.81
5724230 41.00
5824810 41.08
5835859 41.23
5550302 41.47
5816169 41.51
5844894 41.55
dtype: float64
The current trend is flat across all metrics, confirming the need for CRO stocks.
After the analysis carried out on the transactional data, a CRO plan has been developed with 12 specific initiatives organized into 5 major business levers that with a high probability will increase the baselines, achieving a global increase in ecommerce income.
In each session, on average: