PROJECT DESIGN

This case simulates that we are a real estate company that makes investments in large cities, buying properties to later rent them as tourist apartments.

The management has made the decision to invest in Madrid, and has commissioned us to analyze the data that the leader in the AirBnb sector makes public to try to find the types of properties that have the greatest commercial potential for tourist rentals.

As the main deliverable, they expect the type (or types) of properties that the valuation team must look for among the existing opportunities in the city and the main neighborhoods or geographical areas on which to focus.

To meet the objective we will apply the Discovery methodology and BA techniques.

Although this specific case is focused on tourist rentals, the same type of approach can be used in cases that have a high "location" component:

store opening and closing

reduction of installed capacity

franchise expansion

etc.

OBJECTIVE

Locate the profile (or profiles) of properties that maximize commercial potential in the tourist rental market and the main areas where to look for them.

LEVERS

After speaking with the valuation team, they tell us that the levers that have the most impact on the profitability of this type of investment are:

Rental price: the more you can charge per night, the higher the profitability

Occupation: in general, the more days a year a property can be rented, the greater its profitability

Real estate price: the cheaper the property can be acquired, the greater the profitability

KPI's

In this example the Kpis are pretty straightforward:

  • We will measure occupancy as the number of days per year that the property can be rented
  • We will measure the rental price as the price per night in euros according to Airbnb
  • We will measure the price of a property as the multiplication between the number of square meters and the average price per m2 in your area, and we will apply a 25% discount on the official price due to the negotiation strength of our purchasing team.

ENTITIES AND DATA

The relevant entities for our objective and from which we may have data are:

Estate

owners

districts

The specific data in each of them will be reviewed in the next module.

SEED QUESTIONS

About the rental price:

  • What is the average price? And the price range? And by districts? And by neighborhoods?
  • What is the ranking of districts and neighborhoods by average rental price?
  • What factors (apart from location) determine the rental price?
  • What is the relationship between the size of the property and the price for which it can be rented?
  • How does the competition (number of properties available per neighborhood) influence the rental price?
  • How do the prices vary by rental type (entire flat, private room, shared room)?

About the occupation:

  • What is the average occupancy? And by districts? And by neighborhoods?
  • How likely is each occupancy level in each district?
  • What is the ranking of districts and neighborhoods by occupation?
  • What factors (aside from location) determine occupancy?
  • What is the relationship between the size of the property and its degree of occupancy?
  • How does competition (number of properties available per neighborhood) influence occupancy?

About the purchase price:

  • What is the price ranking per m2 by district?
  • What is the property price ranking (m2 * average size) by district?
  • What is the relationship between the price of the property and the rental price by district?
  • What is the relationship between the price of the property and the occupation by district?

INITIAL ANALYSIS OF FILES AND PREPARATION OF THE CASE

We will use actual data provided by AirBnB on this page: Airbnb Insider

SETUP

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

#Automcompletar rápido
%config IPCompleter.greedy=True

UNDERSTAND THE FILES

On the AirBnB website we can see the description of the tables:

jpeg

We are going to load one by one, understand them and make a decision whether to use it or not.

We load and understand aggregated listings

listings = pd.read_csv('../Datos/listings.csv')
listings.head()
id name host_id host_name ... license
0 6369 Rooftop terrace room , ensuite bathroom 13660 Simon ... NaN
1 21853 Bright and airy room 83531 Abdel ... NaN
2 23001 Apartmento Arganzuela- Madrid Rio 82175 Jesus ... NaN
3 24805 Gran Via Studio Madrid 346366726 A ... NaN
4 26825 Single Room whith private Bathroom 114340 Agustina ... NaN
listings.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18909 entries, 0 to 18908
Data columns (total 18 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              18909 non-null  int64  
 1   name                            18906 non-null  object 
 2   host_id                         18909 non-null  int64  
 3   host_name                       18883 non-null  object 
 4   neighbourhood_group             18909 non-null  object 
 5   neighbourhood                   18909 non-null  object 
 6   latitude                        18909 non-null  float64
 7   longitude                       18909 non-null  float64
 8   room_type                       18909 non-null  object 
 9   price                           18909 non-null  int64  
 10  minimum_nights                  18909 non-null  int64  
 11  number_of_reviews               18909 non-null  int64  
 12  last_review                     13877 non-null  object 
 13  reviews_per_month               13877 non-null  float64
 14  calculated_host_listings_count  18909 non-null  int64  
 15  availability_365                18909 non-null  int64  
 16  number_of_reviews_ltm           18909 non-null  int64  
 17  license                         2828 non-null   object 
dtypes: float64(3), int64(8), object(7)
memory usage: 2.6+ MB

We upload and understand detailed listings

listings_det = pd.read_csv('../Datos/listings.csv.gz',compression='gzip')
listings_det.head()
id listing_url scrape_id last_scraped name description ...
0 6369 https://www.airbnb.com/rooms/6369 20210910193531 2021-09-11 Rooftop terrace room , ensuite bathroom Excellent connection with the AIRPORT and EXHI... ...
1 21853 https://www.airbnb.com/rooms/21853 20210910193531 2021-09-11 Bright and airy room We have a quiet and sunny room with a good vie... ...
2 23001 https://www.airbnb.com/rooms/23001 20210910193531 2021-09-11 Apartmento Arganzuela- Madrid Rio Apartamento de tres dormitorios dobles, gran s... ...
3 24805 https://www.airbnb.com/rooms/24805 20210910193531 2021-09-11 Gran Via Studio Madrid Studio located 50 meters from Gran Via, next t... ...
4 26825 https://www.airbnb.com/rooms/26825 20210910193531 2021-09-11 Single Room whith private Bathroom Nice and cozy roon for one person with a priva... ...

5 rows × 74 columns

listings_det.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18909 entries, 0 to 18908
Data columns (total 74 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            18909 non-null  int64  
 1   listing_url                                   18909 non-null  object 
 2   scrape_id                                     18909 non-null  int64  
 3   last_scraped                                  18909 non-null  object 
 4   name                                          18906 non-null  object 
 5   description                                   17854 non-null  object 
 6   neighborhood_overview                         10997 non-null  object 
 7   picture_url                                   18908 non-null  object 
 8   host_id                                       18909 non-null  int64  
 9   host_url                                      18909 non-null  object 
 10  host_name                                     18883 non-null  object 
 11  host_since                                    18883 non-null  object 
 12  host_location                                 18841 non-null  object 
 13  host_about                                    9427 non-null   object 
 14  host_response_time                            11972 non-null  object 
 15  host_response_rate                            11972 non-null  object 
 16  host_acceptance_rate                          11531 non-null  object 
 17  host_is_superhost                             18883 non-null  object 
 18  host_thumbnail_url                            18883 non-null  object 
 19  host_picture_url                              18883 non-null  object 
 20  host_neighbourhood                            12034 non-null  object 
 21  host_listings_count                           18883 non-null  float64
 22  host_total_listings_count                     18883 non-null  float64
 23  host_verifications                            18909 non-null  object 
 24  host_has_profile_pic                          18883 non-null  object 
 25  host_identity_verified                        18883 non-null  object 
 26  neighbourhood                                 10997 non-null  object 
 27  neighbourhood_cleansed                        18909 non-null  object 
 28  neighbourhood_group_cleansed                  18909 non-null  object 
 29  latitude                                      18909 non-null  float64
 30  longitude                                     18909 non-null  float64
 31  property_type                                 18909 non-null  object 
 32  room_type                                     18909 non-null  object 
 33  accommodates                                  18909 non-null  int64  
 34  bathrooms                                     0 non-null      float64
 35  bathrooms_text                                18884 non-null  object 
 36  bedrooms                                      17475 non-null  float64
 37  beds                                          18568 non-null  float64
 38  amenities                                     18909 non-null  object 
 39  price                                         18909 non-null  object 
 40  minimum_nights                                18909 non-null  int64  
 41  maximum_nights                                18909 non-null  int64  
 42  minimum_minimum_nights                        18908 non-null  float64
 43  maximum_minimum_nights                        18908 non-null  float64
 44  minimum_maximum_nights                        18908 non-null  float64
 45  maximum_maximum_nights                        18908 non-null  float64
 46  minimum_nights_avg_ntm                        18908 non-null  float64
 47  maximum_nights_avg_ntm                        18908 non-null  float64
 48  calendar_updated                              0 non-null      float64
 49  has_availability                              18909 non-null  object 
 50  availability_30                               18909 non-null  int64  
 51  availability_60                               18909 non-null  int64  
 52  availability_90                               18909 non-null  int64  
 53  availability_365                              18909 non-null  int64  
 54  calendar_last_scraped                         18909 non-null  object 
 55  number_of_reviews                             18909 non-null  int64  
 56  number_of_reviews_ltm                         18909 non-null  int64  
 57  number_of_reviews_l30d                        18909 non-null  int64  
 58  first_review                                  13877 non-null  object 
 59  last_review                                   13877 non-null  object 
 60  review_scores_rating                          13877 non-null  float64
 61  review_scores_accuracy                        13638 non-null  float64
 62  review_scores_cleanliness                     13640 non-null  float64
 63  review_scores_checkin                         13640 non-null  float64
 64  review_scores_communication                   13640 non-null  float64
 65  review_scores_location                        13637 non-null  float64
 66  review_scores_value                           13636 non-null  float64
 67  license                                       2828 non-null   object 
 68  instant_bookable                              18909 non-null  object 
 69  calculated_host_listings_count                18909 non-null  int64  
 70  calculated_host_listings_count_entire_homes   18909 non-null  int64  
 71  calculated_host_listings_count_private_rooms  18909 non-null  int64  
 72  calculated_host_listings_count_shared_rooms   18909 non-null  int64  
 73  reviews_per_month                             13877 non-null  float64
dtypes: float64(22), int64(17), object(35)
memory usage: 10.7+ MB

Conclusions:

  • They are the same records but the detail file has more columns.
  • We could join them through the id field

We load and understand reviews added

reviews = pd.read_csv('../Datos/reviews.csv')
reviews.head()
listing_id date
0 6369 2010-03-14
1 6369 2010-03-23
2 6369 2010-04-10
3 6369 2010-04-21
4 6369 2010-04-26
reviews.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 618440 entries, 0 to 618439
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   listing_id  618440 non-null  int64 
 1   date        618440 non-null  object
dtypes: int64(1), object(1)
memory usage: 9.4+ MB

We upload and understand detailed reviews

reviews_det = pd.read_csv('../Datos/reviews.csv.gz',compression = 'gzip')
reviews_det.head()
listing_id id date reviewer_id reviewer_name comments
0 6369 29428 2010-03-14 84790 Nancy Simon and Arturo have the ultimate location in...
1 6369 31018 2010-03-23 84338 David Myself and Kristy originally planned on stayin...
2 6369 34694 2010-04-10 98655 Marion We had a great time at Arturo and Simon's ! A ...
3 6369 37146 2010-04-21 109871 Kurt I very much enjoyed the stay. \r<br/>It's a w...
4 6369 38168 2010-04-26 98901 Dennis Arturo and Simon are polite and friendly hosts...
reviews_det.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 618440 entries, 0 to 618439
Data columns (total 6 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   listing_id     618440 non-null  int64 
 1   id             618440 non-null  int64 
 2   date           618440 non-null  object
 3   reviewer_id    618440 non-null  int64 
 4   reviewer_name  618439 non-null  object
 5   comments       618054 non-null  object
dtypes: int64(3), object(3)
memory usage: 28.3+ MB

Conclusions:

  • They are the same records but the detail file has more columns.
  • Really this information from the reviews does not contribute anything to our objective, so we will not use these tables

We load and understand calendar

calendar = pd.read_csv('../Datos/calendar.csv.gz',compression = 'gzip')
calendar.head(30)
listing_id date available price adjusted_price minimum_nights maximum_nights
0 6369 2021-09-11 t $60.00 $60.00 1.0 1125.0
1 6369 2021-09-12 t $60.00 $60.00 1.0 1125.0
2 6369 2021-09-13 t $60.00 $60.00 1.0 1125.0
3 6369 2021-09-14 t $60.00 $60.00 1.0 1125.0
4 6369 2021-09-15 t $60.00 $60.00 1.0 1125.0
5 6369 2021-09-16 t $60.00 $60.00 1.0 1125.0
6 6369 2021-09-17 t $60.00 $60.00 1.0 1125.0
7 6369 2021-09-18 t $60.00 $60.00 1.0 1125.0
8 6369 2021-09-19 t $60.00 $60.00 1.0 1125.0
9 6369 2021-09-20 t $60.00 $60.00 1.0 1125.0
10 6369 2021-09-21 t $60.00 $60.00 1.0 1125.0
11 6369 2021-09-22 t $60.00 $60.00 1.0 1125.0
12 6369 2021-09-23 t $60.00 $60.00 1.0 1125.0
13 6369 2021-09-24 t $60.00 $60.00 1.0 1125.0
14 6369 2021-09-25 t $60.00 $60.00 1.0 1125.0
15 6369 2021-09-26 t $60.00 $60.00 1.0 1125.0
16 6369 2021-09-27 t $60.00 $60.00 1.0 1125.0
17 6369 2021-09-28 t $60.00 $60.00 1.0 1125.0
18 6369 2021-09-29 t $60.00 $60.00 1.0 1125.0
19 6369 2021-09-30 t $60.00 $60.00 1.0 1125.0
20 6369 2021-10-01 t $60.00 $60.00 1.0 1125.0
21 6369 2021-10-02 t $60.00 $60.00 1.0 1125.0
22 6369 2021-10-03 t $60.00 $60.00 1.0 1125.0
23 6369 2021-10-04 t $60.00 $60.00 1.0 1125.0
24 6369 2021-10-05 t $60.00 $60.00 1.0 1125.0
25 6369 2021-10-06 t $60.00 $60.00 1.0 1125.0
26 6369 2021-10-07 t $60.00 $60.00 1.0 1125.0
27 6369 2021-10-08 t $60.00 $60.00 1.0 1125.0
28 6369 2021-10-09 t $60.00 $60.00 1.0 1125.0
29 6369 2021-10-10 t $60.00 $60.00 1.0 1125.0
calendar.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6901414 entries, 0 to 6901413
Data columns (total 7 columns):
 #   Column          Dtype  
---  ------          -----  
 0   listing_id      int64  
 1   date            object 
 2   available       object 
 3   price           object 
 4   adjusted_price  object 
 5   minimum_nights  float64
 6   maximum_nights  float64
dtypes: float64(2), int64(1), object(4)
memory usage: 368.6+ MB

Conclusions:

  • This table is projected into the future, and appears to contain reserve availability
  • It is not information that serves our purposes and therefore we will not use it

We load and understand neighborhoods.csv

neigh = pd.read_csv('../Datos/neighbourhoods.csv')
neigh.head(5)
neighbourhood_group neighbourhood
0 Arganzuela Acacias
1 Arganzuela Atocha
2 Arganzuela Chopera
3 Arganzuela Delicias
4 Arganzuela Imperial

Conclusions:

  • It is simply a neighborhood teacher and neighborhood group
  • At first we will not use it, since both the neighborhood and its group are already incorporated into other tables

Load and understand neighborhoods.geojson

neigh_geo = pd.read_json('../Datos/neighbourhoods.geojson')
neigh_geo.head(5)
type features
0 FeatureCollection {'type': 'Feature', 'geometry': {'type': 'Mult...
1 FeatureCollection {'type': 'Feature', 'geometry': {'type': 'Mult...
2 FeatureCollection {'type': 'Feature', 'geometry': {'type': 'Mult...
3 FeatureCollection {'type': 'Feature', 'geometry': {'type': 'Mult...
4 FeatureCollection {'type': 'Feature', 'geometry': {'type': 'Mult...
neigh_geo.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128 entries, 0 to 127
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   type      128 non-null    object
 1   features  128 non-null    object
dtypes: object(2)
memory usage: 2.1+ KB

Conclusions:

  • Looks like geometry info for maps
  • Making maps with geometries is out of the scope of this program, we will see another much easier option, so we will not use it

Conclusions of the file analysis

  • Main tables we will use:
    • listings.csv
    • listings.csv.gz

CREATION OF A DATABASE

We are going to put the selected tables in an internal database.

We will use what is possibly the simplest database: sqlite, since it is self-contained, serverless and configuration-free, so it is usually the format of choice for storing "own" projects.

We create the connection:

import sqlalchemy as sa

con = sa.create_engine('sqlite:///../Datos/airbnb.db')

We create the tables and load the data

listings.to_sql('listings', con = con, if_exists = 'replace')
listings_det.to_sql('listings_det', con = con, if_exists = 'replace')

CREATION OF THE ANALYTICAL DATAMART

It's a good practice doing the individual data quality of each table first, especially because if the volume is very high, when joining them it is multiplied even more.

Therefore, initially we will leave them as individual tables, then we will apply data quality and finally we will join them to form the board or more officially the analytical datamart.

In this module we are going to:

  1. Access the database
  2. Import the data as Pandas dataframes
  3. Perform data quality
  4. Create the analytical datamart
  5. Save it as a table in the database so you don't have to repeat the process

SET UP

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import sqlalchemy as sa

#Automcompletar rápido
%config IPCompleter.greedy=True

DATA IMPORT

Internal data

Create the connection to the database

Internal data

Create the connection to the database

con = sa.create_engine('sqlite:///../Datos/airbnb.db')
from sqlalchemy import inspect
insp = inspect(con)
tablas = insp.get_table_names()
tablas
['df', 'df_preparado', 'listings', 'listings_det']

Mass Load the tables

exec(f'{tabla} = pd.read_sql(tabla, con)')
for cada in tablas:
  print(cada + ': ' + str(eval(cada).shape))
df: (17710, 24)
df_preparado: (17710, 34)
listings: (18909, 19)
listings_det: (18909, 75)

External Data

In our data we do not have the purchase price of a property, but we had seen that it is one of the main levers.

Therefore we are going to look for that data externally.

On this page we have exactly the information we need: idealista

We can easily extract it with the Chrome instant data scraper plugin, and save it in our Data folder with the name 'prices_idealista.csv'

We load the data, remove the first record and select only the price and district columns

precio_m2 = pd.read_csv('../Datos/precios_idealista.csv') \
    .loc[1:,['table__cell','icon-elbow']] \
    .rename(columns = {'table__cell':'precio_m2','icon-elbow':'distrito'})
precio_m2
precio_m2 distrito
1 4.085 €/m2 Arganzuela
2 3.409 €/m2 Barajas
3 2.123 €/m2 Carabanchel
4 4.827 €/m2 Centro
5 5.098 €/m2 Chamartín
6 5.381 €/m2 Chamberí
7 2.940 €/m2 Ciudad Lineal
8 3.568 €/m2 Fuencarral
9 3.871 €/m2 Hortaleza
10 2.267 €/m2 Latina
11 4.033 €/m2 Moncloa
12 2.500 €/m2 Moratalaz
13 1.918 €/m2 Puente de Vallecas
14 4.788 €/m2 Retiro
15 6.114 €/m2 Salamanca
16 2.591 €/m2 San Blas
17 3.678 €/m2 Tetuán
18 1.995 €/m2 Usera
19 2.403 €/m2 Vicálvaro
20 2.354 €/m2 Villa de Vallecas
21 1.693 €/m2 Villaverde

Cleaning the price:

  1. removing the units
  2. removing the thousands separator points
  3. changing the type to integer
precio_m2['precio_m2'] = precio_m2.precio_m2.str.split(expand = True)[0].str.replace('.','',regex=False).astype('int')
precio_m2
precio_m2 distrito
1 4085 Arganzuela
2 3409 Barajas
3 2123 Carabanchel
4 4827 Centro
5 5098 Chamartín
6 5381 Chamberí
7 2940 Ciudad Lineal
8 3568 Fuencarral
9 3871 Hortaleza
10 2267 Latina
11 4033 Moncloa
12 2500 Moratalaz
13 1918 Puente de Vallecas
14 4788 Retiro
15 6114 Salamanca
16 2591 San Blas
17 3678 Tetuán
18 1995 Usera
19 2403 Vicálvaro
20 2354 Villa de Vallecas
21 1693 Villaverde

DATA QUALITY

listings table

Overview

listings.head()
index id name host_id host_name neighbourhood_group ...
0 0 6369 Rooftop terrace room , ensuite bathroom 13660 Simon Chamartín ...
1 1 21853 Bright and airy room 83531 Abdel Latina ...
2 2 23001 Apartmento Arganzuela- Madrid Rio 82175 Jesus Arganzuela ...
3 3 24805 Gran Via Studio Madrid 346366726 A Centro ...
4 4 26825 Single Room whith private Bathroom 114340 Agustina Arganzuela ...
listings.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18909 entries, 0 to 18908
Data columns (total 19 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   index                           18909 non-null  int64  
 1   id                              18909 non-null  int64  
 2   name                            18906 non-null  object 
 3   host_id                         18909 non-null  int64  
 4   host_name                       18883 non-null  object 
 5   neighbourhood_group             18909 non-null  object 
 6   neighbourhood                   18909 non-null  object 
 7   latitude                        18909 non-null  float64
 8   longitude                       18909 non-null  float64
 9   room_type                       18909 non-null  object 
 10  price                           18909 non-null  int64  
 11  minimum_nights                  18909 non-null  int64  
 12  number_of_reviews               18909 non-null  int64  
 13  last_review                     13877 non-null  object 
 14  reviews_per_month               13877 non-null  float64
 15  calculated_host_listings_count  18909 non-null  int64  
 16  availability_365                18909 non-null  int64  
 17  number_of_reviews_ltm           18909 non-null  int64  
 18  license                         2828 non-null   object 
dtypes: float64(3), int64(9), object(7)
memory usage: 2.7+ MB

Variables and Types

We are going to eliminate those variables that we will not need directly for our objectives.

a_eliminar = ['index',
              'host_name',
              'number_of_reviews',
              'last_review',
              'reviews_per_month',
              'number_of_reviews_ltm',
              'license'
             ]

listings.drop(columns = a_eliminar, inplace=True)

listings
id name host_id neighbourhood_group neighbourhood latitude ...
0 6369 Rooftop terrace room , ensuite bathroom 13660 Chamartín Hispanoamérica 40.457240 ...
1 21853 Bright and airy room 83531 Latina Cármenes 40.403810 ...
2 23001 Apartmento Arganzuela- Madrid Rio 82175 Arganzuela Legazpi 40.388400 ...
3 24805 Gran Via Studio Madrid 346366726 Centro Universidad 40.421830 ...
4 26825 Single Room whith private Bathroom 114340 Arganzuela ... ... ...
... ... ... ... ... ... ... ...
18904 52182264 Enormous Private Room in 12-Bedroom Unit - los... 378060726 Centro Justicia 40.424384 ...
18905 52182273 Stunning Private Room in 11-Bedroom Unit - los... 378060726 Centro Justicia 40.424447 ...
18906 52182303 Classic Private Room in 7-Bedroom Unit - los 3... 378060726 Centro Justicia 40.424989 ...
18907 52182321 Elegant Private Room in 12-Bedroom Unit - los ... 378060726 Salamanca Recoletos 40.424352 ...
18908 52182334 Fashioned Private Room in 12-Bedroom Unit - lo... 378060726 Centro Justicia 40.425670 ...

18909 rows × 12 columns

We will check now data types

listings.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18909 entries, 0 to 18908
Data columns (total 12 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              18909 non-null  int64  
 1   name                            18906 non-null  object 
 2   host_id                         18909 non-null  int64  
 3   neighbourhood_group             18909 non-null  object 
 4   neighbourhood                   18909 non-null  object 
 5   latitude                        18909 non-null  float64
 6   longitude                       18909 non-null  float64
 7   room_type                       18909 non-null  object 
 8   price                           18909 non-null  int64  
 9   minimum_nights                  18909 non-null  int64  
 10  calculated_host_listings_count  18909 non-null  int64  
 11  availability_365                18909 non-null  int64  
dtypes: float64(2), int64(6), object(4)
memory usage: 1.7+ MB

Conclusion: pass some object (neighborhood_group, neighborhood, room_type) to categorical.

for variable in ['neighbourhood_group','neighbourhood','room_type']:
    listings[variable] = listings[variable].astype('category')

Check

listings.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18909 entries, 0 to 18908
Data columns (total 12 columns):
 #   Column                          Non-Null Count  Dtype   
---  ------                          --------------  -----   
 0   id                              18909 non-null  int64   
 1   name                            18906 non-null  object  
 2   host_id                         18909 non-null  int64   
 3   neighbourhood_group             18909 non-null  category
 4   neighbourhood                   18909 non-null  category
 5   latitude                        18909 non-null  float64 
 6   longitude                       18909 non-null  float64 
 7   room_type                       18909 non-null  category
 8   price                           18909 non-null  int64   
 9   minimum_nights                  18909 non-null  int64   
 10  calculated_host_listings_count  18909 non-null  int64   
 11  availability_365                18909 non-null  int64   
dtypes: category(3), float64(2), int64(6), object(1)
memory usage: 1.4+ MB

Analysis of nulls

From the Non-null column of info() we see that only name has 3 nulls.

We review them but we see that it is not a problem, so we leave them.

listings[listings.name.isna()]
id name host_id neighbourhood_group neighbourhood latitude longitude ...
1538 7164589 None 37525983 Centro Palacio 40.41458 -3.71422 ...
2394 11687495 None 48387429 San Blas - Canillejas Simancas 40.43765 -3.62672 ...
2842 13585476 None 20922102 Centro Universidad 40.42718 -3.71144 ...

Analysis of Duplicates

We check if there are any duplicate records.

listings.duplicated().sum()
0

Analysis of categorical variables

We are going to analyze the values and frequencies of the categorical variables

listings.neighbourhood_group.value_counts()
Centro                   8433
Salamanca                1267
Chamberí                 1148
Arganzuela               1070
Tetuán                    811
Carabanchel               669
Retiro                    656
Ciudad Lineal             601
Chamartín                 562
Latina                    547
Puente de Vallecas        542
Moncloa - Aravaca         535
San Blas - Canillejas     485
Hortaleza                 374
Fuencarral - El Pardo     294
Usera                     275
Villaverde                186
Barajas                   147
Moratalaz                 131
Villa de Vallecas         105
Vicálvaro                  71
Name: neighbourhood_group, dtype: int64
listings.neighbourhood.value_counts()
Embajadores    2255
Universidad    1772
Palacio        1533
Sol            1126
Justicia        910
               ... 
El Plantío        5
Valdemarín        4
El Pardo          3
Horcajo           2
Atalaya           1
Name: neighbourhood, Length: 128, dtype: int64
listings.room_type.value_counts()
Entire home/apt    11098
Private room        7390
Shared room          258
Hotel room           163
Name: room_type, dtype: int64

We see that there are hotels. Our company is not considering buying hotels, so we have to remove these records.

listings = listings.loc[listings.room_type != 'Hotel room']
listings.room_type.value_counts()
Entire home/apt    11098
Private room        7390
Shared room          258
Hotel room             0
Name: room_type, dtype: int64

Analysis of numerical variables

Of the numeric variables, it makes sense to analyze from price to availability_365, that is, from column positions 8 to 11

listings.iloc[:,8:12].describe().T
count mean std min 25% 50% 75% max
price 18746.0 129.271365 432.384680 8.0 36.0 64.0 105.0 9999.0
minimum_nights 18746.0 7.295850 35.430022 1.0 1.0 2.0 3.0 1125.0
calculated_host_listings_count 18746.0 10.731676 26.429455 1.0 1.0 2.0 6.0 194.0
availability_365 18746.0 153.761656 140.363063 0.0 0.0 113.0 310.0 365.0

Conclusions:

  • In the price you have to check minimums and maximums
  • In minimum_nights you have to review the maximums
  • In calculated_host_listings_count you have to check the maximums

We will review minimums and maximums in the price

listings.price.plot.kde()
<AxesSubplot:ylabel='Density'>

png

We will focus now on highs values

plt.figure(figsize=(16,8))
listings.price.loc[listings.price > 1000].value_counts().sort_index().plot.bar()
plt.xticks(size = 10);

png

Conclusion:

  • the value 9999 is usually a way to impute nulls, but in this case its frequency is not too far from other values that may be valid, such as 8000, so we won't touch it

We will check the values close to zero

plt.figure(figsize=(16,8))
listings.price.loc[listings.price < 30].value_counts().sort_index().plot.bar()
plt.xticks(size = 10);

png

Conclusion:

  • There is a peak at 20 euros, and it seems that below that amount it would be difficult to obtain profitability, so we are going to rule out properties that are rented below 20 euros
listings = listings.loc[listings.price > 19]
listings
id name host_id neighbourhood_group neighbourhood latitude ...
0 6369 Rooftop terrace room , ensuite bathroom 13660 Chamartín Hispanoamérica 40.457240 ...
1 21853 Bright and airy room 83531 Latina Cármenes 40.403810 ...
2 23001 Apartmento Arganzuela- Madrid Rio 82175 Arganzuela Legazpi 40.388400 ...
3 24805 Gran Via Studio Madrid 346366726 Centro Universidad 40.421830 ...
4 26825 Single Room whith private Bathroom 114340 Arganzuela Legazpi 40.389750 ...
... ... ... ... ... ... ... ...
18904 52182264 Enormous Private Room in 12-Bedroom Unit - los... 378060726 Centro Justicia 40.424384 ...
18905 52182273 Stunning Private Room in 11-Bedroom Unit - los... 378060726 Centro Justicia 40.424447 ...
18906 52182303 Classic Private Room in 7-Bedroom Unit - los 3... 378060726 Centro Justicia 40.424989 ...
18907 52182321 Elegant Private Room in 12-Bedroom Unit - los ... 378060726 Salamanca Recoletos 40.424352 ...
18908 52182334 Fashioned Private Room in 12-Bedroom Unit - lo... 378060726 Centro Justicia 40.425670 ...

17710 rows × 12 columns

For minimum_nights and alculated_host_listings_count you would have to do a similar exercise.

However, it is not something that is core in our analysis and therefore I leave it as homework for you to practice.

Table listings_det

Overview

listings_det.head()
index id listing_url scrape_id last_scraped name ...
0 0 6369 https://www.airbnb.com/rooms/6369 20210910193531 2021-09-11 Rooftop terrace room , ensuite bathroom ...
1 1 21853 https://www.airbnb.com/rooms/21853 20210910193531 2021-09-11 Bright and airy room ...
2 2 23001 https://www.airbnb.com/rooms/23001 20210910193531 2021-09-11 Apartmento Arganzuela- Madrid Rio ...
3 3 24805 https://www.airbnb.com/rooms/24805 20210910193531 2021-09-11 Gran Via Studio Madrid ...
4 4 26825 https://www.airbnb.com/rooms/26825 20210910193531 2021-09-11 Single Room whith private Bathroom ...

5 rows × 75 columns

listings_det.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18909 entries, 0 to 18908
Data columns (total 75 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   index                                         18909 non-null  int64  
 1   id                                            18909 non-null  int64  
 2   listing_url                                   18909 non-null  object 
 3   scrape_id                                     18909 non-null  int64  
 4   last_scraped                                  18909 non-null  object 
 5   name                                          18906 non-null  object 
 6   description                                   17854 non-null  object 
 7   neighborhood_overview                         10997 non-null  object 
 8   picture_url                                   18908 non-null  object 
 9   host_id                                       18909 non-null  int64  
 10  host_url                                      18909 non-null  object 
 11  host_name                                     18883 non-null  object 
 12  host_since                                    18883 non-null  object 
 13  host_location                                 18841 non-null  object 
 14  host_about                                    9427 non-null   object 
 15  host_response_time                            11972 non-null  object 
 16  host_response_rate                            11972 non-null  object 
 17  host_acceptance_rate                          11531 non-null  object 
 18  host_is_superhost                             18883 non-null  object 
 19  host_thumbnail_url                            18883 non-null  object 
 20  host_picture_url                              18883 non-null  object 
 21  host_neighbourhood                            12034 non-null  object 
 22  host_listings_count                           18883 non-null  float64
 23  host_total_listings_count                     18883 non-null  float64
 24  host_verifications                            18909 non-null  object 
 25  host_has_profile_pic                          18883 non-null  object 
 26  host_identity_verified                        18883 non-null  object 
 27  neighbourhood                                 10997 non-null  object 
 28  neighbourhood_cleansed                        18909 non-null  object 
 29  neighbourhood_group_cleansed                  18909 non-null  object 
 30  latitude                                      18909 non-null  float64
 31  longitude                                     18909 non-null  float64
 32  property_type                                 18909 non-null  object 
 33  room_type                                     18909 non-null  object 
 34  accommodates                                  18909 non-null  int64  
 35  bathrooms                                     0 non-null      float64
 36  bathrooms_text                                18884 non-null  object 
 37  bedrooms                                      17475 non-null  float64
 38  beds                                          18568 non-null  float64
 39  amenities                                     18909 non-null  object 
 40  price                                         18909 non-null  object 
 41  minimum_nights                                18909 non-null  int64  
 42  maximum_nights                                18909 non-null  int64  
 43  minimum_minimum_nights                        18908 non-null  float64
 44  maximum_minimum_nights                        18908 non-null  float64
 45  minimum_maximum_nights                        18908 non-null  float64
 46  maximum_maximum_nights                        18908 non-null  float64
 47  minimum_nights_avg_ntm                        18908 non-null  float64
 48  maximum_nights_avg_ntm                        18908 non-null  float64
 49  calendar_updated                              0 non-null      float64
 50  has_availability                              18909 non-null  object 
 51  availability_30                               18909 non-null  int64  
 52  availability_60                               18909 non-null  int64  
 53  availability_90                               18909 non-null  int64  
 54  availability_365                              18909 non-null  int64  
 55  calendar_last_scraped                         18909 non-null  object 
 56  number_of_reviews                             18909 non-null  int64  
 57  number_of_reviews_ltm                         18909 non-null  int64  
 58  number_of_reviews_l30d                        18909 non-null  int64  
 59  first_review                                  13877 non-null  object 
 60  last_review                                   13877 non-null  object 
 61  review_scores_rating                          13877 non-null  float64
 62  review_scores_accuracy                        13638 non-null  float64
 63  review_scores_cleanliness                     13640 non-null  float64
 64  review_scores_checkin                         13640 non-null  float64
 65  review_scores_communication                   13640 non-null  float64
 66  review_scores_location                        13637 non-null  float64
 67  review_scores_value                           13636 non-null  float64
 68  license                                       2828 non-null   object 
 69  instant_bookable                              18909 non-null  object 
 70  calculated_host_listings_count                18909 non-null  int64  
 71  calculated_host_listings_count_entire_homes   18909 non-null  int64  
 72  calculated_host_listings_count_private_rooms  18909 non-null  int64  
 73  calculated_host_listings_count_shared_rooms   18909 non-null  int64  
 74  reviews_per_month                             13877 non-null  float64
dtypes: float64(22), int64(18), object(35)
memory usage: 10.8+ MB

Variables and Types

We are going to select only those variables that provide us with relevant information for our objectives.

a_incluir = ['id',
              'description',
              'host_is_superhost',
              'accommodates',
              'bathrooms',
              'bedrooms',
              'beds',
              'number_of_reviews',
              'review_scores_rating',
              'review_scores_communication',
              'review_scores_location'
             ]

listings_det = listings_det.loc[:,a_incluir]

listings_det
id description host_is_superhost accommodates bathrooms bedrooms ...
0 6369 Excellent connection with the AIRPORT and EXHI... t 2 NaN 1.0 ...
1 21853 We have a quiet and sunny room with a good vie... f 1 NaN 1.0 ...
2 23001 Apartamento de tres dormitorios dobles, gran s... f 6 NaN 1.0 ...
3 24805 Studio located 50 meters from Gran Via, next t... f 3 NaN 1.0 ...
4 26825 Nice and cozy roon for one person with a priva... f 1 NaN 1.0 ...
... ... ... ... ... ... ... ...
18904 52182264 ROOM - 8 sqm. • 1st Floor <br /> <br />cozy ro... f 1 NaN 1.0 ...
18905 52182273 ROOM - 10 sqm. • 4th Floor <br /> <br />cozy r... f 1 NaN 1.0 ...
18906 52182303 ROOM <br /> <br />cozy room in madrid centro i... f 1 NaN 1.0 ...
18907 52182321 ROOM -9 sqm. • 1st Floor <br /> <br />cozy roo... f 1 NaN 1.0 ...
18908 52182334 ROOM - 10 sqm. • 1st Floor <br /> <br />cozy r... f 1 NaN 1.0 ...

18909 rows × 11 columns

We are going to analyze data types

listings_det.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18909 entries, 0 to 18908
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   id                           18909 non-null  int64  
 1   description                  17854 non-null  object 
 2   host_is_superhost            18883 non-null  object 
 3   accommodates                 18909 non-null  int64  
 4   bathrooms                    0 non-null      float64
 5   bedrooms                     17475 non-null  float64
 6   beds                         18568 non-null  float64
 7   number_of_reviews            18909 non-null  int64  
 8   review_scores_rating         13877 non-null  float64
 9   review_scores_communication  13640 non-null  float64
 10  review_scores_location       13637 non-null  float64
dtypes: float64(6), int64(3), object(2)
memory usage: 1.6+ MB

Conclusion: change host_is_superhost to categoric.

listings_det['host_is_superhost'] = listings_det['host_is_superhost'].astype('category')

listings_det.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18909 entries, 0 to 18908
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype   
---  ------                       --------------  -----   
 0   id                           18909 non-null  int64   
 1   description                  17854 non-null  object  
 2   host_is_superhost            18883 non-null  category
 3   accommodates                 18909 non-null  int64   
 4   bathrooms                    0 non-null      float64 
 5   bedrooms                     17475 non-null  float64 
 6   beds                         18568 non-null  float64 
 7   number_of_reviews            18909 non-null  int64   
 8   review_scores_rating         13877 non-null  float64 
 9   review_scores_communication  13640 non-null  float64 
 10  review_scores_location       13637 non-null  float64 
dtypes: category(1), float64(6), int64(3), object(1)
memory usage: 1.5+ MB

Null analysis

listings_det.isna().sum()
id                                 0
description                     1055
host_is_superhost                 26
accommodates                       0
bathrooms                      18909
bedrooms                        1434
beds                             341
number_of_reviews                  0
review_scores_rating            5032
review_scores_communication     5269
review_scores_location          5272
dtype: int64

Conclusions:

  • bathrooms is completely null, so we remove it
  • description nothing happens because it has nulls, so we leave it
  • host_is_superhost has very few nulls and is not a super relevant variable, so we leave it
  • beds: we can try to impute it from accommodates
  • bedrooms is an important variable for us, we can try imputing nulls through proxies like accommodates or beds

Let's see if we can make an imputation of beds based on the number of people that can be accommodated.

pd.crosstab(listings_det.beds, listings_det.accommodates)
accommodates 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ..
beds
0.0 222 259 41 68 15 12 5 3 0 2 0 0 0 0 ..
1.0 2765 4911 399 639 14 9 0 3 0 2 0 0 0 0 ..
2.0 120 1110 1016 2277 125 148 3 3 0 1 0 0 1 0 ..
3.0 18 75 231 925 395 582 18 26 1 1 0 0 0 0 ..
4.0 20 28 32 216 214 498 73 92 3 9 0 1 0 0 ..
5.0 2 5 6 20 42 178 75 94 9 30 2 2 0 1 ..
6.0 14 9 2 5 7 62 21 61 9 18 4 8 2 0 ..
7.0 0 1 0 2 2 6 10 19 8 17 2 6 1 4 ..
8.0 3 2 0 0 3 2 0 21 3 18 2 9 2 2 ..
9.0 0 0 0 0 1 1 0 0 0 6 2 8 1 0 ..
10.0 10 2 1 1 2 0 0 0 0 7 0 2 1 0 ..
11.0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 ..
12.0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 ..
13.0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ..
14.0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ..
15.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ..
16.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ..
17.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ..
18.0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 ..
23.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ..
24.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 ..

It seems that we could make a more or less direct assignment. Reading the matrix vertically we see that:

  • one or two people usually correspond to a bed
  • three or four people usually correspond to two beds
  • five or six people usually correspond to three beds
  • we will put four beds for more than 6 people

We review the number of nulls and the frequency of each value

listings_det['beds'].value_counts(dropna = False)
1.0     8751
2.0     4805
3.0     2272
4.0     1186
0.0      628
5.0      466
NaN      341
6.0      222
7.0       78
8.0       68
10.0      39
9.0       20
14.0       6
11.0       6
12.0       5
16.0       5
13.0       4
18.0       2
17.0       2
24.0       1
15.0       1
23.0       1
Name: beds, dtype: int64

We create a function to impute the nulls of beds based on accommodates

def imputar_nulos(registro):
    #Lista de condiciones
    condiciones = [(registro.accommodates <= 2),
               (registro.accommodates > 2) & (registro.accommodates <= 4),
               (registro.accommodates > 4) & (registro.accommodates <= 6),
               (registro.accommodates > 6)]

    #Lista de resultados
    resultados = [1,2,3,4]

    #Salida
    return(np.select(condiciones,resultados, default = -999))

#Imputación
listings_det.loc[listings_det.beds.isna(),'beds'] = listings_det.loc[listings_det.beds.isna()].apply(imputar_nulos, axis = 1).astype('float64')

Review

listings_det.beds.value_counts(dropna = False)
1.0     9061
2.0     4828
3.0     2276
4.0     1190
0.0      628
5.0      466
6.0      222
7.0       78
8.0       68
10.0      39
9.0       20
14.0       6
11.0       6
12.0       5
16.0       5
13.0       4
18.0       2
17.0       2
24.0       1
15.0       1
23.0       1
Name: beds, dtype: int64

Now let's see if we can make an imputation of bedrooms.

We start by crossing the number of rooms with the number of people that can be accommodated.

pd.crosstab(listings_det.bedrooms, listings_det.accommodates)
accommodates 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ..
bedrooms
1.0 3009 5569 1141 2265 111 81 4 14 0 7 0 0 0 0 ..
2.0 74 173 348 1560 444 852 51 48 4 6 1 0 0 1 ..
3.0 21 22 45 104 220 509 106 158 7 20 4 13 1 2 ..
4.0 16 21 1 12 18 38 34 96 10 39 2 15 5 2 ..
5.0 9 6 0 0 4 3 4 6 9 34 3 8 1 3 ..
6.0 3 3 0 0 0 1 0 1 3 4 2 1 1 1 ..
7.0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 ..
8.0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 ..
9.0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 ..
10.0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 ..
14.0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 ..
15.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ..
18.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 ..

It doesn't seem very reliable.

Let's contrast it with the number of beds.

pd.crosstab(listings_det.bedrooms, listings_det.beds, dropna=False)
beds 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 ...
bedrooms
1.0 451 8038 3039 459 139 15 36 1 13 1 ...
2.0 65 108 1372 1329 532 110 33 7 5 0 ...
3.0 7 28 22 414 393 242 80 25 11 7 ...
4.0 6 23 3 7 95 63 56 27 19 6 ...
5.0 1 14 0 0 2 26 10 16 17 2 ...
6.0 2 4 0 0 0 0 6 0 2 3 ...
7.0 0 0 1 0 1 0 0 1 1 0 ...
8.0 0 3 0 0 0 0 0 0 0 1 ...
9.0 0 0 0 0 0 0 0 1 0 0 ...
10.0 0 0 0 0 0 0 0 0 0 0 ...
14.0 1 0 0 0 0 0 0 0 0 0 ...
15.0 0 0 0 0 0 0 0 0 0 0 ...
18.0 0 1 0 0 0 0 0 0 0 0 ...

13 rows × 21 columns

Here we could make a more direct assignment. Reading the matrix vertically we see that:

  • zero, one or two beds usually correspond to a room
  • three or four beds usually correspond to two rooms
  • five or six beds usually correspond to three rooms
  • we are going to put four rooms with more beds

We are going to modify the function that we had created to impute the nulls of bedrooms from beds.

First we count the bedrooms

listings_det.bedrooms.value_counts(dropna = False)
1.0     12212
2.0      3562
NaN      1434
3.0      1233
4.0       311
5.0       105
6.0        24
7.0        11
8.0         7
9.0         3
10.0        3
14.0        2
15.0        1
18.0        1
Name: bedrooms, dtype: int64

We execute the updated function

def imputar_nulos(registro):
    #Lista de condiciones
    condiciones = [(registro.beds <= 2),
               (registro.beds > 2) & (registro.beds <= 4),
               (registro.beds > 4) & (registro.beds <= 6),
               (registro.beds > 6)]

    #Lista de resultados
    resultados = [1,2,3,4]

    #Salida
    return(np.select(condiciones,resultados, default = -999))

#Imputación
listings_det.loc[listings_det.bedrooms.isna(),'bedrooms'] = listings_det.loc[listings_det.bedrooms.isna()].apply(imputar_nulos, axis = 1).astype('float64')

Review

listings_det.bedrooms.value_counts(dropna = False)
1.0     13540
2.0      3657
3.0      1244
4.0       311
5.0       105
6.0        24
7.0        11
8.0         7
9.0         3
10.0        3
14.0        2
15.0        1
18.0        1
Name: bedrooms, dtype: int64

Finally we are going to eliminate bathrooms

listings_det.drop(columns = 'bathrooms', inplace = True)
listings_det
id description host_is_superhost accommodates bedrooms beds ...
0 6369 Excellent connection with the AIRPORT and EXHI... t 2 1.0 1.0 ...
1 21853 We have a quiet and sunny room with a good vie... f 1 1.0 1.0 ...
2 23001 Apartamento de tres dormitorios dobles, gran s... f 6 3.0 5.0 ...
3 24805 Studio located 50 meters from Gran Via, next t... f 3 1.0 1.0 ...
4 26825 Nice and cozy roon for one person with a priva... f 1 1.0 1.0 ...
... ... ... ... ... ... ... ...
18904 52182264 ROOM - 8 sqm. • 1st Floor <br /> <br />cozy ro... f 1 1.0 0.0 ...
18905 52182273 ROOM - 10 sqm. • 4th Floor <br /> <br />cozy r... f 1 1.0 0.0 ...
18906 52182303 ROOM <br /> <br />cozy room in madrid centro i... f 1 1.0 0.0 ...
18907 52182321 ROOM -9 sqm. • 1st Floor <br /> <br />cozy roo... f 1 1.0 0.0 ...
18908 52182334 ROOM - 10 sqm. • 1st Floor <br /> <br />cozy r... f 1 1.0 0.0 ...

18909 rows × 10 columns

Duplicate analysis

We check if there are any duplicate records

listings_det.duplicated().sum()
0
listings_det.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18909 entries, 0 to 18908
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype   
---  ------                       --------------  -----   
 0   id                           18909 non-null  int64   
 1   description                  17854 non-null  object  
 2   host_is_superhost            18883 non-null  category
 3   accommodates                 18909 non-null  int64   
 4   bedrooms                     18909 non-null  float64 
 5   beds                         18909 non-null  float64 
 6   number_of_reviews            18909 non-null  int64   
 7   review_scores_rating         13877 non-null  float64 
 8   review_scores_communication  13640 non-null  float64 
 9   review_scores_location       13637 non-null  float64 
dtypes: category(1), float64(5), int64(3), object(1)
memory usage: 1.3+ MB

Analysis of categorical variables

We are going to analyze the values and frequencies of the categorical variables

listings_det.host_is_superhost.value_counts()
f    15423
t     3460
Name: host_is_superhost, dtype: int64

Analysis of numerical variables

listings_det.describe(include = 'number').T
count mean std min 25% 50% ..
id 18909.0 3.055841e+07 1.448889e+07 6369.0 19305558.00 33537761.00 ..
accommodates 18909.0 3.094928e+00 1.963715e+00 0.0 2.00 2.00 ..
bedrooms 18909.0 1.414670e+00 8.116969e-01 1.0 1.00 1.00 ..
beds 18909.0 1.909197e+00 1.444317e+00 0.0 1.00 1.00 ..
number_of_reviews 18909.0 3.270612e+01 6.524996e+01 0.0 0.00 5.00 ..
review_scores_rating 13877.0 4.549404e+00 7.834379e-01 0.0 4.50 4.75 ..
review_scores_communication 13640.0 4.776612e+00 4.632929e-01 1.0 4.75 4.92 ..
review_scores_location 13637.0 4.788151e+00 3.774844e-01 1.0 4.73 4.92 ..

We don't see anything strange.

At this point we have already detected and corrected the main data quality problems, so we went on to create the analytical datamart integrating our tables.

ANALYTICAL DATAMART

We have 2 main tables:

  • listings
  • listings_det

And we know that both share the id field, so we can cross them through it.

The main table is listings, since what the detail table does is give us additional data.

Therefore, the table that has to control in the integration is listings.

In addition, we also have the price table, which in this case conceptually crosses with listings through the district (neighbourhood_group).

Although we haven't checked yet that the literals are the same, so some manual correction may be necessary.

Let's start with the 2 main ones.

Since the listings table is going to be sent, the final result will have to have as many rows as listings and as many columns as those of both tables (minus 1 for the id, which will remain as a single variable).

listings.shape
(17710, 12)
listings_det.shape
(18909, 10)

If it goes well, the final table will have 17710 rows and 21 columns.

df = pd.merge(left = listings, right = listings_det, how = 'left', on = 'id')
df
id name host_id neighbourhood_group neighbourhood latitude ...
0 6369 Rooftop terrace room , ensuite bathroom 13660 Chamartín Hispanoamérica 40.457240 ...
1 21853 Bright and airy room 83531 Latina Cármenes 40.403810 ...
2 23001 Apartmento Arganzuela- Madrid Rio 82175 Arganzuela Legazpi 40.388400 ...
3 24805 Gran Via Studio Madrid 346366726 Centro Universidad 40.421830 ...
4 26825 Single Room whith private Bathroom 114340 Arganzuela Legazpi 40.389750 ...
... ... ... ... ... ... ... ...
17705 52182264 Enormous Private Room in 12-Bedroom Unit - los... 378060726 Centro Justicia 40.424384 ...
17706 52182273 Stunning Private Room in 11-Bedroom Unit - los... 378060726 Centro Justicia 40.424447 ...
17707 52182303 Classic Private Room in 7-Bedroom Unit - los 3... 378060726 Centro Justicia 40.424989 ...
17708 52182321 Elegant Private Room in 12-Bedroom Unit - los ... 378060726 Salamanca Recoletos 40.424352 ...
17709 52182334 Fashioned Private Room in 12-Bedroom Unit - lo... 378060726 Centro Justicia 40.425670 ...

17710 rows × 21 columns

Now we are going to see how we can incorporate the external information of the price per square meter.

To do this, the first thing is to analyze the values of the district variable in both tables, since they need to coincide so that we can cross them.

In df the variable is categorical, so to get the levels we have to use .categories

distritos1 = pd.Series(df.neighbourhood_group.unique().categories).sort_values()
distritos1
2                Arganzuela
16                  Barajas
10              Carabanchel
3                    Centro
0                 Chamartín
7                  Chamberí
14            Ciudad Lineal
6     Fuencarral - El Pardo
9                 Hortaleza
1                    Latina
18        Moncloa - Aravaca
19                Moratalaz
5        Puente de Vallecas
11                   Retiro
4                 Salamanca
13    San Blas - Canillejas
12                   Tetuán
17                    Usera
20                Vicálvaro
15        Villa de Vallecas
8                Villaverde
dtype: object
distritos2 = precio_m2.distrito
distritos2
1             Arganzuela
2                Barajas
3            Carabanchel
4                 Centro
5              Chamartín
6               Chamberí
7          Ciudad Lineal
8             Fuencarral
9              Hortaleza
10                Latina
11               Moncloa
12             Moratalaz
13    Puente de Vallecas
14                Retiro
15             Salamanca
16              San Blas
17                Tetuán
18                 Usera
19             Vicálvaro
20     Villa de Vallecas
21            Villaverde
Name: distrito, dtype: object

Comparing everything seems the same except:

  • Fuencarral - El Pardo
  • Moncloa - Aravaca
  • San Blas - Canillejas

Therefore we are going to replace these values in price_m2 so that they are equal to those of df and we can cross them

precio_m2.distrito = precio_m2.distrito.map({'Fuencarral':'Fuencarral - El Pardo',
                        'Moncloa':'Moncloa - Aravaca',
                        'San Blas':'San Blas - Canillejas'}) \
                    .fillna(precio_m2.distrito)

precio_m2
precio_m2 distrito
1 4085 Arganzuela
2 3409 Barajas
3 2123 Carabanchel
4 4827 Centro
5 5098 Chamartín
6 5381 Chamberí
7 2940 Ciudad Lineal
8 3568 Fuencarral - El Pardo
9 3871 Hortaleza
10 2267 Latina
11 4033 Moncloa - Aravaca
12 2500 Moratalaz
13 1918 Puente de Vallecas
14 4788 Retiro
15 6114 Salamanca
16 2591 San Blas - Canillejas
17 3678 Tetuán
18 1995 Usera
19 2403 Vicálvaro
20 2354 Villa de Vallecas
21 1693 Villaverde

Now we can cross them.

df is our key dataframe.

df = pd.merge(left = df, right = precio_m2, how = 'left', left_on='neighbourhood_group', right_on='distrito')
df
id name host_id neighbourhood_group neighbourhood latitude ...
0 6369 Rooftop terrace room , ensuite bathroom 13660 Chamartín Hispanoamérica 40.457240 ...
1 21853 Bright and airy room 83531 Latina Cármenes 40.403810 ...
2 23001 Apartmento Arganzuela- Madrid Rio 82175 Arganzuela Legazpi 40.388400 ...
3 24805 Gran Via Studio Madrid 346366726 Centro Universidad 40.421830 ...
4 26825 Single Room whith private Bathroom 114340 Arganzuela Legazpi 40.389750 ...
... ... ... ... ... ... ... ...
17705 52182264 Enormous Private Room in 12-Bedroom Unit - los... 378060726 Centro Justicia 40.424384 ...
17706 52182273 Stunning Private Room in 11-Bedroom Unit - los... 378060726 Centro Justicia 40.424447 ...
17707 52182303 Classic Private Room in 7-Bedroom Unit - los 3... 378060726 Centro Justicia 40.424989 ...
17708 52182321 Elegant Private Room in 12-Bedroom Unit - los ... 378060726 Salamanca Recoletos 40.424352 ...
17709 52182334 Fashioned Private Room in 12-Bedroom Unit - lo... 378060726 Centro Justicia 40.425670 ...

17710 rows × 23 columns

We check that no nulls have been generated in the union.

df.precio_m2.isna().sum()
0

SAVE TO DATABASE

Now that we have the analysis board, we are going to save it in the database so that every time we want to do an analysis we do not have to repeat all the processing of this notebook.

df.to_sql('df', con = con, if_exists = 'replace')

DATA PREPARATION

In this phase we are going to create new variables or transform the existing ones in order to give a better response to our objective.

We are going to give examples of both how to use internal variables and how to enrich with external variables.

SET UP

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import sqlalchemy as sa

#Automcompletar rápido
%config IPCompleter.greedy=True

pd.options.display.max_columns = None

DATA UPLOAD

con = sa.create_engine('sqlite:///../Datos/airbnb.db')

df = pd.read_sql('df', con = con)

df.head()
index id name host_id neighbourhood_group neighbourhood latitude ...
0 0 6369 Rooftop terrace room , ensuite bathroom 13660 Chamartín Hispanoamérica 40.45724 ...
1 1 21853 Bright and airy room 83531 Latina Cármenes 40.40381 ...
2 2 23001 Apartmento Arganzuela- Madrid Rio 82175 Arganzuela Legazpi 40.38840 ...
3 3 24805 Gran Via Studio Madrid 346366726 Centro Universidad 40.42183 ...
4 4 26825 Single Room whith private Bathroom 114340 Arganzuela Legazpi 40.38975 ...

PREPARATION OF VARIABLES

Creation of lever kpis

First we are going to create the analysis variables, that is, those that we had identified as the Kpis that we will use in the levers that influence the business.

We had said that there were 3:

  • price per night: we already have this directly in the price variable, but we are going to review it to see that we understand it correctly
  • occupation: we have availability_365 but it must be transformed
  • price of the property: we will have to create this with external variables so we leave it for later

We start with the price.

The documentation does not clarify whether the price is for the entire property, or if a room is rented, it is per room.

It is a key piece of information to be able to make an assessment of the potential income of a property.

We are going to try to understand it by analyzing the average price by type of rental.

It is important to filter by only one district so as not to include the "zone" effect.

So first we choose a district that has a lot of data.

df.distrito.value_counts()
Centro                   8127
Salamanca                1218
Chamberí                 1089
Arganzuela               1005
Tetuán                    762
Retiro                    618
Carabanchel               581
Chamartín                 538
Ciudad Lineal             532
Moncloa - Aravaca         493
Puente de Vallecas        465
San Blas - Canillejas     463
Latina                    455
Hortaleza                 354
Fuencarral - El Pardo     265
Usera                     233
Barajas                   142
Villaverde                129
Moratalaz                  94
Villa de Vallecas          90
Vicálvaro                  57
Name: distrito, dtype: int64
df.loc[df.distrito == 'Centro',:].groupby('room_type').price.mean()
room_type
Entire home/apt    148.859980
Private room        67.131643
Shared room         60.464286
Name: price, dtype: float64

Conclusion:

  • renting the apartment has an average price of €148
  • renting a room has an average price of €60 or €67 depending on whether it is shared or private
  • Therefore, to calculate the "income" of a property, we must multiply the price by the number of rooms when it is of the Private room or Shared room types

However, multiplying the price by the total number of rooms can artificially bias upward the income-generating capacity of a property.

Since if it is rented by rooms, it is not likely that it will always be 100%

Therefore we should weight it by the average percentage of rented rooms.

We do not have that data, but suppose we have spoken with the business manager and he has told us that it is 70%.

We can create the total price variable by applying apply on a custom function.

def crear_precio_total(registro):
    if (registro.beds > 1) & ((registro.room_type == 'Private room') | (registro.room_type == 'Shared room')):
        salida = registro.price * registro.beds * 0.7
    else:
        salida = registro.price
    return(salida)

df['precio_total'] = df.apply(crear_precio_total, axis = 1)

Comprobamos

df[['room_type','price','beds','precio_total']].head(30)
room_type price beds precio_total
0 Private room 60 1.0 60.0
1 Private room 31 1.0 31.0
2 Entire home/apt 50 5.0 50.0
3 Entire home/apt 92 1.0 92.0
4 Private room 26 1.0 26.0
5 Entire home/apt 85 3.0 85.0
6 Entire home/apt 65 2.0 65.0
7 Entire home/apt 54 1.0 54.0
8 Entire home/apt 1400 3.0 1400.0
9 Entire home/apt 81 2.0 81.0
10 Entire home/apt 90 3.0 90.0
11 Entire home/apt 82 2.0 82.0
12 Private room 40 1.0 40.0
13 Private room 36 2.0 50.4
14 Entire home/apt 51 1.0 51.0
15 Entire home/apt 76 2.0 76.0
16 Entire home/apt 54 3.0 54.0
17 Entire home/apt 60 3.0 60.0
18 Entire home/apt 55 2.0 55.0
19 Entire home/apt 121 6.0 121.0
20 Entire home/apt 80 2.0 80.0
21 Entire home/apt 63 2.0 63.0
22 Entire home/apt 63 1.0 63.0
23 Entire home/apt 46 1.0 46.0
24 Entire home/apt 171 3.0 171.0
25 Private room 55 1.0 55.0
26 Private room 26 2.0 36.4
27 Private room 31 2.0 43.4
28 Entire home/apt 60 1.0 60.0
29 Entire home/apt 149 2.0 149.0

Now we go with the occupation

The variable we have that allows us to measure this is availability_365.

This variable tells us the number of days in a year that the property is NOT occupied.

Therefore, we would be interested in transforming it into a more direct measure of occupancy, for example, the % of the year that it IS occupied.

We can do it with a direct transformation.

df['ocupacion'] = ((365 - df.availability_365) / 365 * 100).astype('int')
df.head()
index id name ... precio_m2 distrito precio_total ocupacion
0 0 6369 Rooftop terrace room , ensuite bathroom ... 5098 Chamartín 60.0 50
1 1 21853 Bright and airy room ... 2267 Latina 31.0 0
2 2 23001 Apartmento Arganzuela- Madrid Rio ... 4085 Arganzuela 50.0 39
3 3 24805 Gran Via Studio Madrid ... 4827 Centro 92.0 68
4 4 26825 Single Room whith private Bathroom ... 4085 Arganzuela 26.0 4

Transformation of analysis variables

Some of the seed questions are aimed at checking how the price or occupancy behaves according to other variables such as the number of rooms, the average rating, etc.

Normally we can do these analyzes better if we discretize the analysis variable.

In our case, the candidates for this analysis are: accommodates, bedrooms, beds and number_of_reviews.

In bedrooms, a more personalized discretization makes sense. In the others we can make it automatic.

Discretize bedrooms

We begin by evaluating the distribution of the data.

df.bedrooms.value_counts().plot.bar();

png

We are going to discretize for 1,2,3 and more than 3.

We can use np.select

condiciones = [df.bedrooms == 1,
               df.bedrooms == 2,
               df.bedrooms == 3,
               df.bedrooms > 3]

resultados = ['01_Una','02_Dos','03_Tres','04_Cuatro o mas']

df['bedrooms_disc'] = np.select(condiciones, resultados, default = -999)

Review

df.bedrooms_disc.value_counts().plot.bar();

png

Discretize accommodates, beds and number_of_reviews

We are going to use qcut to discriminate with percentiles 0.5, 0.8, 1

df['accommodates_disc'] = pd.qcut(df.accommodates,[0, 0.5, 0.8, 1],
                                 labels = ['0-2','3','4-16'])

df['accommodates_disc'].value_counts().sort_index(ascending = False).plot.barh();

png

df['beds_disc'] = pd.qcut(df.beds,[0, 0.5, 0.8, 1],
                         labels = ['1','2','3-24'])

df['beds_disc'].value_counts().sort_index(ascending = False).plot.barh();

png

df['number_of_reviews_disc'] = pd.qcut(df.number_of_reviews,[0, 0.5, 0.8, 1],
                                      labels = ['1-4','5-48','48-744'])

df['number_of_reviews_disc'].value_counts().sort_index(ascending = False).plot.barh();

png

Creating variables with external data

In this particular case, many things could be done with external data.

The first thing, which we have already partially incorporated, is the property price lever.

We said that we could estimate it by multiplying the square meters of the property by the price per m2.

We have already obtained the price_m2, but we do not have the size of the property in the data.

What we can do is establish criteria based on the number of rooms.

It's not perfect, but it will serve as an approximation.

Estimate of the square meters of the property

Let's use the following algorithm:

  • a room: m2 = 50
  • two rooms: m2 = 70
  • three rooms: m2 = 90
  • four rooms: m2 = 120
  • five or more rooms: m2 = 150
condiciones = [df.bedrooms == 1,
               df.bedrooms == 2,
               df.bedrooms == 3,
               df.bedrooms == 4,
               df.bedrooms > 4]

resultados = [50,70,90,120,150]

df['m2'] = np.select(condiciones, resultados, default = -999)

Comprobamos

df['m2'].value_counts()
50     12422
70      3617
90      1228
120      296
150      147
Name: m2, dtype: int64

Now we can estimate the purchase price of the property.

We remember that we took 30% from the price we got for bargaining power.

df['precio_compra'] = df.m2 * df.precio_m2 * 0.7

Comprobamos

df[['bedrooms','m2','distrito','precio_m2','precio_compra']].head(20)
bedrooms m2 distrito precio_m2 precio_compra
0 1.0 50 Chamartín 5098 178430.0
1 1.0 50 Latina 2267 79345.0
2 3.0 90 Arganzuela 4085 257355.0
3 1.0 50 Centro 4827 168945.0
4 1.0 50 Arganzuela 4085 142975.0
5 3.0 90 Arganzuela 4085 257355.0
6 1.0 50 Centro 4827 168945.0
7 1.0 50 Centro 4827 168945.0
8 2.0 70 Centro 4827 236523.0
9 1.0 50 Centro 4827 168945.0
10 1.0 50 Centro 4827 168945.0
11 2.0 70 Salamanca 6114 299586.0
12 1.0 50 Centro 4827 168945.0
13 1.0 50 Centro 4827 168945.0
14 1.0 50 Centro 4827 168945.0
15 1.0 50 Centro 4827 168945.0
16 1.0 50 Centro 4827 168945.0
17 2.0 70 Centro 4827 236523.0
18 2.0 70 Centro 4827 236523.0
19 3.0 90 Centro 4827 304101.0

Now we are going to give an example of what other types of variables we can build.

In this case we could do a lot with the x,y coordinates.

Since in tourism the location is very important.

For example, we could calculate the distances to different points of interest such as monuments, entertainment venues, sports venues, etc.

Simply as an example we are going to calculate the distance from each property to Puerta del Sol.

To do this, we Google its longitude and latitude: https://www.123coordenadas.com/coordinates/81497-puerta-del-sol-madrid

Latitude: 40.4167278 Longitude: -3.7033387

Calculation of the distance of each property to Puerta del Sol

Given the curvature of the earth, the distance between two points based on their latitude and longitude is calculated using a formula called the Haversine distance.

A Google search gives us a built-in function to calculate it that we can adapt: Haversine

from math import radians, cos, sin, asin, sqrt

def haversine(lat1, lon1, lat2, lon2):

      R = 6372.8 #En km, si usas millas tienes que cambiarlo por 3959.87433

      dLat = radians(lat2 - lat1)
      dLon = radians(lon2 - lon1)
      lat1 = radians(lat1)
      lat2 = radians(lat2)

      a = sin(dLat/2)**2 + cos(lat1)*cos(lat2)*sin(dLon/2)**2
      c = 2*asin(sqrt(a))

      return R * c

We create the variable

#Las coordenadas de la Puerta del Sol serán lat1 y lon1
lat1 = 40.4167278
lon1 = -3.7033387

df['pdi_sol'] = df.apply(lambda registro: haversine(lat1,lon1,registro.latitude,registro.longitude),axis = 1)

We check by reviewing the average distance by districts.

df.groupby('distrito').pdi_sol.mean().sort_values()
distrito
Centro                    0.730611
Arganzuela                1.939114
Chamberí                  2.133167
Retiro                    2.450593
Salamanca                 2.715975
Moncloa - Aravaca         3.294526
Usera                     3.928874
Latina                    3.942634
Carabanchel               3.970238
Chamartín                 4.432842
Puente de Vallecas        4.481127
Tetuán                    4.624605
Moratalaz                 5.073901
Ciudad Lineal             5.231293
Villaverde                7.664192
Fuencarral - El Pardo     8.062301
Hortaleza                 8.074184
San Blas - Canillejas     8.199011
Vicálvaro                 8.599559
Villa de Vallecas         9.176618
Barajas                  11.439064
Name: pdi_sol, dtype: float64

WE SAVE IN THE DATAMART

We are going to save this version as df_preparado

df.to_sql('df_preparado', con = con, if_exists = 'replace')

ANALYSIS AND INSIGHTS

We come to the most important part, where we are going to try to obtain relevant conclusions for the objective using all the preparation work we have done, the Business Analytics techniques that we already know and also we are going to learn how to create a map visualization.

To do this, we will begin by answering the seed questions and it is likely that in the process interesting findings will emerge that lead us to new questions or the application of certain techniques.

SET UP

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import sqlalchemy as sa

#Automcompletar rápido
%config IPCompleter.greedy=True

pd.options.display.max_columns = None
import fiser_tools as fs
fs.misc.dark_theme()

DATA UPLOAD

con = sa.create_engine('sqlite:///../DatosCaso1/airbnb.db')

df = pd.read_sql('df_preparado', con = con)

df.head()
level_0 index id name host_id neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights calculated_host_listings_count availability_365 description host_is_superhost accommodates bedrooms beds number_of_reviews review_scores_rating review_scores_communication review_scores_location precio_m2 distrito precio_total ocupacion bedrooms_disc accommodates_disc beds_disc number_of_reviews_disc m2 precio_compra pdi_sol
0 0 0 6369 Rooftop terrace room , ensuite bathroom 13660 Chamartín Hispanoamérica 40.45724 -3.67688 Private room 60 1 2 180 Excellent connection with the AIRPORT and EXHI... t 2 1.0 1.0 80 4.87 4.89 4.77 164 Chamartín 60.0 50 01_Una 0-2 1 48-744 50 5740.0 5.032039
1 1 1 21853 Bright and airy room 83531 Latina Cármenes 40.40381 -3.74130 Private room 31 4 2 364 We have a quiet and sunny room with a good vie... f 1 1.0 1.0 33 4.58 4.82 4.21 125 Latina 31.0 0 01_Una 0-2 1 5-48 50 4375.0 3.521406
2 2 2 23001 Apartmento Arganzuela- Madrid Rio 82175 Arganzuela Legazpi 40.38840 -3.69511 Entire home/apt 50 15 5 222 Apartamento de tres dormitorios dobles, gran s... f 6 3.0 5.0 0 NaN NaN NaN 155 Arganzuela 50.0 39 03_Tres 4-16 3-24 1-4 90 9765.0 3.226963
3 3 3 24805 Gran Via Studio Madrid 346366726 Centro Universidad 40.42183 -3.70529 Entire home/apt 92 5 1 115 Studio located 50 meters from Gran Via, next t... f 3 1.0 1.0 12 4.92 5.00 5.00 186 Centro 92.0 68 01_Una 0-2 1 5-48 50 6510.0 0.591065
4 4 4 26825 Single Room whith private Bathroom 114340 Arganzuela Legazpi 40.38975 -3.69018 Private room 26 2 1 349 Nice and cozy roon for one person with a priva... f 1 1.0 1.0 149 4.68 4.71 4.70 155 Arganzuela 26.0 4 01_Una 0-2 1 48-744 50 5425.0 3.200942

ANALYSIS

Price analysis

What is the average price? And the price range? And by districts? And by neighborhoods?

What is the ranking of districts and neighborhoods by average rental price?

df.precio_total.describe()
count    17710.000000
mean       149.027770
std        523.211567
min         20.000000
25%         43.000000
50%         70.000000
75%        112.000000
max      22400.000000
Name: precio_total, dtype: float64

We see that there is at least one outlier on the upper part that skews the mean, so we are going to use the median as a more reliable measure of centralization.

df.precio_total.median()
70.0
df.groupby('distrito').precio_total.median().sort_values(ascending = False)
distrito
San Blas - Canillejas    90.0
Salamanca                88.0
Centro                   76.0
Chamartín                74.0
Chamberí                 70.0
Hortaleza                69.0
Retiro                   68.0
Tetuán                   66.0
Moncloa - Aravaca        61.0
Fuencarral - El Pardo    60.0
Arganzuela               58.0
Vicálvaro                53.0
Ciudad Lineal            50.0
Barajas                  49.5
Carabanchel              48.0
Villa de Vallecas        47.5
Latina                   47.0
Usera                    42.0
Villaverde               42.0
Moratalaz                40.0
Puente de Vallecas       40.0
Name: precio_total, dtype: float64

The data from San Blas catches our attention, we are going to look at it in more detail to see what is happening.

df.loc[df.distrito == 'San Blas - Canillejas'].sort_values('precio_total',ascending = False).head(10)
level_0 index id name host_id neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights calculated_host_listings_count availability_365 description host_is_superhost accommodates bedrooms beds number_of_reviews review_scores_rating review_scores_communication review_scores_location precio_m2 distrito precio_total ocupacion bedrooms_disc accommodates_disc beds_disc number_of_reviews_disc m2 precio_compra pdi_sol
9290 9290 9290 34600594 3 rooms near Wanda 38951538 San Blas - Canillejas Arcos 40.42105 -3.61457 Private room 2800 1 2 0 3 rooms 15´walking from Wanda. f 6 3.0 5.0 0 NaN NaN NaN 120 San Blas - Canillejas 9800.0 100 03_Tres 4-16 3-24 1-4 90 7560.0 7.532233
9662 9662 9662 34826962 Pozos rooms 222068421 San Blas - Canillejas Arcos 40.41927 -3.61555 Private room 3000 1 1 88 Cerca nos podemos encontrar un centro comercia... f 4 2.0 3.0 0 NaN NaN NaN 120 San Blas - Canillejas 6300.0 75 02_Dos 3 2 1-4 70 5880.0 7.439364
10189 10189 10189 35149419 CHALET FOR THE FINAL CHAMPIONS LEAGUE MADRID 143732655 San Blas - Canillejas Canillejas 40.44937 -3.61633 Entire home/apt 6000 1 1 365 We are talking about one of the last villas fo... f 6 4.0 5.0 0 NaN NaN NaN 120 San Blas - Canillejas 6000.0 0 04_Cuatro o mas 4-16 3-24 1-4 120 10080.0 8.212429
9668 9668 9668 34833756 Wanda Champions Final 29628177 San Blas - Canillejas Rosas 40.42622 -3.60502 Private room 2700 1 1 0 Somos una pareja joven educados y muy sociable... f 5 2.0 3.0 0 NaN NaN NaN 120 San Blas - Canillejas 5670.0 100 02_Dos 4-16 2 1-4 70 5880.0 8.391922
9878 9878 9878 34970236 Habitaciones cerca Wanda Metropolitano (Champi... 263474389 San Blas - Canillejas Arcos 40.41920 -3.61229 Private room 800 1 1 88 Se alquilan 5 habitaciones, cada una para 2 hu... f 10 5.0 8.0 0 NaN NaN NaN 120 San Blas - Canillejas 4480.0 75 04_Cuatro o mas 4-16 3-24 1-4 150 12600.0 7.714952
9748 9748 9748 34882596 2 rooms near wanda 38951538 San Blas - Canillejas Arcos 40.41948 -3.61427 Private room 1500 1 2 363 2 rooms near Wanda stadium (20' walking) with ... f 4 2.0 4.0 0 NaN NaN NaN 120 San Blas - Canillejas 4200.0 0 02_Dos 3 3-24 1-4 70 5880.0 7.548577
9173 9173 9173 34522997 Beautiful Penthouse next to Wanda Metropolitano 260551569 San Blas - Canillejas Canillejas 40.43686 -3.61093 Private room 2000 1 1 0 Wonderful penthouse in a lovely duplex next to... f 4 1.0 3.0 0 NaN NaN NaN 120 San Blas - Canillejas 4200.0 100 01_Una 3 2 1-4 50 4200.0 8.138310
9831 9831 9831 34933116 FINAL CHAMPIONS LEAGUE MADRID 263232278 San Blas - Canillejas Simancas 40.43968 -3.61933 Private room 2000 2 1 365 Near (15 minutes walking) to WANDA STADIUM.<br... f 5 3.0 3.0 0 NaN NaN NaN 120 San Blas - Canillejas 4200.0 0 03_Tres 4-16 2 1-4 90 7560.0 7.557076
10092 10092 10092 35091283 FINAL CHAMPIONS LEAGUE 264225539 San Blas - Canillejas Rejas 40.44592 -3.58746 Private room 1000 1 1 180 Desayuno, comida y cena incluida. Traslados al... f 6 2.0 6.0 0 NaN NaN NaN 120 San Blas - Canillejas 4200.0 50 02_Dos 4-16 3-24 1-4 70 5880.0 10.334061
10404 10404 10404 35251141 Se alquila para final de la champion league 265376740 San Blas - Canillejas Canillejas 40.44292 -3.60764 Entire home/apt 3000 1 1 365 None f 1 2.0 4.0 0 NaN NaN NaN 120 San Blas - Canillejas 3000.0 0 02_Dos 0-2 3-24 1-4 70 5880.0 8.610212

We see that they are prices in the environment of 3,000 - 5,000 euros!

When reading the description we realize that all these prices are defined by the Champions League final.

Which is an interesting insight:

Insight 1: there may be properties with a regular residual value but with high value at specific times due to sporting events or shows

Would it make sense to generate a rental product that consists of renting in a "normal" way at a price below the market with the condition that the tenant leaves the apartment free to rent it "touristically" on designated dates?

In the rest there are no surprises, with districts like Salamanca, Centro or Chanmartín in the lead.

But for example we see that the average price difference between Retiro and Tetuán is very low.

This leads us to compare the average price per district with the average purchase price also per district.

temp = df.groupby('distrito')[['precio_total','precio_compra']].median()
temp
precio_total precio_compra
distrito
Arganzuela 58.0 5425.0
Barajas 49.5 4235.0
Carabanchel 48.0 4305.0
Centro 76.0 6510.0
Chamartín 74.0 5740.0
Chamberí 70.0 6300.0
Ciudad Lineal 50.0 4760.0
Fuencarral - El Pardo 60.0 4480.0
Hortaleza 69.0 4515.0
Latina 47.0 4375.0
Moncloa - Aravaca 61.0 5390.0
Moratalaz 40.0 3955.0
Puente de Vallecas 40.0 4305.0
Retiro 68.0 5670.0
Salamanca 88.0 6685.0
San Blas - Canillejas 90.0 4200.0
Tetuán 66.0 5600.0
Usera 42.0 4585.0
Vicálvaro 53.0 3850.0
Villa de Vallecas 47.5 4130.0
Villaverde 42.0 3920.0
plt.figure(figsize = (16,8))
sns.scatterplot(data = temp, x = 'precio_compra', y = 'precio_total')
#Ponemos las etiquetas
for cada in range(0,temp.shape[0]):
    plt.text(temp.precio_compra[cada], temp.precio_total[cada], temp.index[cada])

png

There is a clear correlation between the purchase price in each district and the price we will be able to charge.

Three clusters of low-low, medium-medium and high-high are clearly perceived.

And the exception of San Blas that we already know why it is.

Therefore, as expected, there is no clear "bargain" at this level a priori.

We are going to repeat the analysis at the neighborhood level to see if we identify anything.

temp = df.groupby('neighbourhood')[['precio_total','precio_compra']].median()
temp
precio_total precio_compra
neighbourhood
Abrantes 35.5 4305.0
Acacias 53.8 5425.0
Adelfas 54.0 7938.0
Aeropuerto 41.3 4235.0
Aguilas 47.0 4375.0
... ... ...
Valverde 62.5 4480.0
Ventas 42.0 4760.0
Vinateros 50.0 3955.0
Vista Alegre 50.0 4305.0
Zofío 36.4 4585.0

128 rows × 2 columns

plt.figure(figsize = (16,20))
sns.scatterplot(data = temp, x = 'precio_compra', y = 'precio_total')
#Ponemos las etiquetas
for cada in range(0,temp.shape[0]):
    plt.text(temp.precio_compra[cada], temp.precio_total[cada], temp.index[cada])

png

At this level we already see more things:

  • 3 neighborhoods that stand out, possibly all 3 are from San Blas
  • Within each low-medium-high group, we can begin to separate
    • Bass: Simancas, Ambroz, Marroquina, San Juan Bautista
    • Medium: El Plantio, Valdemarín, Valdefuentes
    • Medium-high: Jerónimos, Fuentela Reina
    • High: Recoletos

Insight 2: There are certain neighborhoods that a priori can maximize the cost-income ratio and we can also segment it by the type of quality of the property in which we are interested in investing

df.loc[df.neighbourhood.isin(['Rosas','Canillejas','Hellin']),'distrito'].unique()
array(['San Blas - Canillejas'], dtype=object)

What factors (apart from location determine the rental price?

To answer this question we can build a minicube, since we have discretized our analysis variables.

#Paso 1: Seleccionar qué variables serán la métricas y cuales las dimensiones
metricas = ['precio_total','precio_compra']
dimensiones = ['bedrooms_disc','accommodates_disc','beds_disc','number_of_reviews_disc']

minicubo_precio = df[dimensiones + metricas]
minicubo_precio
bedrooms_disc accommodates_disc beds_disc number_of_reviews_disc precio_total precio_compra
0 01_Una 0-2 1 48-744 60.0 5740.0
1 01_Una 0-2 1 5-48 31.0 4375.0
2 03_Tres 4-16 3-24 1-4 50.0 9765.0
3 01_Una 0-2 1 5-48 92.0 6510.0
4 01_Una 0-2 1 48-744 26.0 5425.0
... ... ... ... ... ... ...
17705 01_Una 0-2 1 1-4 29.0 6510.0
17706 01_Una 0-2 1 1-4 29.0 6510.0
17707 01_Una 0-2 1 1-4 29.0 6510.0
17708 01_Una 0-2 1 1-4 29.0 6685.0
17709 01_Una 0-2 1 1-4 33.0 6510.0

17710 rows × 6 columns

#Paso 2: pasar a transaccional las dimensiones
minicubo_precio = minicubo_precio.melt(id_vars=['precio_total','precio_compra'])
minicubo_precio
precio_total precio_compra variable value
0 60.0 5740.0 bedrooms_disc 01_Una
1 31.0 4375.0 bedrooms_disc 01_Una
2 50.0 9765.0 bedrooms_disc 03_Tres
3 92.0 6510.0 bedrooms_disc 01_Una
4 26.0 5425.0 bedrooms_disc 01_Una
... ... ... ... ...
70835 29.0 6510.0 number_of_reviews_disc 1-4
70836 29.0 6510.0 number_of_reviews_disc 1-4
70837 29.0 6510.0 number_of_reviews_disc 1-4
70838 29.0 6685.0 number_of_reviews_disc 1-4
70839 33.0 6510.0 number_of_reviews_disc 1-4

70840 rows × 4 columns

#Paso 3: Agregar las métricas por "variable" y "valor" con las funciones deseadas
minicubo_precio = minicubo_precio.groupby(['variable','value'])[['precio_total','precio_compra']].agg('median')
minicubo_precio
precio_total precio_compra
variable value
accommodates_disc 0-2 50.0 6510.0
3 86.0 6510.0
4-16 126.0 9114.0
bedrooms_disc 01_Una 56.0 6510.0
02_Dos 100.0 9114.0
03_Tres 140.0 11718.0
04_Cuatro o mas 204.0 15624.0
beds_disc 1 59.0 6510.0
2 100.0 9114.0
3-24 139.0 9359.0
number_of_reviews_disc 1-4 70.0 6510.0
48-744 68.0 6510.0
5-48 69.0 6510.0

On the minicube we are analyzing each variable.

minicubo_precio.loc['bedrooms_disc']
precio_total precio_compra
value
01_Una 56.0 6510.0
02_Dos 100.0 9114.0
03_Tres 140.0 11718.0
04_Cuatro o mas 204.0 15624.0
f, ax = plt.subplots()
ax.plot(minicubo_precio.loc['bedrooms_disc'].precio_total)
ax2 = ax.twinx()
ax2.plot(minicubo_precio.loc['bedrooms_disc'].precio_compra,color = 'green');

png

Regarding the number of rooms, there is nothing to highlight.

There is an almost perfect relationship between the purchase price and the total price that can be charged.

Part of this effect may be artificial, since we use the number of rooms to calculate the total price as the purchase price.

minicubo_precio.loc['beds_disc']
precio_total precio_compra
value
1 59.0 6510.0
2 100.0 9114.0
3-24 139.0 9359.0
f, ax = plt.subplots()
ax.plot(minicubo_precio.loc['beds_disc'].precio_total)
ax2 = ax.twinx()
ax2.plot(minicubo_precio.loc['beds_disc'].precio_compra,color = 'green');

png

Regarding the number of beds, there is a conclusion:

Insight 3: the number of beds to avoid is 2. Either we put a bed or we try to fit as many as possible.

Since there was no effect on the number of rooms, could it be that the owners are trying to cram in a lot more beds than rooms to maximize revenue?

Let's see it for example with the floors of a room:

df[df.bedrooms == 1].groupby('beds').precio_total.median().plot();

png

Indeed there is something here, since it appears that for one-room apartments there are people who are putting up to dozens of beds!

It would be a topic to explore in more detail and discuss with someone who knows the business.

Let's see some examples:

df.loc[(df.bedrooms == 1)& (df.beds > 8)]
level_0 index id name host_id neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights calculated_host_listings_count availability_365 description host_is_superhost accommodates bedrooms beds number_of_reviews review_scores_rating review_scores_communication review_scores_location precio_m2 distrito precio_total ocupacion bedrooms_disc accommodates_disc beds_disc number_of_reviews_disc m2 precio_compra pdi_sol
357 357 357 1044902 Masía el Ferrajón 5751753 Moncloa - Aravaca Ciudad Universitaria 40.46549 -3.75115 Private room 112 2 1 0 <b>The space</b><br />Masía el Ferrajón featur... f 16 1.0 14.0 0 NaN NaN NaN 154 Moncloa - Aravaca 1097.6 100 01_Una 4-16 3-24 1-4 50 5390.0 6.767308
605 605 605 2143824 The Hat Madrid - 10 people shared 10940437 Centro Sol 40.41511 -3.70804 Shared room 24 1 6 364 10 people shared room with shared bathroom. <b... t 10 1.0 10.0 91 4.57 4.69 4.84 186 Centro 168.0 0 01_Una 4-16 3-24 48-744 50 6510.0 0.436897
710 710 710 3011110 The Hat Madrid-Compartida 10 pers 2 10940437 Centro Embajadores 40.41311 -3.70621 Shared room 24 1 6 364 The Hat Madrid es el primer Boutique Hostel de... t 10 1.0 10.0 28 4.56 4.81 4.93 186 Centro 168.0 0 01_Una 4-16 3-24 5-48 50 6510.0 0.470155
11848 11848 11848 39600798 Bed in a 10-Bed Dormitory In the Centre of Madrid 304035848 Centro Justicia 40.42710 -3.69912 Shared room 20 1 13 351 Low-key, hostel-style setup on a tree-lined st... f 1 1.0 10.0 2 5.00 5.00 5.00 186 Centro 140.0 3 01_Una 0-2 3-24 1-4 50 6510.0 1.207701
13340 13340 13340 42492960 Apartamento dúplex 10camas en literas 338384437 Tetuán Cuatro Caminos 40.44959 -3.70138 Private room 298 1 1 364 Vivienda en planta baja con 10 plazas distribu... f 10 1.0 10.0 1 5.00 4.00 5.00 160 Tetuán 2086.0 0 01_Una 4-16 3-24 1-4 50 5600.0 3.658902
15722 15722 15722 48855617 habitaciones en una casa tranquila, en el centro 194738186 Salamanca Fuente del Berro 40.42604 -3.66373 Private room 36 2 1 359 Es una casa que tiene dos habitaciones que se ... f 5 1.0 9.0 0 NaN NaN NaN 191 Salamanca 226.8 1 01_Una 4-16 3-24 1-4 50 6685.0 3.510206

Vamos a analizar ahora por el número de huéspedes que aceptan

minicubo_precio.loc['accommodates_disc']
precio_total precio_compra
value
0-2 50.0 6510.0
3 86.0 6510.0
4-16 126.0 9114.0
f, ax = plt.subplots()
ax.plot(minicubo_precio.loc['accommodates_disc'].precio_total)
ax2 = ax.twinx()
ax2.plot(minicubo_precio.loc['accommodates_disc'].precio_compra,color = 'green');

png

Insight 4: the optimal number of guests is 3, since the price of the properties to accommodate 3 is the same as to accommodate 1 or 2. From 4 the floor needs to be larger and the purchase price increases a lot

Finally we are going to analyze the variable that we have built of proximity to a point of interest to see if it has an effect on the price of the rooms.

In a real situation we would have built many of this type, and repeated the analysis with all of them.

In this case, as we have built the distance to Puerta del Sol, we are going to evaluate only the districts for which this may be relevant, that is, the most central ones.

To do this, we will first calculate the average distance per district and choose a cut-off point.

df.groupby('distrito').pdi_sol.median().sort_values()
distrito
Centro                    0.728581
Arganzuela                1.817840
Chamberí                  2.078790
Moncloa - Aravaca         2.130135
Retiro                    2.415432
Salamanca                 2.705209
Latina                    3.496952
Carabanchel               3.647956
Usera                     3.661376
Chamartín                 4.233216
Puente de Vallecas        4.233631
Tetuán                    4.572866
Ciudad Lineal             5.170226
Moratalaz                 5.242395
Villaverde                7.579665
Fuencarral - El Pardo     7.738933
San Blas - Canillejas     7.962141
Hortaleza                 8.132864
Vicálvaro                 8.396723
Villa de Vallecas         8.822907
Barajas                  11.593340
Name: pdi_sol, dtype: float64

Let's cut in Latina included.

And on that selection we are going to visualize with a scatter.

df.groupby('distrito').pdi_sol.median().sort_values()[0:7].index.to_list()
['Centro',
 'Arganzuela',
 'Chamberí',
 'Moncloa - Aravaca',
 'Retiro',
 'Salamanca',
 'Latina']
seleccion = df.groupby('distrito').pdi_sol.median().sort_values()[0:7].index.to_list()

plt.figure(figsize = (16,12))
sns.scatterplot(data = df.loc[df.distrito.isin(seleccion)], x = 'pdi_sol', y = 'precio_total');

png

There does not seem to be as direct a relationship as would be expected between the distance to Puerta del Sol and the rental price.

Insight 5: being within the district it seems that the proximity to points of interest does not have as much impact as would be expected. This opens the door to looking for properties that, being in a central district, are not right next to the PoI and therefore hopefully have a lower purchase price

Analysis of occupation

At this point, we could repeat exactly the same analysis as with the price, but changing the price variable for the occupancy variable that we had built.

Since it would be the same, we are not going to develop it and I will leave it as a homework for you to practice and try to get your first insights.

Instead I prefer the time to show you how we can include analysis on a map, since in this case it would be something very relevant and it is a type of analysis that is always very popular.

Geographic analysis on a map

import folium

We are going to use the coordinates of the Puerta del Sol that we already had as the center point of the map.

For example, we are going to visually analyze the insight on the San Blas district.

datos = df[df.distrito == 'San Blas - Canillejas'].copy()

To represent the markers of all the floors we have to create a loop to add them to the map.

We are going also to add some color legend to see which is the price range.

datos['precio_total_disc'] = pd.qcut(datos['precio_total'], q = [0, .25, .5, .75, 1.], 
                              labels=['yellow', 'orange', 'blue', 'red'])
mapa = folium.Map(location=[40.4167278, -3.7033387],zoom_start=12)

for piso in range(0,len(datos)):
   folium.CircleMarker(
      location = [datos.iloc[piso]['latitude'], datos.iloc[piso]['longitude']],
      popup = datos.iloc[piso]['precio_total'],
      fill=True,
      color = datos.iloc[piso]['precio_total_disc'],
      fill_opacity=1,
      radius = 5
   ).add_to(mapa)

mapa
Make this Notebook Trusted to load map: File -> Trust Notebook

Insight 1b: There are still many yellow and orange dots, which may indicate that there is still potential to further exploit the factor of sporting events

Final Presentation based on Exhibits from analyzed Insights