This case simulates that we are a real estate company that makes investments in large cities, buying properties to later rent them as tourist apartments.
The management has made the decision to invest in Madrid, and has commissioned us to analyze the data that the leader in the AirBnb sector makes public to try to find the types of properties that have the greatest commercial potential for tourist rentals.
As the main deliverable, they expect the type (or types) of properties that the valuation team must look for among the existing opportunities in the city and the main neighborhoods or geographical areas on which to focus.
To meet the objective we will apply the Discovery methodology and BA techniques.
Although this specific case is focused on tourist rentals, the same type of approach can be used in cases that have a high "location" component:
store opening and closing
reduction of installed capacity
franchise expansion
etc.
Locate the profile (or profiles) of properties that maximize commercial potential in the tourist rental market and the main areas where to look for them.
After speaking with the valuation team, they tell us that the levers that have the most impact on the profitability of this type of investment are:
Rental price: the more you can charge per night, the higher the profitability
Occupation: in general, the more days a year a property can be rented, the greater its profitability
Real estate price: the cheaper the property can be acquired, the greater the profitability
In this example the Kpis are pretty straightforward:
The relevant entities for our objective and from which we may have data are:
Estate
owners
districts
The specific data in each of them will be reviewed in the next module.
We will use actual data provided by AirBnB on this page: Airbnb Insider
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
#Automcompletar rápido
%config IPCompleter.greedy=True
On the AirBnB website we can see the description of the tables:
We are going to load one by one, understand them and make a decision whether to use it or not.
listings = pd.read_csv('../Datos/listings.csv')
listings.head()
id | name | host_id | host_name | ... | license | |
---|---|---|---|---|---|---|
0 | 6369 | Rooftop terrace room , ensuite bathroom | 13660 | Simon | ... | NaN |
1 | 21853 | Bright and airy room | 83531 | Abdel | ... | NaN |
2 | 23001 | Apartmento Arganzuela- Madrid Rio | 82175 | Jesus | ... | NaN |
3 | 24805 | Gran Via Studio Madrid | 346366726 | A | ... | NaN |
4 | 26825 | Single Room whith private Bathroom | 114340 | Agustina | ... | NaN |
listings.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18909 entries, 0 to 18908
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 18909 non-null int64
1 name 18906 non-null object
2 host_id 18909 non-null int64
3 host_name 18883 non-null object
4 neighbourhood_group 18909 non-null object
5 neighbourhood 18909 non-null object
6 latitude 18909 non-null float64
7 longitude 18909 non-null float64
8 room_type 18909 non-null object
9 price 18909 non-null int64
10 minimum_nights 18909 non-null int64
11 number_of_reviews 18909 non-null int64
12 last_review 13877 non-null object
13 reviews_per_month 13877 non-null float64
14 calculated_host_listings_count 18909 non-null int64
15 availability_365 18909 non-null int64
16 number_of_reviews_ltm 18909 non-null int64
17 license 2828 non-null object
dtypes: float64(3), int64(8), object(7)
memory usage: 2.6+ MB
listings_det = pd.read_csv('../Datos/listings.csv.gz',compression='gzip')
listings_det.head()
id | listing_url | scrape_id | last_scraped | name | description | ... | |
---|---|---|---|---|---|---|---|
0 | 6369 | https://www.airbnb.com/rooms/6369 | 20210910193531 | 2021-09-11 | Rooftop terrace room , ensuite bathroom | Excellent connection with the AIRPORT and EXHI... | ... |
1 | 21853 | https://www.airbnb.com/rooms/21853 | 20210910193531 | 2021-09-11 | Bright and airy room | We have a quiet and sunny room with a good vie... | ... |
2 | 23001 | https://www.airbnb.com/rooms/23001 | 20210910193531 | 2021-09-11 | Apartmento Arganzuela- Madrid Rio | Apartamento de tres dormitorios dobles, gran s... | ... |
3 | 24805 | https://www.airbnb.com/rooms/24805 | 20210910193531 | 2021-09-11 | Gran Via Studio Madrid | Studio located 50 meters from Gran Via, next t... | ... |
4 | 26825 | https://www.airbnb.com/rooms/26825 | 20210910193531 | 2021-09-11 | Single Room whith private Bathroom | Nice and cozy roon for one person with a priva... | ... |
5 rows × 74 columns
listings_det.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18909 entries, 0 to 18908
Data columns (total 74 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 18909 non-null int64
1 listing_url 18909 non-null object
2 scrape_id 18909 non-null int64
3 last_scraped 18909 non-null object
4 name 18906 non-null object
5 description 17854 non-null object
6 neighborhood_overview 10997 non-null object
7 picture_url 18908 non-null object
8 host_id 18909 non-null int64
9 host_url 18909 non-null object
10 host_name 18883 non-null object
11 host_since 18883 non-null object
12 host_location 18841 non-null object
13 host_about 9427 non-null object
14 host_response_time 11972 non-null object
15 host_response_rate 11972 non-null object
16 host_acceptance_rate 11531 non-null object
17 host_is_superhost 18883 non-null object
18 host_thumbnail_url 18883 non-null object
19 host_picture_url 18883 non-null object
20 host_neighbourhood 12034 non-null object
21 host_listings_count 18883 non-null float64
22 host_total_listings_count 18883 non-null float64
23 host_verifications 18909 non-null object
24 host_has_profile_pic 18883 non-null object
25 host_identity_verified 18883 non-null object
26 neighbourhood 10997 non-null object
27 neighbourhood_cleansed 18909 non-null object
28 neighbourhood_group_cleansed 18909 non-null object
29 latitude 18909 non-null float64
30 longitude 18909 non-null float64
31 property_type 18909 non-null object
32 room_type 18909 non-null object
33 accommodates 18909 non-null int64
34 bathrooms 0 non-null float64
35 bathrooms_text 18884 non-null object
36 bedrooms 17475 non-null float64
37 beds 18568 non-null float64
38 amenities 18909 non-null object
39 price 18909 non-null object
40 minimum_nights 18909 non-null int64
41 maximum_nights 18909 non-null int64
42 minimum_minimum_nights 18908 non-null float64
43 maximum_minimum_nights 18908 non-null float64
44 minimum_maximum_nights 18908 non-null float64
45 maximum_maximum_nights 18908 non-null float64
46 minimum_nights_avg_ntm 18908 non-null float64
47 maximum_nights_avg_ntm 18908 non-null float64
48 calendar_updated 0 non-null float64
49 has_availability 18909 non-null object
50 availability_30 18909 non-null int64
51 availability_60 18909 non-null int64
52 availability_90 18909 non-null int64
53 availability_365 18909 non-null int64
54 calendar_last_scraped 18909 non-null object
55 number_of_reviews 18909 non-null int64
56 number_of_reviews_ltm 18909 non-null int64
57 number_of_reviews_l30d 18909 non-null int64
58 first_review 13877 non-null object
59 last_review 13877 non-null object
60 review_scores_rating 13877 non-null float64
61 review_scores_accuracy 13638 non-null float64
62 review_scores_cleanliness 13640 non-null float64
63 review_scores_checkin 13640 non-null float64
64 review_scores_communication 13640 non-null float64
65 review_scores_location 13637 non-null float64
66 review_scores_value 13636 non-null float64
67 license 2828 non-null object
68 instant_bookable 18909 non-null object
69 calculated_host_listings_count 18909 non-null int64
70 calculated_host_listings_count_entire_homes 18909 non-null int64
71 calculated_host_listings_count_private_rooms 18909 non-null int64
72 calculated_host_listings_count_shared_rooms 18909 non-null int64
73 reviews_per_month 13877 non-null float64
dtypes: float64(22), int64(17), object(35)
memory usage: 10.7+ MB
Conclusions:
reviews = pd.read_csv('../Datos/reviews.csv')
reviews.head()
listing_id | date | |
---|---|---|
0 | 6369 | 2010-03-14 |
1 | 6369 | 2010-03-23 |
2 | 6369 | 2010-04-10 |
3 | 6369 | 2010-04-21 |
4 | 6369 | 2010-04-26 |
reviews.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 618440 entries, 0 to 618439
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 listing_id 618440 non-null int64
1 date 618440 non-null object
dtypes: int64(1), object(1)
memory usage: 9.4+ MB
reviews_det = pd.read_csv('../Datos/reviews.csv.gz',compression = 'gzip')
reviews_det.head()
listing_id | id | date | reviewer_id | reviewer_name | comments | |
---|---|---|---|---|---|---|
0 | 6369 | 29428 | 2010-03-14 | 84790 | Nancy | Simon and Arturo have the ultimate location in... |
1 | 6369 | 31018 | 2010-03-23 | 84338 | David | Myself and Kristy originally planned on stayin... |
2 | 6369 | 34694 | 2010-04-10 | 98655 | Marion | We had a great time at Arturo and Simon's ! A ... |
3 | 6369 | 37146 | 2010-04-21 | 109871 | Kurt | I very much enjoyed the stay. \r<br/>It's a w... |
4 | 6369 | 38168 | 2010-04-26 | 98901 | Dennis | Arturo and Simon are polite and friendly hosts... |
reviews_det.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 618440 entries, 0 to 618439
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 listing_id 618440 non-null int64
1 id 618440 non-null int64
2 date 618440 non-null object
3 reviewer_id 618440 non-null int64
4 reviewer_name 618439 non-null object
5 comments 618054 non-null object
dtypes: int64(3), object(3)
memory usage: 28.3+ MB
Conclusions:
calendar = pd.read_csv('../Datos/calendar.csv.gz',compression = 'gzip')
calendar.head(30)
listing_id | date | available | price | adjusted_price | minimum_nights | maximum_nights | |
---|---|---|---|---|---|---|---|
0 | 6369 | 2021-09-11 | t | $60.00 | $60.00 | 1.0 | 1125.0 |
1 | 6369 | 2021-09-12 | t | $60.00 | $60.00 | 1.0 | 1125.0 |
2 | 6369 | 2021-09-13 | t | $60.00 | $60.00 | 1.0 | 1125.0 |
3 | 6369 | 2021-09-14 | t | $60.00 | $60.00 | 1.0 | 1125.0 |
4 | 6369 | 2021-09-15 | t | $60.00 | $60.00 | 1.0 | 1125.0 |
5 | 6369 | 2021-09-16 | t | $60.00 | $60.00 | 1.0 | 1125.0 |
6 | 6369 | 2021-09-17 | t | $60.00 | $60.00 | 1.0 | 1125.0 |
7 | 6369 | 2021-09-18 | t | $60.00 | $60.00 | 1.0 | 1125.0 |
8 | 6369 | 2021-09-19 | t | $60.00 | $60.00 | 1.0 | 1125.0 |
9 | 6369 | 2021-09-20 | t | $60.00 | $60.00 | 1.0 | 1125.0 |
10 | 6369 | 2021-09-21 | t | $60.00 | $60.00 | 1.0 | 1125.0 |
11 | 6369 | 2021-09-22 | t | $60.00 | $60.00 | 1.0 | 1125.0 |
12 | 6369 | 2021-09-23 | t | $60.00 | $60.00 | 1.0 | 1125.0 |
13 | 6369 | 2021-09-24 | t | $60.00 | $60.00 | 1.0 | 1125.0 |
14 | 6369 | 2021-09-25 | t | $60.00 | $60.00 | 1.0 | 1125.0 |
15 | 6369 | 2021-09-26 | t | $60.00 | $60.00 | 1.0 | 1125.0 |
16 | 6369 | 2021-09-27 | t | $60.00 | $60.00 | 1.0 | 1125.0 |
17 | 6369 | 2021-09-28 | t | $60.00 | $60.00 | 1.0 | 1125.0 |
18 | 6369 | 2021-09-29 | t | $60.00 | $60.00 | 1.0 | 1125.0 |
19 | 6369 | 2021-09-30 | t | $60.00 | $60.00 | 1.0 | 1125.0 |
20 | 6369 | 2021-10-01 | t | $60.00 | $60.00 | 1.0 | 1125.0 |
21 | 6369 | 2021-10-02 | t | $60.00 | $60.00 | 1.0 | 1125.0 |
22 | 6369 | 2021-10-03 | t | $60.00 | $60.00 | 1.0 | 1125.0 |
23 | 6369 | 2021-10-04 | t | $60.00 | $60.00 | 1.0 | 1125.0 |
24 | 6369 | 2021-10-05 | t | $60.00 | $60.00 | 1.0 | 1125.0 |
25 | 6369 | 2021-10-06 | t | $60.00 | $60.00 | 1.0 | 1125.0 |
26 | 6369 | 2021-10-07 | t | $60.00 | $60.00 | 1.0 | 1125.0 |
27 | 6369 | 2021-10-08 | t | $60.00 | $60.00 | 1.0 | 1125.0 |
28 | 6369 | 2021-10-09 | t | $60.00 | $60.00 | 1.0 | 1125.0 |
29 | 6369 | 2021-10-10 | t | $60.00 | $60.00 | 1.0 | 1125.0 |
calendar.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6901414 entries, 0 to 6901413
Data columns (total 7 columns):
# Column Dtype
--- ------ -----
0 listing_id int64
1 date object
2 available object
3 price object
4 adjusted_price object
5 minimum_nights float64
6 maximum_nights float64
dtypes: float64(2), int64(1), object(4)
memory usage: 368.6+ MB
Conclusions:
neigh = pd.read_csv('../Datos/neighbourhoods.csv')
neigh.head(5)
neighbourhood_group | neighbourhood | |
---|---|---|
0 | Arganzuela | Acacias |
1 | Arganzuela | Atocha |
2 | Arganzuela | Chopera |
3 | Arganzuela | Delicias |
4 | Arganzuela | Imperial |
Conclusions:
neigh_geo = pd.read_json('../Datos/neighbourhoods.geojson')
neigh_geo.head(5)
type | features | |
---|---|---|
0 | FeatureCollection | {'type': 'Feature', 'geometry': {'type': 'Mult... |
1 | FeatureCollection | {'type': 'Feature', 'geometry': {'type': 'Mult... |
2 | FeatureCollection | {'type': 'Feature', 'geometry': {'type': 'Mult... |
3 | FeatureCollection | {'type': 'Feature', 'geometry': {'type': 'Mult... |
4 | FeatureCollection | {'type': 'Feature', 'geometry': {'type': 'Mult... |
neigh_geo.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128 entries, 0 to 127
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 type 128 non-null object
1 features 128 non-null object
dtypes: object(2)
memory usage: 2.1+ KB
Conclusions:
We are going to put the selected tables in an internal database.
We will use what is possibly the simplest database: sqlite, since it is self-contained, serverless and configuration-free, so it is usually the format of choice for storing "own" projects.
We create the connection:
import sqlalchemy as sa
con = sa.create_engine('sqlite:///../Datos/airbnb.db')
We create the tables and load the data
listings.to_sql('listings', con = con, if_exists = 'replace')
listings_det.to_sql('listings_det', con = con, if_exists = 'replace')
It's a good practice doing the individual data quality of each table first, especially because if the volume is very high, when joining them it is multiplied even more.
Therefore, initially we will leave them as individual tables, then we will apply data quality and finally we will join them to form the board or more officially the analytical datamart.
In this module we are going to:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import sqlalchemy as sa
#Automcompletar rápido
%config IPCompleter.greedy=True
Create the connection to the database
Create the connection to the database
con = sa.create_engine('sqlite:///../Datos/airbnb.db')
from sqlalchemy import inspect
insp = inspect(con)
tablas = insp.get_table_names()
tablas
['df', 'df_preparado', 'listings', 'listings_det']
Mass Load the tables
exec(f'{tabla} = pd.read_sql(tabla, con)')
for cada in tablas:
print(cada + ': ' + str(eval(cada).shape))
df: (17710, 24)
df_preparado: (17710, 34)
listings: (18909, 19)
listings_det: (18909, 75)
In our data we do not have the purchase price of a property, but we had seen that it is one of the main levers.
Therefore we are going to look for that data externally.
On this page we have exactly the information we need: idealista
We can easily extract it with the Chrome instant data scraper plugin, and save it in our Data folder with the name 'prices_idealista.csv'
We load the data, remove the first record and select only the price and district columns
precio_m2 = pd.read_csv('../Datos/precios_idealista.csv') \
.loc[1:,['table__cell','icon-elbow']] \
.rename(columns = {'table__cell':'precio_m2','icon-elbow':'distrito'})
precio_m2
precio_m2 | distrito | |
---|---|---|
1 | 4.085 €/m2 | Arganzuela |
2 | 3.409 €/m2 | Barajas |
3 | 2.123 €/m2 | Carabanchel |
4 | 4.827 €/m2 | Centro |
5 | 5.098 €/m2 | Chamartín |
6 | 5.381 €/m2 | Chamberí |
7 | 2.940 €/m2 | Ciudad Lineal |
8 | 3.568 €/m2 | Fuencarral |
9 | 3.871 €/m2 | Hortaleza |
10 | 2.267 €/m2 | Latina |
11 | 4.033 €/m2 | Moncloa |
12 | 2.500 €/m2 | Moratalaz |
13 | 1.918 €/m2 | Puente de Vallecas |
14 | 4.788 €/m2 | Retiro |
15 | 6.114 €/m2 | Salamanca |
16 | 2.591 €/m2 | San Blas |
17 | 3.678 €/m2 | Tetuán |
18 | 1.995 €/m2 | Usera |
19 | 2.403 €/m2 | Vicálvaro |
20 | 2.354 €/m2 | Villa de Vallecas |
21 | 1.693 €/m2 | Villaverde |
Cleaning the price:
precio_m2['precio_m2'] = precio_m2.precio_m2.str.split(expand = True)[0].str.replace('.','',regex=False).astype('int')
precio_m2
precio_m2 | distrito | |
---|---|---|
1 | 4085 | Arganzuela |
2 | 3409 | Barajas |
3 | 2123 | Carabanchel |
4 | 4827 | Centro |
5 | 5098 | Chamartín |
6 | 5381 | Chamberí |
7 | 2940 | Ciudad Lineal |
8 | 3568 | Fuencarral |
9 | 3871 | Hortaleza |
10 | 2267 | Latina |
11 | 4033 | Moncloa |
12 | 2500 | Moratalaz |
13 | 1918 | Puente de Vallecas |
14 | 4788 | Retiro |
15 | 6114 | Salamanca |
16 | 2591 | San Blas |
17 | 3678 | Tetuán |
18 | 1995 | Usera |
19 | 2403 | Vicálvaro |
20 | 2354 | Villa de Vallecas |
21 | 1693 | Villaverde |
listings.head()
index | id | name | host_id | host_name | neighbourhood_group | ... | |
---|---|---|---|---|---|---|---|
0 | 0 | 6369 | Rooftop terrace room , ensuite bathroom | 13660 | Simon | Chamartín | ... |
1 | 1 | 21853 | Bright and airy room | 83531 | Abdel | Latina | ... |
2 | 2 | 23001 | Apartmento Arganzuela- Madrid Rio | 82175 | Jesus | Arganzuela | ... |
3 | 3 | 24805 | Gran Via Studio Madrid | 346366726 | A | Centro | ... |
4 | 4 | 26825 | Single Room whith private Bathroom | 114340 | Agustina | Arganzuela | ... |
listings.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18909 entries, 0 to 18908
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 index 18909 non-null int64
1 id 18909 non-null int64
2 name 18906 non-null object
3 host_id 18909 non-null int64
4 host_name 18883 non-null object
5 neighbourhood_group 18909 non-null object
6 neighbourhood 18909 non-null object
7 latitude 18909 non-null float64
8 longitude 18909 non-null float64
9 room_type 18909 non-null object
10 price 18909 non-null int64
11 minimum_nights 18909 non-null int64
12 number_of_reviews 18909 non-null int64
13 last_review 13877 non-null object
14 reviews_per_month 13877 non-null float64
15 calculated_host_listings_count 18909 non-null int64
16 availability_365 18909 non-null int64
17 number_of_reviews_ltm 18909 non-null int64
18 license 2828 non-null object
dtypes: float64(3), int64(9), object(7)
memory usage: 2.7+ MB
We are going to eliminate those variables that we will not need directly for our objectives.
a_eliminar = ['index',
'host_name',
'number_of_reviews',
'last_review',
'reviews_per_month',
'number_of_reviews_ltm',
'license'
]
listings.drop(columns = a_eliminar, inplace=True)
listings
id | name | host_id | neighbourhood_group | neighbourhood | latitude | ... | |
---|---|---|---|---|---|---|---|
0 | 6369 | Rooftop terrace room , ensuite bathroom | 13660 | Chamartín | Hispanoamérica | 40.457240 | ... |
1 | 21853 | Bright and airy room | 83531 | Latina | Cármenes | 40.403810 | ... |
2 | 23001 | Apartmento Arganzuela- Madrid Rio | 82175 | Arganzuela | Legazpi | 40.388400 | ... |
3 | 24805 | Gran Via Studio Madrid | 346366726 | Centro | Universidad | 40.421830 | ... |
4 | 26825 | Single Room whith private Bathroom | 114340 | Arganzuela | ... | ... | ... |
... | ... | ... | ... | ... | ... | ... | ... |
18904 | 52182264 | Enormous Private Room in 12-Bedroom Unit - los... | 378060726 | Centro | Justicia | 40.424384 | ... |
18905 | 52182273 | Stunning Private Room in 11-Bedroom Unit - los... | 378060726 | Centro | Justicia | 40.424447 | ... |
18906 | 52182303 | Classic Private Room in 7-Bedroom Unit - los 3... | 378060726 | Centro | Justicia | 40.424989 | ... |
18907 | 52182321 | Elegant Private Room in 12-Bedroom Unit - los ... | 378060726 | Salamanca | Recoletos | 40.424352 | ... |
18908 | 52182334 | Fashioned Private Room in 12-Bedroom Unit - lo... | 378060726 | Centro | Justicia | 40.425670 | ... |
18909 rows × 12 columns
We will check now data types
listings.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18909 entries, 0 to 18908
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 18909 non-null int64
1 name 18906 non-null object
2 host_id 18909 non-null int64
3 neighbourhood_group 18909 non-null object
4 neighbourhood 18909 non-null object
5 latitude 18909 non-null float64
6 longitude 18909 non-null float64
7 room_type 18909 non-null object
8 price 18909 non-null int64
9 minimum_nights 18909 non-null int64
10 calculated_host_listings_count 18909 non-null int64
11 availability_365 18909 non-null int64
dtypes: float64(2), int64(6), object(4)
memory usage: 1.7+ MB
Conclusion: pass some object (neighborhood_group, neighborhood, room_type) to categorical.
for variable in ['neighbourhood_group','neighbourhood','room_type']:
listings[variable] = listings[variable].astype('category')
Check
listings.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18909 entries, 0 to 18908
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 18909 non-null int64
1 name 18906 non-null object
2 host_id 18909 non-null int64
3 neighbourhood_group 18909 non-null category
4 neighbourhood 18909 non-null category
5 latitude 18909 non-null float64
6 longitude 18909 non-null float64
7 room_type 18909 non-null category
8 price 18909 non-null int64
9 minimum_nights 18909 non-null int64
10 calculated_host_listings_count 18909 non-null int64
11 availability_365 18909 non-null int64
dtypes: category(3), float64(2), int64(6), object(1)
memory usage: 1.4+ MB
From the Non-null column of info() we see that only name has 3 nulls.
We review them but we see that it is not a problem, so we leave them.
listings[listings.name.isna()]
id | name | host_id | neighbourhood_group | neighbourhood | latitude | longitude | ... | |
---|---|---|---|---|---|---|---|---|
1538 | 7164589 | None | 37525983 | Centro | Palacio | 40.41458 | -3.71422 | ... |
2394 | 11687495 | None | 48387429 | San Blas - Canillejas | Simancas | 40.43765 | -3.62672 | ... |
2842 | 13585476 | None | 20922102 | Centro | Universidad | 40.42718 | -3.71144 | ... |
We check if there are any duplicate records.
listings.duplicated().sum()
0
We are going to analyze the values and frequencies of the categorical variables
listings.neighbourhood_group.value_counts()
Centro 8433
Salamanca 1267
Chamberí 1148
Arganzuela 1070
Tetuán 811
Carabanchel 669
Retiro 656
Ciudad Lineal 601
Chamartín 562
Latina 547
Puente de Vallecas 542
Moncloa - Aravaca 535
San Blas - Canillejas 485
Hortaleza 374
Fuencarral - El Pardo 294
Usera 275
Villaverde 186
Barajas 147
Moratalaz 131
Villa de Vallecas 105
Vicálvaro 71
Name: neighbourhood_group, dtype: int64
listings.neighbourhood.value_counts()
Embajadores 2255
Universidad 1772
Palacio 1533
Sol 1126
Justicia 910
...
El Plantío 5
Valdemarín 4
El Pardo 3
Horcajo 2
Atalaya 1
Name: neighbourhood, Length: 128, dtype: int64
listings.room_type.value_counts()
Entire home/apt 11098
Private room 7390
Shared room 258
Hotel room 163
Name: room_type, dtype: int64
We see that there are hotels. Our company is not considering buying hotels, so we have to remove these records.
listings = listings.loc[listings.room_type != 'Hotel room']
listings.room_type.value_counts()
Entire home/apt 11098
Private room 7390
Shared room 258
Hotel room 0
Name: room_type, dtype: int64
Of the numeric variables, it makes sense to analyze from price to availability_365, that is, from column positions 8 to 11
listings.iloc[:,8:12].describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
price | 18746.0 | 129.271365 | 432.384680 | 8.0 | 36.0 | 64.0 | 105.0 | 9999.0 |
minimum_nights | 18746.0 | 7.295850 | 35.430022 | 1.0 | 1.0 | 2.0 | 3.0 | 1125.0 |
calculated_host_listings_count | 18746.0 | 10.731676 | 26.429455 | 1.0 | 1.0 | 2.0 | 6.0 | 194.0 |
availability_365 | 18746.0 | 153.761656 | 140.363063 | 0.0 | 0.0 | 113.0 | 310.0 | 365.0 |
Conclusions:
We will review minimums and maximums in the price
listings.price.plot.kde()
<AxesSubplot:ylabel='Density'>
We will focus now on highs values
plt.figure(figsize=(16,8))
listings.price.loc[listings.price > 1000].value_counts().sort_index().plot.bar()
plt.xticks(size = 10);
Conclusion:
We will check the values close to zero
plt.figure(figsize=(16,8))
listings.price.loc[listings.price < 30].value_counts().sort_index().plot.bar()
plt.xticks(size = 10);
Conclusion:
listings = listings.loc[listings.price > 19]
listings
id | name | host_id | neighbourhood_group | neighbourhood | latitude | ... | |
---|---|---|---|---|---|---|---|
0 | 6369 | Rooftop terrace room , ensuite bathroom | 13660 | Chamartín | Hispanoamérica | 40.457240 | ... |
1 | 21853 | Bright and airy room | 83531 | Latina | Cármenes | 40.403810 | ... |
2 | 23001 | Apartmento Arganzuela- Madrid Rio | 82175 | Arganzuela | Legazpi | 40.388400 | ... |
3 | 24805 | Gran Via Studio Madrid | 346366726 | Centro | Universidad | 40.421830 | ... |
4 | 26825 | Single Room whith private Bathroom | 114340 | Arganzuela | Legazpi | 40.389750 | ... |
... | ... | ... | ... | ... | ... | ... | ... |
18904 | 52182264 | Enormous Private Room in 12-Bedroom Unit - los... | 378060726 | Centro | Justicia | 40.424384 | ... |
18905 | 52182273 | Stunning Private Room in 11-Bedroom Unit - los... | 378060726 | Centro | Justicia | 40.424447 | ... |
18906 | 52182303 | Classic Private Room in 7-Bedroom Unit - los 3... | 378060726 | Centro | Justicia | 40.424989 | ... |
18907 | 52182321 | Elegant Private Room in 12-Bedroom Unit - los ... | 378060726 | Salamanca | Recoletos | 40.424352 | ... |
18908 | 52182334 | Fashioned Private Room in 12-Bedroom Unit - lo... | 378060726 | Centro | Justicia | 40.425670 | ... |
17710 rows × 12 columns
For minimum_nights and alculated_host_listings_count you would have to do a similar exercise.
However, it is not something that is core in our analysis and therefore I leave it as homework for you to practice.
listings_det.head()
index | id | listing_url | scrape_id | last_scraped | name | ... | |
---|---|---|---|---|---|---|---|
0 | 0 | 6369 | https://www.airbnb.com/rooms/6369 | 20210910193531 | 2021-09-11 | Rooftop terrace room , ensuite bathroom | ... |
1 | 1 | 21853 | https://www.airbnb.com/rooms/21853 | 20210910193531 | 2021-09-11 | Bright and airy room | ... |
2 | 2 | 23001 | https://www.airbnb.com/rooms/23001 | 20210910193531 | 2021-09-11 | Apartmento Arganzuela- Madrid Rio | ... |
3 | 3 | 24805 | https://www.airbnb.com/rooms/24805 | 20210910193531 | 2021-09-11 | Gran Via Studio Madrid | ... |
4 | 4 | 26825 | https://www.airbnb.com/rooms/26825 | 20210910193531 | 2021-09-11 | Single Room whith private Bathroom | ... |
5 rows × 75 columns
listings_det.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18909 entries, 0 to 18908
Data columns (total 75 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 index 18909 non-null int64
1 id 18909 non-null int64
2 listing_url 18909 non-null object
3 scrape_id 18909 non-null int64
4 last_scraped 18909 non-null object
5 name 18906 non-null object
6 description 17854 non-null object
7 neighborhood_overview 10997 non-null object
8 picture_url 18908 non-null object
9 host_id 18909 non-null int64
10 host_url 18909 non-null object
11 host_name 18883 non-null object
12 host_since 18883 non-null object
13 host_location 18841 non-null object
14 host_about 9427 non-null object
15 host_response_time 11972 non-null object
16 host_response_rate 11972 non-null object
17 host_acceptance_rate 11531 non-null object
18 host_is_superhost 18883 non-null object
19 host_thumbnail_url 18883 non-null object
20 host_picture_url 18883 non-null object
21 host_neighbourhood 12034 non-null object
22 host_listings_count 18883 non-null float64
23 host_total_listings_count 18883 non-null float64
24 host_verifications 18909 non-null object
25 host_has_profile_pic 18883 non-null object
26 host_identity_verified 18883 non-null object
27 neighbourhood 10997 non-null object
28 neighbourhood_cleansed 18909 non-null object
29 neighbourhood_group_cleansed 18909 non-null object
30 latitude 18909 non-null float64
31 longitude 18909 non-null float64
32 property_type 18909 non-null object
33 room_type 18909 non-null object
34 accommodates 18909 non-null int64
35 bathrooms 0 non-null float64
36 bathrooms_text 18884 non-null object
37 bedrooms 17475 non-null float64
38 beds 18568 non-null float64
39 amenities 18909 non-null object
40 price 18909 non-null object
41 minimum_nights 18909 non-null int64
42 maximum_nights 18909 non-null int64
43 minimum_minimum_nights 18908 non-null float64
44 maximum_minimum_nights 18908 non-null float64
45 minimum_maximum_nights 18908 non-null float64
46 maximum_maximum_nights 18908 non-null float64
47 minimum_nights_avg_ntm 18908 non-null float64
48 maximum_nights_avg_ntm 18908 non-null float64
49 calendar_updated 0 non-null float64
50 has_availability 18909 non-null object
51 availability_30 18909 non-null int64
52 availability_60 18909 non-null int64
53 availability_90 18909 non-null int64
54 availability_365 18909 non-null int64
55 calendar_last_scraped 18909 non-null object
56 number_of_reviews 18909 non-null int64
57 number_of_reviews_ltm 18909 non-null int64
58 number_of_reviews_l30d 18909 non-null int64
59 first_review 13877 non-null object
60 last_review 13877 non-null object
61 review_scores_rating 13877 non-null float64
62 review_scores_accuracy 13638 non-null float64
63 review_scores_cleanliness 13640 non-null float64
64 review_scores_checkin 13640 non-null float64
65 review_scores_communication 13640 non-null float64
66 review_scores_location 13637 non-null float64
67 review_scores_value 13636 non-null float64
68 license 2828 non-null object
69 instant_bookable 18909 non-null object
70 calculated_host_listings_count 18909 non-null int64
71 calculated_host_listings_count_entire_homes 18909 non-null int64
72 calculated_host_listings_count_private_rooms 18909 non-null int64
73 calculated_host_listings_count_shared_rooms 18909 non-null int64
74 reviews_per_month 13877 non-null float64
dtypes: float64(22), int64(18), object(35)
memory usage: 10.8+ MB
We are going to select only those variables that provide us with relevant information for our objectives.
a_incluir = ['id',
'description',
'host_is_superhost',
'accommodates',
'bathrooms',
'bedrooms',
'beds',
'number_of_reviews',
'review_scores_rating',
'review_scores_communication',
'review_scores_location'
]
listings_det = listings_det.loc[:,a_incluir]
listings_det
id | description | host_is_superhost | accommodates | bathrooms | bedrooms | ... | |
---|---|---|---|---|---|---|---|
0 | 6369 | Excellent connection with the AIRPORT and EXHI... | t | 2 | NaN | 1.0 | ... |
1 | 21853 | We have a quiet and sunny room with a good vie... | f | 1 | NaN | 1.0 | ... |
2 | 23001 | Apartamento de tres dormitorios dobles, gran s... | f | 6 | NaN | 1.0 | ... |
3 | 24805 | Studio located 50 meters from Gran Via, next t... | f | 3 | NaN | 1.0 | ... |
4 | 26825 | Nice and cozy roon for one person with a priva... | f | 1 | NaN | 1.0 | ... |
... | ... | ... | ... | ... | ... | ... | ... |
18904 | 52182264 | ROOM - 8 sqm. • 1st Floor <br /> <br />cozy ro... | f | 1 | NaN | 1.0 | ... |
18905 | 52182273 | ROOM - 10 sqm. • 4th Floor <br /> <br />cozy r... | f | 1 | NaN | 1.0 | ... |
18906 | 52182303 | ROOM <br /> <br />cozy room in madrid centro i... | f | 1 | NaN | 1.0 | ... |
18907 | 52182321 | ROOM -9 sqm. • 1st Floor <br /> <br />cozy roo... | f | 1 | NaN | 1.0 | ... |
18908 | 52182334 | ROOM - 10 sqm. • 1st Floor <br /> <br />cozy r... | f | 1 | NaN | 1.0 | ... |
18909 rows × 11 columns
We are going to analyze data types
listings_det.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18909 entries, 0 to 18908
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 18909 non-null int64
1 description 17854 non-null object
2 host_is_superhost 18883 non-null object
3 accommodates 18909 non-null int64
4 bathrooms 0 non-null float64
5 bedrooms 17475 non-null float64
6 beds 18568 non-null float64
7 number_of_reviews 18909 non-null int64
8 review_scores_rating 13877 non-null float64
9 review_scores_communication 13640 non-null float64
10 review_scores_location 13637 non-null float64
dtypes: float64(6), int64(3), object(2)
memory usage: 1.6+ MB
Conclusion: change host_is_superhost to categoric.
listings_det['host_is_superhost'] = listings_det['host_is_superhost'].astype('category')
listings_det.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18909 entries, 0 to 18908
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 18909 non-null int64
1 description 17854 non-null object
2 host_is_superhost 18883 non-null category
3 accommodates 18909 non-null int64
4 bathrooms 0 non-null float64
5 bedrooms 17475 non-null float64
6 beds 18568 non-null float64
7 number_of_reviews 18909 non-null int64
8 review_scores_rating 13877 non-null float64
9 review_scores_communication 13640 non-null float64
10 review_scores_location 13637 non-null float64
dtypes: category(1), float64(6), int64(3), object(1)
memory usage: 1.5+ MB
listings_det.isna().sum()
id 0
description 1055
host_is_superhost 26
accommodates 0
bathrooms 18909
bedrooms 1434
beds 341
number_of_reviews 0
review_scores_rating 5032
review_scores_communication 5269
review_scores_location 5272
dtype: int64
Conclusions:
Let's see if we can make an imputation of beds based on the number of people that can be accommodated.
pd.crosstab(listings_det.beds, listings_det.accommodates)
accommodates | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | .. |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
beds | |||||||||||||||
0.0 | 222 | 259 | 41 | 68 | 15 | 12 | 5 | 3 | 0 | 2 | 0 | 0 | 0 | 0 | .. |
1.0 | 2765 | 4911 | 399 | 639 | 14 | 9 | 0 | 3 | 0 | 2 | 0 | 0 | 0 | 0 | .. |
2.0 | 120 | 1110 | 1016 | 2277 | 125 | 148 | 3 | 3 | 0 | 1 | 0 | 0 | 1 | 0 | .. |
3.0 | 18 | 75 | 231 | 925 | 395 | 582 | 18 | 26 | 1 | 1 | 0 | 0 | 0 | 0 | .. |
4.0 | 20 | 28 | 32 | 216 | 214 | 498 | 73 | 92 | 3 | 9 | 0 | 1 | 0 | 0 | .. |
5.0 | 2 | 5 | 6 | 20 | 42 | 178 | 75 | 94 | 9 | 30 | 2 | 2 | 0 | 1 | .. |
6.0 | 14 | 9 | 2 | 5 | 7 | 62 | 21 | 61 | 9 | 18 | 4 | 8 | 2 | 0 | .. |
7.0 | 0 | 1 | 0 | 2 | 2 | 6 | 10 | 19 | 8 | 17 | 2 | 6 | 1 | 4 | .. |
8.0 | 3 | 2 | 0 | 0 | 3 | 2 | 0 | 21 | 3 | 18 | 2 | 9 | 2 | 2 | .. |
9.0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 6 | 2 | 8 | 1 | 0 | .. |
10.0 | 10 | 2 | 1 | 1 | 2 | 0 | 0 | 0 | 0 | 7 | 0 | 2 | 1 | 0 | .. |
11.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | .. |
12.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | .. |
13.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | .. |
14.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | .. |
15.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | .. |
16.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | .. |
17.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | .. |
18.0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | .. |
23.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | .. |
24.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | .. |
It seems that we could make a more or less direct assignment. Reading the matrix vertically we see that:
We review the number of nulls and the frequency of each value
listings_det['beds'].value_counts(dropna = False)
1.0 8751
2.0 4805
3.0 2272
4.0 1186
0.0 628
5.0 466
NaN 341
6.0 222
7.0 78
8.0 68
10.0 39
9.0 20
14.0 6
11.0 6
12.0 5
16.0 5
13.0 4
18.0 2
17.0 2
24.0 1
15.0 1
23.0 1
Name: beds, dtype: int64
We create a function to impute the nulls of beds based on accommodates
def imputar_nulos(registro):
#Lista de condiciones
condiciones = [(registro.accommodates <= 2),
(registro.accommodates > 2) & (registro.accommodates <= 4),
(registro.accommodates > 4) & (registro.accommodates <= 6),
(registro.accommodates > 6)]
#Lista de resultados
resultados = [1,2,3,4]
#Salida
return(np.select(condiciones,resultados, default = -999))
#Imputación
listings_det.loc[listings_det.beds.isna(),'beds'] = listings_det.loc[listings_det.beds.isna()].apply(imputar_nulos, axis = 1).astype('float64')
Review
listings_det.beds.value_counts(dropna = False)
1.0 9061
2.0 4828
3.0 2276
4.0 1190
0.0 628
5.0 466
6.0 222
7.0 78
8.0 68
10.0 39
9.0 20
14.0 6
11.0 6
12.0 5
16.0 5
13.0 4
18.0 2
17.0 2
24.0 1
15.0 1
23.0 1
Name: beds, dtype: int64
Now let's see if we can make an imputation of bedrooms.
We start by crossing the number of rooms with the number of people that can be accommodated.
pd.crosstab(listings_det.bedrooms, listings_det.accommodates)
accommodates | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | .. |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
bedrooms | |||||||||||||||
1.0 | 3009 | 5569 | 1141 | 2265 | 111 | 81 | 4 | 14 | 0 | 7 | 0 | 0 | 0 | 0 | .. |
2.0 | 74 | 173 | 348 | 1560 | 444 | 852 | 51 | 48 | 4 | 6 | 1 | 0 | 0 | 1 | .. |
3.0 | 21 | 22 | 45 | 104 | 220 | 509 | 106 | 158 | 7 | 20 | 4 | 13 | 1 | 2 | .. |
4.0 | 16 | 21 | 1 | 12 | 18 | 38 | 34 | 96 | 10 | 39 | 2 | 15 | 5 | 2 | .. |
5.0 | 9 | 6 | 0 | 0 | 4 | 3 | 4 | 6 | 9 | 34 | 3 | 8 | 1 | 3 | .. |
6.0 | 3 | 3 | 0 | 0 | 0 | 1 | 0 | 1 | 3 | 4 | 2 | 1 | 1 | 1 | .. |
7.0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | .. |
8.0 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | .. |
9.0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | .. |
10.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | .. |
14.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | .. |
15.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | .. |
18.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | .. |
It doesn't seem very reliable.
Let's contrast it with the number of beds.
pd.crosstab(listings_det.bedrooms, listings_det.beds, dropna=False)
beds | 0.0 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 | 7.0 | 8.0 | 9.0 | ... |
---|---|---|---|---|---|---|---|---|---|---|---|
bedrooms | |||||||||||
1.0 | 451 | 8038 | 3039 | 459 | 139 | 15 | 36 | 1 | 13 | 1 | ... |
2.0 | 65 | 108 | 1372 | 1329 | 532 | 110 | 33 | 7 | 5 | 0 | ... |
3.0 | 7 | 28 | 22 | 414 | 393 | 242 | 80 | 25 | 11 | 7 | ... |
4.0 | 6 | 23 | 3 | 7 | 95 | 63 | 56 | 27 | 19 | 6 | ... |
5.0 | 1 | 14 | 0 | 0 | 2 | 26 | 10 | 16 | 17 | 2 | ... |
6.0 | 2 | 4 | 0 | 0 | 0 | 0 | 6 | 0 | 2 | 3 | ... |
7.0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | ... |
8.0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... |
9.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... |
10.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... |
14.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... |
15.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... |
18.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... |
13 rows × 21 columns
Here we could make a more direct assignment. Reading the matrix vertically we see that:
We are going to modify the function that we had created to impute the nulls of bedrooms from beds.
First we count the bedrooms
listings_det.bedrooms.value_counts(dropna = False)
1.0 12212
2.0 3562
NaN 1434
3.0 1233
4.0 311
5.0 105
6.0 24
7.0 11
8.0 7
9.0 3
10.0 3
14.0 2
15.0 1
18.0 1
Name: bedrooms, dtype: int64
We execute the updated function
def imputar_nulos(registro):
#Lista de condiciones
condiciones = [(registro.beds <= 2),
(registro.beds > 2) & (registro.beds <= 4),
(registro.beds > 4) & (registro.beds <= 6),
(registro.beds > 6)]
#Lista de resultados
resultados = [1,2,3,4]
#Salida
return(np.select(condiciones,resultados, default = -999))
#Imputación
listings_det.loc[listings_det.bedrooms.isna(),'bedrooms'] = listings_det.loc[listings_det.bedrooms.isna()].apply(imputar_nulos, axis = 1).astype('float64')
Review
listings_det.bedrooms.value_counts(dropna = False)
1.0 13540
2.0 3657
3.0 1244
4.0 311
5.0 105
6.0 24
7.0 11
8.0 7
9.0 3
10.0 3
14.0 2
15.0 1
18.0 1
Name: bedrooms, dtype: int64
Finally we are going to eliminate bathrooms
listings_det.drop(columns = 'bathrooms', inplace = True)
listings_det
id | description | host_is_superhost | accommodates | bedrooms | beds | ... | |
---|---|---|---|---|---|---|---|
0 | 6369 | Excellent connection with the AIRPORT and EXHI... | t | 2 | 1.0 | 1.0 | ... |
1 | 21853 | We have a quiet and sunny room with a good vie... | f | 1 | 1.0 | 1.0 | ... |
2 | 23001 | Apartamento de tres dormitorios dobles, gran s... | f | 6 | 3.0 | 5.0 | ... |
3 | 24805 | Studio located 50 meters from Gran Via, next t... | f | 3 | 1.0 | 1.0 | ... |
4 | 26825 | Nice and cozy roon for one person with a priva... | f | 1 | 1.0 | 1.0 | ... |
... | ... | ... | ... | ... | ... | ... | ... |
18904 | 52182264 | ROOM - 8 sqm. • 1st Floor <br /> <br />cozy ro... | f | 1 | 1.0 | 0.0 | ... |
18905 | 52182273 | ROOM - 10 sqm. • 4th Floor <br /> <br />cozy r... | f | 1 | 1.0 | 0.0 | ... |
18906 | 52182303 | ROOM <br /> <br />cozy room in madrid centro i... | f | 1 | 1.0 | 0.0 | ... |
18907 | 52182321 | ROOM -9 sqm. • 1st Floor <br /> <br />cozy roo... | f | 1 | 1.0 | 0.0 | ... |
18908 | 52182334 | ROOM - 10 sqm. • 1st Floor <br /> <br />cozy r... | f | 1 | 1.0 | 0.0 | ... |
18909 rows × 10 columns
We check if there are any duplicate records
listings_det.duplicated().sum()
0
listings_det.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18909 entries, 0 to 18908
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 18909 non-null int64
1 description 17854 non-null object
2 host_is_superhost 18883 non-null category
3 accommodates 18909 non-null int64
4 bedrooms 18909 non-null float64
5 beds 18909 non-null float64
6 number_of_reviews 18909 non-null int64
7 review_scores_rating 13877 non-null float64
8 review_scores_communication 13640 non-null float64
9 review_scores_location 13637 non-null float64
dtypes: category(1), float64(5), int64(3), object(1)
memory usage: 1.3+ MB
We are going to analyze the values and frequencies of the categorical variables
listings_det.host_is_superhost.value_counts()
f 15423
t 3460
Name: host_is_superhost, dtype: int64
listings_det.describe(include = 'number').T
count | mean | std | min | 25% | 50% | .. | |
---|---|---|---|---|---|---|---|
id | 18909.0 | 3.055841e+07 | 1.448889e+07 | 6369.0 | 19305558.00 | 33537761.00 | .. |
accommodates | 18909.0 | 3.094928e+00 | 1.963715e+00 | 0.0 | 2.00 | 2.00 | .. |
bedrooms | 18909.0 | 1.414670e+00 | 8.116969e-01 | 1.0 | 1.00 | 1.00 | .. |
beds | 18909.0 | 1.909197e+00 | 1.444317e+00 | 0.0 | 1.00 | 1.00 | .. |
number_of_reviews | 18909.0 | 3.270612e+01 | 6.524996e+01 | 0.0 | 0.00 | 5.00 | .. |
review_scores_rating | 13877.0 | 4.549404e+00 | 7.834379e-01 | 0.0 | 4.50 | 4.75 | .. |
review_scores_communication | 13640.0 | 4.776612e+00 | 4.632929e-01 | 1.0 | 4.75 | 4.92 | .. |
review_scores_location | 13637.0 | 4.788151e+00 | 3.774844e-01 | 1.0 | 4.73 | 4.92 | .. |
We don't see anything strange.
At this point we have already detected and corrected the main data quality problems, so we went on to create the analytical datamart integrating our tables.
We have 2 main tables:
And we know that both share the id field, so we can cross them through it.
The main table is listings, since what the detail table does is give us additional data.
Therefore, the table that has to control in the integration is listings.
In addition, we also have the price table, which in this case conceptually crosses with listings through the district (neighbourhood_group).
Although we haven't checked yet that the literals are the same, so some manual correction may be necessary.
Let's start with the 2 main ones.
Since the listings table is going to be sent, the final result will have to have as many rows as listings and as many columns as those of both tables (minus 1 for the id, which will remain as a single variable).
listings.shape
(17710, 12)
listings_det.shape
(18909, 10)
If it goes well, the final table will have 17710 rows and 21 columns.
df = pd.merge(left = listings, right = listings_det, how = 'left', on = 'id')
df
id | name | host_id | neighbourhood_group | neighbourhood | latitude | ... | |
---|---|---|---|---|---|---|---|
0 | 6369 | Rooftop terrace room , ensuite bathroom | 13660 | Chamartín | Hispanoamérica | 40.457240 | ... |
1 | 21853 | Bright and airy room | 83531 | Latina | Cármenes | 40.403810 | ... |
2 | 23001 | Apartmento Arganzuela- Madrid Rio | 82175 | Arganzuela | Legazpi | 40.388400 | ... |
3 | 24805 | Gran Via Studio Madrid | 346366726 | Centro | Universidad | 40.421830 | ... |
4 | 26825 | Single Room whith private Bathroom | 114340 | Arganzuela | Legazpi | 40.389750 | ... |
... | ... | ... | ... | ... | ... | ... | ... |
17705 | 52182264 | Enormous Private Room in 12-Bedroom Unit - los... | 378060726 | Centro | Justicia | 40.424384 | ... |
17706 | 52182273 | Stunning Private Room in 11-Bedroom Unit - los... | 378060726 | Centro | Justicia | 40.424447 | ... |
17707 | 52182303 | Classic Private Room in 7-Bedroom Unit - los 3... | 378060726 | Centro | Justicia | 40.424989 | ... |
17708 | 52182321 | Elegant Private Room in 12-Bedroom Unit - los ... | 378060726 | Salamanca | Recoletos | 40.424352 | ... |
17709 | 52182334 | Fashioned Private Room in 12-Bedroom Unit - lo... | 378060726 | Centro | Justicia | 40.425670 | ... |
17710 rows × 21 columns
Now we are going to see how we can incorporate the external information of the price per square meter.
To do this, the first thing is to analyze the values of the district variable in both tables, since they need to coincide so that we can cross them.
In df the variable is categorical, so to get the levels we have to use .categories
distritos1 = pd.Series(df.neighbourhood_group.unique().categories).sort_values()
distritos1
2 Arganzuela
16 Barajas
10 Carabanchel
3 Centro
0 Chamartín
7 Chamberí
14 Ciudad Lineal
6 Fuencarral - El Pardo
9 Hortaleza
1 Latina
18 Moncloa - Aravaca
19 Moratalaz
5 Puente de Vallecas
11 Retiro
4 Salamanca
13 San Blas - Canillejas
12 Tetuán
17 Usera
20 Vicálvaro
15 Villa de Vallecas
8 Villaverde
dtype: object
distritos2 = precio_m2.distrito
distritos2
1 Arganzuela
2 Barajas
3 Carabanchel
4 Centro
5 Chamartín
6 Chamberí
7 Ciudad Lineal
8 Fuencarral
9 Hortaleza
10 Latina
11 Moncloa
12 Moratalaz
13 Puente de Vallecas
14 Retiro
15 Salamanca
16 San Blas
17 Tetuán
18 Usera
19 Vicálvaro
20 Villa de Vallecas
21 Villaverde
Name: distrito, dtype: object
Comparing everything seems the same except:
Therefore we are going to replace these values in price_m2 so that they are equal to those of df and we can cross them
precio_m2.distrito = precio_m2.distrito.map({'Fuencarral':'Fuencarral - El Pardo',
'Moncloa':'Moncloa - Aravaca',
'San Blas':'San Blas - Canillejas'}) \
.fillna(precio_m2.distrito)
precio_m2
precio_m2 | distrito | |
---|---|---|
1 | 4085 | Arganzuela |
2 | 3409 | Barajas |
3 | 2123 | Carabanchel |
4 | 4827 | Centro |
5 | 5098 | Chamartín |
6 | 5381 | Chamberí |
7 | 2940 | Ciudad Lineal |
8 | 3568 | Fuencarral - El Pardo |
9 | 3871 | Hortaleza |
10 | 2267 | Latina |
11 | 4033 | Moncloa - Aravaca |
12 | 2500 | Moratalaz |
13 | 1918 | Puente de Vallecas |
14 | 4788 | Retiro |
15 | 6114 | Salamanca |
16 | 2591 | San Blas - Canillejas |
17 | 3678 | Tetuán |
18 | 1995 | Usera |
19 | 2403 | Vicálvaro |
20 | 2354 | Villa de Vallecas |
21 | 1693 | Villaverde |
Now we can cross them.
df is our key dataframe.
df = pd.merge(left = df, right = precio_m2, how = 'left', left_on='neighbourhood_group', right_on='distrito')
df
id | name | host_id | neighbourhood_group | neighbourhood | latitude | ... | |
---|---|---|---|---|---|---|---|
0 | 6369 | Rooftop terrace room , ensuite bathroom | 13660 | Chamartín | Hispanoamérica | 40.457240 | ... |
1 | 21853 | Bright and airy room | 83531 | Latina | Cármenes | 40.403810 | ... |
2 | 23001 | Apartmento Arganzuela- Madrid Rio | 82175 | Arganzuela | Legazpi | 40.388400 | ... |
3 | 24805 | Gran Via Studio Madrid | 346366726 | Centro | Universidad | 40.421830 | ... |
4 | 26825 | Single Room whith private Bathroom | 114340 | Arganzuela | Legazpi | 40.389750 | ... |
... | ... | ... | ... | ... | ... | ... | ... |
17705 | 52182264 | Enormous Private Room in 12-Bedroom Unit - los... | 378060726 | Centro | Justicia | 40.424384 | ... |
17706 | 52182273 | Stunning Private Room in 11-Bedroom Unit - los... | 378060726 | Centro | Justicia | 40.424447 | ... |
17707 | 52182303 | Classic Private Room in 7-Bedroom Unit - los 3... | 378060726 | Centro | Justicia | 40.424989 | ... |
17708 | 52182321 | Elegant Private Room in 12-Bedroom Unit - los ... | 378060726 | Salamanca | Recoletos | 40.424352 | ... |
17709 | 52182334 | Fashioned Private Room in 12-Bedroom Unit - lo... | 378060726 | Centro | Justicia | 40.425670 | ... |
17710 rows × 23 columns
We check that no nulls have been generated in the union.
df.precio_m2.isna().sum()
0
Now that we have the analysis board, we are going to save it in the database so that every time we want to do an analysis we do not have to repeat all the processing of this notebook.
df.to_sql('df', con = con, if_exists = 'replace')
In this phase we are going to create new variables or transform the existing ones in order to give a better response to our objective.
We are going to give examples of both how to use internal variables and how to enrich with external variables.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import sqlalchemy as sa
#Automcompletar rápido
%config IPCompleter.greedy=True
pd.options.display.max_columns = None
con = sa.create_engine('sqlite:///../Datos/airbnb.db')
df = pd.read_sql('df', con = con)
df.head()
index | id | name | host_id | neighbourhood_group | neighbourhood | latitude | ... | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 6369 | Rooftop terrace room , ensuite bathroom | 13660 | Chamartín | Hispanoamérica | 40.45724 | ... |
1 | 1 | 21853 | Bright and airy room | 83531 | Latina | Cármenes | 40.40381 | ... |
2 | 2 | 23001 | Apartmento Arganzuela- Madrid Rio | 82175 | Arganzuela | Legazpi | 40.38840 | ... |
3 | 3 | 24805 | Gran Via Studio Madrid | 346366726 | Centro | Universidad | 40.42183 | ... |
4 | 4 | 26825 | Single Room whith private Bathroom | 114340 | Arganzuela | Legazpi | 40.38975 | ... |
First we are going to create the analysis variables, that is, those that we had identified as the Kpis that we will use in the levers that influence the business.
We had said that there were 3:
We start with the price.
The documentation does not clarify whether the price is for the entire property, or if a room is rented, it is per room.
It is a key piece of information to be able to make an assessment of the potential income of a property.
We are going to try to understand it by analyzing the average price by type of rental.
It is important to filter by only one district so as not to include the "zone" effect.
So first we choose a district that has a lot of data.
df.distrito.value_counts()
Centro 8127
Salamanca 1218
Chamberí 1089
Arganzuela 1005
Tetuán 762
Retiro 618
Carabanchel 581
Chamartín 538
Ciudad Lineal 532
Moncloa - Aravaca 493
Puente de Vallecas 465
San Blas - Canillejas 463
Latina 455
Hortaleza 354
Fuencarral - El Pardo 265
Usera 233
Barajas 142
Villaverde 129
Moratalaz 94
Villa de Vallecas 90
Vicálvaro 57
Name: distrito, dtype: int64
df.loc[df.distrito == 'Centro',:].groupby('room_type').price.mean()
room_type
Entire home/apt 148.859980
Private room 67.131643
Shared room 60.464286
Name: price, dtype: float64
Conclusion:
However, multiplying the price by the total number of rooms can artificially bias upward the income-generating capacity of a property.
Since if it is rented by rooms, it is not likely that it will always be 100%
Therefore we should weight it by the average percentage of rented rooms.
We do not have that data, but suppose we have spoken with the business manager and he has told us that it is 70%.
We can create the total price variable by applying apply on a custom function.
def crear_precio_total(registro):
if (registro.beds > 1) & ((registro.room_type == 'Private room') | (registro.room_type == 'Shared room')):
salida = registro.price * registro.beds * 0.7
else:
salida = registro.price
return(salida)
df['precio_total'] = df.apply(crear_precio_total, axis = 1)
Comprobamos
df[['room_type','price','beds','precio_total']].head(30)
room_type | price | beds | precio_total | |
---|---|---|---|---|
0 | Private room | 60 | 1.0 | 60.0 |
1 | Private room | 31 | 1.0 | 31.0 |
2 | Entire home/apt | 50 | 5.0 | 50.0 |
3 | Entire home/apt | 92 | 1.0 | 92.0 |
4 | Private room | 26 | 1.0 | 26.0 |
5 | Entire home/apt | 85 | 3.0 | 85.0 |
6 | Entire home/apt | 65 | 2.0 | 65.0 |
7 | Entire home/apt | 54 | 1.0 | 54.0 |
8 | Entire home/apt | 1400 | 3.0 | 1400.0 |
9 | Entire home/apt | 81 | 2.0 | 81.0 |
10 | Entire home/apt | 90 | 3.0 | 90.0 |
11 | Entire home/apt | 82 | 2.0 | 82.0 |
12 | Private room | 40 | 1.0 | 40.0 |
13 | Private room | 36 | 2.0 | 50.4 |
14 | Entire home/apt | 51 | 1.0 | 51.0 |
15 | Entire home/apt | 76 | 2.0 | 76.0 |
16 | Entire home/apt | 54 | 3.0 | 54.0 |
17 | Entire home/apt | 60 | 3.0 | 60.0 |
18 | Entire home/apt | 55 | 2.0 | 55.0 |
19 | Entire home/apt | 121 | 6.0 | 121.0 |
20 | Entire home/apt | 80 | 2.0 | 80.0 |
21 | Entire home/apt | 63 | 2.0 | 63.0 |
22 | Entire home/apt | 63 | 1.0 | 63.0 |
23 | Entire home/apt | 46 | 1.0 | 46.0 |
24 | Entire home/apt | 171 | 3.0 | 171.0 |
25 | Private room | 55 | 1.0 | 55.0 |
26 | Private room | 26 | 2.0 | 36.4 |
27 | Private room | 31 | 2.0 | 43.4 |
28 | Entire home/apt | 60 | 1.0 | 60.0 |
29 | Entire home/apt | 149 | 2.0 | 149.0 |
Now we go with the occupation
The variable we have that allows us to measure this is availability_365.
This variable tells us the number of days in a year that the property is NOT occupied.
Therefore, we would be interested in transforming it into a more direct measure of occupancy, for example, the % of the year that it IS occupied.
We can do it with a direct transformation.
df['ocupacion'] = ((365 - df.availability_365) / 365 * 100).astype('int')
df.head()
index | id | name | ... | precio_m2 | distrito | precio_total | ocupacion | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 6369 | Rooftop terrace room , ensuite bathroom | ... | 5098 | Chamartín | 60.0 | 50 |
1 | 1 | 21853 | Bright and airy room | ... | 2267 | Latina | 31.0 | 0 |
2 | 2 | 23001 | Apartmento Arganzuela- Madrid Rio | ... | 4085 | Arganzuela | 50.0 | 39 |
3 | 3 | 24805 | Gran Via Studio Madrid | ... | 4827 | Centro | 92.0 | 68 |
4 | 4 | 26825 | Single Room whith private Bathroom | ... | 4085 | Arganzuela | 26.0 | 4 |
Some of the seed questions are aimed at checking how the price or occupancy behaves according to other variables such as the number of rooms, the average rating, etc.
Normally we can do these analyzes better if we discretize the analysis variable.
In our case, the candidates for this analysis are: accommodates, bedrooms, beds and number_of_reviews.
In bedrooms, a more personalized discretization makes sense. In the others we can make it automatic.
Discretize bedrooms
We begin by evaluating the distribution of the data.
df.bedrooms.value_counts().plot.bar();
We are going to discretize for 1,2,3 and more than 3.
We can use np.select
condiciones = [df.bedrooms == 1,
df.bedrooms == 2,
df.bedrooms == 3,
df.bedrooms > 3]
resultados = ['01_Una','02_Dos','03_Tres','04_Cuatro o mas']
df['bedrooms_disc'] = np.select(condiciones, resultados, default = -999)
Review
df.bedrooms_disc.value_counts().plot.bar();
Discretize accommodates, beds and number_of_reviews
We are going to use qcut to discriminate with percentiles 0.5, 0.8, 1
df['accommodates_disc'] = pd.qcut(df.accommodates,[0, 0.5, 0.8, 1],
labels = ['0-2','3','4-16'])
df['accommodates_disc'].value_counts().sort_index(ascending = False).plot.barh();
df['beds_disc'] = pd.qcut(df.beds,[0, 0.5, 0.8, 1],
labels = ['1','2','3-24'])
df['beds_disc'].value_counts().sort_index(ascending = False).plot.barh();
df['number_of_reviews_disc'] = pd.qcut(df.number_of_reviews,[0, 0.5, 0.8, 1],
labels = ['1-4','5-48','48-744'])
df['number_of_reviews_disc'].value_counts().sort_index(ascending = False).plot.barh();
In this particular case, many things could be done with external data.
The first thing, which we have already partially incorporated, is the property price lever.
We said that we could estimate it by multiplying the square meters of the property by the price per m2.
We have already obtained the price_m2, but we do not have the size of the property in the data.
What we can do is establish criteria based on the number of rooms.
It's not perfect, but it will serve as an approximation.
Estimate of the square meters of the property
Let's use the following algorithm:
condiciones = [df.bedrooms == 1,
df.bedrooms == 2,
df.bedrooms == 3,
df.bedrooms == 4,
df.bedrooms > 4]
resultados = [50,70,90,120,150]
df['m2'] = np.select(condiciones, resultados, default = -999)
Comprobamos
df['m2'].value_counts()
50 12422
70 3617
90 1228
120 296
150 147
Name: m2, dtype: int64
Now we can estimate the purchase price of the property.
We remember that we took 30% from the price we got for bargaining power.
df['precio_compra'] = df.m2 * df.precio_m2 * 0.7
Comprobamos
df[['bedrooms','m2','distrito','precio_m2','precio_compra']].head(20)
bedrooms | m2 | distrito | precio_m2 | precio_compra | |
---|---|---|---|---|---|
0 | 1.0 | 50 | Chamartín | 5098 | 178430.0 |
1 | 1.0 | 50 | Latina | 2267 | 79345.0 |
2 | 3.0 | 90 | Arganzuela | 4085 | 257355.0 |
3 | 1.0 | 50 | Centro | 4827 | 168945.0 |
4 | 1.0 | 50 | Arganzuela | 4085 | 142975.0 |
5 | 3.0 | 90 | Arganzuela | 4085 | 257355.0 |
6 | 1.0 | 50 | Centro | 4827 | 168945.0 |
7 | 1.0 | 50 | Centro | 4827 | 168945.0 |
8 | 2.0 | 70 | Centro | 4827 | 236523.0 |
9 | 1.0 | 50 | Centro | 4827 | 168945.0 |
10 | 1.0 | 50 | Centro | 4827 | 168945.0 |
11 | 2.0 | 70 | Salamanca | 6114 | 299586.0 |
12 | 1.0 | 50 | Centro | 4827 | 168945.0 |
13 | 1.0 | 50 | Centro | 4827 | 168945.0 |
14 | 1.0 | 50 | Centro | 4827 | 168945.0 |
15 | 1.0 | 50 | Centro | 4827 | 168945.0 |
16 | 1.0 | 50 | Centro | 4827 | 168945.0 |
17 | 2.0 | 70 | Centro | 4827 | 236523.0 |
18 | 2.0 | 70 | Centro | 4827 | 236523.0 |
19 | 3.0 | 90 | Centro | 4827 | 304101.0 |
Now we are going to give an example of what other types of variables we can build.
In this case we could do a lot with the x,y coordinates.
Since in tourism the location is very important.
For example, we could calculate the distances to different points of interest such as monuments, entertainment venues, sports venues, etc.
Simply as an example we are going to calculate the distance from each property to Puerta del Sol.
To do this, we Google its longitude and latitude: https://www.123coordenadas.com/coordinates/81497-puerta-del-sol-madrid
Latitude: 40.4167278 Longitude: -3.7033387
Calculation of the distance of each property to Puerta del Sol
Given the curvature of the earth, the distance between two points based on their latitude and longitude is calculated using a formula called the Haversine distance.
A Google search gives us a built-in function to calculate it that we can adapt: Haversine
from math import radians, cos, sin, asin, sqrt
def haversine(lat1, lon1, lat2, lon2):
R = 6372.8 #En km, si usas millas tienes que cambiarlo por 3959.87433
dLat = radians(lat2 - lat1)
dLon = radians(lon2 - lon1)
lat1 = radians(lat1)
lat2 = radians(lat2)
a = sin(dLat/2)**2 + cos(lat1)*cos(lat2)*sin(dLon/2)**2
c = 2*asin(sqrt(a))
return R * c
We create the variable
#Las coordenadas de la Puerta del Sol serán lat1 y lon1
lat1 = 40.4167278
lon1 = -3.7033387
df['pdi_sol'] = df.apply(lambda registro: haversine(lat1,lon1,registro.latitude,registro.longitude),axis = 1)
We check by reviewing the average distance by districts.
df.groupby('distrito').pdi_sol.mean().sort_values()
distrito
Centro 0.730611
Arganzuela 1.939114
Chamberí 2.133167
Retiro 2.450593
Salamanca 2.715975
Moncloa - Aravaca 3.294526
Usera 3.928874
Latina 3.942634
Carabanchel 3.970238
Chamartín 4.432842
Puente de Vallecas 4.481127
Tetuán 4.624605
Moratalaz 5.073901
Ciudad Lineal 5.231293
Villaverde 7.664192
Fuencarral - El Pardo 8.062301
Hortaleza 8.074184
San Blas - Canillejas 8.199011
Vicálvaro 8.599559
Villa de Vallecas 9.176618
Barajas 11.439064
Name: pdi_sol, dtype: float64
We are going to save this version as df_preparado
df.to_sql('df_preparado', con = con, if_exists = 'replace')
We come to the most important part, where we are going to try to obtain relevant conclusions for the objective using all the preparation work we have done, the Business Analytics techniques that we already know and also we are going to learn how to create a map visualization.
To do this, we will begin by answering the seed questions and it is likely that in the process interesting findings will emerge that lead us to new questions or the application of certain techniques.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import sqlalchemy as sa
#Automcompletar rápido
%config IPCompleter.greedy=True
pd.options.display.max_columns = None
import fiser_tools as fs
fs.misc.dark_theme()
con = sa.create_engine('sqlite:///../DatosCaso1/airbnb.db')
df = pd.read_sql('df_preparado', con = con)
df.head()
level_0 | index | id | name | host_id | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | calculated_host_listings_count | availability_365 | description | host_is_superhost | accommodates | bedrooms | beds | number_of_reviews | review_scores_rating | review_scores_communication | review_scores_location | precio_m2 | distrito | precio_total | ocupacion | bedrooms_disc | accommodates_disc | beds_disc | number_of_reviews_disc | m2 | precio_compra | pdi_sol | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 6369 | Rooftop terrace room , ensuite bathroom | 13660 | Chamartín | Hispanoamérica | 40.45724 | -3.67688 | Private room | 60 | 1 | 2 | 180 | Excellent connection with the AIRPORT and EXHI... | t | 2 | 1.0 | 1.0 | 80 | 4.87 | 4.89 | 4.77 | 164 | Chamartín | 60.0 | 50 | 01_Una | 0-2 | 1 | 48-744 | 50 | 5740.0 | 5.032039 |
1 | 1 | 1 | 21853 | Bright and airy room | 83531 | Latina | Cármenes | 40.40381 | -3.74130 | Private room | 31 | 4 | 2 | 364 | We have a quiet and sunny room with a good vie... | f | 1 | 1.0 | 1.0 | 33 | 4.58 | 4.82 | 4.21 | 125 | Latina | 31.0 | 0 | 01_Una | 0-2 | 1 | 5-48 | 50 | 4375.0 | 3.521406 |
2 | 2 | 2 | 23001 | Apartmento Arganzuela- Madrid Rio | 82175 | Arganzuela | Legazpi | 40.38840 | -3.69511 | Entire home/apt | 50 | 15 | 5 | 222 | Apartamento de tres dormitorios dobles, gran s... | f | 6 | 3.0 | 5.0 | 0 | NaN | NaN | NaN | 155 | Arganzuela | 50.0 | 39 | 03_Tres | 4-16 | 3-24 | 1-4 | 90 | 9765.0 | 3.226963 |
3 | 3 | 3 | 24805 | Gran Via Studio Madrid | 346366726 | Centro | Universidad | 40.42183 | -3.70529 | Entire home/apt | 92 | 5 | 1 | 115 | Studio located 50 meters from Gran Via, next t... | f | 3 | 1.0 | 1.0 | 12 | 4.92 | 5.00 | 5.00 | 186 | Centro | 92.0 | 68 | 01_Una | 0-2 | 1 | 5-48 | 50 | 6510.0 | 0.591065 |
4 | 4 | 4 | 26825 | Single Room whith private Bathroom | 114340 | Arganzuela | Legazpi | 40.38975 | -3.69018 | Private room | 26 | 2 | 1 | 349 | Nice and cozy roon for one person with a priva... | f | 1 | 1.0 | 1.0 | 149 | 4.68 | 4.71 | 4.70 | 155 | Arganzuela | 26.0 | 4 | 01_Una | 0-2 | 1 | 48-744 | 50 | 5425.0 | 3.200942 |
What is the average price? And the price range? And by districts? And by neighborhoods?
What is the ranking of districts and neighborhoods by average rental price?
df.precio_total.describe()
count 17710.000000
mean 149.027770
std 523.211567
min 20.000000
25% 43.000000
50% 70.000000
75% 112.000000
max 22400.000000
Name: precio_total, dtype: float64
We see that there is at least one outlier on the upper part that skews the mean, so we are going to use the median as a more reliable measure of centralization.
df.precio_total.median()
70.0
df.groupby('distrito').precio_total.median().sort_values(ascending = False)
distrito
San Blas - Canillejas 90.0
Salamanca 88.0
Centro 76.0
Chamartín 74.0
Chamberí 70.0
Hortaleza 69.0
Retiro 68.0
Tetuán 66.0
Moncloa - Aravaca 61.0
Fuencarral - El Pardo 60.0
Arganzuela 58.0
Vicálvaro 53.0
Ciudad Lineal 50.0
Barajas 49.5
Carabanchel 48.0
Villa de Vallecas 47.5
Latina 47.0
Usera 42.0
Villaverde 42.0
Moratalaz 40.0
Puente de Vallecas 40.0
Name: precio_total, dtype: float64
The data from San Blas catches our attention, we are going to look at it in more detail to see what is happening.
df.loc[df.distrito == 'San Blas - Canillejas'].sort_values('precio_total',ascending = False).head(10)
level_0 | index | id | name | host_id | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | calculated_host_listings_count | availability_365 | description | host_is_superhost | accommodates | bedrooms | beds | number_of_reviews | review_scores_rating | review_scores_communication | review_scores_location | precio_m2 | distrito | precio_total | ocupacion | bedrooms_disc | accommodates_disc | beds_disc | number_of_reviews_disc | m2 | precio_compra | pdi_sol | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9290 | 9290 | 9290 | 34600594 | 3 rooms near Wanda | 38951538 | San Blas - Canillejas | Arcos | 40.42105 | -3.61457 | Private room | 2800 | 1 | 2 | 0 | 3 rooms 15´walking from Wanda. | f | 6 | 3.0 | 5.0 | 0 | NaN | NaN | NaN | 120 | San Blas - Canillejas | 9800.0 | 100 | 03_Tres | 4-16 | 3-24 | 1-4 | 90 | 7560.0 | 7.532233 |
9662 | 9662 | 9662 | 34826962 | Pozos rooms | 222068421 | San Blas - Canillejas | Arcos | 40.41927 | -3.61555 | Private room | 3000 | 1 | 1 | 88 | Cerca nos podemos encontrar un centro comercia... | f | 4 | 2.0 | 3.0 | 0 | NaN | NaN | NaN | 120 | San Blas - Canillejas | 6300.0 | 75 | 02_Dos | 3 | 2 | 1-4 | 70 | 5880.0 | 7.439364 |
10189 | 10189 | 10189 | 35149419 | CHALET FOR THE FINAL CHAMPIONS LEAGUE MADRID | 143732655 | San Blas - Canillejas | Canillejas | 40.44937 | -3.61633 | Entire home/apt | 6000 | 1 | 1 | 365 | We are talking about one of the last villas fo... | f | 6 | 4.0 | 5.0 | 0 | NaN | NaN | NaN | 120 | San Blas - Canillejas | 6000.0 | 0 | 04_Cuatro o mas | 4-16 | 3-24 | 1-4 | 120 | 10080.0 | 8.212429 |
9668 | 9668 | 9668 | 34833756 | Wanda Champions Final | 29628177 | San Blas - Canillejas | Rosas | 40.42622 | -3.60502 | Private room | 2700 | 1 | 1 | 0 | Somos una pareja joven educados y muy sociable... | f | 5 | 2.0 | 3.0 | 0 | NaN | NaN | NaN | 120 | San Blas - Canillejas | 5670.0 | 100 | 02_Dos | 4-16 | 2 | 1-4 | 70 | 5880.0 | 8.391922 |
9878 | 9878 | 9878 | 34970236 | Habitaciones cerca Wanda Metropolitano (Champi... | 263474389 | San Blas - Canillejas | Arcos | 40.41920 | -3.61229 | Private room | 800 | 1 | 1 | 88 | Se alquilan 5 habitaciones, cada una para 2 hu... | f | 10 | 5.0 | 8.0 | 0 | NaN | NaN | NaN | 120 | San Blas - Canillejas | 4480.0 | 75 | 04_Cuatro o mas | 4-16 | 3-24 | 1-4 | 150 | 12600.0 | 7.714952 |
9748 | 9748 | 9748 | 34882596 | 2 rooms near wanda | 38951538 | San Blas - Canillejas | Arcos | 40.41948 | -3.61427 | Private room | 1500 | 1 | 2 | 363 | 2 rooms near Wanda stadium (20' walking) with ... | f | 4 | 2.0 | 4.0 | 0 | NaN | NaN | NaN | 120 | San Blas - Canillejas | 4200.0 | 0 | 02_Dos | 3 | 3-24 | 1-4 | 70 | 5880.0 | 7.548577 |
9173 | 9173 | 9173 | 34522997 | Beautiful Penthouse next to Wanda Metropolitano | 260551569 | San Blas - Canillejas | Canillejas | 40.43686 | -3.61093 | Private room | 2000 | 1 | 1 | 0 | Wonderful penthouse in a lovely duplex next to... | f | 4 | 1.0 | 3.0 | 0 | NaN | NaN | NaN | 120 | San Blas - Canillejas | 4200.0 | 100 | 01_Una | 3 | 2 | 1-4 | 50 | 4200.0 | 8.138310 |
9831 | 9831 | 9831 | 34933116 | FINAL CHAMPIONS LEAGUE MADRID | 263232278 | San Blas - Canillejas | Simancas | 40.43968 | -3.61933 | Private room | 2000 | 2 | 1 | 365 | Near (15 minutes walking) to WANDA STADIUM.<br... | f | 5 | 3.0 | 3.0 | 0 | NaN | NaN | NaN | 120 | San Blas - Canillejas | 4200.0 | 0 | 03_Tres | 4-16 | 2 | 1-4 | 90 | 7560.0 | 7.557076 |
10092 | 10092 | 10092 | 35091283 | FINAL CHAMPIONS LEAGUE | 264225539 | San Blas - Canillejas | Rejas | 40.44592 | -3.58746 | Private room | 1000 | 1 | 1 | 180 | Desayuno, comida y cena incluida. Traslados al... | f | 6 | 2.0 | 6.0 | 0 | NaN | NaN | NaN | 120 | San Blas - Canillejas | 4200.0 | 50 | 02_Dos | 4-16 | 3-24 | 1-4 | 70 | 5880.0 | 10.334061 |
10404 | 10404 | 10404 | 35251141 | Se alquila para final de la champion league | 265376740 | San Blas - Canillejas | Canillejas | 40.44292 | -3.60764 | Entire home/apt | 3000 | 1 | 1 | 365 | None | f | 1 | 2.0 | 4.0 | 0 | NaN | NaN | NaN | 120 | San Blas - Canillejas | 3000.0 | 0 | 02_Dos | 0-2 | 3-24 | 1-4 | 70 | 5880.0 | 8.610212 |
We see that they are prices in the environment of 3,000 - 5,000 euros!
When reading the description we realize that all these prices are defined by the Champions League final.
Which is an interesting insight:
Insight 1: there may be properties with a regular residual value but with high value at specific times due to sporting events or shows
Would it make sense to generate a rental product that consists of renting in a "normal" way at a price below the market with the condition that the tenant leaves the apartment free to rent it "touristically" on designated dates?
In the rest there are no surprises, with districts like Salamanca, Centro or Chanmartín in the lead.
But for example we see that the average price difference between Retiro and Tetuán is very low.
This leads us to compare the average price per district with the average purchase price also per district.
temp = df.groupby('distrito')[['precio_total','precio_compra']].median()
temp
precio_total | precio_compra | |
---|---|---|
distrito | ||
Arganzuela | 58.0 | 5425.0 |
Barajas | 49.5 | 4235.0 |
Carabanchel | 48.0 | 4305.0 |
Centro | 76.0 | 6510.0 |
Chamartín | 74.0 | 5740.0 |
Chamberí | 70.0 | 6300.0 |
Ciudad Lineal | 50.0 | 4760.0 |
Fuencarral - El Pardo | 60.0 | 4480.0 |
Hortaleza | 69.0 | 4515.0 |
Latina | 47.0 | 4375.0 |
Moncloa - Aravaca | 61.0 | 5390.0 |
Moratalaz | 40.0 | 3955.0 |
Puente de Vallecas | 40.0 | 4305.0 |
Retiro | 68.0 | 5670.0 |
Salamanca | 88.0 | 6685.0 |
San Blas - Canillejas | 90.0 | 4200.0 |
Tetuán | 66.0 | 5600.0 |
Usera | 42.0 | 4585.0 |
Vicálvaro | 53.0 | 3850.0 |
Villa de Vallecas | 47.5 | 4130.0 |
Villaverde | 42.0 | 3920.0 |
plt.figure(figsize = (16,8))
sns.scatterplot(data = temp, x = 'precio_compra', y = 'precio_total')
#Ponemos las etiquetas
for cada in range(0,temp.shape[0]):
plt.text(temp.precio_compra[cada], temp.precio_total[cada], temp.index[cada])
There is a clear correlation between the purchase price in each district and the price we will be able to charge.
Three clusters of low-low, medium-medium and high-high are clearly perceived.
And the exception of San Blas that we already know why it is.
Therefore, as expected, there is no clear "bargain" at this level a priori.
We are going to repeat the analysis at the neighborhood level to see if we identify anything.
temp = df.groupby('neighbourhood')[['precio_total','precio_compra']].median()
temp
precio_total | precio_compra | |
---|---|---|
neighbourhood | ||
Abrantes | 35.5 | 4305.0 |
Acacias | 53.8 | 5425.0 |
Adelfas | 54.0 | 7938.0 |
Aeropuerto | 41.3 | 4235.0 |
Aguilas | 47.0 | 4375.0 |
... | ... | ... |
Valverde | 62.5 | 4480.0 |
Ventas | 42.0 | 4760.0 |
Vinateros | 50.0 | 3955.0 |
Vista Alegre | 50.0 | 4305.0 |
Zofío | 36.4 | 4585.0 |
128 rows × 2 columns
plt.figure(figsize = (16,20))
sns.scatterplot(data = temp, x = 'precio_compra', y = 'precio_total')
#Ponemos las etiquetas
for cada in range(0,temp.shape[0]):
plt.text(temp.precio_compra[cada], temp.precio_total[cada], temp.index[cada])
At this level we already see more things:
Insight 2: There are certain neighborhoods that a priori can maximize the cost-income ratio and we can also segment it by the type of quality of the property in which we are interested in investing
df.loc[df.neighbourhood.isin(['Rosas','Canillejas','Hellin']),'distrito'].unique()
array(['San Blas - Canillejas'], dtype=object)
What factors (apart from location determine the rental price?
To answer this question we can build a minicube, since we have discretized our analysis variables.
#Paso 1: Seleccionar qué variables serán la métricas y cuales las dimensiones
metricas = ['precio_total','precio_compra']
dimensiones = ['bedrooms_disc','accommodates_disc','beds_disc','number_of_reviews_disc']
minicubo_precio = df[dimensiones + metricas]
minicubo_precio
bedrooms_disc | accommodates_disc | beds_disc | number_of_reviews_disc | precio_total | precio_compra | |
---|---|---|---|---|---|---|
0 | 01_Una | 0-2 | 1 | 48-744 | 60.0 | 5740.0 |
1 | 01_Una | 0-2 | 1 | 5-48 | 31.0 | 4375.0 |
2 | 03_Tres | 4-16 | 3-24 | 1-4 | 50.0 | 9765.0 |
3 | 01_Una | 0-2 | 1 | 5-48 | 92.0 | 6510.0 |
4 | 01_Una | 0-2 | 1 | 48-744 | 26.0 | 5425.0 |
... | ... | ... | ... | ... | ... | ... |
17705 | 01_Una | 0-2 | 1 | 1-4 | 29.0 | 6510.0 |
17706 | 01_Una | 0-2 | 1 | 1-4 | 29.0 | 6510.0 |
17707 | 01_Una | 0-2 | 1 | 1-4 | 29.0 | 6510.0 |
17708 | 01_Una | 0-2 | 1 | 1-4 | 29.0 | 6685.0 |
17709 | 01_Una | 0-2 | 1 | 1-4 | 33.0 | 6510.0 |
17710 rows × 6 columns
#Paso 2: pasar a transaccional las dimensiones
minicubo_precio = minicubo_precio.melt(id_vars=['precio_total','precio_compra'])
minicubo_precio
precio_total | precio_compra | variable | value | |
---|---|---|---|---|
0 | 60.0 | 5740.0 | bedrooms_disc | 01_Una |
1 | 31.0 | 4375.0 | bedrooms_disc | 01_Una |
2 | 50.0 | 9765.0 | bedrooms_disc | 03_Tres |
3 | 92.0 | 6510.0 | bedrooms_disc | 01_Una |
4 | 26.0 | 5425.0 | bedrooms_disc | 01_Una |
... | ... | ... | ... | ... |
70835 | 29.0 | 6510.0 | number_of_reviews_disc | 1-4 |
70836 | 29.0 | 6510.0 | number_of_reviews_disc | 1-4 |
70837 | 29.0 | 6510.0 | number_of_reviews_disc | 1-4 |
70838 | 29.0 | 6685.0 | number_of_reviews_disc | 1-4 |
70839 | 33.0 | 6510.0 | number_of_reviews_disc | 1-4 |
70840 rows × 4 columns
#Paso 3: Agregar las métricas por "variable" y "valor" con las funciones deseadas
minicubo_precio = minicubo_precio.groupby(['variable','value'])[['precio_total','precio_compra']].agg('median')
minicubo_precio
precio_total | precio_compra | ||
---|---|---|---|
variable | value | ||
accommodates_disc | 0-2 | 50.0 | 6510.0 |
3 | 86.0 | 6510.0 | |
4-16 | 126.0 | 9114.0 | |
bedrooms_disc | 01_Una | 56.0 | 6510.0 |
02_Dos | 100.0 | 9114.0 | |
03_Tres | 140.0 | 11718.0 | |
04_Cuatro o mas | 204.0 | 15624.0 | |
beds_disc | 1 | 59.0 | 6510.0 |
2 | 100.0 | 9114.0 | |
3-24 | 139.0 | 9359.0 | |
number_of_reviews_disc | 1-4 | 70.0 | 6510.0 |
48-744 | 68.0 | 6510.0 | |
5-48 | 69.0 | 6510.0 |
On the minicube we are analyzing each variable.
minicubo_precio.loc['bedrooms_disc']
precio_total | precio_compra | |
---|---|---|
value | ||
01_Una | 56.0 | 6510.0 |
02_Dos | 100.0 | 9114.0 |
03_Tres | 140.0 | 11718.0 |
04_Cuatro o mas | 204.0 | 15624.0 |
f, ax = plt.subplots()
ax.plot(minicubo_precio.loc['bedrooms_disc'].precio_total)
ax2 = ax.twinx()
ax2.plot(minicubo_precio.loc['bedrooms_disc'].precio_compra,color = 'green');
Regarding the number of rooms, there is nothing to highlight.
There is an almost perfect relationship between the purchase price and the total price that can be charged.
Part of this effect may be artificial, since we use the number of rooms to calculate the total price as the purchase price.
minicubo_precio.loc['beds_disc']
precio_total | precio_compra | |
---|---|---|
value | ||
1 | 59.0 | 6510.0 |
2 | 100.0 | 9114.0 |
3-24 | 139.0 | 9359.0 |
f, ax = plt.subplots()
ax.plot(minicubo_precio.loc['beds_disc'].precio_total)
ax2 = ax.twinx()
ax2.plot(minicubo_precio.loc['beds_disc'].precio_compra,color = 'green');
Regarding the number of beds, there is a conclusion:
Insight 3: the number of beds to avoid is 2. Either we put a bed or we try to fit as many as possible.
Since there was no effect on the number of rooms, could it be that the owners are trying to cram in a lot more beds than rooms to maximize revenue?
Let's see it for example with the floors of a room:
df[df.bedrooms == 1].groupby('beds').precio_total.median().plot();
Indeed there is something here, since it appears that for one-room apartments there are people who are putting up to dozens of beds!
It would be a topic to explore in more detail and discuss with someone who knows the business.
Let's see some examples:
df.loc[(df.bedrooms == 1)& (df.beds > 8)]
level_0 | index | id | name | host_id | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | calculated_host_listings_count | availability_365 | description | host_is_superhost | accommodates | bedrooms | beds | number_of_reviews | review_scores_rating | review_scores_communication | review_scores_location | precio_m2 | distrito | precio_total | ocupacion | bedrooms_disc | accommodates_disc | beds_disc | number_of_reviews_disc | m2 | precio_compra | pdi_sol | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
357 | 357 | 357 | 1044902 | Masía el Ferrajón | 5751753 | Moncloa - Aravaca | Ciudad Universitaria | 40.46549 | -3.75115 | Private room | 112 | 2 | 1 | 0 | <b>The space</b><br />Masía el Ferrajón featur... | f | 16 | 1.0 | 14.0 | 0 | NaN | NaN | NaN | 154 | Moncloa - Aravaca | 1097.6 | 100 | 01_Una | 4-16 | 3-24 | 1-4 | 50 | 5390.0 | 6.767308 |
605 | 605 | 605 | 2143824 | The Hat Madrid - 10 people shared | 10940437 | Centro | Sol | 40.41511 | -3.70804 | Shared room | 24 | 1 | 6 | 364 | 10 people shared room with shared bathroom. <b... | t | 10 | 1.0 | 10.0 | 91 | 4.57 | 4.69 | 4.84 | 186 | Centro | 168.0 | 0 | 01_Una | 4-16 | 3-24 | 48-744 | 50 | 6510.0 | 0.436897 |
710 | 710 | 710 | 3011110 | The Hat Madrid-Compartida 10 pers 2 | 10940437 | Centro | Embajadores | 40.41311 | -3.70621 | Shared room | 24 | 1 | 6 | 364 | The Hat Madrid es el primer Boutique Hostel de... | t | 10 | 1.0 | 10.0 | 28 | 4.56 | 4.81 | 4.93 | 186 | Centro | 168.0 | 0 | 01_Una | 4-16 | 3-24 | 5-48 | 50 | 6510.0 | 0.470155 |
11848 | 11848 | 11848 | 39600798 | Bed in a 10-Bed Dormitory In the Centre of Madrid | 304035848 | Centro | Justicia | 40.42710 | -3.69912 | Shared room | 20 | 1 | 13 | 351 | Low-key, hostel-style setup on a tree-lined st... | f | 1 | 1.0 | 10.0 | 2 | 5.00 | 5.00 | 5.00 | 186 | Centro | 140.0 | 3 | 01_Una | 0-2 | 3-24 | 1-4 | 50 | 6510.0 | 1.207701 |
13340 | 13340 | 13340 | 42492960 | Apartamento dúplex 10camas en literas | 338384437 | Tetuán | Cuatro Caminos | 40.44959 | -3.70138 | Private room | 298 | 1 | 1 | 364 | Vivienda en planta baja con 10 plazas distribu... | f | 10 | 1.0 | 10.0 | 1 | 5.00 | 4.00 | 5.00 | 160 | Tetuán | 2086.0 | 0 | 01_Una | 4-16 | 3-24 | 1-4 | 50 | 5600.0 | 3.658902 |
15722 | 15722 | 15722 | 48855617 | habitaciones en una casa tranquila, en el centro | 194738186 | Salamanca | Fuente del Berro | 40.42604 | -3.66373 | Private room | 36 | 2 | 1 | 359 | Es una casa que tiene dos habitaciones que se ... | f | 5 | 1.0 | 9.0 | 0 | NaN | NaN | NaN | 191 | Salamanca | 226.8 | 1 | 01_Una | 4-16 | 3-24 | 1-4 | 50 | 6685.0 | 3.510206 |
Vamos a analizar ahora por el número de huéspedes que aceptan
minicubo_precio.loc['accommodates_disc']
precio_total | precio_compra | |
---|---|---|
value | ||
0-2 | 50.0 | 6510.0 |
3 | 86.0 | 6510.0 |
4-16 | 126.0 | 9114.0 |
f, ax = plt.subplots()
ax.plot(minicubo_precio.loc['accommodates_disc'].precio_total)
ax2 = ax.twinx()
ax2.plot(minicubo_precio.loc['accommodates_disc'].precio_compra,color = 'green');
Insight 4: the optimal number of guests is 3, since the price of the properties to accommodate 3 is the same as to accommodate 1 or 2. From 4 the floor needs to be larger and the purchase price increases a lot
Finally we are going to analyze the variable that we have built of proximity to a point of interest to see if it has an effect on the price of the rooms.
In a real situation we would have built many of this type, and repeated the analysis with all of them.
In this case, as we have built the distance to Puerta del Sol, we are going to evaluate only the districts for which this may be relevant, that is, the most central ones.
To do this, we will first calculate the average distance per district and choose a cut-off point.
df.groupby('distrito').pdi_sol.median().sort_values()
distrito
Centro 0.728581
Arganzuela 1.817840
Chamberí 2.078790
Moncloa - Aravaca 2.130135
Retiro 2.415432
Salamanca 2.705209
Latina 3.496952
Carabanchel 3.647956
Usera 3.661376
Chamartín 4.233216
Puente de Vallecas 4.233631
Tetuán 4.572866
Ciudad Lineal 5.170226
Moratalaz 5.242395
Villaverde 7.579665
Fuencarral - El Pardo 7.738933
San Blas - Canillejas 7.962141
Hortaleza 8.132864
Vicálvaro 8.396723
Villa de Vallecas 8.822907
Barajas 11.593340
Name: pdi_sol, dtype: float64
Let's cut in Latina included.
And on that selection we are going to visualize with a scatter.
df.groupby('distrito').pdi_sol.median().sort_values()[0:7].index.to_list()
['Centro',
'Arganzuela',
'Chamberí',
'Moncloa - Aravaca',
'Retiro',
'Salamanca',
'Latina']
seleccion = df.groupby('distrito').pdi_sol.median().sort_values()[0:7].index.to_list()
plt.figure(figsize = (16,12))
sns.scatterplot(data = df.loc[df.distrito.isin(seleccion)], x = 'pdi_sol', y = 'precio_total');
There does not seem to be as direct a relationship as would be expected between the distance to Puerta del Sol and the rental price.
Insight 5: being within the district it seems that the proximity to points of interest does not have as much impact as would be expected. This opens the door to looking for properties that, being in a central district, are not right next to the PoI and therefore hopefully have a lower purchase price
At this point, we could repeat exactly the same analysis as with the price, but changing the price variable for the occupancy variable that we had built.
Since it would be the same, we are not going to develop it and I will leave it as a homework for you to practice and try to get your first insights.
Instead I prefer the time to show you how we can include analysis on a map, since in this case it would be something very relevant and it is a type of analysis that is always very popular.
import folium
We are going to use the coordinates of the Puerta del Sol that we already had as the center point of the map.
For example, we are going to visually analyze the insight on the San Blas district.
datos = df[df.distrito == 'San Blas - Canillejas'].copy()
To represent the markers of all the floors we have to create a loop to add them to the map.
We are going also to add some color legend to see which is the price range.
datos['precio_total_disc'] = pd.qcut(datos['precio_total'], q = [0, .25, .5, .75, 1.],
labels=['yellow', 'orange', 'blue', 'red'])
mapa = folium.Map(location=[40.4167278, -3.7033387],zoom_start=12)
for piso in range(0,len(datos)):
folium.CircleMarker(
location = [datos.iloc[piso]['latitude'], datos.iloc[piso]['longitude']],
popup = datos.iloc[piso]['precio_total'],
fill=True,
color = datos.iloc[piso]['precio_total_disc'],
fill_opacity=1,
radius = 5
).add_to(mapa)
mapa
Insight 1b: There are still many yellow and orange dots, which may indicate that there is still potential to further exploit the factor of sporting events