Analyzing Airbnb New User Bookings

Andrea Cabello
3 min readApr 8, 2021

--

What will a new airbnb user’s first booking destination be?

I. Overview

For my capstone project at Flatiron School I analyzed data provided by Airbnb at kaggle.com

  • All the users in the data set are from the USA.
  • The data was provided in the form of multiple data sets by Airbnb itself as a challenge on Kaggle.
  • I will grab the train_data set and perform my own train_test_split.

This project consists of two parts.

  • Part I: Binary Classification Model

Will a new airbnb user end up booking a destination? True or False. For this, we will create a new feature called ‘effective_booking’

  • Part II: Multi-Class Classification

What will a new airbnb user’s first booking destination be? There are 12 possible outcomes of the destination country: ‘US’, ‘FR’, ‘CA’, ‘GB’, ‘ES’, ‘IT’, ‘PT’, ‘NL’,’DE’, ‘AU’, ‘NDF’ (no destination found), and ‘other’. Please note that ‘NDF’ is different from ‘other’ because ‘other’ means there was a booking, but is to a country not included in the list, while ‘NDF’ means there wasn’t a booking.

II. Business Problem

  1. Predict whether a new airbnb user will effectively book a destination or not.
  2. Predict which country a new airbnb user’s first booking destination will be.
  • We created a feature “effective_booking” True or False and build a binary classification model to predict if a customer will end up booking or not.
  • What is happening? what defines if a customer ends up booking or not at a granular or overall level?
  • Only 42% of users ended up booking.
  • Then build a classifier to predict of those who book, where are they going?
  • 128070 observations (users) in the train data.
  • 74878 NDF (no destination found) 58%
  • Number of actual bookings: 53192
  • US represents domestic travel, which is 70% of all bookings in our data set.

III. Feature Engineering

  • Age Feature: we used the .cut() method to create age bins and assign the users ages to the corresponding one.
train_data['age_bins'] = pd.cut(x=train_data['age'], bins=[14, 19, 24, 29, 34, 39, 44, 49, 54, 59, 64, 69, 74, 
79, 84, 89, 94, 99])
train_data['age_bins'] = train_data.age_bins.astype(str)age_mapper = {'nan':'unknown',
'(29.0, 34.0]':'30-34',
'(24.0, 29.0]':'25-29',
'(34.0, 39.0]':'35-39',
'(39.0, 44.0]':'40-44',
'(19.0, 24.0]':'20-24',
'(44.0, 49.0]':'45-49',
'(49.0, 54.0]':'50-54',
'(54.0, 59.0]':'55-59',
'(59.0, 64.0]':'59-64',
'(64.0, 69.0]':'65-69',
'(14.0, 19.0]':'15-19',
'(69.0, 74.0]':'69-74',
'(74.0, 79.0]':'75+',
'(79.0, 84.0]':'75+',
'(94.0, 99.0]':'75+',
'(84.0, 89.0]':'75+',
'(89.0, 94.0]':'75+',}
train_data['age_bins'].replace(age_mapper, inplace=True)
  • Effective_booking feature
countries_list = train_data['country_destination'].unique().tolist()
countries_list.remove('NDF')
train_data['effective_booking'] = train_data['country_destination'].isin(countries_list)train_data.effective_booking.value_counts()
False 124543
True 88908

IV. EDA

V. Model Results

  • Binary Classification Model: Random Forest Classifier
Training Accuracy for Random Forest: 64.72%
Test Accuracy for Random Forest: 64.89%
  • Multi Class Classification: XGBoost Classifier
Training Accuracy: 87.56%
Validation accuracy: 87.59%

VI. Conclusions and Future Work

  • As the dataset contained new users information, the value ‘unknown’ appeared often in several categories. Considering this, the fact that we were able to predict destinations with high accuracy is surprisingly good.
  • Our binary classification model could be improved but it is still quite helpful to somewhat understand what are the common traits among first time users that will end up booking a destination vs those who won’t.
  • With our XGBoost classifier model, we can correctly predict which destination a new user will choose. This is very valuable information for marketing purposes.
  • High accuracy despite high number of unknown values.
  • Predict users behavior allows us to implement Target marketing strategies.
  • Build a model to help us define our Market Segmentation strategy.

--

--

Andrea Cabello

She. Her. Passionate mind. Stubborn soul. Bohemian heart. Born and raised Peruvian. Proud New Yorker. Self made bilingual. Data Scientist in the making.