Skytrax Data #4: The Logit Model

Skytrax Data #4: The Logit Model

We're going to quickly follow up on the OLS model presented in the last post with an alternative outcome variable: Whether or not passengers chose to recommend the airline to others.

In [1]:
import psycopg2
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import math
import statsmodels.api as sm
In [2]:
#This connects to the postgres database above.
connection = psycopg2.connect("dbname = skytrax user=skytraxadmin password=skytraxadmin")
cursor = connection.cursor()
In [3]:
connection.rollback()
cursor.execute('SELECT * from airline;')
data = cursor.fetchall()
In [4]:
data = pd.DataFrame(data)
descriptions = [desc[0] for desc in cursor.description]
data.columns = descriptions

In these data, "recommended" is coded as 1 and "not recommended" is coded as 0:

In [25]:
print len(data.loc[data['recommended'] == 1])
print len(data.loc[data['recommended'] == 0])
22098
19298
In [27]:
data.groupby(by = 'recommended').mean()
Out[27]:
overall_rating seat_comfort_rating cabin_staff_rating food_beverages_rating inflight_entertainment_rating ground_service_rating wifi_connectivity_rating value_money_rating
recommended
0 2.640637 2.082208 2.094998 1.773961 1.618873 1.626176 1.403509 1.812525
1 8.327206 3.953765 4.358122 3.667807 3.043509 3.992263 3.547085 4.242555

Unsurprisingly, ratings for flights that are recommended are higher than those that are not.

The logistic regression model

How do these ratings influence whether or not a passenger ultimately choses to recommend an airline to others? To find out, we can calculate the change in odds with each unit change in each variable. For example, for every point higher an airline scores on its overall rating, what is the probability that a passenger selected "1", would recommend, rather than "0"?

Again, we're going to ignore ground service ratings as well as wifi connectivity rating because the vast majority of cases have missing values on these variables.

The first thing we're going to do is center all our predictors. This, as we will see shortly, makes the constant interpretable, and is useful when trying to make point predictions (i.e., what are the odds that someone recommends an airline vs. not if he/she rated it an 8, for example).

In [ ]:
predictors = ['overall_rating', 'seat_comfort_rating', 'cabin_staff_rating', 'food_beverages_rating', \
          'inflight_entertainment_rating', 'value_money_rating']

for predictor in predictors:
    data['{0}_centered'.format(predictor)] = data[predictor] - data[predictor].mean()

I ran two models, one that included passengers' overall ratings, and one that did not:

In [55]:
import statsmodels.api as sm

#x = data[['seat_comfort_rating', 'cabin_staff_rating', 'food_beverages_rating', \
#          'inflight_entertainment_rating', 'value_money_rating']]

x = data[['seat_comfort_rating_centered', 'cabin_staff_rating_centered',\
          'food_beverages_rating_centered', 'inflight_entertainment_rating_centered', 'value_money_rating_centered']]


x = sm.add_constant(x)

logit = sm.Logit(data['recommended'], x, missing = 'drop') #drop missing values

results = logit.fit()
results.summary()
Optimization terminated successfully.
         Current function value: 0.215391
         Iterations 8
Out[55]:
Logit Regression Results
Dep. Variable: recommended No. Observations: 31075
Model: Logit Df Residuals: 31069
Method: MLE Df Model: 5
Date: Tue, 17 May 2016 Pseudo R-squ.: 0.6876
Time: 17:17:33 Log-Likelihood: -6693.3
converged: True LL-Null: -21423.
LLR p-value: 0.000
coef std err z P>|z| [95.0% Conf. Int.]
const 0.0595 0.023 2.610 0.009 0.015 0.104
seat_comfort_rating_centered 0.5314 0.022 24.398 0.000 0.489 0.574
cabin_staff_rating_centered 0.8361 0.021 39.970 0.000 0.795 0.877
food_beverages_rating_centered 0.2134 0.019 11.344 0.000 0.177 0.250
inflight_entertainment_rating_centered 0.0747 0.015 4.915 0.000 0.045 0.105
value_money_rating_centered 1.3341 0.026 52.277 0.000 1.284 1.384
In [56]:
np.exp(results.params)
Out[56]:
const                                     1.061330
seat_comfort_rating_centered              1.701393
cabin_staff_rating_centered               2.307242
food_beverages_rating_centered            1.237909
inflight_entertainment_rating_centered    1.077572
value_money_rating_centered               3.796635
dtype: float64

Interpreting the output

The coefficients of a logistic regression are in log odds, so to make them intuitively interpretable, we exponentiate them to get odds ratios. Odds ratios are simply the likelihood of one case over the other (in our case, it's the likelihood of recommending vs not).

The constant

The constant in this model is the predicted value (in this case, the predicted log odds) when all Xs = 0.

In this case, because we centered our variables at the mean, X=0 refers to the mean of that variable. Thus, our constant in this analysis refers to the odds of recommending vs not recommending for a hypothetical passenger who gave the airline mean ratings across the board. It's value, in this case, 1.06, tells us that this hypothetical passenger was 6% more likely to recommend than not. This is close to the unconditional (i.e., when we don't have any predictors) ratio of 14% (which is what you get when you divide the number of people who recommended vs not).

The coefficients

The coefficients in this model refer to the change in odds with each unit change for each variable. A value above one means that higher passenger ratings on that variable makes it more likely that a passenger recommends an airline. As one might expect, this was the case across the board here.

Let's look at a specific example: The coefficient of value for money, which is the largest here, tells us that, holding all other variables constant (i.e., for our hypothetical passenger that scored an airline the mean for everything), each unit change in value for money made it 3.8 times more likely (than the base rate of 6%) that he/she would recommend vs not.

Including passengers' overall ratings into the model

Next, we're going to look at a model that includes passengers' overall ratings:

In [60]:
#x = data[['overall_rating', 'seat_comfort_rating', 'cabin_staff_rating', 'food_beverages_rating', \
#          'inflight_entertainment_rating', 'value_money_rating']]

x = data[['overall_rating_centered', 'seat_comfort_rating_centered', 'cabin_staff_rating_centered',\
          'food_beverages_rating_centered', 'inflight_entertainment_rating_centered', 'value_money_rating_centered']]

x = sm.add_constant(x)

logit = sm.Logit(data['recommended'], x, missing = 'drop') #drop missing values

results = logit.fit()
results.summary()
Optimization terminated successfully.
         Current function value: 0.128654
         Iterations 9
Out[60]:
Logit Regression Results
Dep. Variable: recommended No. Observations: 28341
Model: Logit Df Residuals: 28334
Method: MLE Df Model: 6
Date: Tue, 17 May 2016 Pseudo R-squ.: 0.8093
Time: 17:32:21 Log-Likelihood: -3646.2
converged: True LL-Null: -19124.
LLR p-value: 0.000
coef std err z P>|z| [95.0% Conf. Int.]
const 0.9089 0.033 27.320 0.000 0.844 0.974
overall_rating_centered 1.2794 0.023 54.786 0.000 1.234 1.325
seat_comfort_rating_centered 0.1691 0.029 5.802 0.000 0.112 0.226
cabin_staff_rating_centered 0.2845 0.029 9.981 0.000 0.229 0.340
food_beverages_rating_centered 0.0528 0.027 1.974 0.048 0.000 0.105
inflight_entertainment_rating_centered 0.0154 0.021 0.726 0.468 -0.026 0.057
value_money_rating_centered 0.5350 0.034 15.784 0.000 0.469 0.601
In [61]:
np.exp(results.params)
Out[61]:
const                                     2.481574
overall_rating_centered                   3.594482
seat_comfort_rating_centered              1.184220
cabin_staff_rating_centered               1.329129
food_beverages_rating_centered            1.054215
inflight_entertainment_rating_centered    1.015530
value_money_rating_centered               1.707426
dtype: float64

Including passengers' overall ratings substantially changed many of the coefficients. Let's go over the different parameters agian:

The constant

The constant for this model is now 2.48, which tells us that the hypothetical passenger who gave the airline the mean score on all ratings is now ~2.5 times more likely to recommend than not (contrast this to 1.06 in the previous model). What this implies is that an "average" in terms of an overall rating is really, on average, an endorsement of the airline.

The coefficients

Here, compared to the previous model, the coefficients for all the other ratings are "depressed", in they are much smaller than before. They are still positive, which means that in general, higher ratings on these criteria = more likely to recommend, but knowing a passenger's overall rating makes that information somewhat redundant.

Summary

In sum, when passengers are made to rate airlines this way, their decision to recommend is dominated by their overall, wholistic rating of the airline. That said, these ratings are all correlated, so the takeaway point should not be that inflight entertainment (which was a non-significant predictor even with an n of 28,000) is unimportant, but rather that further analyses should be done to assess the relative contribution of each predictor to a passenger's overall rating.