Skytrax Data #3: The OLS Model

Skytrax Data #3: The OLS Model

In this post, we're going to examine one way someone might evaluate which rating is the "most important" one in predicting passengers' overall ratings: By examining coefficients in a multiple regression. Recall that besides passengers' overall ratings, there were 7 other criteria: their ratings of the flight's seat comfort, cabin staff, food and beverage, inflight entertainment, ground service, wifi connectivity, and value for money.

In [1]:
import psycopg2
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import math
import statsmodels.api as sm
In [2]:
#This connects to the postgres database above.
connection = psycopg2.connect("dbname = skytrax user=skytraxadmin password=skytraxadmin")
cursor = connection.cursor()
In [3]:
connection.rollback()
cursor.execute('SELECT * from airline;')
data = cursor.fetchall()
In [4]:
data = pd.DataFrame(data)
descriptions = [desc[0] for desc in cursor.description]
data.columns = descriptions
In [5]:
data.count()
Out[5]:
airline_name                     41396
link                             41396
title                            41396
author                           41396
author_country                   39805
date                             41396
content                          41396
aircraft                          1278
type_traveller                    2378
cabin_flown                      38520
route                             2341
overall_rating                   36861
seat_comfort_rating              33706
cabin_staff_rating               33708
food_beverages_rating            33264
inflight_entertainment_rating    31114
ground_service_rating             2203
wifi_connectivity_rating           565
value_money_rating               39723
recommended                      41396
dtype: int64

In our subsequent analysis, we're going to ignore ground service ratings as well as wifi connectivity rating because, as you can see, these have very limited data.

In [6]:
from pandas.stats.api import ols as olspd

olspd(y= data['overall_rating'], x = data[['seat_comfort_rating', 'cabin_staff_rating',\
                                           'food_beverages_rating', 'inflight_entertainment_rating',\
                                           'value_money_rating']])
C:\Users\dfthd\Anaconda3\envs\python2\lib\site-packages\IPython\core\interactiveshell.py:2885: FutureWarning: The pandas.stats.ols module is deprecated and will be removed in a future version. We refer to external packages like statsmodels, see some examples here: http://statsmodels.sourceforge.net/stable/regression.html
  exec(code_obj, self.user_global_ns, self.user_ns)
Out[6]:
-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <seat_comfort_rating> + <cabin_staff_rating>
             + <food_beverages_rating> + <inflight_entertainment_rating> + <value_money_rating>
             + <intercept>

Number of Observations:         28341
Number of Degrees of Freedom:   6

R-squared:         0.7887
Adj R-squared:     0.7887

Rmse:              1.4956

F-stat (5, 28335): 21157.9768, p-value:     0.0000

Degrees of Freedom: model 5, resid 28335

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
seat_comfort_rating     0.4255     0.0097      43.89     0.0000     0.4065     0.4445
cabin_staff_rating     0.6621     0.0094      70.60     0.0000     0.6437     0.6805
food_beverages_rating     0.1660     0.0085      19.63     0.0000     0.1494     0.1826
inflight_entertainment_rating     0.0317     0.0062       5.09     0.0000     0.0195     0.0439
value_money_rating     0.9759     0.0103      94.89     0.0000     0.9557     0.9960
--------------------------------------------------------------------------------
     intercept    -1.5582     0.0254     -61.42     0.0000    -1.6079    -1.5085
---------------------------------End of Summary---------------------------------

First, this tells us that the model with these 5 predictors predicts about 79% of the variance in participants' overall ratings. In addition, the F-test tells us that this percentage is statistically significant, although it has to be said that with such a large sample (n = 28341), this test was probably always going to be significant.

Second, each of the coefficients tell us the change in Y (overall_ratings) for each unit change in that predictor. For example, for each unit change in passengers' seat comfort ratings, passengers overall ratings increased by .43 points.

In the case of passengers' seat comfort ratings, this relationship is statistically significant. In fact, in this analysis, all the predictors are significantly related to passengers' overall ratings.

However, it should be said that statistical significance and practical significance are not the same. From the coefficients, for example, each unit change in passengers' inflight entertainment ratings only increased their overall rating by 0.03, but each unit change in passengers' value for money ratings increased their overall ratings by .98 points.

That said, the coefficients in this analysis are conditional in that they are adjusted for one another. In other words, these coefficients represent what we know about the relationship between passenger ratings of each of the criteria given our knowledge of how they rated the other elements.

Trying to isolate the impact of a single predictor is not as simple as running 5 separate regression models, because the predictors are correlated. We can illustrate this as follows:

In [7]:
import statsmodels.api as sm

predictors = ['seat_comfort_rating', 'cabin_staff_rating', 'food_beverages_rating', 'inflight_entertainment_rating',\
                'value_money_rating']

cleaned_data = data[['overall_rating','seat_comfort_rating', 'cabin_staff_rating', 'food_beverages_rating', 'inflight_entertainment_rating',\
                'value_money_rating']]

cleaned_data = cleaned_data.dropna()

for predictor in predictors:
    #x = sm.add_constant(cleaned_data[predictor])
    model = sm.OLS(cleaned_data[predictor], cleaned_data['overall_rating'])
    results = model.fit()
    print '{0} coefficient: {1}'.format(predictor, results.params[0])
    print '{0} rsquared: {1}'.format(predictor, results.rsquared)
seat_comfort_rating coefficient: 0.485086064206
seat_comfort_rating rsquared: 0.892254153535
cabin_staff_rating coefficient: 0.53019002014
cabin_staff_rating rsquared: 0.911693773785
food_beverages_rating coefficient: 0.452602436575
food_beverages_rating rsquared: 0.852452501199
inflight_entertainment_rating coefficient: 0.374759040658
inflight_entertainment_rating rsquared: 0.710378283952
value_money_rating coefficient: 0.52075902098
value_money_rating rsquared: 0.931569962502

As you can see, the five separate OLS models give us five R2 values that do not add up to 100% of variance explained. This is because the five predictors are correlated. In a future post, we will use Shapley Value Regression to try and tease them apart.