Skytrax Data #4: The Logit Model

Skytrax Data #4: The Logit Model

We're going to quickly follow up on the OLS model presented in the last post with an alternative outcome variable: Whether or not passengers chose to recommend the airline to others.

In [1]:
import psycopg2
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import math
import statsmodels.api as sm
In [2]:
#This connects to the postgres database above.
connection = psycopg2.connect("dbname = skytrax user=skytraxadmin password=skytraxadmin")
cursor = connection.cursor()
In [3]:
connection.rollback()
cursor.execute('SELECT * from airline;')
data = cursor.fetchall()
In [4]:
data = pd.DataFrame(data)
descriptions = [desc[0] for desc in cursor.description]
data.columns = descriptions

In these data, "recommended" is coded as 1 and "not recommended" is coded as 0:

In [25]:
print len(data.loc[data['recommended'] == 1])
print len(data.loc[data['recommended'] == 0])
22098
19298
In [27]:
data.groupby(by = 'recommended').mean()
Out[27]:
overall_rating seat_comfort_rating cabin_staff_rating food_beverages_rating inflight_entertainment_rating ground_service_rating wifi_connectivity_rating value_money_rating
recommended
0 2.640637 2.082208 2.094998 1.773961 1.618873 1.626176 1.403509 1.812525
1 8.327206 3.953765 4.358122 3.667807 3.043509 3.992263 3.547085 4.242555

Unsurprisingly, ratings for flights that are recommended are higher than those that are not.

The logistic regression model

How do these ratings influence whether or not a passenger ultimately choses to recommend an airline to others? To find out, we can calculate the change in odds with each unit change in each variable. For example, for every point higher an airline scores on its overall rating, what is the probability that a passenger selected "1", would recommend, rather than "0"?

Again, we're going to ignore ground service ratings as well as wifi connectivity rating because the vast majority of cases have missing values on these variables.

The first thing we're going to do is center all our predictors. This, as we will see shortly, makes the constant interpretable, and is useful when trying to make point predictions (i.e., what are the odds that someone recommends an airline vs. not if he/she rated it an 8, for example).

In [ ]:
predictors = ['overall_rating', 'seat_comfort_rating', 'cabin_staff_rating', 'food_beverages_rating', \
          'inflight_entertainment_rating', 'value_money_rating']

for predictor in predictors:
    data['{0}_centered'.format(predictor)] = data[predictor] - data[predictor].mean()

I ran two models, one that included passengers' overall ratings, and one that did not:

In [55]:
import statsmodels.api as sm

#x = data[['seat_comfort_rating', 'cabin_staff_rating', 'food_beverages_rating', \
#          'inflight_entertainment_rating', 'value_money_rating']]

x = data[['seat_comfort_rating_centered', 'cabin_staff_rating_centered',\
          'food_beverages_rating_centered', 'inflight_entertainment_rating_centered', 'value_money_rating_centered']]


x = sm.add_constant(x)

logit = sm.Logit(data['recommended'], x, missing = 'drop') #drop missing values

results = logit.fit()
results.summary()
Optimization terminated successfully.
         Current function value: 0.215391
         Iterations 8
Out[55]:
Logit Regression Results
Dep. Variable: recommended No. Observations: 31075
Model: Logit Df Residuals: 31069
Method: MLE Df Model: 5
Date: Tue, 17 May 2016 Pseudo R-squ.: 0.6876
Time: 17:17:33 Log-Likelihood: -6693.3
converged: True LL-Null: -21423.
LLR p-value: 0.000
coef std err z P>|z| [95.0% Conf. Int.]
const 0.0595 0.023 2.610 0.009 0.015 0.104
seat_comfort_rating_centered 0.5314 0.022 24.398 0.000 0.489 0.574
cabin_staff_rating_centered 0.8361 0.021 39.970 0.000 0.795 0.877
food_beverages_rating_centered 0.2134 0.019 11.344 0.000 0.177 0.250
inflight_entertainment_rating_centered 0.0747 0.015 4.915 0.000 0.045 0.105
value_money_rating_centered 1.3341 0.026 52.277 0.000 1.284 1.384
In [56]:
np.exp(results.params)
Out[56]:
const                                     1.061330
seat_comfort_rating_centered              1.701393
cabin_staff_rating_centered               2.307242
food_beverages_rating_centered            1.237909
inflight_entertainment_rating_centered    1.077572
value_money_rating_centered               3.796635
dtype: float64

Interpreting the output

The coefficients of a logistic regression are in log odds, so to make them intuitively interpretable, we exponentiate them to get odds ratios. Odds ratios are simply the likelihood of one case over the other (in our case, it's the likelihood of recommending vs not).

The constant

The constant in this model is the predicted value (in this case, the predicted log odds) when all Xs = 0.

In this case, because we centered our variables at the mean, X=0 refers to the mean of that variable. Thus, our constant in this analysis refers to the odds of recommending vs not recommending for a hypothetical passenger who gave the airline mean ratings across the board. It's value, in this case, 1.06, tells us that this hypothetical passenger was 6% more likely to recommend than not. This is close to the unconditional (i.e., when we don't have any predictors) ratio of 14% (which is what you get when you divide the number of people who recommended vs not).

The coefficients

The coefficients in this model refer to the change in odds with each unit change for each variable. A value above one means that higher passenger ratings on that variable makes it more likely that a passenger recommends an airline. As one might expect, this was the case across the board here.

Let's look at a specific example: The coefficient of value for money, which is the largest here, tells us that, holding all other variables constant (i.e., for our hypothetical passenger that scored an airline the mean for everything), each unit change in value for money made it 3.8 times more likely (than the base rate of 6%) that he/she would recommend vs not.

Including passengers' overall ratings into the model

Next, we're going to look at a model that includes passengers' overall ratings:

In [60]:
#x = data[['overall_rating', 'seat_comfort_rating', 'cabin_staff_rating', 'food_beverages_rating', \
#          'inflight_entertainment_rating', 'value_money_rating']]

x = data[['overall_rating_centered', 'seat_comfort_rating_centered', 'cabin_staff_rating_centered',\
          'food_beverages_rating_centered', 'inflight_entertainment_rating_centered', 'value_money_rating_centered']]

x = sm.add_constant(x)

logit = sm.Logit(data['recommended'], x, missing = 'drop') #drop missing values

results = logit.fit()
results.summary()
Optimization terminated successfully.
         Current function value: 0.128654
         Iterations 9
Out[60]:
Logit Regression Results
Dep. Variable: recommended No. Observations: 28341
Model: Logit Df Residuals: 28334
Method: MLE Df Model: 6
Date: Tue, 17 May 2016 Pseudo R-squ.: 0.8093
Time: 17:32:21 Log-Likelihood: -3646.2
converged: True LL-Null: -19124.
LLR p-value: 0.000
coef std err z P>|z| [95.0% Conf. Int.]
const 0.9089 0.033 27.320 0.000 0.844 0.974
overall_rating_centered 1.2794 0.023 54.786 0.000 1.234 1.325
seat_comfort_rating_centered 0.1691 0.029 5.802 0.000 0.112 0.226
cabin_staff_rating_centered 0.2845 0.029 9.981 0.000 0.229 0.340
food_beverages_rating_centered 0.0528 0.027 1.974 0.048 0.000 0.105
inflight_entertainment_rating_centered 0.0154 0.021 0.726 0.468 -0.026 0.057
value_money_rating_centered 0.5350 0.034 15.784 0.000 0.469 0.601
In [61]:
np.exp(results.params)
Out[61]:
const                                     2.481574
overall_rating_centered                   3.594482
seat_comfort_rating_centered              1.184220
cabin_staff_rating_centered               1.329129
food_beverages_rating_centered            1.054215
inflight_entertainment_rating_centered    1.015530
value_money_rating_centered               1.707426
dtype: float64

Including passengers' overall ratings substantially changed many of the coefficients. Let's go over the different parameters agian:

The constant

The constant for this model is now 2.48, which tells us that the hypothetical passenger who gave the airline the mean score on all ratings is now ~2.5 times more likely to recommend than not (contrast this to 1.06 in the previous model). What this implies is that an "average" in terms of an overall rating is really, on average, an endorsement of the airline.

The coefficients

Here, compared to the previous model, the coefficients for all the other ratings are "depressed", in they are much smaller than before. They are still positive, which means that in general, higher ratings on these criteria = more likely to recommend, but knowing a passenger's overall rating makes that information somewhat redundant.

Summary

In sum, when passengers are made to rate airlines this way, their decision to recommend is dominated by their overall, wholistic rating of the airline. That said, these ratings are all correlated, so the takeaway point should not be that inflight entertainment (which was a non-significant predictor even with an n of 28,000) is unimportant, but rather that further analyses should be done to assess the relative contribution of each predictor to a passenger's overall rating.

Skytrax Data #3: The OLS Model

Skytrax Data #3: The OLS Model

In this post, we're going to examine one way someone might evaluate which rating is the "most important" one in predicting passengers' overall ratings: By examining coefficients in a multiple regression. Recall that besides passengers' overall ratings, there were 7 other criteria: their ratings of the flight's seat comfort, cabin staff, food and beverage, inflight entertainment, ground service, wifi connectivity, and value for money.

In [1]:
import psycopg2
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import math
import statsmodels.api as sm
In [2]:
#This connects to the postgres database above.
connection = psycopg2.connect("dbname = skytrax user=skytraxadmin password=skytraxadmin")
cursor = connection.cursor()
In [3]:
connection.rollback()
cursor.execute('SELECT * from airline;')
data = cursor.fetchall()
In [4]:
data = pd.DataFrame(data)
descriptions = [desc[0] for desc in cursor.description]
data.columns = descriptions
In [5]:
data.count()
Out[5]:
airline_name                     41396
link                             41396
title                            41396
author                           41396
author_country                   39805
date                             41396
content                          41396
aircraft                          1278
type_traveller                    2378
cabin_flown                      38520
route                             2341
overall_rating                   36861
seat_comfort_rating              33706
cabin_staff_rating               33708
food_beverages_rating            33264
inflight_entertainment_rating    31114
ground_service_rating             2203
wifi_connectivity_rating           565
value_money_rating               39723
recommended                      41396
dtype: int64

In our subsequent analysis, we're going to ignore ground service ratings as well as wifi connectivity rating because, as you can see, these have very limited data.

In [6]:
from pandas.stats.api import ols as olspd

olspd(y= data['overall_rating'], x = data[['seat_comfort_rating', 'cabin_staff_rating',\
                                           'food_beverages_rating', 'inflight_entertainment_rating',\
                                           'value_money_rating']])
C:\Users\dfthd\Anaconda3\envs\python2\lib\site-packages\IPython\core\interactiveshell.py:2885: FutureWarning: The pandas.stats.ols module is deprecated and will be removed in a future version. We refer to external packages like statsmodels, see some examples here: http://statsmodels.sourceforge.net/stable/regression.html
  exec(code_obj, self.user_global_ns, self.user_ns)
Out[6]:
-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <seat_comfort_rating> + <cabin_staff_rating>
             + <food_beverages_rating> + <inflight_entertainment_rating> + <value_money_rating>
             + <intercept>

Number of Observations:         28341
Number of Degrees of Freedom:   6

R-squared:         0.7887
Adj R-squared:     0.7887

Rmse:              1.4956

F-stat (5, 28335): 21157.9768, p-value:     0.0000

Degrees of Freedom: model 5, resid 28335

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
seat_comfort_rating     0.4255     0.0097      43.89     0.0000     0.4065     0.4445
cabin_staff_rating     0.6621     0.0094      70.60     0.0000     0.6437     0.6805
food_beverages_rating     0.1660     0.0085      19.63     0.0000     0.1494     0.1826
inflight_entertainment_rating     0.0317     0.0062       5.09     0.0000     0.0195     0.0439
value_money_rating     0.9759     0.0103      94.89     0.0000     0.9557     0.9960
--------------------------------------------------------------------------------
     intercept    -1.5582     0.0254     -61.42     0.0000    -1.6079    -1.5085
---------------------------------End of Summary---------------------------------

First, this tells us that the model with these 5 predictors predicts about 79% of the variance in participants' overall ratings. In addition, the F-test tells us that this percentage is statistically significant, although it has to be said that with such a large sample (n = 28341), this test was probably always going to be significant.

Second, each of the coefficients tell us the change in Y (overall_ratings) for each unit change in that predictor. For example, for each unit change in passengers' seat comfort ratings, passengers overall ratings increased by .43 points.

In the case of passengers' seat comfort ratings, this relationship is statistically significant. In fact, in this analysis, all the predictors are significantly related to passengers' overall ratings.

However, it should be said that statistical significance and practical significance are not the same. From the coefficients, for example, each unit change in passengers' inflight entertainment ratings only increased their overall rating by 0.03, but each unit change in passengers' value for money ratings increased their overall ratings by .98 points.

That said, the coefficients in this analysis are conditional in that they are adjusted for one another. In other words, these coefficients represent what we know about the relationship between passenger ratings of each of the criteria given our knowledge of how they rated the other elements.

Trying to isolate the impact of a single predictor is not as simple as running 5 separate regression models, because the predictors are correlated. We can illustrate this as follows:

In [7]:
import statsmodels.api as sm

predictors = ['seat_comfort_rating', 'cabin_staff_rating', 'food_beverages_rating', 'inflight_entertainment_rating',\
                'value_money_rating']

cleaned_data = data[['overall_rating','seat_comfort_rating', 'cabin_staff_rating', 'food_beverages_rating', 'inflight_entertainment_rating',\
                'value_money_rating']]

cleaned_data = cleaned_data.dropna()

for predictor in predictors:
    #x = sm.add_constant(cleaned_data[predictor])
    model = sm.OLS(cleaned_data[predictor], cleaned_data['overall_rating'])
    results = model.fit()
    print '{0} coefficient: {1}'.format(predictor, results.params[0])
    print '{0} rsquared: {1}'.format(predictor, results.rsquared)
seat_comfort_rating coefficient: 0.485086064206
seat_comfort_rating rsquared: 0.892254153535
cabin_staff_rating coefficient: 0.53019002014
cabin_staff_rating rsquared: 0.911693773785
food_beverages_rating coefficient: 0.452602436575
food_beverages_rating rsquared: 0.852452501199
inflight_entertainment_rating coefficient: 0.374759040658
inflight_entertainment_rating rsquared: 0.710378283952
value_money_rating coefficient: 0.52075902098
value_money_rating rsquared: 0.931569962502

As you can see, the five separate OLS models give us five R2 values that do not add up to 100% of variance explained. This is because the five predictors are correlated. In a future post, we will use Shapley Value Regression to try and tease them apart.

Skytrax Data #2: Shapley Value Regression

Shapley Value Regression

When we have many variables predicting an outcome, how do we know which is the most "important"? In a multiple regression context, one might think, intuitively, that if you threw all the predictors into the same model, the one with the largest coefficient woud be the most important predictor.

If all the predictors were perfectly orthogonal (i.e., completely uncorrelated), this would be true, as the coefficients would not change as a function of which predictors were included/excluded in the model.

However, in most applications (perhaps even the vast majority of real-world cases), predictors are correlated with one another. This results in overlap in the variance that each predictor explains, and confounds simple interpretation of coefficients and R2 statistics with regard to judging how important one predictor is over another. One consequence of predictors being correlated, for example, is that the order in which predictors are entered into the model affects the degree to which they increase R2.

One solution to this problem is, when dealing with multiple predictors, to iterate over all possible orderings of the predictors, and obtain the average importance of each predictor. This procedure is known as Shapley value regression, and is detailed here.

The R package relaimpo does this, but I wanted to work out an analog in Python.

In [2]:
import numpy as np
import itertools

class fit:
    '''
    This creates an instance of the class OLS with attributes coefficients, fitted_values, residuals, and rsquared 
    '''
    def __init__(self, data, outcome, predictors):
        #We do the following so we can call them as attributes
        self.outcome = outcome
        self.predictors = {key: value for (key,value) in [(x, 0) for x in predictors]}
        #Create some dictionaries with all the predictors as keys, and empty lists as values
        self.resultsrsquared = {k:[] for k in predictors} 
        self.resultscoefficients = {k:[] for k in predictors}
        self.results = {k:[] for k in predictors}
        #The above is preferred over dict.fromkeys(list,default) as the default value is shared among all keys, so appending an object
        #will append it to all keys.
        #See: http://stackoverflow.com/questions/23397153/append-value-to-one-list-in-dictionary-appends-value-to-all-lists-in-dictionary
        self.iterate(data, outcome, predictors)
    
    '''
    The method below returns the coefficients from a least-squares regression.
    '''    
    def ols(self, Y, X): #Y is an nx1 vector, X is an nxp matrix
        X_transpose = np.transpose(X)
        X_transpose_X = np.dot(X_transpose, X)
        X_inverse = np.linalg.inv(X_transpose_X)
        X_transpose_Y = np.dot(X_transpose, Y)
               
        self.coefficients = np.dot(X_inverse, X_transpose_Y)
        self.intercept = float(self.coefficients[0])
        self.coefficientslist = list(self.coefficients[1:]) #Change type to list
        
        '''
        The lines below are for calculating R^2.
        '''
        self.fitted_values = np.dot(X, self.coefficients)
        self.residuals = Y - self.fitted_values        
        
        #SStotal
        yty = np.dot(np.transpose(Y),Y)
        j = np.ones((len(Y),len(Y)))
        ytj = np.dot(np.transpose(Y), j)
        ytj = np.dot(ytj,Y)/len(Y)
        sst = yty - ytj
        #SSerror
        sse = np.dot(np.transpose(self.residuals), self.residuals)
        #SSregression
        ssr = sst - sse
        #rsquared
        self.rsquared = ssr/sst
        return self.coefficientslist
        
    def iterate(self,data, outcome, predictors):
        #Create dataset where we remove missing values listwise
        #SOMETHING IS WRONG WITH HOW YOU ARE SELECTING COLUMNS
        Y = np.array(data['{0}'.format(outcome)])
        
        print "Total cases: {0}".format(len(Y))
        
        all_data = np.column_stack((Y, np.ones((len(Y), 1))))
        for bs in predictors:
            all_data =  np.column_stack((all_data, np.array(data[bs])))
        all_data = all_data[~np.isnan(all_data).any(axis=1)]
        Y = all_data[:,0]
        C = all_data[:,1]
        X = all_data[:,2:]
        
        print "Cases after listwise deletion: {0}".format(len(Y))   
        
        '''
        Iterate through all predictors and return the average coefficient for all combinations of predictors
        '''
        for predictor in range(0, len(predictors)): #for each predictor, let's find the average for 1 to n predictors
        
            otherpredictors = [oth for oth in range(0,len(predictors)) if oth != predictor] #list of other predictors
            dictionaryofcoefficients = {n:[] for n in range(1,len(predictors)+1)} # +1 to reflect 1 to n predictors rather than 0 to n-1
            dictionaryofrsquared = {n:[] for n in range(1,len(predictors)+1)}
            
            Xpredictor = np.column_stack((C,X[:,predictor]))

            for npred in range(0,len(predictors)): #from 1 to n-1 other predictors
                ordering = itertools.combinations(otherpredictors, npred)
                for comb in ordering:
                    #do ols for y and our subset of X
                    if len(comb) == 0:
                        dictionaryofcoefficients[npred+1].append(self.ols(Y, Xpredictor)[0]) 
                        dictionaryofrsquared[npred+1].append(self.rsquared)
                    else:
                        Xsub = X[:,list(comb)] #the other predictors are in this matrix
                        Xsub = np.column_stack((Xpredictor,Xsub)) #combine them
                        dictionaryofcoefficients[npred+1].append(self.ols(Y, Xsub)[0]) 
                        dictionaryofrsquared[npred+1].append(self.rsquared)
            
            dictionaryofrsquared.pop(len(predictors)) #because the model with all predictors always has the same R squared
            
            for k,v in dictionaryofcoefficients.iteritems():
                dictionaryofcoefficients[k] = sum(dictionaryofcoefficients[k])/len(dictionaryofcoefficients[k])
                self.resultscoefficients[predictors[predictor]] = dictionaryofcoefficients
            
            for k,v in sorted(dictionaryofrsquared.iteritems(), reverse = True):    
                if k == 1:
                    dictionaryofrsquared[k] = v[0]
                else:
                    dictionaryofrsquared[k] = (sum(dictionaryofrsquared[k])/len(dictionaryofrsquared[k])) 
                    #- (sum(dictionaryofrsquared[k-1])/len(dictionaryofrsquared[k-1]))
                self.resultsrsquared[predictors[predictor]] = dictionaryofrsquared
            
        for k, v in self.resultsrsquared.iteritems():
            self.results[k] = sum(v.values())/len(v.values())
             

As far as I can tell, this code does about 80% of what I want it to. It returns a list of what the average coefficient for each predictor is as a function of how many other predictors are in the model, and also returns the average R2 of the model with N predictors in it. There's still more to do though:

  1. The code runs extremely slow, probably because of the weird nested for-loops I have in there.
  2. The method doesn't actually return the relative importance of each predictor. This has to do with the average change in R2 that each predictor contributes when added to the model. Right now, it just returns the R2 of the whole model.

I'll be working on this in the coming days, but for now, as an example for what does (and does not) work, let's run this on the Skytrax data, and see what happens:

In [3]:
import psycopg2
import pandas as pd
import numpy as np
In [6]:
#get data from postgres database
connection = psycopg2.connect("dbname = skytrax user=skytraxadmin password=skytraxadmin")
cursor = connection.cursor()
cursor.execute("SELECT * from airline")
testdata = cursor.fetchall()
column_names = [desc[0] for desc in cursor.description]
testdata = pd.DataFrame(testdata)
testdata.columns = column_names
In [8]:
import time
import Shapley
reload(Shapley)

start = time.time()

fit = Shapley.fit(testdata, "overall_rating", ['cabin_staff_rating', 'seat_comfort_rating', 'food_beverages_rating', \
                                                    'inflight_entertainment_rating', 'value_money_rating'])

print "\nAverage coefficients with N other predictors in the model:\n"
print fit.resultscoefficients
print "\nAverage Rsquared with N other predictors in the model:\n"
print fit.resultsrsquared
print "\nThis crap took {0}s to run".format(time.time() - start)
Total cases: 41396
Cases after listwise deletion: 28341

Average coefficients with N other predictors in the model:

{'value_money_rating': {1: 1.8649063450094783, 2: 1.5144640753771565, 3: 1.2811039278852754, 4: 1.1094616213211062, 5: 0.97588253839017369}, 'cabin_staff_rating': {1: 1.7220375706163686, 2: 1.2423932090274965, 3: 0.96482096920203941, 4: 0.78336800727344302, 5: 0.66212106011664673}, 'food_beverages_rating': {1: 1.3824000680960342, 2: 0.74574656752576396, 3: 0.41993379259250335, 4: 0.25813695445838092, 5: 0.16598034625130431}, 'seat_comfort_rating': {1: 1.7279134729889485, 2: 1.103370681632768, 3: 0.76087228190797218, 4: 0.55735092810755438, 5: 0.42553696755928527}, 'inflight_entertainment_rating': {1: 0.81813312432369223, 2: 0.25077293584225169, 3: 0.10988060366414072, 4: 0.055685681377817153, 5: 0.031685430819348603}}

Average Rsquared with N other predictors in the model:

{'value_money_rating': {1: 0.70000482291020016, 2: 0.73437848675413542, 3: 0.75769834885876719, 4: 0.77509515934098283}, 'cabin_staff_rating': {1: 0.61108838008392297, 2: 0.68880734158425838, 3: 0.73593199899832795, 4: 0.76760232624119729}, 'food_beverages_rating': {1: 0.41976593597077677, 2: 0.60519081823348375, 3: 0.70695963415469354, 4: 0.7590299721429592}, 'seat_comfort_rating': {1: 0.52806397291997464, 2: 0.64985921373852684, 3: 0.71984029509356728, 4: 0.76190216873812322}, 'inflight_entertainment_rating': {1: 0.1814590593325032, 2: 0.5789830526105908, 3: 0.70255932748876526, 4: 0.75836033918843182}}

This crap took 255.414000034s to run

Skytrax Data #1

In [10]:
%matplotlib inline

Skytrax ratings

Following up on this (http://www.quangn.com/exploring-reviews-of-airline-services/) investigation into Skytrax ratings, which are a collection of ratings by made by airline passengers, I wanted to answer some interesting questions I had of my own.

Specifically, what I was interested in is the relative contribution of each of the different metrics (e.g., ground staff, seat comfort, value) to (a) a passenger's overall rating of the airline, and (b) their decision to recommend the airline to another person. Consequently, which criteria should airlines focus on to improve customers' overall perceptions of them (the equivalent of model lift in logit regression)?

Along the way, I thought I'd sate my own curiousity about the following:

  1. How do passenger ratings differ between types of aircraft?
  2. How do passenger ratings differ by cabin class?
  3. How do passenger artings differ depending on where they come from?
  4. How have ratings changed over time?

But first, let's import the data:

In [1]:
import psycopg2
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
from matplotlib import pyplot as plt
import math

I first loaded the data (obtained from https://github.com/quankiquanki/skytrax-reviews-dataset) into a postgres database. Pandas has a very helpful .to_sql() method that makes this easy:

In [ ]:
#This snippet of code loads all the .csv files into a postgres database that I'd set up.
engine = create_engine("postgres://skytraxadmin:skytraxadmin@localhost/skytrax")
databases = ['airline', 'airport', 'seat', 'lounge'] #csv files

for x in databases:
    data = pd.read_csv("{0}.csv".format(x))
    data.to_sql(x, engine, index = False)
In [2]:
#This connects to the postgres database above.
connection = psycopg2.connect("dbname = skytrax user=skytraxadmin password=skytraxadmin")
cursor = connection.cursor()

Ratings by airline

The first thing I was interested in doing was a sanity check to just make sure that I was looking at the same data that the original blog post analyzed (http://www.quangn.com/exploring-reviews-of-airline-services/).

This was the case, although it appears that the original author of that post took only airlines where there were more than 100 ratings, which is what I happened to do.

First, we select the average rating of each airline, as long as that airline has more than 100 ratings. 100 is arbitrary, but if an airline has only a few ratings, the average rating is likely to be extremely variable across samples:

In [4]:
connection.rollback()
cursor.execute("SELECT airline_name, count(airline_name), avg(overall_rating) FROM airline \
               GROUP BY airline_name HAVING count(airline_name) > 100 ORDER BY avg(overall_rating) desc;")
In [5]:
airline_name_data = cursor.fetchall()
column_names = [desc[0] for desc in cursor.description] #column names are stored in the .description attribute
In [8]:
airline_name_data = pd.DataFrame(airline_name_data)
airline_name_data.columns = column_names

Next, we use matplotlib to plot a simple chart of the top 10 rated airlines:

In [13]:
top_ten_airlines = airline_name_data.loc[range(0,9), ['airline_name','avg']]
figure = plt.figure(figsize=(13,8))
bar_x = range(0,9)
plt.bar(range(0,9), top_ten_airlines['avg'], align = 'center')
plt.xticks(range(0,9),top_ten_airlines['airline_name'], rotation ='vertical', fontsize=15)
plt.xlabel('Airline Name',fontsize=15)
plt.ylabel('Rating out of 10', fontsize=15)
plt.yticks(fontsize=15)
print top_ten_airlines #the bar graph does look a little pointless when you can just print out a list of the top 10 airlines.
#but whatever.
         airline_name       avg
0     asiana-airlines  8.348837
1    garuda-indonesia  8.307692
2          air-astana  8.281553
3     bangkok-airways  8.122066
4     indigo-airlines  8.076923
5          korean-air  8.034921
6             eva-air  7.976351
7     aegean-airlines  7.823789
8  singapore-airlines  7.773148

Ratings by aircraft type

The next thing I wanted to do was to see if ratings differed as a function of what aircraft the passenger took.

In [27]:
connection.rollback()
cursor.execute("SELECT aircraft, count(aircraft), avg(overall_rating) AS ovr, \
avg(seat_comfort_rating) AS cmft, avg(inflight_entertainment_rating) AS inft FROM airline \
WHERE aircraft != 'None' GROUP BY aircraft HAVING count(aircraft) >25 ORDER BY avg(overall_rating) asc;")
In [28]:
aircraft_type_data = cursor.fetchall()
column_names = [d[0] for d in cursor.description]
aircraft_type_data = pd.DataFrame(aircraft_type_data)
aircraft_type_data.columns = column_names
In [29]:
aircraft_type_data
Out[29]:
aircraft count ovr cmft inft
0 Boeing 737 47 5.553191 2.829787 2.888889
1 Boeing 747-400 26 5.615385 3.153846 2.720000
2 Boeing 777 48 5.625000 3.354167 3.191489
3 A320 132 5.636364 2.856061 2.333333
4 A330 74 6.189189 3.243243 3.101449
5 Boeing 787 43 6.255814 3.558140 3.275000
6 A319 26 6.307692 3.115385 2.727273
7 A321 45 6.444444 3.444444 2.416667
8 A380 52 7.000000 3.788462 4.115385
9 Boeing 737-800 61 7.278689 3.622951 3.257143
10 Boeing 777-300ER 29 7.517241 4.103448 3.892857
11 A330-300 28 8.250000 4.000000 4.000000

From the above there are only 11 aircraft types that have more than 25 ratings. A couple of aircraft have multiple listings. These appear to be the updated versions of the same aircraft (737-800 vs. 737 and A330-300 vs A330; I assume the generic types are the older models).

There can be many reasons for this, of course. Newer aircraft tend to be more comfortable and reliable. Furthermore, an airline that runs older aircraft may also be associated with other negative experiences that arise because they're struggling. Alternatively, older aircraft may also be associated with "value" airlines, in which case, if customer expectations are commensurately low, these older aircraft may actually be rated more positively. That seems like a future question worth investigaing.

Boeing vs. Airbus

Just for fun, let's compare the average ratings of Boeing vs. Airbus aircraft. We can use regex to select the different rows:

In [30]:
airbus_aircraft = aircraft_type_data.loc[aircraft_type_data['aircraft'].str.contains(r'A.*')]
boeing_aircraft = aircraft_type_data.loc[aircraft_type_data['aircraft'].str.contains(r'Boeing.*')]
print airbus_aircraft.mean()
print boeing_aircraft.mean()
count    59.500000
ovr       6.637948
cmft      3.407932
inft      3.115684
dtype: float64
count    42.333333
ovr       6.307553
cmft      3.437056
inft      3.204230
dtype: float64

Interestingly, even though Boeing aircraft have comparable (or even slightly higher) ratings on "comfort" and "inflight entertainment", Airbus aircraft have higher overall ratings.

Ratings by cabin type

Finally, let's take a look at whether the flying experience is different for the top 1% than it is for the rest of us. As an aside, if you go on Youtube and listen to air traffic controller recordings at JFK, the ATC uses "Top 1%" as a colloquialism for private jets. We go through the same process to import the data:

In [31]:
connection.rollback()
cursor.execute("SELECT avg(overall_rating) AS ovr, avg(seat_comfort_rating) AS seat, avg(cabin_staff_rating) AS cstaff,\
avg(food_beverages_rating) AS food, avg(inflight_entertainment_rating) AS inflight, avg(ground_service_rating) AS gstaff,\
avg(wifi_connectivity_rating) AS wifi, avg(value_money_rating) AS value, cabin_flown, count(*)\
FROM airline GROUP BY cabin_flown;")
In [32]:
data = cursor.fetchall()
column_names = [desc[0] for desc in cursor.description]
In [33]:
data = pd.DataFrame(data)
data.columns = column_names
In [41]:
data['ovr'] = data['ovr']/2 #rescaling the overall rating (out of 10) to be out of 5
In [40]:
data
Out[40]:
ovr seat cstaff food inflight gstaff wifi value cabin_flown count
0 2.111056 2.371125 2.600531 1.936776 1.578055 2.750000 2.000000 2.664648 None 2876
1 2.928374 3.150485 3.406380 2.971567 2.827994 3.644444 3.153846 3.122679 Premium Economy 1510
2 2.987079 2.986317 3.207894 2.646467 2.222536 2.573143 2.066820 3.099697 Economy 29784
3 3.325650 3.740325 3.771536 3.357233 3.002574 3.276316 3.285714 3.340547 First Class 879
4 3.437469 3.616658 3.873538 3.559762 3.046943 3.375000 2.726190 3.556118 Business Class 6347

The last thing we're going to do for now is to compare ratings between passengers who end up recommending the airline to friends vs. those who do not:

In [35]:
connection.rollback()
cursor.execute("SELECT avg(overall_rating)/2 AS ovr, avg(seat_comfort_rating) AS seat, avg(cabin_staff_rating) AS cstaff,\
avg(food_beverages_rating) AS food, avg(inflight_entertainment_rating) AS inflight, avg(ground_service_rating) AS gstaff,\
avg(wifi_connectivity_rating) AS wifi, avg(value_money_rating) AS value, cabin_flown, count(*), recommended \
FROM airline GROUP BY (cabin_flown, recommended) ORDER BY recommended, cabin_flown;")
In [36]:
cabin_recommend_data = cursor.fetchall()
column_names = [desc[0] for desc in cursor.description]
In [37]:
cabin_recommend_data = pd.DataFrame(cabin_recommend_data)
cabin_recommend_data.columns = column_names

The data below suggest a high degree of polarization. One might imagine that people have an implicit "threshold" value that, once crossed, results in a recommendation. Instead, what we find, perhaps unsurprisingly, is that flights that are not recommended are rated a lot worse than those that are recommended.

This is consistent with, but not exclusively because of, a process whereby people judge their flights as "good" or "bad", and then use the rating system to express their satisfaction/disatisfaction with each of the criteria.

In [38]:
cabin_recommend_data
Out[38]:
ovr seat cstaff food inflight gstaff wifi value cabin_flown count recommended
0 1.542186 2.379014 2.442661 2.227797 2.062937 2.086538 1.433333 1.934594 Business Class 1978 0
1 1.279593 2.059694 2.062266 1.728103 1.556541 1.567488 1.384354 1.822827 Economy 13988 0
2 1.392473 2.623693 2.358885 2.003546 1.942029 1.655172 1.500000 1.723473 First Class 312 0
3 1.249601 2.044709 2.166915 1.862891 1.854790 2.250000 2.000000 1.715318 Premium Economy 692 0
4 1.364362 1.574697 1.641992 1.266938 1.165536 2.000000 1.600000 1.441138 None 2328 0
5 4.203916 4.182438 4.527654 4.165486 3.493651 3.995370 3.444444 4.282444 Business Class 4369 1
6 4.140086 3.860673 4.288908 3.497350 2.851571 3.961905 3.500000 4.223123 Economy 15796 1
7 4.276896 4.363813 4.560311 4.101365 3.586826 4.276596 4.000000 4.227513 First Class 567 1
8 4.215159 4.112840 4.485084 3.936446 3.674479 4.151515 3.875000 4.316176 Premium Economy 818 1
9 4.339286 3.904145 4.445596 3.220779 2.369792 5.000000 4.000000 4.390511 None 548 1