Skytrax Data #3: The OLS Model
Skytrax Data #3: The OLS Model¶
In this post, we're going to examine one way someone might evaluate which rating is the "most important" one in predicting passengers' overall ratings: By examining coefficients in a multiple regression. Recall that besides passengers' overall ratings, there were 7 other criteria: their ratings of the flight's seat comfort, cabin staff, food and beverage, inflight entertainment, ground service, wifi connectivity, and value for money.
import psycopg2
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import math
import statsmodels.api as sm
#This connects to the postgres database above.
connection = psycopg2.connect("dbname = skytrax user=skytraxadmin password=skytraxadmin")
cursor = connection.cursor()
connection.rollback()
cursor.execute('SELECT * from airline;')
data = cursor.fetchall()
data = pd.DataFrame(data)
descriptions = [desc[0] for desc in cursor.description]
data.columns = descriptions
data.count()
In our subsequent analysis, we're going to ignore ground service ratings as well as wifi connectivity rating because, as you can see, these have very limited data.
from pandas.stats.api import ols as olspd
olspd(y= data['overall_rating'], x = data[['seat_comfort_rating', 'cabin_staff_rating',\
'food_beverages_rating', 'inflight_entertainment_rating',\
'value_money_rating']])
First, this tells us that the model with these 5 predictors predicts about 79% of the variance in participants' overall ratings. In addition, the F-test tells us that this percentage is statistically significant, although it has to be said that with such a large sample (n = 28341), this test was probably always going to be significant.
Second, each of the coefficients tell us the change in Y (overall_ratings) for each unit change in that predictor. For example, for each unit change in passengers' seat comfort ratings, passengers overall ratings increased by .43 points.
In the case of passengers' seat comfort ratings, this relationship is statistically significant. In fact, in this analysis, all the predictors are significantly related to passengers' overall ratings.
However, it should be said that statistical significance and practical significance are not the same. From the coefficients, for example, each unit change in passengers' inflight entertainment ratings only increased their overall rating by 0.03, but each unit change in passengers' value for money ratings increased their overall ratings by .98 points.
That said, the coefficients in this analysis are conditional in that they are adjusted for one another. In other words, these coefficients represent what we know about the relationship between passenger ratings of each of the criteria given our knowledge of how they rated the other elements.
Trying to isolate the impact of a single predictor is not as simple as running 5 separate regression models, because the predictors are correlated. We can illustrate this as follows:
import statsmodels.api as sm
predictors = ['seat_comfort_rating', 'cabin_staff_rating', 'food_beverages_rating', 'inflight_entertainment_rating',\
'value_money_rating']
cleaned_data = data[['overall_rating','seat_comfort_rating', 'cabin_staff_rating', 'food_beverages_rating', 'inflight_entertainment_rating',\
'value_money_rating']]
cleaned_data = cleaned_data.dropna()
for predictor in predictors:
#x = sm.add_constant(cleaned_data[predictor])
model = sm.OLS(cleaned_data[predictor], cleaned_data['overall_rating'])
results = model.fit()
print '{0} coefficient: {1}'.format(predictor, results.params[0])
print '{0} rsquared: {1}'.format(predictor, results.rsquared)
As you can see, the five separate OLS models give us five R2 values that do not add up to 100% of variance explained. This is because the five predictors are correlated. In a future post, we will use Shapley Value Regression to try and tease them apart.