%matplotlib inline

Skytrax ratings¶

Following up on this (http://www.quangn.com/exploring-reviews-of-airline-services/) investigation into Skytrax ratings, which are a collection of ratings by made by airline passengers, I wanted to answer some interesting questions I had of my own.

Specifically, what I was interested in is the relative contribution of each of the different metrics (e.g., ground staff, seat comfort, value) to (a) a passenger's overall rating of the airline, and (b) their decision to recommend the airline to another person. Consequently, which criteria should airlines focus on to improve customers' overall perceptions of them (the equivalent of model lift in logit regression)?

Along the way, I thought I'd sate my own curiousity about the following:

How do passenger ratings differ between types of aircraft?
How do passenger ratings differ by cabin class?
How do passenger artings differ depending on where they come from?
How have ratings changed over time?

But first, let's import the data:

import psycopg2
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
from matplotlib import pyplot as plt
import math

I first loaded the data (obtained from https://github.com/quankiquanki/skytrax-reviews-dataset) into a postgres database. Pandas has a very helpful .to_sql() method that makes this easy:

#This snippet of code loads all the .csv files into a postgres database that I'd set up.
engine = create_engine("postgres://skytraxadmin:skytraxadmin@localhost/skytrax")
databases = ['airline', 'airport', 'seat', 'lounge'] #csv files

for x in databases:
    data = pd.read_csv("{0}.csv".format(x))
    data.to_sql(x, engine, index = False)

#This connects to the postgres database above.
connection = psycopg2.connect("dbname = skytrax user=skytraxadmin password=skytraxadmin")
cursor = connection.cursor()

Ratings by airline¶

The first thing I was interested in doing was a sanity check to just make sure that I was looking at the same data that the original blog post analyzed (http://www.quangn.com/exploring-reviews-of-airline-services/).

This was the case, although it appears that the original author of that post took only airlines where there were more than 100 ratings, which is what I happened to do.

First, we select the average rating of each airline, as long as that airline has more than 100 ratings. 100 is arbitrary, but if an airline has only a few ratings, the average rating is likely to be extremely variable across samples:

connection.rollback()
cursor.execute("SELECT airline_name, count(airline_name), avg(overall_rating) FROM airline \
               GROUP BY airline_name HAVING count(airline_name) > 100 ORDER BY avg(overall_rating) desc;")

airline_name_data = cursor.fetchall()
column_names = [desc[0] for desc in cursor.description] #column names are stored in the .description attribute

airline_name_data = pd.DataFrame(airline_name_data)
airline_name_data.columns = column_names

Next, we use matplotlib to plot a simple chart of the top 10 rated airlines:

top_ten_airlines = airline_name_data.loc[range(0,9), ['airline_name','avg']]
figure = plt.figure(figsize=(13,8))
bar_x = range(0,9)
plt.bar(range(0,9), top_ten_airlines['avg'], align = 'center')
plt.xticks(range(0,9),top_ten_airlines['airline_name'], rotation ='vertical', fontsize=15)
plt.xlabel('Airline Name',fontsize=15)
plt.ylabel('Rating out of 10', fontsize=15)
plt.yticks(fontsize=15)
print top_ten_airlines #the bar graph does look a little pointless when you can just print out a list of the top 10 airlines.
#but whatever.

         airline_name       avg
0     asiana-airlines  8.348837
1    garuda-indonesia  8.307692
2          air-astana  8.281553
3     bangkok-airways  8.122066
4     indigo-airlines  8.076923
5          korean-air  8.034921
6             eva-air  7.976351
7     aegean-airlines  7.823789
8  singapore-airlines  7.773148

Ratings by aircraft type¶

The next thing I wanted to do was to see if ratings differed as a function of what aircraft the passenger took.

connection.rollback()
cursor.execute("SELECT aircraft, count(aircraft), avg(overall_rating) AS ovr, \
avg(seat_comfort_rating) AS cmft, avg(inflight_entertainment_rating) AS inft FROM airline \
WHERE aircraft != 'None' GROUP BY aircraft HAVING count(aircraft) >25 ORDER BY avg(overall_rating) asc;")

aircraft_type_data = cursor.fetchall()
column_names = [d[0] for d in cursor.description]
aircraft_type_data = pd.DataFrame(aircraft_type_data)
aircraft_type_data.columns = column_names

aircraft_type_data

From the above there are only 11 aircraft types that have more than 25 ratings. A couple of aircraft have multiple listings. These appear to be the updated versions of the same aircraft (737-800 vs. 737 and A330-300 vs A330; I assume the generic types are the older models).

There can be many reasons for this, of course. Newer aircraft tend to be more comfortable and reliable. Furthermore, an airline that runs older aircraft may also be associated with other negative experiences that arise because they're struggling. Alternatively, older aircraft may also be associated with "value" airlines, in which case, if customer expectations are commensurately low, these older aircraft may actually be rated more positively. That seems like a future question worth investigaing.

Boeing vs. Airbus¶

Just for fun, let's compare the average ratings of Boeing vs. Airbus aircraft. We can use regex to select the different rows:

airbus_aircraft = aircraft_type_data.loc[aircraft_type_data['aircraft'].str.contains(r'A.*')]
boeing_aircraft = aircraft_type_data.loc[aircraft_type_data['aircraft'].str.contains(r'Boeing.*')]
print airbus_aircraft.mean()
print boeing_aircraft.mean()

count    59.500000
ovr       6.637948
cmft      3.407932
inft      3.115684
dtype: float64
count    42.333333
ovr       6.307553
cmft      3.437056
inft      3.204230
dtype: float64

Interestingly, even though Boeing aircraft have comparable (or even slightly higher) ratings on "comfort" and "inflight entertainment", Airbus aircraft have higher overall ratings.

Ratings by cabin type¶

Finally, let's take a look at whether the flying experience is different for the top 1% than it is for the rest of us. As an aside, if you go on Youtube and listen to air traffic controller recordings at JFK, the ATC uses "Top 1%" as a colloquialism for private jets. We go through the same process to import the data:

connection.rollback()
cursor.execute("SELECT avg(overall_rating) AS ovr, avg(seat_comfort_rating) AS seat, avg(cabin_staff_rating) AS cstaff,\
avg(food_beverages_rating) AS food, avg(inflight_entertainment_rating) AS inflight, avg(ground_service_rating) AS gstaff,\
avg(wifi_connectivity_rating) AS wifi, avg(value_money_rating) AS value, cabin_flown, count(*)\
FROM airline GROUP BY cabin_flown;")

data = cursor.fetchall()
column_names = [desc[0] for desc in cursor.description]

data = pd.DataFrame(data)
data.columns = column_names

data['ovr'] = data['ovr']/2 #rescaling the overall rating (out of 10) to be out of 5

data

The last thing we're going to do for now is to compare ratings between passengers who end up recommending the airline to friends vs. those who do not:

connection.rollback()
cursor.execute("SELECT avg(overall_rating)/2 AS ovr, avg(seat_comfort_rating) AS seat, avg(cabin_staff_rating) AS cstaff,\
avg(food_beverages_rating) AS food, avg(inflight_entertainment_rating) AS inflight, avg(ground_service_rating) AS gstaff,\
avg(wifi_connectivity_rating) AS wifi, avg(value_money_rating) AS value, cabin_flown, count(*), recommended \
FROM airline GROUP BY (cabin_flown, recommended) ORDER BY recommended, cabin_flown;")

cabin_recommend_data = cursor.fetchall()
column_names = [desc[0] for desc in cursor.description]

cabin_recommend_data = pd.DataFrame(cabin_recommend_data)
cabin_recommend_data.columns = column_names

The data below suggest a high degree of polarization. One might imagine that people have an implicit "threshold" value that, once crossed, results in a recommendation. Instead, what we find, perhaps unsurprisingly, is that flights that are not recommended are rated a lot worse than those that are recommended.

This is consistent with, but not exclusively because of, a process whereby people judge their flights as "good" or "bad", and then use the rating system to express their satisfaction/disatisfaction with each of the criteria.

cabin_recommend_data

	aircraft	count	ovr	cmft	inft
0	Boeing 737	47	5.553191	2.829787	2.888889
1	Boeing 747-400	26	5.615385	3.153846	2.720000
2	Boeing 777	48	5.625000	3.354167	3.191489
3	A320	132	5.636364	2.856061	2.333333
4	A330	74	6.189189	3.243243	3.101449
5	Boeing 787	43	6.255814	3.558140	3.275000
6	A319	26	6.307692	3.115385	2.727273
7	A321	45	6.444444	3.444444	2.416667
8	A380	52	7.000000	3.788462	4.115385
9	Boeing 737-800	61	7.278689	3.622951	3.257143
10	Boeing 777-300ER	29	7.517241	4.103448	3.892857
11	A330-300	28	8.250000	4.000000	4.000000

	ovr	seat	cstaff	food	inflight	gstaff	wifi	value	cabin_flown	count
0	2.111056	2.371125	2.600531	1.936776	1.578055	2.750000	2.000000	2.664648	None	2876
1	2.928374	3.150485	3.406380	2.971567	2.827994	3.644444	3.153846	3.122679	Premium Economy	1510
2	2.987079	2.986317	3.207894	2.646467	2.222536	2.573143	2.066820	3.099697	Economy	29784
3	3.325650	3.740325	3.771536	3.357233	3.002574	3.276316	3.285714	3.340547	First Class	879
4	3.437469	3.616658	3.873538	3.559762	3.046943	3.375000	2.726190	3.556118	Business Class	6347

	ovr	seat	cstaff	food	inflight	gstaff	wifi	value	cabin_flown	count	recommended
0	1.542186	2.379014	2.442661	2.227797	2.062937	2.086538	1.433333	1.934594	Business Class	1978	0
1	1.279593	2.059694	2.062266	1.728103	1.556541	1.567488	1.384354	1.822827	Economy	13988	0
2	1.392473	2.623693	2.358885	2.003546	1.942029	1.655172	1.500000	1.723473	First Class	312	0
3	1.249601	2.044709	2.166915	1.862891	1.854790	2.250000	2.000000	1.715318	Premium Economy	692	0
4	1.364362	1.574697	1.641992	1.266938	1.165536	2.000000	1.600000	1.441138	None	2328	0
5	4.203916	4.182438	4.527654	4.165486	3.493651	3.995370	3.444444	4.282444	Business Class	4369	1
6	4.140086	3.860673	4.288908	3.497350	2.851571	3.961905	3.500000	4.223123	Economy	15796	1
7	4.276896	4.363813	4.560311	4.101365	3.586826	4.276596	4.000000	4.227513	First Class	567	1
8	4.215159	4.112840	4.485084	3.936446	3.674479	4.151515	3.875000	4.316176	Premium Economy	818	1
9	4.339286	3.904145	4.445596	3.220779	2.369792	5.000000	4.000000	4.390511	None	548	1