# Skytrax Data #2: Shapley Value Regression

## Shapley Value Regression¶

When we have many variables predicting an outcome, how do we know which is the most "important"? In a multiple regression context, one might think, intuitively, that if you threw all the predictors into the same model, the one with the largest coefficient woud be the most important predictor.

If all the predictors were perfectly orthogonal (i.e., completely uncorrelated), this would be true, as the coefficients would not change as a function of which predictors were included/excluded in the model.

However, in most applications (perhaps even the vast majority of real-world cases), predictors are correlated with one another. This results in overlap in the variance that each predictor explains, and confounds simple interpretation of coefficients and R^{2} statistics *with regard to* judging how important one predictor is over another. One consequence of predictors being correlated, for example, is that the order in which predictors are entered into the model affects the degree to which they increase R^{2}.

One solution to this problem is, when dealing with multiple predictors, to iterate over all possible orderings of the predictors, and obtain the *average* importance of each predictor. This procedure is known as Shapley value regression, and is detailed here.

The R package relaimpo does this, but I wanted to work out an analog in Python.

```
import numpy as np
import itertools
class fit:
'''
This creates an instance of the class OLS with attributes coefficients, fitted_values, residuals, and rsquared
'''
def __init__(self, data, outcome, predictors):
#We do the following so we can call them as attributes
self.outcome = outcome
self.predictors = {key: value for (key,value) in [(x, 0) for x in predictors]}
#Create some dictionaries with all the predictors as keys, and empty lists as values
self.resultsrsquared = {k:[] for k in predictors}
self.resultscoefficients = {k:[] for k in predictors}
self.results = {k:[] for k in predictors}
#The above is preferred over dict.fromkeys(list,default) as the default value is shared among all keys, so appending an object
#will append it to all keys.
#See: http://stackoverflow.com/questions/23397153/append-value-to-one-list-in-dictionary-appends-value-to-all-lists-in-dictionary
self.iterate(data, outcome, predictors)
'''
The method below returns the coefficients from a least-squares regression.
'''
def ols(self, Y, X): #Y is an nx1 vector, X is an nxp matrix
X_transpose = np.transpose(X)
X_transpose_X = np.dot(X_transpose, X)
X_inverse = np.linalg.inv(X_transpose_X)
X_transpose_Y = np.dot(X_transpose, Y)
self.coefficients = np.dot(X_inverse, X_transpose_Y)
self.intercept = float(self.coefficients[0])
self.coefficientslist = list(self.coefficients[1:]) #Change type to list
'''
The lines below are for calculating R^2.
'''
self.fitted_values = np.dot(X, self.coefficients)
self.residuals = Y - self.fitted_values
#SStotal
yty = np.dot(np.transpose(Y),Y)
j = np.ones((len(Y),len(Y)))
ytj = np.dot(np.transpose(Y), j)
ytj = np.dot(ytj,Y)/len(Y)
sst = yty - ytj
#SSerror
sse = np.dot(np.transpose(self.residuals), self.residuals)
#SSregression
ssr = sst - sse
#rsquared
self.rsquared = ssr/sst
return self.coefficientslist
def iterate(self,data, outcome, predictors):
#Create dataset where we remove missing values listwise
#SOMETHING IS WRONG WITH HOW YOU ARE SELECTING COLUMNS
Y = np.array(data['{0}'.format(outcome)])
print "Total cases: {0}".format(len(Y))
all_data = np.column_stack((Y, np.ones((len(Y), 1))))
for bs in predictors:
all_data = np.column_stack((all_data, np.array(data[bs])))
all_data = all_data[~np.isnan(all_data).any(axis=1)]
Y = all_data[:,0]
C = all_data[:,1]
X = all_data[:,2:]
print "Cases after listwise deletion: {0}".format(len(Y))
'''
Iterate through all predictors and return the average coefficient for all combinations of predictors
'''
for predictor in range(0, len(predictors)): #for each predictor, let's find the average for 1 to n predictors
otherpredictors = [oth for oth in range(0,len(predictors)) if oth != predictor] #list of other predictors
dictionaryofcoefficients = {n:[] for n in range(1,len(predictors)+1)} # +1 to reflect 1 to n predictors rather than 0 to n-1
dictionaryofrsquared = {n:[] for n in range(1,len(predictors)+1)}
Xpredictor = np.column_stack((C,X[:,predictor]))
for npred in range(0,len(predictors)): #from 1 to n-1 other predictors
ordering = itertools.combinations(otherpredictors, npred)
for comb in ordering:
#do ols for y and our subset of X
if len(comb) == 0:
dictionaryofcoefficients[npred+1].append(self.ols(Y, Xpredictor)[0])
dictionaryofrsquared[npred+1].append(self.rsquared)
else:
Xsub = X[:,list(comb)] #the other predictors are in this matrix
Xsub = np.column_stack((Xpredictor,Xsub)) #combine them
dictionaryofcoefficients[npred+1].append(self.ols(Y, Xsub)[0])
dictionaryofrsquared[npred+1].append(self.rsquared)
dictionaryofrsquared.pop(len(predictors)) #because the model with all predictors always has the same R squared
for k,v in dictionaryofcoefficients.iteritems():
dictionaryofcoefficients[k] = sum(dictionaryofcoefficients[k])/len(dictionaryofcoefficients[k])
self.resultscoefficients[predictors[predictor]] = dictionaryofcoefficients
for k,v in sorted(dictionaryofrsquared.iteritems(), reverse = True):
if k == 1:
dictionaryofrsquared[k] = v[0]
else:
dictionaryofrsquared[k] = (sum(dictionaryofrsquared[k])/len(dictionaryofrsquared[k]))
#- (sum(dictionaryofrsquared[k-1])/len(dictionaryofrsquared[k-1]))
self.resultsrsquared[predictors[predictor]] = dictionaryofrsquared
for k, v in self.resultsrsquared.iteritems():
self.results[k] = sum(v.values())/len(v.values())
```

As far as I can tell, this code does about 80% of what I want it to. It returns a list of what the average coefficient for each predictor is as a function of how many other predictors are in the model, and also returns the average R^{2} of the model with N predictors in it. There's still more to do though:

- The code runs extremely slow, probably because of the weird nested for-loops I have in there.
- The method doesn't actually return the relative importance of each predictor. This has to do with the average change in R
^{2}that each predictor contributes when added to the model. Right now, it just returns the R^{2}of the whole model.

I'll be working on this in the coming days, but for now, as an example for what does (and does not) work, let's run this on the Skytrax data, and see what happens:

```
import psycopg2
import pandas as pd
import numpy as np
```

```
#get data from postgres database
connection = psycopg2.connect("dbname = skytrax user=skytraxadmin password=skytraxadmin")
cursor = connection.cursor()
cursor.execute("SELECT * from airline")
testdata = cursor.fetchall()
column_names = [desc[0] for desc in cursor.description]
testdata = pd.DataFrame(testdata)
testdata.columns = column_names
```

```
import time
import Shapley
reload(Shapley)
start = time.time()
fit = Shapley.fit(testdata, "overall_rating", ['cabin_staff_rating', 'seat_comfort_rating', 'food_beverages_rating', \
'inflight_entertainment_rating', 'value_money_rating'])
print "\nAverage coefficients with N other predictors in the model:\n"
print fit.resultscoefficients
print "\nAverage Rsquared with N other predictors in the model:\n"
print fit.resultsrsquared
print "\nThis crap took {0}s to run".format(time.time() - start)
```