Skytrax Data #2: Shapley Value Regression

Shapley Value Regression

When we have many variables predicting an outcome, how do we know which is the most "important"? In a multiple regression context, one might think, intuitively, that if you threw all the predictors into the same model, the one with the largest coefficient woud be the most important predictor.

If all the predictors were perfectly orthogonal (i.e., completely uncorrelated), this would be true, as the coefficients would not change as a function of which predictors were included/excluded in the model.

However, in most applications (perhaps even the vast majority of real-world cases), predictors are correlated with one another. This results in overlap in the variance that each predictor explains, and confounds simple interpretation of coefficients and R2 statistics with regard to judging how important one predictor is over another. One consequence of predictors being correlated, for example, is that the order in which predictors are entered into the model affects the degree to which they increase R2.

One solution to this problem is, when dealing with multiple predictors, to iterate over all possible orderings of the predictors, and obtain the average importance of each predictor. This procedure is known as Shapley value regression, and is detailed here.

The R package relaimpo does this, but I wanted to work out an analog in Python.

In [2]:
import numpy as np
import itertools

class fit:
    '''
    This creates an instance of the class OLS with attributes coefficients, fitted_values, residuals, and rsquared 
    '''
    def __init__(self, data, outcome, predictors):
        #We do the following so we can call them as attributes
        self.outcome = outcome
        self.predictors = {key: value for (key,value) in [(x, 0) for x in predictors]}
        #Create some dictionaries with all the predictors as keys, and empty lists as values
        self.resultsrsquared = {k:[] for k in predictors} 
        self.resultscoefficients = {k:[] for k in predictors}
        self.results = {k:[] for k in predictors}
        #The above is preferred over dict.fromkeys(list,default) as the default value is shared among all keys, so appending an object
        #will append it to all keys.
        #See: http://stackoverflow.com/questions/23397153/append-value-to-one-list-in-dictionary-appends-value-to-all-lists-in-dictionary
        self.iterate(data, outcome, predictors)
    
    '''
    The method below returns the coefficients from a least-squares regression.
    '''    
    def ols(self, Y, X): #Y is an nx1 vector, X is an nxp matrix
        X_transpose = np.transpose(X)
        X_transpose_X = np.dot(X_transpose, X)
        X_inverse = np.linalg.inv(X_transpose_X)
        X_transpose_Y = np.dot(X_transpose, Y)
               
        self.coefficients = np.dot(X_inverse, X_transpose_Y)
        self.intercept = float(self.coefficients[0])
        self.coefficientslist = list(self.coefficients[1:]) #Change type to list
        
        '''
        The lines below are for calculating R^2.
        '''
        self.fitted_values = np.dot(X, self.coefficients)
        self.residuals = Y - self.fitted_values        
        
        #SStotal
        yty = np.dot(np.transpose(Y),Y)
        j = np.ones((len(Y),len(Y)))
        ytj = np.dot(np.transpose(Y), j)
        ytj = np.dot(ytj,Y)/len(Y)
        sst = yty - ytj
        #SSerror
        sse = np.dot(np.transpose(self.residuals), self.residuals)
        #SSregression
        ssr = sst - sse
        #rsquared
        self.rsquared = ssr/sst
        return self.coefficientslist
        
    def iterate(self,data, outcome, predictors):
        #Create dataset where we remove missing values listwise
        #SOMETHING IS WRONG WITH HOW YOU ARE SELECTING COLUMNS
        Y = np.array(data['{0}'.format(outcome)])
        
        print "Total cases: {0}".format(len(Y))
        
        all_data = np.column_stack((Y, np.ones((len(Y), 1))))
        for bs in predictors:
            all_data =  np.column_stack((all_data, np.array(data[bs])))
        all_data = all_data[~np.isnan(all_data).any(axis=1)]
        Y = all_data[:,0]
        C = all_data[:,1]
        X = all_data[:,2:]
        
        print "Cases after listwise deletion: {0}".format(len(Y))   
        
        '''
        Iterate through all predictors and return the average coefficient for all combinations of predictors
        '''
        for predictor in range(0, len(predictors)): #for each predictor, let's find the average for 1 to n predictors
        
            otherpredictors = [oth for oth in range(0,len(predictors)) if oth != predictor] #list of other predictors
            dictionaryofcoefficients = {n:[] for n in range(1,len(predictors)+1)} # +1 to reflect 1 to n predictors rather than 0 to n-1
            dictionaryofrsquared = {n:[] for n in range(1,len(predictors)+1)}
            
            Xpredictor = np.column_stack((C,X[:,predictor]))

            for npred in range(0,len(predictors)): #from 1 to n-1 other predictors
                ordering = itertools.combinations(otherpredictors, npred)
                for comb in ordering:
                    #do ols for y and our subset of X
                    if len(comb) == 0:
                        dictionaryofcoefficients[npred+1].append(self.ols(Y, Xpredictor)[0]) 
                        dictionaryofrsquared[npred+1].append(self.rsquared)
                    else:
                        Xsub = X[:,list(comb)] #the other predictors are in this matrix
                        Xsub = np.column_stack((Xpredictor,Xsub)) #combine them
                        dictionaryofcoefficients[npred+1].append(self.ols(Y, Xsub)[0]) 
                        dictionaryofrsquared[npred+1].append(self.rsquared)
            
            dictionaryofrsquared.pop(len(predictors)) #because the model with all predictors always has the same R squared
            
            for k,v in dictionaryofcoefficients.iteritems():
                dictionaryofcoefficients[k] = sum(dictionaryofcoefficients[k])/len(dictionaryofcoefficients[k])
                self.resultscoefficients[predictors[predictor]] = dictionaryofcoefficients
            
            for k,v in sorted(dictionaryofrsquared.iteritems(), reverse = True):    
                if k == 1:
                    dictionaryofrsquared[k] = v[0]
                else:
                    dictionaryofrsquared[k] = (sum(dictionaryofrsquared[k])/len(dictionaryofrsquared[k])) 
                    #- (sum(dictionaryofrsquared[k-1])/len(dictionaryofrsquared[k-1]))
                self.resultsrsquared[predictors[predictor]] = dictionaryofrsquared
            
        for k, v in self.resultsrsquared.iteritems():
            self.results[k] = sum(v.values())/len(v.values())
             

As far as I can tell, this code does about 80% of what I want it to. It returns a list of what the average coefficient for each predictor is as a function of how many other predictors are in the model, and also returns the average R2 of the model with N predictors in it. There's still more to do though:

  1. The code runs extremely slow, probably because of the weird nested for-loops I have in there.
  2. The method doesn't actually return the relative importance of each predictor. This has to do with the average change in R2 that each predictor contributes when added to the model. Right now, it just returns the R2 of the whole model.

I'll be working on this in the coming days, but for now, as an example for what does (and does not) work, let's run this on the Skytrax data, and see what happens:

In [3]:
import psycopg2
import pandas as pd
import numpy as np
In [6]:
#get data from postgres database
connection = psycopg2.connect("dbname = skytrax user=skytraxadmin password=skytraxadmin")
cursor = connection.cursor()
cursor.execute("SELECT * from airline")
testdata = cursor.fetchall()
column_names = [desc[0] for desc in cursor.description]
testdata = pd.DataFrame(testdata)
testdata.columns = column_names
In [8]:
import time
import Shapley
reload(Shapley)

start = time.time()

fit = Shapley.fit(testdata, "overall_rating", ['cabin_staff_rating', 'seat_comfort_rating', 'food_beverages_rating', \
                                                    'inflight_entertainment_rating', 'value_money_rating'])

print "\nAverage coefficients with N other predictors in the model:\n"
print fit.resultscoefficients
print "\nAverage Rsquared with N other predictors in the model:\n"
print fit.resultsrsquared
print "\nThis crap took {0}s to run".format(time.time() - start)
Total cases: 41396
Cases after listwise deletion: 28341

Average coefficients with N other predictors in the model:

{'value_money_rating': {1: 1.8649063450094783, 2: 1.5144640753771565, 3: 1.2811039278852754, 4: 1.1094616213211062, 5: 0.97588253839017369}, 'cabin_staff_rating': {1: 1.7220375706163686, 2: 1.2423932090274965, 3: 0.96482096920203941, 4: 0.78336800727344302, 5: 0.66212106011664673}, 'food_beverages_rating': {1: 1.3824000680960342, 2: 0.74574656752576396, 3: 0.41993379259250335, 4: 0.25813695445838092, 5: 0.16598034625130431}, 'seat_comfort_rating': {1: 1.7279134729889485, 2: 1.103370681632768, 3: 0.76087228190797218, 4: 0.55735092810755438, 5: 0.42553696755928527}, 'inflight_entertainment_rating': {1: 0.81813312432369223, 2: 0.25077293584225169, 3: 0.10988060366414072, 4: 0.055685681377817153, 5: 0.031685430819348603}}

Average Rsquared with N other predictors in the model:

{'value_money_rating': {1: 0.70000482291020016, 2: 0.73437848675413542, 3: 0.75769834885876719, 4: 0.77509515934098283}, 'cabin_staff_rating': {1: 0.61108838008392297, 2: 0.68880734158425838, 3: 0.73593199899832795, 4: 0.76760232624119729}, 'food_beverages_rating': {1: 0.41976593597077677, 2: 0.60519081823348375, 3: 0.70695963415469354, 4: 0.7590299721429592}, 'seat_comfort_rating': {1: 0.52806397291997464, 2: 0.64985921373852684, 3: 0.71984029509356728, 4: 0.76190216873812322}, 'inflight_entertainment_rating': {1: 0.1814590593325032, 2: 0.5789830526105908, 3: 0.70255932748876526, 4: 0.75836033918843182}}

This crap took 255.414000034s to run