Mousetracker Data #3¶

In this post, I'm extracting some additional information about the stimuli so that we can run further analysis on participants' choices. For further background, please refer to the first post in this series.

import os
import re
import pandas as pd

data = pd.read_csv('./data cleaned/%s' % os.listdir('./data cleaned')[0])
data.head()

includelist = pd.read_csv('n=452 subjectID.csv', header = None)
includelist = includelist[0].values

data = data.loc[data['subject'].isin(includelist)]

Overview¶

The basic idea is that participants were shown one of two sets of poker chips that they would split between themselves and another person close to them. In every case, they could make a choice that was either selfish (gave more to themselves), or altruistic (gave more to their partner).

We wanted to know how much utility a choice would have to give to a participant before it made them selfish. In other words, how much more would a selfish choice have to give me for me to not be altruistic.

data['RESPONSE1'] = [x for x in data['resp_1'].str.extract(r'..(\d*)_.(\d*)', expand = True).values]
data['RESPONSE2'] = [x for x in data['resp_2'].str.extract(r'..(\d*)_.(\d*)', expand = True).values]

The columns 'resp_1' and 'resp_2' are image files of the choices shown. The naming convention is as follows: 'S' for how many chips the self gets, followed by that number, and 'O' for how many chips the other person gets.

We also have two other columns of interest: 'response', and 'error'. 'response' is which option the participant chose, and 'error' is whether or not that was the selfish choice. In this case, '0' is a selfish choice and '1' is an altruistic choice. This will become important shortly.

data.head()

Algorithm¶

This got a little more complicated because the software we were using to capture mouse telemetry data randomized the position of the stimuli (i.e., whether the selfish choice was on the left or right of the screen was randomized), as it should.

This information is not a feature/variable on its own, but can be inferred from the 'response' and 'error' variables. If a participant chose the option on the left (response == 1), and that was coded as an 'error', it means the selfish choice was on the right (because 'errors' are altruistic choices).

There were two pieces of information that we wanted to extract:

How many more chips did the selfish choice give vs. the altruistic choice?
How many more chips did the selfish choice give the group vs. the altruistic choice?

For example, let's look at the first row data:

data.head(1)

In this case, our participant chose the left option, which was the selfish choice. This choice gave him/her 1 more chip (9-8), and gave the group 4 fewer chips ((9+7)-(8+12)).

I didn't have time to think of an efficient way to do this for all the rows at once, so I decided to brute force it. First, I created a smaller dataframe:

tempdata = pd.DataFrame(columns = ('RESPONSE','ERROR','RESPONSE1','RESPONSE2'))
tempdata['RESPONSE'] = data['response']
tempdata['ERROR'] = data['error']
tempdata['RESPONSE1'] = data['RESPONSE1']
tempdata['RESPONSE2'] = data['RESPONSE2']
tempdata['SELFISHCHOICESELFMORE'] = 0
tempdata['SELFISHCHOICEGROUPMORE'] = 0

This algorithm basically iterates through each row in the data fram, checks to see if the selfish choice is on the left or right, and does the math I described above.

SELFISHCHOICESELFMORE = []
SELFISHCHOICEGROUPMORE = []

for row in tempdata.iterrows():
    if (row[1][0] == 1) & (row[1][1] == 0) | ((row[1][0] == 2) & (row[1][1] == 1)):
        try:
            SELFISHCHOICESELFMORE.append(int(row[1][2][0]) - int(row[1][3][0]))
            SELFISHCHOICEGROUPMORE.append((int(row[1][2][0]) + int(row[1][2][1])) - (int(row[1][3][0]) + int(row[1][3][1])))
        except:
            SELFISHCHOICESELFMORE.append(None)
            SELFISHCHOICEGROUPMORE.append(None)
    elif ((row[1][0] == 2) & (row[1][1] == 0)) | ((row[1][0] == 1) & (row[1][1] == 1)):
        try:
            SELFISHCHOICESELFMORE.append(int(row[1][3][0]) - int(row[1][2][0]))
            SELFISHCHOICEGROUPMORE.append((int(row[1][3][0]) + int(row[1][3][1])) - (int(row[1][2][0]) + int(row[1][2][1])))
        except:
            SELFISHCHOICESELFMORE.append(None)
            SELFISHCHOICEGROUPMORE.append(None)

tempdata = tempdata.drop(['RESPONSE1','RESPONSE2', 'RESPONSE', 'ERROR'], axis = 1)
tempdata['SELFISHCHOICESELFMORE'] = SELFISHCHOICESELFMORE
tempdata['SELFISHCHOICEGROUPMORE'] = SELFISHCHOICEGROUPMORE

Concatenating and writing to a csv:

outdata = pd.concat([data,tempdata], axis = 1)
outdata.to_csv('combineddata3.csv')

%matplotlib inline

Misc #3: Soup¶

My friend and I once argued about whether one "eats" or "drinks" soup. I believed that you "drink" soup, whereas she thought that you "eat" it.

From our conversation, I realized the verbs people associate with various foods have to do with the state of matter they mentally conjure up when thinking about those foods. For example, I posit that if you came up to me while I was holding a Slurpee (for the sake of balance, other semi-frozen beverages are available), I would be eating it if it just came out of the machine, but drinking it if I'd been standing out in the sweltering New York summer heat for a while.

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import math

Thus, this simple dataset about soup was born. Using Amazon's Mechanical Turk, I asked 107 people: (a) Whether they thought soup was more similar to stew (closer to solid) or broth (closer to liquid), and (b) whether they were more likely to eat or drink soup.

Participants answered both questions on a 1 to 5 scale with 5 being "most likely to drink" and "closer to stew" (effectively reverse-coding their responses and erring conservatively with regard to my own hypothesis should participants cluster their responses at the ends of the scale). There was also a free-response question that was quite entertaining to read.

soupdata = pd.read_csv('SOUPDATA.csv')

soupdata.head()

Scatterplot of responses¶

This first scatterplot is not a very good one, because both scales consisted of 5 discrete points, so the points on the graph overlap (we'll fix that in a second).

plt.figure(figsize=(12,10))
plt.scatter(soupdata['SIMILAR'],soupdata['EATDRINK']);
plt.xlabel("5 = most similar to stew", size = 24);
plt.ylabel("5 = more likely to drink", size = 24);

We can artifically jitter the points to make the variance of these points more visible. First, let's write a simple function that takes an array and jitters it:

def jitter(array):
    multiplier = 1 + 0.1* np.random.random(np.shape(array))
    return multiplier * array

This allows us to see how responses were clustered. Already, this pattern of data appears to reflect a negative correlation (which is what I expected), but we'll test that later.

plt.figure(figsize=(12,10))
plt.scatter(jitter(soupdata['SIMILAR']), jitter(soupdata['EATDRINK']));
plt.xlabel("5 = most similar to stew", size = 24);
plt.ylabel("5 = more likely to drink", size = 24);

Broth vs. stew¶

We can then look at the distribution of the two scales to give us further insight into how our sample responded to the two scales.

np.mean(soupdata['SIMILAR'])

2.9056603773584904

This mean tells us that on average, the sample was split between whether or not soup was broth-like or stew-like.

However, the distribution was such that there weren't two discrete groups of "broth people" vs. "stew people". Rather, most people think it's somewhere in between, or lean slightly to one side. In some ways, this serves as a "sanity check" and also tells us that people know that different types of soup exist.

plt.figure(figsize=(8,6));
plt.hist(soupdata['SIMILAR'], bins = range(1,7));
plt.xlabel('5 = most simlar to stew', size = 24);
plt.ylabel('Frequency', size = 24);
plt.xticks(range(1,6));

Eating vs. drinking soup¶

On the other hand, the participants' average "eating/drinking" score was slightly below the mean, indictating that they'd be more likely to "eat" soup than "drink" it. This is not surprising considering I restricted the sample to respondents living in the U.S., where my friend and I were having our argument.

np.mean(soupdata['EATDRINK'])

2.4622641509433962

The distribution of responses is also normal, with a slight positive skew.

plt.figure(figsize=(8,6));
plt.hist(soupdata['EATDRINK'], bins = range(1,7));
plt.xlabel('5 = most simlar to stew', size = 24);
plt.ylabel('Frequency', size = 24);
plt.xticks(range(1,6));

To tell us whether or not the 2.46 average that we obtained (vs. the scale midpoint of 3) might have been due to sampling error, we can also calculate the results of a one-way t-test, comparing 2.46 to the midpoint of the scale. This involves finding the standard error of the sample, and although there are numerous packages that can do this for you, it's easy enough to do by hand:

#Standard deviation
soupsd = np.std(soupdata['EATDRINK'])
print 'Standard deviation = {0}'.format(soupsd)

#Standard error = standard deviation/square root(sample size)
soupse = soupsd/math.sqrt(len(soupdata['EATDRINK']))
print 'Standard error = {0}'.format(soupse)

#t-value = (mean1 - mean2)/standard error
tvalue = (3 - np.mean(soupdata['EATDRINK']))/soupse
print 't({0}) = {1}'.format(len(soupdata['EATDRINK']), tvalue)

Standard deviation = 1.08309939875
Standard error = 0.105199913354
t(106) = 5.11156171059

This gives us a t-value of 5.11 with 106 degrees of freedom, which, if you look up in a t-distribution table, returns a p of < .01 (for reference, a t-value of ~2 with df of ~30 gives you p <.05). We can also calculate confidence intervals:

upperbound = 2.462 + (soupse *1.96)
lowerbound = 2.462 - (soupse *1.96)
print '95% confidence intervals [{0}]'.format((lowerbound,upperbound))

95% confidence intervals [(2.255808169826536, 2.6681918301734644)]

This tells us that if we randomly sampled our population over and over again, 95% of the sample means would fall between 2.26 and 2.67, still below the scale midpoint of 3. Taken together, it's quite likely (and also quite theoretically reasonable), that these participants' tendency to favor "eating" soup was not due to sampling error.

Predicting eating vs. drinking from broth-y vs. stew-y soups¶

So, what about my hypothesis? To test whether people's mental representation of soup predicts the verb that they use to describe consuming it, we can run a simply OLS regression. I'm not going to manually write out the algorithm for it (although I have before, see here), and instead, borrow the much prettier statsmodel one.

(CAVEAT: These are correlational data; the usual caveats with regard to establishing causality apply)

import statsmodels.api as sm

subdata = sm.add_constant(soupdata)
subdata['SIMILARC'] = subdata['SIMILAR'] - np.mean(subdata['SIMILAR'])
model = sm.OLS(subdata['EATDRINK'],subdata.loc[:,['const','SIMILARC']],hasconstant=1)
results = model.fit()

results.summary()

In this analysis, the first coefficient (const), 2.46, tells us that a hypothetical person who though soup could be just as much broth-like as it could be stew-like is still more likely to eat it than drink it (2.46 being significantly lower than the midpoint of 3).

The second coefficient, SIMILARC, is negative, statistically significant and tells us that believing soup is stew-like is associated with saying that one would eat (vs. drink) soup. Specifically, for every 1 point a hypothetical respondent says that soup is similar to stew (vs. broth), he/she is predicted to respond -.27 points lower on the eat vs. drink scale.

Coming back to our original scatter plot, we can draw a fit-line over it:

#Defining a function for our fit line:
def fitline(Xs):
    return [y*(-.2701)+3.47 for y in Xs] #The 3.47 (vs. 2.46) comes from the uncentered version of this analysis

plt.figure(figsize=(12,10))
plt.scatter(jitter(soupdata['SIMILAR']), jitter(soupdata['EATDRINK']));
plt.xlabel("5 = most similar to stew", size = 24);
plt.ylabel("5 = more likely to drink", size = 24);
plt.plot(range(1,6), fitline(range(1,6)));

Conclusion¶

In summary, I learned that:

Most people obviously know that there are different types of soup, but there is variabilty in what someone's "default" category is.
Not everyone "eats" soup, but more people in the U.S. "eat" it than "drink it".
Finally, people who associate soup with stew are more likely to eat it than drink it.

I'm not sure who won the argument in the end, but this analysis does feel a little like winning the battle but losing the war.

import pandas as pd

Misc #2: Restructuring data¶

This post details some of the "dirty work" (specifically, restructuring data), that goes on behind data analysis.

In this case, I'm trying to restructure a person-period/stacked dataset into a person-level/long dataset.

These two formats can be transposed from one to the other, but we often prefer one over the other for readibility/analytic purposes (detailed later).

Person-period vs. person-level data¶

Person-period datasets are usually found in longitudinal research, where a single person contributes multiple rows of data, reflecting the same measure at multiple time points. For example, a person's blood pressure measured every day over a year might be stored in a table where every row is a new blood pressure reading. Each person contributes 365 rows of data, identified by an ID number.

On the other hand, this blood pressure data can also be stored in a person-level dataset, where each row represents a single person, and multiple variables represent each time point (for example, "BP1", "BP2", "BP3" for days 1, 2 and 3 respectively).

As you might have guessed, in this example, a person-period dataset would be preferred. Firstly, it would take less space to store, as we only have to declare the type of data once for the column and not have to do it for each new row. In contrast, a person-level dataset would have to store information about what type of data is contained in each column, and this redundant information is replicated every time a new reading is taken.

Person-period datasets are also more convenient when missing data is to be expected. Substantial amounts of missing data in a person-level dataset would mean lots of empty cells, whereas missing rows do not have to be stored, unless there are other unique, period-specific data (e.g., if we were missing blood pressure data, but also had to store heart rate data for the day).

On the other hand, person-level datasets are often preferred when there are multiple variables that do not change with each time period. They also tend to be more human-readable. For example, if we were interested in someone's blood pressure at 3 time points, and also 100 other variables about their stable, demographic information a person-period dataset would result in a lot of redundancy, as these 100 variables are replicated for each new row.

Current data¶

The annonymized data I'm working with here come from an experiment I did a couple of years ago. In this study, participants completed a simple counting task where they had to count backwards, and I measured how long they took for each response. The specific details about the study are irrelevant to this post.

stackeddata = pd.read_csv('stacked dataset.csv')

This is what the person-period version of the data look like:

stackeddata.head()

Here, you can see ID, the unique identifier for each participant, TIME and TIMETRIAL, the time they took to type in each response both cumulatively and trial-by-trial (so these are all different for each row), as well as some time-invariant data, such as the total number of mistakes they made (MISTAKESTOTAL).

Transposing from person-period to person-level data involves summarizing data in some way. In this case, I'd like to analyze how many responses participants made in total, so that's what I'm going to collapse the data against.

longdata = pd.DataFrame(index=stackeddata[stackeddata.columns.values[0]].unique(), columns = ('TOTRESPONSES','BEFPASS',\
                                                                     'AFTPASS', 'IPOVERT', 'IP', 'T', 'DIFFICULTY', 'HIvsLEffect'))

This creates an empty dataframe with the unique participant IDs as indices, as well as the columns I want to carry over.

longdata.head()

Let's deal with the time-invariant columns first. They're easy (because each row is the same for each participant, so we can just take the first one):

for column in ('IPOVERT', 'IP', 'T', 'DIFFICULTY', 'HIvsLEffect'):
    for participant in longdata.index.values:
        longdata.loc[participant,column] =  stackeddata.loc[stackeddata[stackeddata.columns.values[0]]==participant][column].values[0]

longdata.head()

There are three columns that we have to populate. The first, TOTRESPONSES is simply the number of responses that each participant made.

for participant in longdata.index.values:
        longdata.loc[participant, 'TOTRESPONSES'] = len(stackeddata.loc[(stackeddata[stackeddata.columns.values[0]]==participant) & \
                                                           (stackeddata['MISTAKE'] == 0)])

longdata.head()

BEFPASS and AFTPASS have to do with whether or not a participant passed a displayed score on the screen (denoted by the column PASSED; 0 for no, 1 for yes).

for participant in longdata.index.values:
        longdata.loc[participant, 'BEFPASS'] = len(stackeddata.loc[(stackeddata[stackeddata.columns.values[0]]==participant) & \
                                                           (stackeddata['MISTAKE'] == 0) & (stackeddata['PASSED'] == 0)])
        longdata.loc[participant, 'AFTPASS'] = len(stackeddata.loc[(stackeddata[stackeddata.columns.values[0]]==participant) & \
                                                           (stackeddata['MISTAKE'] == 0) & (stackeddata['PASSED'] == 1)])

Finally, let's output these data to a csv:

longdata.to_csv('long data.csv')

%matplotlib inline

from matplotlib import pyplot as plt
from matplotlib import patches as mpatches
import pandas as pd
import numpy as np
import psycopg2
from datetime import datetime
from datetime import date


pd.options.display.max_columns = 40

Hack for Heat #11: NYC's top complaints pt. 2¶

In this post, I'm going to post a visualization of the top complaints reported in the previous post. I have this in a csv:

tempdata = pd.read_csv('topfiverev.csv', names = ['Year','Category','Count'])

tempdata.head()

#Fixing some category names
tempdata['Category'] = ['PAINT/PLASTER' if (x == 'PAINT/PLASTER') | (x== 'PAINT - PLASTER') else x for x in tempdata['Category']]
tempdata['Category'] = ['Street Condition' if x == 'Traffic Signal Condition' else x for x in tempdata['Category'] ]

What we want is to create a data frame with as many rows as there are categories, as many columns as there are years, and the counts as the data points.

upcaseindex = [x.upper() for x in tempdata['Category'].unique()]
upcaseindex.remove('NONCONST') #remove this category because it refers to missing values

plotdata = pd.DataFrame(index = upcaseindex, columns = tempdata['Year'].unique())

plotdata

Now let's populate it with values.

for row in tempdata['Category'].unique():
    for year in  tempdata['Year'].unique():
        try:
            plotdata.loc[row.upper(),year] = tempdata.loc[(tempdata['Year'] == year) & (tempdata['Category'] == row)]['Count'].values[0]
        except:
            plotdata.loc[row.upper(),year] = None

plotdata = plotdata.drop(['NONCONST'])
plotdata

x = range(0,len(plotdata.columns))
xlabels = plotdata.columns

plt.figure(figsize=(16,10));
plt.xlim(-1,len(xlabels));
plt.ylim(0,500000);
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.gca().yaxis.set_ticks_position('left')
plt.gca().xaxis.set_ticks_position('bottom')
plt.title('Complaint Category by Year', size = 24)
plt.ylabel('Number of Complaints',size = 14)

plotcolors = [(1,0,103),(213,255,0),(255,0,86),(158,0,142),(14,76,161),(255,229,2),(0,95,57),\
            (0,255,0),(149,0,58),(255,147,126),(164,36,0),(0,21,68),(145,208,203),(98,14,0)]
plotcolors = [(color[0]/float(255),color[1]/float(255),color[2]/float(255)) for color in plotcolors]
legendcolors = [mpatches.Patch(color = color) for color in plotcolors]

count = 0
for row in plotdata.index:
    y = plotdata.loc[row]
    plt.plot(x, y, color = plotcolors[count], lw=4, marker = '.', markersize = 20);
    count += 1

plt.legend(legendcolors, upcaseindex, bbox_to_anchor=(0.3, 1));
plt.xticks(x,xlabels);

So that's my graph. Do note that the reason these complaints dropped in 2016 is obviously because we're only midway through the year! I grabbed these data in early July, so that's when these numbers were last updated.

%matplotlib inline

from matplotlib import pyplot as plt
from matplotlib import patches as mpatches
import pandas as pd
import numpy as np
import psycopg2
from datetime import datetime
from datetime import date


pd.options.display.max_columns = 40

connection = psycopg2.connect('dbname = threeoneone user=threeoneoneadmin password=threeoneoneadmin')
cursor = connection.cursor()

Hack for Heat #10: NYC's top complaints¶

In this post, I'm going to look at what the top 311 complaints in NYC are, and how their numbers have changed over the years.

cursor.execute('''SELECT createddate, borough, complainttype FROM service''')
totalcomplaints = cursor.fetchall()

len(totalcomplaints)

13092182

Over the past 7 years, there were about 13 million complaints, which is about just 1.5 per person over the 7 years, or .22 complaints per person per year

totalcomplaints = pd.DataFrame(totalcomplaints)

totalcomplaints.columns = ['Date', 'Boro', 'Comptype']

totalcomplaints['Year'] = [x.year for x in totalcomplaints['Date']]

This is how it breaks down over the years:

totalcomplaints.groupby(by=['Year']).count()

complainttypesbyseason = totalcomplaints.groupby(by=['Year','Comptype']).count()

topfive = complainttypesbyseason['Date'].groupby(level=0,group_keys=False).apply(lambda x:x.sort_values(ascending=False).head(5))
topfive

Year  Comptype              
2010  HEATING                   214218
      GENERAL CONSTRUCTION      127131
      Street Light Condition    116466
      Noise - Residential       115615
      PLUMBING                  111504
2011  HEATING                   190184
      GENERAL CONSTRUCTION      133242
      PLUMBING                  113218
      Noise - Residential       111447
      Street Condition          104694
2012  HEATING                   182974
      Noise - Residential       127943
      GENERAL CONSTRUCTION      112436
      Street Light Condition     93981
      PLUMBING                   87561
2013  HEATING                   202896
      Noise - Residential       151516
      GENERAL CONSTRUCTION      112169
      Street Light Condition     92502
      PLUMBING                   86564
2014  Noise - Residential       192608
      HEAT/HOT WATER            132767
      Street Condition          114545
      HEATING                    97597
      Street Light Condition     94665
2015  HEAT/HOT WATER            225706
      Noise - Residential       208026
      Street Condition          124378
      Blocked Driveway          100881
      Street Light Condition     98106
2016  HEAT/HOT WATER            126906
      Noise - Residential       104324
      Illegal Parking            55932
      Blocked Driveway           53955
      Street Condition           51977
dtype: int64

These data gives us a rough idea of how the number of complaints have changed over time, but one problem is that the labels that were used changed each year (presumably because they were replaced by new labels that better reflected what the complaints were). Secondly, for our purposes, some categories can be combined. For example, 'Street Condition' and 'Street Light Condition' may be important for administrators, but for now, we can be content with just "stuff that's gone wrong on the street".

topfive.index.levels[1]

Index([u'Blocked Driveway', u'GENERAL CONSTRUCTION', u'HEAT/HOT WATER',
       u'HEATING', u'Illegal Parking', u'Noise - Residential', u'PLUMBING',
       u'Street Condition', u'Street Light Condition'],
      dtype='object', name=u'Comptype')

So, over the last 5 years, there have been 10 different categories that have made up the top five. Notice that some of these labels have changed over time. We'll have to standardize them.

years = totalcomplaints.Year.unique()

#first have a list of labels that are the same
oldlabels = totalcomplaints.loc[totalcomplaints['Year'] == years[0],['Comptype']]['Comptype'].unique()

newlabels = []

for x in years:
    yearslabels = totalcomplaints.loc[totalcomplaints['Year'] == x,['Comptype']]['Comptype'].unique()
    [newlabels.append(x) for x in yearslabels if x not in oldlabels and x not in newlabels]

From doing this, we get two lists: oldlabels and newlabels, that we will use to decide how to combine stuff.

#sorted(oldlabels)

#sorted(newlabels)

One modification I'm going to make is to combine 'Street Condition', 'Street Light Condition', 'Street Sign - Damaged', 'Street Sign - Dangling', and 'Street Sign - Missing' into 'Street Condition'.

For the current purposes, these are the same thing.

I'm also going to collapse 'GENERAL CONSTRUCTION', 'General Construction/Plumbing' and 'PLUMBING' into a new category called 'CONSTRUCTION/PLUMBING'.

I'll collapse noise related complaints ('Noise', 'Noise - Commercial', 'Noise - Helicopter', 'Noise - House of Worship', 'Noise - Park', 'Noise - Residential', 'Noise - Street/Sidewalk', 'Noise - Vehicle') into a label named 'NOISY'.

Finally, I'll combine 'HEAT/HOT WATER' and 'HEATING' into 'HEAT RELATED'.

streetconditioncomb = ['Street Condition', 'Street Light Condition', 'Street Sign - Damaged', 'Street Sign - Dangling', \
                       'Street Sign - Missing']

totalcomplaints['RevComptype'] = ['Street Condition' if x in streetconditioncomb else x for x in totalcomplaints['Comptype']]

genconstructcomb = ['GENERAL CONSTRUCTION', 'General Construction/Plumbing', 'PLUMBING']

totalcomplaints['RevComptype'] = ['CONSTRUCTION/PLUMBING' if x in genconstructcomb else x for x in totalcomplaints['RevComptype']]

noisecomb = ['Noise', 'Noise - Commercial', 'Noise - Helicopter', 'Noise - House of Worship', 'Noise - Park',\
             'Noise - Residential', 'Noise - Street/Sidewalk', 'Noise - Vehicle']

totalcomplaints['RevComptype'] = ['NOISY' if x in noisecomb else x for x in totalcomplaints['RevComptype']]

heatcomb = ['HEAT/HOT WATER', 'HEATING']

totalcomplaints['RevComptype'] = ['HEAT RELATED' if x in heatcomb else x for x in totalcomplaints['RevComptype']]

revisedcomplainttypes = totalcomplaints.groupby(by=['Year','RevComptype']).count()

topfiverev = revisedcomplainttypes['Date'].groupby(level=0, group_keys=False).apply(lambda x : x.sort_values(ascending=False).head(8))
topfiverev

Year  RevComptype             
2010  CONSTRUCTION/PLUMBING       287253
      Street Condition            229584
      HEAT RELATED                214218
      NOISY                       201834
      PAINT - PLASTER              93194
      Water System                 70346
      NONCONST                     69659
      Traffic Signal Condition     53773
2011  CONSTRUCTION/PLUMBING       272234
      Street Condition            216865
      NOISY                       196453
      HEAT RELATED                190184
      PAINT - PLASTER             100704
      NONCONST                     68407
      Water System                 61577
      Blocked Driveway             52592
2012  CONSTRUCTION/PLUMBING       225338
      NOISY                       220392
      HEAT RELATED                182974
      Street Condition            171910
      PAINT - PLASTER              77287
      NONCONST                     60055
      Water System                 57678
      Blocked Driveway             50818
2013  NOISY                       259356
      CONSTRUCTION/PLUMBING       223314
      HEAT RELATED                202896
      Street Condition            182618
      PAINT - PLASTER              77957
      Blocked Driveway             57408
      NONCONST                     53831
      Water System                 53186
2014  NOISY                       337892
      HEAT RELATED                230364
      Street Condition            221097
      CONSTRUCTION/PLUMBING       106251
      Blocked Driveway             79170
      PAINT/PLASTER                63503
      Illegal Parking              63243
      UNSANITARY CONDITION         61789
2015  NOISY                       387227
      Street Condition            233839
      HEAT RELATED                225706
      Blocked Driveway            100881
      Illegal Parking              92679
      CONSTRUCTION/PLUMBING        83585
      UNSANITARY CONDITION         82888
      Water System                 71086
2016  NOISY                       197186
      HEAT RELATED                126906
      Street Condition            105346
      Illegal Parking              55932
      Blocked Driveway             53955
      CONSTRUCTION/PLUMBING        39460
      UNSANITARY CONDITION         35357
      Water System                 31299
dtype: int64

This gives us some different numbers. Taken together, although heat related complaints are still constantly in the top 5, what we observe over time is also a gradual decrease in construction related complaints, a constant, variable but large amount of street-related complaints, and a steady increase in noise-related complaints. Next, I'll post a version of this looking at only the heating season.

%matplotlib inline

from matplotlib import pyplot as plt
from matplotlib import patches as mpatches
import pandas as pd
import numpy as np
import psycopg2
from datetime import datetime
from datetime import date


pd.options.display.max_columns = 40

connection = psycopg2.connect('dbname = threeoneone user=threeoneoneadmin password=threeoneoneadmin')
cursor = connection.cursor()

Hack for Heat #9: Total complaints over time, complaints per capita, etc.¶

In this post, I'm going to be doing a couple of miscellaneous things to prepare for our first blog post of the summer. Last year's posts can be found here.

cursor.execute('''SELECT createddate, borough, complainttype FROM service;''')
borodata = cursor.fetchall()

borodata = pd.DataFrame(borodata)

borodata.columns = ['Date', 'Boro', 'Comptype']

Again, I removed entires earlier than 2011 because there are data quality issues (substantially fewer cases):

borodata = borodata.loc[borodata['Date'] > date(2011,3,1)]

borodata = borodata.loc [borodata['Date'] <date(2016,6,1)]

Remove cases where the borough was unspecified:

borodata = borodata.loc[borodata['Boro'] != 'Unspecified']

Total complaints by borough, year-over-year¶

We already have a chart of heat complaints by month; we need one for the total number of complaints received.

borodata['Year'] = [x.year for x in borodata['Date']]
borodata['Month'] = [x.month for x in borodata['Date']]
borodata['Day'] = [x.day for x in borodata['Date']]

plotdata = borodata.groupby(by=['Boro', 'Year', 'Month']).count()
plotdata.head()

Now we generate a dictionary that we can use to plot these data:

plotdict = {x:[] for x in borodata['Boro'].unique()}

for boro in plotdict:
    plotdict[boro] = list(plotdata.xs(boro).Date)

Now we need to format our data for the plot. We need 5 "rows", each representing a borough, and n columns, each representing one month.

plotdata = np.zeros(len(plotdict['BROOKLYN']))
legendkey = []

for boro in plotdict.keys():
    plotdata = np.row_stack((plotdata, plotdict[boro]))
    legendkey.append(boro)
    
plotdata = np.delete(plotdata, (0), axis=0)

x = np.arange(len(plotdata[0]))

#crude xlabels
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
years = ['2011', '2012', '2013', '2014', '2015', '2016']
xlabels = []
for year in years:
    for month in months:
        xlabels.append("{0} {1}".format(month,year))

        
xlabels = xlabels[2:-7] #start from march 2011, end may2016

plotcolors = [(1,0,103),(213,255,0),(255,0,86),(158,0,142),(14,76,161),(255,229,2),(0,95,57),\
            (0,255,0),(149,0,58),(255,147,126),(164,36,0),(0,21,68),(145,208,203),(98,14,0)]

#rescaling rgb from 0-255 to 0 to 1
plotcolors = [(color[0]/float(255),color[1]/float(255),color[2]/float(255)) for color in plotcolors]
legendcolors = [mpatches.Patch(color = color) for color in plotcolors]

plt.figure(figsize = (15,10));
plt.stackplot(x,plotdata, colors = plotcolors);
plt.xticks(x,xlabels,rotation=90);
plt.xlim(0,len(xlabels))
plt.legend(legendcolors,legendkey, bbox_to_anchor=(0.2, 1));
plt.title('Heating Complaints by Borough', size = 24)
plt.ylabel('Number of Complaints',size = 14)

plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.gca().yaxis.set_ticks_position('left')
plt.gca().xaxis.set_ticks_position('bottom')

plotdatacsv = pd.DataFrame(plotdata)
plotdatacsv.index = legendkey
plotdatacsv.columns = xlabels
plotdatacsv.to_csv('totalcomplaintsbyboroughandmonth.csv')

Normalized total complaints¶

We'll also need a normalized version of the above, where everything is scaled to 100%, to see if complaints come disproportionately from one borough vs. another. There are probably better ways to do this, but I'm going to simply sum the number of complaints from each time point, and divide each column by the total for that column (i.e., number of complaints from that borough for that month divided by total number of complaints).

normplotdata = np.ndarray(shape=np.shape(plotdata))

It will thus be helpful to have an array of the total number of complaints:

totalcomplaintsbymonth = borodata.groupby(by=['Year','Month']).count().Date

totalcomplaintsbymonth = totalcomplaintsbymonth.values

totalcomplaintsbymonth = np.tile(totalcomplaintsbymonth,(5,1))

normplotdata = plotdata/totalcomplaintsbymonth

plotcolors = [(1,0,103),(213,255,0),(255,0,86),(158,0,142),(14,76,161),(255,229,2),(0,95,57),\
            (0,255,0),(149,0,58),(255,147,126),(164,36,0),(0,21,68),(145,208,203),(98,14,0)]

#rescaling rgb from 0-255 to 0 to 1
plotcolors = [(color[0]/float(255),color[1]/float(255),color[2]/float(255)) for color in plotcolors]
legendcolors = [mpatches.Patch(color = color) for color in plotcolors]

plt.figure(figsize = (15,10));
plt.stackplot(x,normplotdata, colors = plotcolors);
plt.xticks(x,xlabels,rotation=90);
plt.xlim(0,len(xlabels))
plt.ylim(0,1)
plt.legend(legendcolors,legendkey, bbox_to_anchor=(0.2, 1));
plt.title('Heating Complaints by Borough', size = 24)
plt.ylabel('Number of Complaints',size = 14)

plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.gca().yaxis.set_ticks_position('left')
plt.gca().xaxis.set_ticks_position('bottom')

Heat complaints per capita:¶

We'd also like to look at how many complaints per capita are generated. Given that we only have full years for 2012-2015, let's consider those.

boropop = {
    'MANHATTAN': 1636268,
    'BRONX': 1438159,
    'BROOKLYN': 2621793,
    'QUEENS': 2321580,
    'STATEN ISLAND': 473279,
    }

totalnycpop = reduce(lambda x, y: x +y, boropop.values())
totalnycpop

8491079

complaintsbyyear = borodata.groupby(by=['Year']).count()
borocomplaintsbyyear = borodata.groupby(by=['Year','Boro']).count()

complaintsbyyear['Pop'] = [totalnycpop for x in complaintsbyyear.index]

borocomplaintsbyyear['Pop'] = [boropop.get(x[1]) for x in borocomplaintsbyyear.index]

complaintsbyyear['CompPerCap'] = complaintsbyyear['Day']/complaintsbyyear['Pop']

complaintsbyyear

borocomplaintsbyyear['CompPerCap'] = borocomplaintsbyyear['Day']/borocomplaintsbyyear['Pop']

borocomplaintsbyyear

From these data, we can see two things: First, on average, about 1 in 5 people make a 311 complaint each year. Second, overall, this number are pretty consistent across the 5 boroughs.

Overview¶

Since the time of the last post about these data, my friends and I at the NYU couples lab have collect some actual Mousetracker data from online participants. In this post, I wrote some code to pre-process and clean the dataset.

Current project¶

As a refresher, we hypothesized that at any given time, individuals are concerned with:

their self-interest
their partner's interests
the interest of the group or dyad, or the relationship, or them as a pair

and these motives affect the way individuals choose to distribute resources.

To distinguish between these three motives, we generated three sets of stimuli using poker chips that pit each of these motives against each other:

The first set of stimuli pit participants' self-interest against the interests of their partner.
The second set of stimuli pits a participant's concern for the interest of their partner vs. their own self interest and the group's interest.
The last set of stimuli pit participants' self-interest against that of their partner and the group.

The data¶

The data come in a person-period dataset. This is a "long" format where each participant has multiple rows that represent each trial of the experiment (there were 60 or so trials). However, each row also contains multiple columns each representing a bin of average locations the participant's mouse pointer was during that time span. There are ~100 such bins.

In other words, each participant made 60 choices, and their mouse positions were averaged into ~100 time points per trial.

The first thing we're going to do is actually some cleaning. Each of the .csv files have a redundant first row, so we're just going to delete that:

import os
import re

files = os.listdir('./data')
files

['GBPG SMGL VS. SLGM n=118.csv',
 'GBPG SMGM VS. SLGL n=118.csv',
 'GBPG SO n=118.csv',
 'GBPR SMGL VS. SLGM n=121.csv',
 'GBPR SMGM vs. SLGL n=121.csv',
 'GBPR SO n=121.csv',
 'RBPG SMGL vs. SLGM n=105.csv',
 'RBPG SMGM vs. SLGL n=105.csv',
 'RBPG SO n=105.csv',
 'RBPR SMGL vs. SLGM n=120.csv',
 'RBPR SMGM vs. SLGL n=120.csv',
 'RBPR SO n=120.csv']

#for all our files
for name in files:
    with open('./data/{0}'.format(name)) as f:
        contents = f.readlines()
    out =  open('./data cleaned/{0}.csv'.format(re.match(r'(.*) n=.*', name).group(1)), 'wb')
    
    found = 0
    for line in contents[1:]:
        if line == 'MEAN SUBJECT-BY-SUBJECT DATA\n':
            found = 1
        elif found == 0:
            out.write(line)
        elif found == 1:
            pass
    out.close()

If there wasn't a section at the end I had to delete, I could have just done this:

import pandas as pd
   
testdata = pd.read_csv('./data/%s' % files[0],skiprows = 1)
testdata.head()

Next, I'm going to write a loop that basically does what I did in the first post to all the separate datasets. Again, we're going to be finding the mean of participants' reaction time (RT), maximum deviation (MD), and the area under curve (AUC).

What we want in the end is a csv file that has the overall mean RT, MD, AUC, as well as those metrics when participants' were correct vs. incorrect.

I wrote two functions that basically do what I did by hand in the first post. The first combines two redundant columns, and the second finds the mean of that column, depending on whether the participant made an error or not, or whether we want the grand mean.

data = pd.read_csv('./data cleaned/%s' % os.listdir('./data cleaned')[0])
data.head()

#first, combine the redundant columns and type the relevant columns
def combine_columns(dddd):
    dddd['MD'] = dddd.loc[dddd['MD_1'].isnull() == False, ['MD_1']]
    dddd.loc[dddd['MD'].isnull() == True,['MD']] = dddd.loc[dddd['MD_2'].isnull() == False]['MD_2'] 
    dddd['AUC'] = dddd.loc[dddd['AUC_1'].isnull() == False, ['AUC_1']]
    dddd.loc[dddd['AUC'].isnull() == True, ['AUC']] = dddd.loc[dddd['AUC_2'].isnull() == False]['AUC_2']
    
combine_columns(data)

def find_mean(datasource, participantid, metric, error=None):
    participantsdata = datasource.loc[datasource['subject'] == participantid]
    if error == 1:
        return participantsdata.loc[participantsdata['error']==1][metric].astype('float').mean()
    elif error == 0:
        return participantsdata.loc[participantsdata['error']==0][metric].astype('float').mean()
    else:        
        return participantsdata[metric].astype('float').mean()

Next, we're going to test some code that calculates the mean of the afore-mentioned metrics for every participant in a dataset:

combine_columns(data)

participants = data['subject'].unique()

participantdict = {x:[] for x in participants}

for participant in participantdict:
    for metric in ['AUC', 'MD', 'RT']:
        try:
            participantdict[participant].append(find_mean(data, participant, metric, error = 1))
        except:
            participantdict[participant].append(None)

outdata = pd.DataFrame.from_dict(participantdict,orient='index')
outdata.columns = ['AUC', 'MD', 'RT']

outdata.head()

Let's write this as a function:

def sum_participants(pkeys,metrics,err,datttt):
    adictionary = {x:[] for x in pkeys}
    for participant in adictionary:
        for metric in metrics:
            try:
                adictionary[participant].append(find_mean(datttt, participant, metric ,error = err))
            except:
                adictionary[participant].append(None)
        adictionary[participant].append(err)
    return adictionary

Definitely not production code, but it should work.

Combining datasets¶

Alright, now we have all of that working, let's combine the datasets that we have, and add features to tell us where the data came from:

files = os.listdir('./data cleaned')
files

['GBPG SMGL VS. SLGM.csv',
 'GBPG SMGM VS. SLGL.csv',
 'GBPG SO.csv',
 'GBPR SMGL VS. SLGM.csv',
 'GBPR SMGM vs. SLGL.csv',
 'GBPR SO.csv',
 'RBPG SMGL vs. SLGM.csv',
 'RBPG SMGM vs. SLGL.csv',
 'RBPG SO.csv',
 'RBPR SMGL vs. SLGM.csv',
 'RBPR SMGM vs. SLGL.csv',
 'RBPR SO.csv']

What we're going to do is first create an empty DataFrame with all our columns. Next, we'll load all of our data, do the math for the 3 measures, add a feature that captures where the data came from (i.e., one column for the color codes, another color for the comparison type).

combineddata = pd.DataFrame(columns= ['AUC', 'MD', 'RT', 'ERROR', 'COLORCODE', 'COMPARISON'])

Now, let's put everything together, and loop through all the filenames:

metrics = ['AUC', 'MD', 'RT']

for filename in files:
    tempdata = pd.read_csv('./data cleaned/{0}'.format(filename))
    
    combine_columns(tempdata)
    
    participants = tempdata['subject'].unique()
    
    correctdict = sum_participants(participants,metrics,0,tempdata)
    errordict = sum_participants(participants,metrics,1,tempdata)
    
    correctdata = pd.DataFrame.from_dict(correctdict,orient='index')
    errordata = pd.DataFrame.from_dict(errordict,orient='index')
    
    outdata = pd.concat([correctdata,errordata])
    outdata.columns = ['AUC', 'MD', 'RT', 'ERROR']
    outdata['COLORCODE'] = re.match(r'(....)', filename).group(1)
    outdata['COMPARISON'] = re.match(r'.... (.*).csv',filename).group(1)
    
    print filename
    print len(outdata)
    
    combineddata = pd.concat([combineddata,outdata])

GBPG SMGL VS. SLGM.csv
236
GBPG SMGM VS. SLGL.csv
236
GBPG SO.csv
236
GBPR SMGL VS. SLGM.csv
242
GBPR SMGM vs. SLGL.csv
242
GBPR SO.csv
242
RBPG SMGL vs. SLGM.csv
210
RBPG SMGM vs. SLGL.csv
210
RBPG SO.csv
210
RBPR SMGL vs. SLGM.csv
240
RBPR SMGM vs. SLGL.csv
240
RBPR SO.csv
240

combineddata.head()

This is what the data should look like, so let's write it to a csv:

combineddata.sort_index().to_csv('combineddata.csv')

outdata

%matplotlib inline

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import psycopg2
from datetime import date

pd.options.display.max_columns = 40
pd.options.mode.chained_assignment = None  # default='warn'

Hack for Heat #8: Complaint resolution time revisited¶

In this post, I'm going to explore a simple way of dealing with the censoring issue in complaint resolution time: Basically, because we only care about heat complaints during the heat season, we can constrain our cases of interest to the heating season specifically (1^st October to May 31^st).

#pull from our database again:
connection = psycopg2.connect('dbname = threeoneone user= threeoneoneadmin password = threeoneoneadmin')

cursor = connection.cursor()

cursor.execute('''SELECT createddate, closeddate, borough, complainttype FROM service;''')

data = cursor.fetchall()

data = pd.DataFrame(data)

data.columns = ['createddate', 'closeddate', 'borough', 'complainttype']

data = data.loc[(data['complainttype'] == 'HEATING') | (data['complainttype'] == 'HEAT/HOT WATER') ]

So far, we've subsetted the data to only include the heating complaints. Now let's create a datetime mask. What we want is to have subsets of data for each of the heating seasons:

heatmonths = range(1, 6) + [10,11,12]

heatmonths

[1, 2, 3, 4, 5, 10, 11, 12]

createdmask = data['createddate'].map(lambda x: (x.month in heatmonths))
closedmask = data['closeddate'].map(lambda x: (x.month in heatmonths) if x != None else False)

mask = createdmask & closedmask

heatseasondata = data.loc[mask]

How many heat/hot water complaints are created and closed inside vs. outside of heating season?

len(data.loc[mask]) #inside heating season

1302688

len(data.loc[~mask])#outside heating season

68750

The next thing we want to do is ignore cases where the complaint was resolved in the next heating season:

prevmonths = range(1, 6) 
nextmonths = [10,11,12]

heatseasondata['createdheatseason'] = [x.year if (x.month in prevmonths) else (x.year-1) for x in heatseasondata['createddate']]

heatseasondata.head()

heatseasondata['closedheatseason'] = [x.year if (x.month in prevmonths) else (x.year-1) for x in heatseasondata['closeddate']]

Now that we've done this, we can select only the cases where the closed date was in the same season as the created date:

heatseasondata = heatseasondata.loc[heatseasondata['createdheatseason']  == heatseasondata['closedheatseason']]

Okay, now we can calculate some average resolution times:

heatseasondata['resolutiontime'] = heatseasondata['closeddate'] - heatseasondata['createddate']

heatseasondata['resolutiontimeint'] = heatseasondata.resolutiontime.astype('timedelta64[D]')

resolutiontimedata = heatseasondata.groupby(by='createdheatseason').mean()['resolutiontimeint']

resolutiontimedata.to_csv('resolutiontimebyyear.csv')

resolutiontimedata

createdheatseason
2009    4.605845
2010    3.695498
2011    3.666409
2012    3.623974
2013    3.514244
2014    3.137578
2015    3.406145
2016    3.446075
Name: resolutiontimeint, dtype: float64

Resolution times by year:¶

x = resolutiontimedata.index.values
y = resolutiontimedata.values

plt.figure(figsize=(12,10));
plt.plot(x,y);

Resolution time by borough:¶

restimebyboro = heatseasondata.groupby(by=['borough', 'createdheatseason']).mean()['resolutiontimeint']

restimebyboro.to_csv('restimebyboro.csv')

restimebyboro = restimebyboro.loc[[x in range(2010,2017) for x in restimebyboro.index.get_level_values('createdheatseason')]]

boros = heatseasondata.borough.unique()

boroplots = {x:[] for x in boros}
for boro in boros:
    boroplots[boro] = restimebyboro.xs(boro).values
    
boroplots.pop('Unspecified')

array([ 3.64268956,  3.89000015])

x = range(2010,2017)

plt.figure(figsize=(12,10));
for boro in boroplots:
    plt.plot(x,boroplots[boro]);
plt.legend(boroplots.keys());
plt.xticks(x, [str(label) for label in x]);

I was told beforehand that the average resolution time from a similar analysis last year was about 3-5 days, so this looks about right.

%matplotlib inline

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import psycopg2
from datetime import datetime
from datetime import date

pd.options.display.max_columns = 40

Hack for Heat #7: Heat complaints over time¶

In the last post, I plotted how the number of complaints differd by borough over time. This time around, I'm going to revisit this process, but this time focusing on heating complaints only (this is Heat Seek, after all).

Loading data:¶

connection = psycopg2.connect('dbname = threeoneone user=threeoneoneadmin password=threeoneoneadmin')
cursor = connection.cursor()

cursor.execute('''SELECT DISTINCT complainttype FROM service;''')
complainttypes = cursor.fetchall()

cursor.execute('''SELECT createddate, borough, complainttype FROM service;''')
borodata = cursor.fetchall()

borodata = pd.DataFrame(borodata)

borodata.head()

borodata.columns = ['Date', 'Boro', 'Comptype']

heatdata = borodata.loc[(borodata['Comptype'] == 'HEATING') | (borodata['Comptype'] == 'HEAT/HOT WATER')]

I removed entires earlier than 2011 because there are data quality issues (substantially fewer cases):

heatdata = heatdata.loc[heatdata['Date'] > date(2011,3,1)]

Heat complaints by borough:¶

len(heatdata)

1092048

There were about a million heating complaints over the 3+ years we have data for.

heatbydate = heatdata.groupby(by='Boro').count()

heatbydate

Per capita:¶

Again, let's look at how many heat complaints each boro generates per person:

boropop = {
    'MANHATTAN': 1636268,
    'BRONX': 1438159,
    'BROOKLYN': 2621793,
    'QUEENS': 2321580,
    'STATEN ISLAND': 473279,
    }

heatbydate['Pop'] = [boropop.get(x) for x in heatbydate.index]

heatbydate['CompPerCap'] = heatbydate['Date']/heatbydate['Pop']

heatbydate

Complaints by borough over months¶

First, let's recreate the graph from before:

heatdata['Year'] = [x.year for x in heatdata['Date']]
heatdata['Month'] = [x.month for x in heatdata['Date']]
heatdata['Day'] = [x.day for x in heatdata['Date']]

heatdata.head()

We remove data where it was unspecified what the boro was

heatdata = heatdata.loc[heatdata['Boro'] != 'Unspecified']

heatplotdata = heatdata.groupby(by=['Boro', 'Year','Month']).count()
heatplotdata

boros = heatbydate.index
borodict = {x:[] for x in boros}
borodict.pop('Unspecified')

for boro in borodict:
    borodict[boro] = list(heatplotdata.xs(boro).Date)

plotdata = np.zeros(len(borodict['BROOKLYN']))
for boro in sorted(borodict.keys()):
    plotdata = np.row_stack((plotdata, borodict[boro]))

plotdata = np.delete(plotdata, (0), axis=0)

from matplotlib import patches as mpatches

x = np.arange(len(plotdata[0]))

#crude xlabels
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
years = ['2011', '2012', '2013', '2014', '2015', '2016']
xlabels = []
for year in years:
    for month in months:
        xlabels.append("{0} {1}".format(month,year))

xlabels = xlabels[2:-7] #start from march 2011, end may2016

A non-normalized plot:¶

The plot below is of raw complaint numbers, and is what we might expect: complaints about eat matter the most during the heating season!

plotcolors = [(1,0,103),(213,255,0),(255,0,86),(158,0,142),(14,76,161),(255,229,2),(0,95,57),\
            (0,255,0),(149,0,58),(255,147,126),(164,36,0),(0,21,68),(145,208,203),(98,14,0)]

#rescaling rgb from 0-255 to 0 to 1
plotcolors = [(color[0]/float(255),color[1]/float(255),color[2]/float(255)) for color in plotcolors]
legendcolors = [mpatches.Patch(color = color) for color in plotcolors]

plt.figure(figsize = (15,10));
plt.stackplot(x,plotdata, colors = plotcolors);
plt.xticks(x,xlabels,rotation=90);
plt.xlim(0,len(xlabels))
plt.legend(legendcolors,sorted(borodict.keys()), bbox_to_anchor=(0.2, 1));
plt.title('Heating Complaints by Borough', size = 24)
plt.ylabel('Number of Complaints',size = 14)

plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.gca().yaxis.set_ticks_position('left')
plt.gca().xaxis.set_ticks_position('bottom')

Normalizing data:¶

Next, we're going to normalize these data. This allows us to better visualize if the proportion of heating complaints changed by borough, over time (e.g., did one borough generate more/less complaints vs. others over time?).

What we want to do is divide each of the 5 rows by the total number of complaints by that column (the month)

totalcounts = heatdata.groupby(by=['Year', 'Month']).count().Date.values

boros = heatbydate.index
normdict = {x:[] for x in boros}
normdict.pop('Unspecified')

for boro in normdict:
    for i in range(0,len(plotdata[1])): # for all the values in each row
        normp = float(borodict[boro][i])/float(totalcounts[i])
        normdict[boro].append(normp*100)

normplotdata = np.zeros(len(borodict['BROOKLYN']))
for boro in sorted(normdict.keys()):
    normplotdata = np.row_stack((normplotdata, normdict[boro]))

normplotdata = np.delete(normplotdata,(0),axis = 0)

plotcolors = [(1,0,103),(213,255,0),(255,0,86),(158,0,142),(14,76,161),(255,229,2),(0,95,57),\
            (0,255,0),(149,0,58),(255,147,126),(164,36,0),(0,21,68),(145,208,203),(98,14,0)]

#rescaling rgb from 0-255 to 0 to 1
plotcolors = [(color[0]/float(255),color[1]/float(255),color[2]/float(255)) for color in plotcolors]
legendcolors = [mpatches.Patch(color = color) for color in plotcolors]

plt.figure(figsize = (15,10));
plt.stackplot(x,normplotdata, colors = plotcolors);
plt.xticks(x,xlabels,rotation=90);
plt.xlim(0,len(xlabels))
plt.ylim(0,100)
plt.legend(legendcolors,sorted(normdict.keys()), bbox_to_anchor=(0.2, 1));
plt.title('Heating Complaints by Borough (normalized)', size = 24)
plt.ylabel('% of Complaints',size = 14)

plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.gca().yaxis.set_ticks_position('left')
plt.gca().xaxis.set_ticks_position('bottom')

Overall, it looks like heating complaints were pretty consistent by borough. In the next post, I'm going to take a look at one way of addressing the afore-mentioned censoring problems.

%matplotlib inline

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import psycopg2

pd.options.display.max_columns = 40

Hack for Heat #6: Complaint resolution time, over time and censoring¶

As a follow up to the last post, we're going to try and find out if the average resolution time has changed over time. This might be tricky, as we might run into the issue of censoring. Censoring) is a problem in time-series analyses that occurs when data that are missing because they lie outside of the range of the measure.

In this case, what we might expect to find is that complaint resolution time appears shorter as we get to the present day, and that may be true, but it may also happen because cases where problems have yet to be solved show up as missing data. In other words, for a case that was opened in 2010, the longest possible resolution time is 5 years, whereas for a case opened yesterday, the longest possible resolution time is 1 day.

So, as a first step, let's look at the proportion of unseen cases over time:

#Like before, we're going to select the relevant columns from the database:
connection = psycopg2.connect('dbname= threeoneone user=threeoneoneadmin password=threeoneoneadmin')
cursor = connection.cursor()

cursor.execute('''SELECT createddate, closeddate, borough FROM service;''')
data = cursor.fetchall()
data = pd.DataFrame(data)

data.columns = ['createddate','closeddate','borough']

Let's extract years and months again:

data['cryear'] = [x.year for x in data['createddate']]
data['crmonth'] = [x.month for x in data['createddate']]

And now, we're going to filter out bad cases again. However, this time, we're going to be a bit more selective. We're going to include cases where the closed date is missing.

#filter out bad casesa
import datetime

today = datetime.date(2016,5,29)
janone = datetime.date(2010,1,1)

data = data.loc[(data['closeddate'] > data['createddate']) | (data['closeddate'].isnull() == True)]

databyyear = data.groupby(by='cryear').count()
databyyear

data['timedelta'] = data['closeddate'] - data['createddate']
data['timedeltaint'] = [x.days if pd.isnull(x) == False else None for x in data['timedelta']]

data.groupby(by='cryear').mean()

This table shows exactly what I'm talking about - as we get closer to the current day, the average resolution time falls more and more. If censorig is occuring, we might expect that the proportion of cases closed is also decreasing over time. This is generally the case:

databyyear['propclosed'] = databyyear['closeddate']/databyyear['createddate']
databyyear

With the exception of year 2011, which we have reason to believe is wonky (see Hack for Heat #4), this is generally the case. So, how do we deal with this issue? To be honest, I don't know at the moment. But I'm going to read this paper tomorrow, and see if I can implement something.

	subject	trial	stimfile	condition	code	resp_1	resp_2	response	distractor	error	...	Y_95	Y_96	Y_97	Y_98	Y_99	Y_100	Y_101	Unnamed: 226	RESPONSE1	RESPONSE2
0	75870	2	NaN	1	smglvsslgm4	~S9_O7.jpg	~S8_O12.jpg	1	2	0	...	1.1826	1.1825	1.1824	1.1874	1.1875	1.1875	1.1875	NaN	[9, 7]	[8, 12]
1	75870	3	NaN	1	smglvsslgm6	~S10_O15.jpg	~S12_O8.jpg	2	1	0	...	1.1823	1.1837	1.1865	1.1876	1.1875	1.1875	1.1875	NaN	[10, 15]	[12, 8]
2	75870	7	NaN	1	smglvsslgm18	~S12_O7.jpg	~S8_O13.jpg	1	2	0	...	1.1950	1.1935	1.1924	1.1925	1.1925	1.1925	1.1925	NaN	[12, 7]	[8, 13]
3	75870	9	NaN	1	smglvsslgm19	~S9_O12.jpg	~S12_O8.jpg	2	1	0	...	1.2999	1.3000	1.3000	1.3000	1.3000	1.3000	1.3000	NaN	[9, 12]	[12, 8]
4	75870	14	NaN	2	smglvsslgm9	~S9_O7.jpg	~S10_O5.jpg	1	1	1	...	1.2575	1.2577	1.2565	1.2543	1.2519	1.2500	1.2500	NaN	[9, 7]	[10, 5]

	SIMILAR	EATDRINK	COMMENT
0	5	3	this little survey made me laugh
1	4	3	Sopu is awesome.
2	4	1	NaN
3	2	2	NaN
4	2	3	Soup can only be eaten if meat and other harde...

Dep. Variable:	EATDRINK	R-squared:	0.062
Model:	OLS	Adj. R-squared:	0.053
Method:	Least Squares	F-statistic:	6.832
Date:	Fri, 29 Jul 2016	Prob (F-statistic):	0.0103
Time:	17:58:57	Log-Likelihood:	-155.50
No. Observations:	106	AIC:	315.0
Df Residuals:	104	BIC:	320.3
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
const	2.4623	0.103	23.933	0.000	2.258 2.666
SIMILARC	-0.2701	0.103	-2.614	0.010	-0.475 -0.065

Omnibus:	3.808	Durbin-Watson:	2.014
Prob(Omnibus):	0.149	Jarque-Bera (JB):	3.819
Skew:	0.450	Prob(JB):	0.148
Kurtosis:	2.766	Cond. No.	1.00

	ID	TIME	TIMETRIAL	MISTAKE	NUMBER	MISTAKESTOTAL	SCORET	RESPONSENUM	...	IPOVERTC	SPEED100	PASSEDNUM	HIvsLEffect	DIFFBIN	RESPONSENUMC	DIFFICULTYC	PHIGHSCORE	PASSEDBY
0	25	16.262476	16.262476	1	2018	1	1981	0	...	-1.091039	6.149125	55	-1	1	-5	4.56	14.0	41.0
1	25	26.887734	10.625258	0	2017	1	1981	1	...	-1.091039	9.411536	54	-1	1	-4	4.56	14.0	41.0
2	25	33.911948	7.024214	0	2014	1	1981	2	...	-1.091039	14.236468	53	-1	1	-3	4.56	14.0	41.0
3	25	39.654125	5.742177	0	2011	1	1981	3	...	-1.091039	17.414998	52	-1	1	-2	4.56	14.0	41.0
4	25	48.295860	8.641734	0	2008	1	1981	4	...	-1.091039	11.571752	51	-1	1	-1	4.56	14.0	41.0

	TOTRESPONSES	BEFPASS	AFTPASS	IPOVERT	IP	T	DIFFICULTY	HIvsLEffect
25	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
26	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
28	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
29	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	TOTRESPONSES	BEFPASS	AFTPASS	IPOVERT	IP	T	DIFFICULTY	HIvsLEffect
25	NaN	NaN	NaN	-2	2.33333	4.33333	10	-1
26	NaN	NaN	NaN	-1.33333	2.16667	3.5	8	-1
28	NaN	NaN	NaN	0.666667	5	4.33333	5	-1
29	NaN	NaN	NaN	-2.33333	3.83333	6.16667	8	-1
31	NaN	NaN	NaN	-4.33333	1.66667	6	6	-1

	Year	Category	Count
0	2010	CONSTRUCTION/PLUMBING	287253
1	2010	Street Condition	229584
2	2010	HEAT RELATED	214218
3	2010	NOISY	201834
4	2010	PAINT - PLASTER	93194

	2010	2011	2012	2013	2014	2015	2016
CONSTRUCTION/PLUMBING	287253	272234	225338	223314	106251	83585	39460
STREET CONDITION	229584	216865	171910	182618	221097	233839	105346
HEAT RELATED	214218	190184	182974	202896	230364	225706	126906
NOISY	201834	196453	220392	259356	337892	387227	197186
PAINT/PLASTER	93194	100704	77287	77957	63503	None	None
WATER SYSTEM	70346	61577	57678	53186	None	71086	31299
BLOCKED DRIVEWAY	None	52592	50818	57408	79170	100881	53955
ILLEGAL PARKING	None	None	None	None	63243	92679	55932
UNSANITARY CONDITION	None	None	None	None	61789	82888	35357

	Date	Boro	Comptype
Year
2010	2005730	2005730	2005730
2011	1918888	1918888	1918888
2012	1783013	1783013	1783013
2013	1848938	1848938	1848938
2014	2102177	2102177	2102177
2015	2286697	2286697	2286697
2016	1146739	1146739	1146739

			Date	Comptype	Day
Boro	Year	Month
BRONX	2011	3	26256	26256	26256
		4	22898	22898	22898
		5	22722	22722	22722
		6	23677	23677	23677
		7	27348	27348	27348

	Date	Boro	Comptype	Month	Day	Pop	CompPerCap
Year
2011	1359368	1359368	1359368	1359368	1359368	8491079	0.160094
2012	1528597	1528597	1528597	1528597	1528597	8491079	0.180024
2013	1546188	1546188	1546188	1546188	1546188	8491079	0.182096
2014	1770357	1770357	1770357	1770357	1770357	8491079	0.208496
2015	1924917	1924917	1924917	1924917	1924917	8491079	0.226699
2016	891783	891783	891783	891783	891783	8491079	0.105026

	AUC	MD	RT
9104386	0.222260	0.147950	618.200000
7279624	-0.039400	-0.006537	1208.375000
3452429	0.378033	0.257911	1381.444444
7354384	0.034620	0.038660	1102.700000
479766	-0.036642	-0.048183	867.166667

	AUC	MD	RT	COLORCODE	COMPARISON
9104386	0.198260	0.117420	729.000000	GBPG	SMGL VS. SLGM
7279624	0.147017	0.087525	1235.500000	GBPG	SMGL VS. SLGM
3452429	0.530336	0.300000	1443.727273	GBPG	SMGL VS. SLGM
7354384	0.006440	0.003520	1173.000000	GBPG	SMGL VS. SLGM
479766	0.005037	0.016762	814.250000	GBPG	SMGL VS. SLGM

		Date	Comptype	Month	Day	Pop	CompPerCap
Year	Boro
2011	BRONX	262597	262597	262597	262597	1438159	0.182592
	BROOKLYN	434038	434038	434038	434038	2621793	0.165550
	MANHATTAN	273282	273282	273282	273282	1636268	0.167015
	QUEENS	315070	315070	315070	315070	2321580	0.135714
	STATEN ISLAND	74381	74381	74381	74381	473279	0.157161
2012	BRONX	296502	296502	296502	296502	1438159	0.206168
	BROOKLYN	482088	482088	482088	482088	2621793	0.183877
	MANHATTAN	314437	314437	314437	314437	1636268	0.192167
	QUEENS	358990	358990	358990	358990	2321580	0.154632
	STATEN ISLAND	76580	76580	76580	76580	473279	0.161807
2013	BRONX	302328	302328	302328	302328	1438159	0.210219
	BROOKLYN	490979	490979	490979	490979	2621793	0.187268
	MANHATTAN	332487	332487	332487	332487	1636268	0.203198
	QUEENS	343772	343772	343772	343772	2321580	0.148077
	STATEN ISLAND	76622	76622	76622	76622	473279	0.161896
2014	BRONX	340297	340297	340297	340297	1438159	0.236620
	BROOKLYN	557290	557290	557290	557290	2621793	0.212561
	MANHATTAN	373895	373895	373895	373895	1636268	0.228505
	QUEENS	408389	408389	408389	408389	2321580	0.175910
	STATEN ISLAND	90486	90486	90486	90486	473279	0.191190
2015	BRONX	355017	355017	355017	355017	1438159	0.246855
	BROOKLYN	602144	602144	602144	602144	2621793	0.229669
	MANHATTAN	417223	417223	417223	417223	1636268	0.254985
	QUEENS	455115	455115	455115	455115	2321580	0.196037
	STATEN ISLAND	95418	95418	95418	95418	473279	0.201610
2016	BRONX	169773	169773	169773	169773	1438159	0.118049
	BROOKLYN	271140	271140	271140	271140	2621793	0.103418
	MANHATTAN	209840	209840	209840	209840	1636268	0.128243
	QUEENS	200335	200335	200335	200335	2321580	0.086293
	STATEN ISLAND	40695	40695	40695	40695	473279	0.085985

	AUC	MD	RT	ERROR	COLORCODE	COMPARISON
448004	2.707033	1.111633	906.333333	0	RBPR	SO
2675720	0.352980	0.169195	953.950000	0	RBPR	SO
2947082	0.160825	0.137081	1037.437500	0	RBPR	SO
8594447	0.907000	0.577125	1015.250000	0	RBPR	SO
515090	0.394933	0.282000	833.888889	0	RBPR	SO
1283091	0.474890	0.310680	781.600000	0	RBPR	SO
1591835	2.229329	0.997536	920.357143	0	RBPR	SO
8448030	0.717493	0.313850	748.642857	0	RBPR	SO
8767007	1.628893	0.699000	990.214286	0	RBPR	SO
9444902	0.676900	0.361963	800.750000	0	RBPR	SO
8873015	0.907767	0.487600	660.000000	0	RBPR	SO
8469940	0.164550	0.122125	647.375000	0	RBPR	SO
8325642	1.047320	0.590833	852.800000	0	RBPR	SO
503358	0.058525	0.050817	682.333333	0	RBPR	SO
7386178	1.234072	0.637233	924.611111	0	RBPR	SO
953414	-0.118170	-0.146170	862.500000	0	RBPR	SO
5312076	0.395989	0.157433	548.111111	0	RBPR	SO
2515533	0.960689	0.571544	974.222222	0	RBPR	SO
6012496	1.453787	0.773067	973.066667	0	RBPR	SO
4717240	0.883219	0.595013	1131.687500	0	RBPR	SO
3007065	0.044767	0.053333	739.555556	0	RBPR	SO
3507292	-0.100017	-0.085700	537.000000	0	RBPR	SO
1363052	1.691200	0.790547	985.941176	0	RBPR	SO
1426541	0.000000	0.000000	404.636364	0	RBPR	SO
9603603	0.744925	0.492655	1258.700000	0	RBPR	SO
5480054	-0.025157	-0.026857	649.714286	0	RBPR	SO
1355895	0.970286	0.480421	1062.285714	0	RBPR	SO
7049336	0.971658	0.564100	852.736842	0	RBPR	SO
3354234	0.390192	0.230254	1166.076923	0	RBPR	SO
2585535	1.319237	0.650863	829.250000	0	RBPR	SO
...	...	...	...	...	...	...
84352	0.928446	0.443062	743.384615	1	RBPR	SO
3042691	NaN	NaN	NaN	1	RBPR	SO
2597764	2.076241	0.906424	939.000000	1	RBPR	SO
9324421	NaN	NaN	NaN	1	RBPR	SO
5227406	NaN	NaN	NaN	1	RBPR	SO
7229839	0.029771	0.035086	437.714286	1	RBPR	SO
4169107	NaN	NaN	NaN	1	RBPR	SO
6001558	-0.091046	-0.119262	596.538462	1	RBPR	SO
647592	1.448136	0.694329	1193.571429	1	RBPR	SO

	0	1	2
0	2011-04-15	MANHATTAN	Street Condition
1	2011-04-15	MANHATTAN	Street Condition
2	2011-04-15	MANHATTAN	Street Condition
3	2011-04-15	MANHATTAN	Street Condition
4	2011-04-14	QUEENS	Street Condition

	Date	Comptype
Boro
BRONX	354335	354335
BROOKLYN	334574	334574
MANHATTAN	250296	250296
QUEENS	138477	138477
STATEN ISLAND	10601	10601
Unspecified	3765	3765

	TOTRESPONSES	BEFPASS	AFTPASS	IPOVERT	IP	T	DIFFICULTY	HIvsLEffect
25	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
26	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
28	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
29	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	createddate	closeddate	borough	complainttype	createdheatseason
0	2014-11-18	2014-11-22	BROOKLYN	HEAT/HOT WATER	2013
1	2014-11-18	2014-11-25	BROOKLYN	HEAT/HOT WATER	2013
2	2014-11-18	2014-11-19	BRONX	HEAT/HOT WATER	2013
3	2014-11-18	2014-11-19	MANHATTAN	HEAT/HOT WATER	2013
4	2014-11-18	2014-11-20	BRONX	HEAT/HOT WATER	2013

	Date	Boro	Comptype	Year	Month	Day
58	2015-10-17	BROOKLYN	HEAT/HOT WATER	2015	10	17
728	2011-04-16	MANHATTAN	HEATING	2011	4	16
1026	2011-04-16	BROOKLYN	HEATING	2011	4	16
1418	2015-10-17	MANHATTAN	HEAT/HOT WATER	2015	10	17
1624	2011-04-15	MANHATTAN	HEATING	2011	4	15

	createddate	closeddate	borough	crmonth
cryear
2010	1340309	1281719	1340309	1340309
2011	1294795	1157180	1294795	1294795
2012	1185598	1148399	1185598	1185598
2013	1221690	1177863	1221690	1221690
2014	1354300	1305561	1354300	1354300
2015	1451521	1378090	1451521	1451521
2016	605070	530959	605070	605070

	crmonth	timedeltaint
cryear
2010	6.505385	24.508043
2011	6.491262	22.512834
2012	6.617468	22.690249
2013	6.414707	20.205784
2014	6.280245	19.712918
2015	6.314193	16.731161
2016	2.799473	8.634098

	TOTRESPONSES	BEFPASS	AFTPASS	IPOVERT	IP	T	DIFFICULTY	HIvsLEffect
25	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
26	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
28	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
29	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN