Hello!

I'm Bryan. This is my blog of the following projects (links above):

Hack for Heat

In my free time, I volunteer for Heat Seek, an NYC based non-profit that deploys temperature sensors to ensure that everyone in the city has adequate heating over the winter. My posts aim to show some of the groundwork that goes into writing some of our blog posts.

Mediocre Museum

The mediocre part of this project comes from my friend Emilia's and my goal of building the most average artifact possible. We're currently working with open data obtained from the Canada Science and Technology Museums Corporation, obtainable here.

Mousetracker Data

I'm working with my colleagues, Qi Xu and Carlina Conrad, at the NYU Couples Lab to learn more about how people in relationships allocate resources amongsts themselves, and balance self-, altruistic, and group interests.

Skytrax Ratings

What makes a good flight? Using Skytrax ratings data scraped by Quang Nguyen, I'm investigating what factors make up a passenger's overall rating, as well as what influences their decision to recommend an airline to others (vs. not).


Mousetracker Data #3

Mousetracker Data #3

In this post, I'm extracting some additional information about the stimuli so that we can run further analysis on participants' choices. For further background, please refer to the first post in this series.

In [4]:
import os
import re
import pandas as pd

data = pd.read_csv('./data cleaned/%s' % os.listdir('./data cleaned')[0])
data.head()

includelist = pd.read_csv('n=452 subjectID.csv', header = None)
includelist = includelist[0].values
In [5]:
data = data.loc[data['subject'].isin(includelist)]

Overview

The basic idea is that participants were shown one of two sets of poker chips that they would split between themselves and another person close to them. In every case, they could make a choice that was either selfish (gave more to themselves), or altruistic (gave more to their partner).

We wanted to know how much utility a choice would have to give to a participant before it made them selfish. In other words, how much more would a selfish choice have to give me for me to not be altruistic.

In [7]:
data['RESPONSE1'] = [x for x in data['resp_1'].str.extract(r'..(\d*)_.(\d*)', expand = True).values]
data['RESPONSE2'] = [x for x in data['resp_2'].str.extract(r'..(\d*)_.(\d*)', expand = True).values]

The columns 'resp_1' and 'resp_2' are image files of the choices shown. The naming convention is as follows: 'S' for how many chips the self gets, followed by that number, and 'O' for how many chips the other person gets.

We also have two other columns of interest: 'response', and 'error'. 'response' is which option the participant chose, and 'error' is whether or not that was the selfish choice. In this case, '0' is a selfish choice and '1' is an altruistic choice. This will become important shortly.

In [8]:
data.head()
Out[8]:
subject trial stimfile condition code resp_1 resp_2 response distractor error ... Y_95 Y_96 Y_97 Y_98 Y_99 Y_100 Y_101 Unnamed: 226 RESPONSE1 RESPONSE2
0 75870 2 NaN 1 smglvsslgm4 ~S9_O7.jpg ~S8_O12.jpg 1 2 0 ... 1.1826 1.1825 1.1824 1.1874 1.1875 1.1875 1.1875 NaN [9, 7] [8, 12]
1 75870 3 NaN 1 smglvsslgm6 ~S10_O15.jpg ~S12_O8.jpg 2 1 0 ... 1.1823 1.1837 1.1865 1.1876 1.1875 1.1875 1.1875 NaN [10, 15] [12, 8]
2 75870 7 NaN 1 smglvsslgm18 ~S12_O7.jpg ~S8_O13.jpg 1 2 0 ... 1.1950 1.1935 1.1924 1.1925 1.1925 1.1925 1.1925 NaN [12, 7] [8, 13]
3 75870 9 NaN 1 smglvsslgm19 ~S9_O12.jpg ~S12_O8.jpg 2 1 0 ... 1.2999 1.3000 1.3000 1.3000 1.3000 1.3000 1.3000 NaN [9, 12] [12, 8]
4 75870 14 NaN 2 smglvsslgm9 ~S9_O7.jpg ~S10_O5.jpg 1 1 1 ... 1.2575 1.2577 1.2565 1.2543 1.2519 1.2500 1.2500 NaN [9, 7] [10, 5]

5 rows × 229 columns

Algorithm

This got a little more complicated because the software we were using to capture mouse telemetry data randomized the position of the stimuli (i.e., whether the selfish choice was on the left or right of the screen was randomized), as it should.

This information is not a feature/variable on its own, but can be inferred from the 'response' and 'error' variables. If a participant chose the option on the left (response == 1), and that was coded as an 'error', it means the selfish choice was on the right (because 'errors' are altruistic choices).

There were two pieces of information that we wanted to extract:

  1. How many more chips did the selfish choice give vs. the altruistic choice?
  2. How many more chips did the selfish choice give the group vs. the altruistic choice?

For example, let's look at the first row data:

In [16]:
data.head(1)
Out[16]:
subject trial stimfile condition code resp_1 resp_2 response distractor error ... Y_95 Y_96 Y_97 Y_98 Y_99 Y_100 Y_101 Unnamed: 226 RESPONSE1 RESPONSE2
0 75870 2 NaN 1 smglvsslgm4 ~S9_O7.jpg ~S8_O12.jpg 1 2 0 ... 1.1826 1.1825 1.1824 1.1874 1.1875 1.1875 1.1875 NaN [9, 7] [8, 12]

1 rows × 229 columns

In this case, our participant chose the left option, which was the selfish choice. This choice gave him/her 1 more chip (9-8), and gave the group 4 fewer chips ((9+7)-(8+12)).

I didn't have time to think of an efficient way to do this for all the rows at once, so I decided to brute force it. First, I created a smaller dataframe:

In [10]:
tempdata = pd.DataFrame(columns = ('RESPONSE','ERROR','RESPONSE1','RESPONSE2'))
tempdata['RESPONSE'] = data['response']
tempdata['ERROR'] = data['error']
tempdata['RESPONSE1'] = data['RESPONSE1']
tempdata['RESPONSE2'] = data['RESPONSE2']
tempdata['SELFISHCHOICESELFMORE'] = 0
tempdata['SELFISHCHOICEGROUPMORE'] = 0

This algorithm basically iterates through each row in the data fram, checks to see if the selfish choice is on the left or right, and does the math I described above.

In [11]:
SELFISHCHOICESELFMORE = []
SELFISHCHOICEGROUPMORE = []

for row in tempdata.iterrows():
    if (row[1][0] == 1) & (row[1][1] == 0) | ((row[1][0] == 2) & (row[1][1] == 1)):
        try:
            SELFISHCHOICESELFMORE.append(int(row[1][2][0]) - int(row[1][3][0]))
            SELFISHCHOICEGROUPMORE.append((int(row[1][2][0]) + int(row[1][2][1])) - (int(row[1][3][0]) + int(row[1][3][1])))
        except:
            SELFISHCHOICESELFMORE.append(None)
            SELFISHCHOICEGROUPMORE.append(None)
    elif ((row[1][0] == 2) & (row[1][1] == 0)) | ((row[1][0] == 1) & (row[1][1] == 1)):
        try:
            SELFISHCHOICESELFMORE.append(int(row[1][3][0]) - int(row[1][2][0]))
            SELFISHCHOICEGROUPMORE.append((int(row[1][3][0]) + int(row[1][3][1])) - (int(row[1][2][0]) + int(row[1][2][1])))
        except:
            SELFISHCHOICESELFMORE.append(None)
            SELFISHCHOICEGROUPMORE.append(None)
In [12]:
tempdata = tempdata.drop(['RESPONSE1','RESPONSE2', 'RESPONSE', 'ERROR'], axis = 1)
tempdata['SELFISHCHOICESELFMORE'] = SELFISHCHOICESELFMORE
tempdata['SELFISHCHOICEGROUPMORE'] = SELFISHCHOICEGROUPMORE

Concatenating and writing to a csv:

In [14]:
outdata = pd.concat([data,tempdata], axis = 1)
outdata.to_csv('combineddata3.csv')

Misc #3: Soup

In [1]:
%matplotlib inline

Misc #3: Soup

My friend and I once argued about whether one "eats" or "drinks" soup. I believed that you "drink" soup, whereas she thought that you "eat" it.

From our conversation, I realized the verbs people associate with various foods have to do with the state of matter they mentally conjure up when thinking about those foods. For example, I posit that if you came up to me while I was holding a Slurpee (for the sake of balance, other semi-frozen beverages are available), I would be eating it if it just came out of the machine, but drinking it if I'd been standing out in the sweltering New York summer heat for a while.

In [2]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import math

Thus, this simple dataset about soup was born. Using Amazon's Mechanical Turk, I asked 107 people: (a) Whether they thought soup was more similar to stew (closer to solid) or broth (closer to liquid), and (b) whether they were more likely to eat or drink soup.

Participants answered both questions on a 1 to 5 scale with 5 being "most likely to drink" and "closer to stew" (effectively reverse-coding their responses and erring conservatively with regard to my own hypothesis should participants cluster their responses at the ends of the scale). There was also a free-response question that was quite entertaining to read.

In [3]:
soupdata = pd.read_csv('SOUPDATA.csv')
In [4]:
soupdata.head()
Out[4]:
SIMILAR EATDRINK COMMENT
0 5 3 this little survey made me laugh
1 4 3 Sopu is awesome.
2 4 1 NaN
3 2 2 NaN
4 2 3 Soup can only be eaten if meat and other harde...

Scatterplot of responses

This first scatterplot is not a very good one, because both scales consisted of 5 discrete points, so the points on the graph overlap (we'll fix that in a second).

In [5]:
plt.figure(figsize=(12,10))
plt.scatter(soupdata['SIMILAR'],soupdata['EATDRINK']);
plt.xlabel("5 = most similar to stew", size = 24);
plt.ylabel("5 = more likely to drink", size = 24);

We can artifically jitter the points to make the variance of these points more visible. First, let's write a simple function that takes an array and jitters it:

In [6]:
def jitter(array):
    multiplier = 1 + 0.1* np.random.random(np.shape(array))
    return multiplier * array

This allows us to see how responses were clustered. Already, this pattern of data appears to reflect a negative correlation (which is what I expected), but we'll test that later.

In [7]:
plt.figure(figsize=(12,10))
plt.scatter(jitter(soupdata['SIMILAR']), jitter(soupdata['EATDRINK']));
plt.xlabel("5 = most similar to stew", size = 24);
plt.ylabel("5 = more likely to drink", size = 24);

Broth vs. stew

We can then look at the distribution of the two scales to give us further insight into how our sample responded to the two scales.

In [8]:
np.mean(soupdata['SIMILAR'])
Out[8]:
2.9056603773584904

This mean tells us that on average, the sample was split between whether or not soup was broth-like or stew-like.

However, the distribution was such that there weren't two discrete groups of "broth people" vs. "stew people". Rather, most people think it's somewhere in between, or lean slightly to one side. In some ways, this serves as a "sanity check" and also tells us that people know that different types of soup exist.

In [9]:
plt.figure(figsize=(8,6));
plt.hist(soupdata['SIMILAR'], bins = range(1,7));
plt.xlabel('5 = most simlar to stew', size = 24);
plt.ylabel('Frequency', size = 24);
plt.xticks(range(1,6));

Eating vs. drinking soup

On the other hand, the participants' average "eating/drinking" score was slightly below the mean, indictating that they'd be more likely to "eat" soup than "drink" it. This is not surprising considering I restricted the sample to respondents living in the U.S., where my friend and I were having our argument.

In [10]:
np.mean(soupdata['EATDRINK'])
Out[10]:
2.4622641509433962

The distribution of responses is also normal, with a slight positive skew.

In [11]:
plt.figure(figsize=(8,6));
plt.hist(soupdata['EATDRINK'], bins = range(1,7));
plt.xlabel('5 = most simlar to stew', size = 24);
plt.ylabel('Frequency', size = 24);
plt.xticks(range(1,6));

To tell us whether or not the 2.46 average that we obtained (vs. the scale midpoint of 3) might have been due to sampling error, we can also calculate the results of a one-way t-test, comparing 2.46 to the midpoint of the scale. This involves finding the standard error of the sample, and although there are numerous packages that can do this for you, it's easy enough to do by hand:

In [12]:
#Standard deviation
soupsd = np.std(soupdata['EATDRINK'])
print 'Standard deviation = {0}'.format(soupsd)

#Standard error = standard deviation/square root(sample size)
soupse = soupsd/math.sqrt(len(soupdata['EATDRINK']))
print 'Standard error = {0}'.format(soupse)

#t-value = (mean1 - mean2)/standard error
tvalue = (3 - np.mean(soupdata['EATDRINK']))/soupse
print 't({0}) = {1}'.format(len(soupdata['EATDRINK']), tvalue)
Standard deviation = 1.08309939875
Standard error = 0.105199913354
t(106) = 5.11156171059

This gives us a t-value of 5.11 with 106 degrees of freedom, which, if you look up in a t-distribution table, returns a p of < .01 (for reference, a t-value of ~2 with df of ~30 gives you p <.05). We can also calculate confidence intervals:

In [13]:
upperbound = 2.462 + (soupse *1.96)
lowerbound = 2.462 - (soupse *1.96)
print '95% confidence intervals [{0}]'.format((lowerbound,upperbound))
95% confidence intervals [(2.255808169826536, 2.6681918301734644)]

This tells us that if we randomly sampled our population over and over again, 95% of the sample means would fall between 2.26 and 2.67, still below the scale midpoint of 3. Taken together, it's quite likely (and also quite theoretically reasonable), that these participants' tendency to favor "eating" soup was not due to sampling error.

Predicting eating vs. drinking from broth-y vs. stew-y soups

So, what about my hypothesis? To test whether people's mental representation of soup predicts the verb that they use to describe consuming it, we can run a simply OLS regression. I'm not going to manually write out the algorithm for it (although I have before, see here), and instead, borrow the much prettier statsmodel one.

(CAVEAT: These are correlational data; the usual caveats with regard to establishing causality apply)

In [14]:
import statsmodels.api as sm
In [15]:
subdata = sm.add_constant(soupdata)
subdata['SIMILARC'] = subdata['SIMILAR'] - np.mean(subdata['SIMILAR'])
model = sm.OLS(subdata['EATDRINK'],subdata.loc[:,['const','SIMILARC']],hasconstant=1)
results = model.fit()
In [16]:
results.summary()
Out[16]:
OLS Regression Results
Dep. Variable: EATDRINK R-squared: 0.062
Model: OLS Adj. R-squared: 0.053
Method: Least Squares F-statistic: 6.832
Date: Fri, 29 Jul 2016 Prob (F-statistic): 0.0103
Time: 17:58:57 Log-Likelihood: -155.50
No. Observations: 106 AIC: 315.0
Df Residuals: 104 BIC: 320.3
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
const 2.4623 0.103 23.933 0.000 2.258 2.666
SIMILARC -0.2701 0.103 -2.614 0.010 -0.475 -0.065
Omnibus: 3.808 Durbin-Watson: 2.014
Prob(Omnibus): 0.149 Jarque-Bera (JB): 3.819
Skew: 0.450 Prob(JB): 0.148
Kurtosis: 2.766 Cond. No. 1.00

In this analysis, the first coefficient (const), 2.46, tells us that a hypothetical person who though soup could be just as much broth-like as it could be stew-like is still more likely to eat it than drink it (2.46 being significantly lower than the midpoint of 3).

The second coefficient, SIMILARC, is negative, statistically significant and tells us that believing soup is stew-like is associated with saying that one would eat (vs. drink) soup. Specifically, for every 1 point a hypothetical respondent says that soup is similar to stew (vs. broth), he/she is predicted to respond -.27 points lower on the eat vs. drink scale.

Coming back to our original scatter plot, we can draw a fit-line over it:

In [17]:
#Defining a function for our fit line:
def fitline(Xs):
    return [y*(-.2701)+3.47 for y in Xs] #The 3.47 (vs. 2.46) comes from the uncentered version of this analysis
In [18]:
plt.figure(figsize=(12,10))
plt.scatter(jitter(soupdata['SIMILAR']), jitter(soupdata['EATDRINK']));
plt.xlabel("5 = most similar to stew", size = 24);
plt.ylabel("5 = more likely to drink", size = 24);
plt.plot(range(1,6), fitline(range(1,6)));

Conclusion

In summary, I learned that:

  1. Most people obviously know that there are different types of soup, but there is variabilty in what someone's "default" category is.
  2. Not everyone "eats" soup, but more people in the U.S. "eat" it than "drink it".
  3. Finally, people who associate soup with stew are more likely to eat it than drink it.

I'm not sure who won the argument in the end, but this analysis does feel a little like winning the battle but losing the war.

Misc #2: Restructuring data

In [1]:
import pandas as pd

Misc #2: Restructuring data

This post details some of the "dirty work" (specifically, restructuring data), that goes on behind data analysis.

In this case, I'm trying to restructure a person-period/stacked dataset into a person-level/long dataset.

These two formats can be transposed from one to the other, but we often prefer one over the other for readibility/analytic purposes (detailed later).

Person-period vs. person-level data

Person-period datasets are usually found in longitudinal research, where a single person contributes multiple rows of data, reflecting the same measure at multiple time points. For example, a person's blood pressure measured every day over a year might be stored in a table where every row is a new blood pressure reading. Each person contributes 365 rows of data, identified by an ID number.

On the other hand, this blood pressure data can also be stored in a person-level dataset, where each row represents a single person, and multiple variables represent each time point (for example, "BP1", "BP2", "BP3" for days 1, 2 and 3 respectively).

As you might have guessed, in this example, a person-period dataset would be preferred. Firstly, it would take less space to store, as we only have to declare the type of data once for the column and not have to do it for each new row. In contrast, a person-level dataset would have to store information about what type of data is contained in each column, and this redundant information is replicated every time a new reading is taken.

Person-period datasets are also more convenient when missing data is to be expected. Substantial amounts of missing data in a person-level dataset would mean lots of empty cells, whereas missing rows do not have to be stored, unless there are other unique, period-specific data (e.g., if we were missing blood pressure data, but also had to store heart rate data for the day).

On the other hand, person-level datasets are often preferred when there are multiple variables that do not change with each time period. They also tend to be more human-readable. For example, if we were interested in someone's blood pressure at 3 time points, and also 100 other variables about their stable, demographic information a person-period dataset would result in a lot of redundancy, as these 100 variables are replicated for each new row.

Current data

The annonymized data I'm working with here come from an experiment I did a couple of years ago. In this study, participants completed a simple counting task where they had to count backwards, and I measured how long they took for each response. The specific details about the study are irrelevant to this post.

In [2]:
stackeddata = pd.read_csv('stacked dataset.csv')

This is what the person-period version of the data look like:

In [3]:
stackeddata.head()
Out[3]:
ID TIME TIMETRIAL MISTAKE NUMBER MISTAKESTOTAL SCORET RESPONSENUM SPLINE PASSED ... IPOVERTC SPEED100 PASSEDNUM HIvsLEffect DIFFBIN RESPONSENUMC DIFFICULTYC PHIGHSCORE PASSEDBY PASSEDBIN
0 25 16.262476 16.262476 1 2018 1 1981 0 0 0 ... -1.091039 6.149125 55 -1 1 -5 4.56 14.0 41.0 0
1 25 26.887734 10.625258 0 2017 1 1981 1 0 0 ... -1.091039 9.411536 54 -1 1 -4 4.56 14.0 41.0 0
2 25 33.911948 7.024214 0 2014 1 1981 2 0 0 ... -1.091039 14.236468 53 -1 1 -3 4.56 14.0 41.0 0
3 25 39.654125 5.742177 0 2011 1 1981 3 0 0 ... -1.091039 17.414998 52 -1 1 -2 4.56 14.0 41.0 0
4 25 48.295860 8.641734 0 2008 1 1981 4 0 0 ... -1.091039 11.571752 51 -1 1 -1 4.56 14.0 41.0 0

5 rows × 31 columns

Here, you can see ID, the unique identifier for each participant, TIME and TIMETRIAL, the time they took to type in each response both cumulatively and trial-by-trial (so these are all different for each row), as well as some time-invariant data, such as the total number of mistakes they made (MISTAKESTOTAL).

Transposing from person-period to person-level data involves summarizing data in some way. In this case, I'd like to analyze how many responses participants made in total, so that's what I'm going to collapse the data against.

In [4]:
longdata = pd.DataFrame(index=stackeddata[stackeddata.columns.values[0]].unique(), columns = ('TOTRESPONSES','BEFPASS',\
                                                                     'AFTPASS', 'IPOVERT', 'IP', 'T', 'DIFFICULTY', 'HIvsLEffect'))

This creates an empty dataframe with the unique participant IDs as indices, as well as the columns I want to carry over.

In [5]:
longdata.head()
Out[5]:
TOTRESPONSES BEFPASS AFTPASS IPOVERT IP T DIFFICULTY HIvsLEffect
25 NaN NaN NaN NaN NaN NaN NaN NaN
26 NaN NaN NaN NaN NaN NaN NaN NaN
28 NaN NaN NaN NaN NaN NaN NaN NaN
29 NaN NaN NaN NaN NaN NaN NaN NaN
31 NaN NaN NaN NaN NaN NaN NaN NaN

Let's deal with the time-invariant columns first. They're easy (because each row is the same for each participant, so we can just take the first one):

In [6]:
for column in ('IPOVERT', 'IP', 'T', 'DIFFICULTY', 'HIvsLEffect'):
    for participant in longdata.index.values:
        longdata.loc[participant,column] =  stackeddata.loc[stackeddata[stackeddata.columns.values[0]]==participant][column].values[0]
In [7]:
longdata.head()
Out[7]:
TOTRESPONSES BEFPASS AFTPASS IPOVERT IP T DIFFICULTY HIvsLEffect
25 NaN NaN NaN -2 2.33333 4.33333 10 -1
26 NaN NaN NaN -1.33333 2.16667 3.5 8 -1
28 NaN NaN NaN 0.666667 5 4.33333 5 -1
29 NaN NaN NaN -2.33333 3.83333 6.16667 8 -1
31 NaN NaN NaN -4.33333 1.66667 6 6 -1

There are three columns that we have to populate. The first, TOTRESPONSES is simply the number of responses that each participant made.

In [8]:
for participant in longdata.index.values:
        longdata.loc[participant, 'TOTRESPONSES'] = len(stackeddata.loc[(stackeddata[stackeddata.columns.values[0]]==participant) & \
                                                           (stackeddata['MISTAKE'] == 0)])
In [9]:
longdata.head()
Out[9]:
TOTRESPONSES BEFPASS AFTPASS IPOVERT IP T DIFFICULTY HIvsLEffect
25 13 NaN NaN -2 2.33333 4.33333 10 -1
26 34 NaN NaN -1.33333 2.16667 3.5 8 -1
28 20 NaN NaN 0.666667 5 4.33333 5 -1
29 10 NaN NaN -2.33333 3.83333 6.16667 8 -1
31 26 NaN NaN -4.33333 1.66667 6 6 -1

BEFPASS and AFTPASS have to do with whether or not a participant passed a displayed score on the screen (denoted by the column PASSED; 0 for no, 1 for yes).

In [10]:
for participant in longdata.index.values:
        longdata.loc[participant, 'BEFPASS'] = len(stackeddata.loc[(stackeddata[stackeddata.columns.values[0]]==participant) & \
                                                           (stackeddata['MISTAKE'] == 0) & (stackeddata['PASSED'] == 0)])
        longdata.loc[participant, 'AFTPASS'] = len(stackeddata.loc[(stackeddata[stackeddata.columns.values[0]]==participant) & \
                                                           (stackeddata['MISTAKE'] == 0) & (stackeddata['PASSED'] == 1)])

Finally, let's output these data to a csv:

In [11]:
longdata.to_csv('long data.csv')

Hack for Heat #11: NYC's top complaints pt.2

In [1]:
%matplotlib inline

from matplotlib import pyplot as plt
from matplotlib import patches as mpatches
import pandas as pd
import numpy as np
import psycopg2
from datetime import datetime
from datetime import date


pd.options.display.max_columns = 40

Hack for Heat #11: NYC's top complaints pt. 2

In this post, I'm going to post a visualization of the top complaints reported in the previous post. I have this in a csv:

In [2]:
tempdata = pd.read_csv('topfiverev.csv', names = ['Year','Category','Count'])
In [3]:
tempdata.head()
Out[3]:
Year Category Count
0 2010 CONSTRUCTION/PLUMBING 287253
1 2010 Street Condition 229584
2 2010 HEAT RELATED 214218
3 2010 NOISY 201834
4 2010 PAINT - PLASTER 93194
In [4]:
#Fixing some category names
tempdata['Category'] = ['PAINT/PLASTER' if (x == 'PAINT/PLASTER') | (x== 'PAINT - PLASTER') else x for x in tempdata['Category']]
tempdata['Category'] = ['Street Condition' if x == 'Traffic Signal Condition' else x for x in tempdata['Category'] ]

What we want is to create a data frame with as many rows as there are categories, as many columns as there are years, and the counts as the data points.

In [5]:
upcaseindex = [x.upper() for x in tempdata['Category'].unique()]
upcaseindex.remove('NONCONST') #remove this category because it refers to missing values
In [6]:
plotdata = pd.DataFrame(index = upcaseindex, columns = tempdata['Year'].unique())
In [7]:
plotdata
Out[7]:
2010 2011 2012 2013 2014 2015 2016
CONSTRUCTION/PLUMBING NaN NaN NaN NaN NaN NaN NaN
STREET CONDITION NaN NaN NaN NaN NaN NaN NaN
HEAT RELATED NaN NaN NaN NaN NaN NaN NaN
NOISY NaN NaN NaN NaN NaN NaN NaN
PAINT/PLASTER NaN NaN NaN NaN NaN NaN NaN
WATER SYSTEM NaN NaN NaN NaN NaN NaN NaN
BLOCKED DRIVEWAY NaN NaN NaN NaN NaN NaN NaN
ILLEGAL PARKING NaN NaN NaN NaN NaN NaN NaN
UNSANITARY CONDITION NaN NaN NaN NaN NaN NaN NaN

Now let's populate it with values.

In [8]:
for row in tempdata['Category'].unique():
    for year in  tempdata['Year'].unique():
        try:
            plotdata.loc[row.upper(),year] = tempdata.loc[(tempdata['Year'] == year) & (tempdata['Category'] == row)]['Count'].values[0]
        except:
            plotdata.loc[row.upper(),year] = None
In [9]:
plotdata = plotdata.drop(['NONCONST'])
plotdata
Out[9]:
2010 2011 2012 2013 2014 2015 2016
CONSTRUCTION/PLUMBING 287253 272234 225338 223314 106251 83585 39460
STREET CONDITION 229584 216865 171910 182618 221097 233839 105346
HEAT RELATED 214218 190184 182974 202896 230364 225706 126906
NOISY 201834 196453 220392 259356 337892 387227 197186
PAINT/PLASTER 93194 100704 77287 77957 63503 None None
WATER SYSTEM 70346 61577 57678 53186 None 71086 31299
BLOCKED DRIVEWAY None 52592 50818 57408 79170 100881 53955
ILLEGAL PARKING None None None None 63243 92679 55932
UNSANITARY CONDITION None None None None 61789 82888 35357
In [10]:
x = range(0,len(plotdata.columns))
xlabels = plotdata.columns

plt.figure(figsize=(16,10));
plt.xlim(-1,len(xlabels));
plt.ylim(0,500000);
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.gca().yaxis.set_ticks_position('left')
plt.gca().xaxis.set_ticks_position('bottom')
plt.title('Complaint Category by Year', size = 24)
plt.ylabel('Number of Complaints',size = 14)

plotcolors = [(1,0,103),(213,255,0),(255,0,86),(158,0,142),(14,76,161),(255,229,2),(0,95,57),\
            (0,255,0),(149,0,58),(255,147,126),(164,36,0),(0,21,68),(145,208,203),(98,14,0)]
plotcolors = [(color[0]/float(255),color[1]/float(255),color[2]/float(255)) for color in plotcolors]
legendcolors = [mpatches.Patch(color = color) for color in plotcolors]

count = 0
for row in plotdata.index:
    y = plotdata.loc[row]
    plt.plot(x, y, color = plotcolors[count], lw=4, marker = '.', markersize = 20);
    count += 1

plt.legend(legendcolors, upcaseindex, bbox_to_anchor=(0.3, 1));
plt.xticks(x,xlabels);

So that's my graph. Do note that the reason these complaints dropped in 2016 is obviously because we're only midway through the year! I grabbed these data in early July, so that's when these numbers were last updated.

Hack for Heat #10: NYC's top complaints

In [1]:
%matplotlib inline

from matplotlib import pyplot as plt
from matplotlib import patches as mpatches
import pandas as pd
import numpy as np
import psycopg2
from datetime import datetime
from datetime import date


pd.options.display.max_columns = 40

connection = psycopg2.connect('dbname = threeoneone user=threeoneoneadmin password=threeoneoneadmin')
cursor = connection.cursor()

Hack for Heat #10: NYC's top complaints

In this post, I'm going to look at what the top 311 complaints in NYC are, and how their numbers have changed over the years.

In [2]:
cursor.execute('''SELECT createddate, borough, complainttype FROM service''')
totalcomplaints = cursor.fetchall()
In [3]:
len(totalcomplaints)
Out[3]:
13092182

Over the past 7 years, there were about 13 million complaints, which is about just 1.5 per person over the 7 years, or .22 complaints per person per year

In [4]:
totalcomplaints = pd.DataFrame(totalcomplaints)
In [5]:
totalcomplaints.columns = ['Date', 'Boro', 'Comptype']
In [6]:
totalcomplaints['Year'] = [x.year for x in totalcomplaints['Date']]

This is how it breaks down over the years:

In [7]:
totalcomplaints.groupby(by=['Year']).count()
Out[7]:
Date Boro Comptype
Year
2010 2005730 2005730 2005730
2011 1918888 1918888 1918888
2012 1783013 1783013 1783013
2013 1848938 1848938 1848938
2014 2102177 2102177 2102177
2015 2286697 2286697 2286697
2016 1146739 1146739 1146739
In [8]:
complainttypesbyseason = totalcomplaints.groupby(by=['Year','Comptype']).count()
In [9]:
topfive = complainttypesbyseason['Date'].groupby(level=0,group_keys=False).apply(lambda x:x.sort_values(ascending=False).head(5))
topfive
Out[9]:
Year  Comptype              
2010  HEATING                   214218
      GENERAL CONSTRUCTION      127131
      Street Light Condition    116466
      Noise - Residential       115615
      PLUMBING                  111504
2011  HEATING                   190184
      GENERAL CONSTRUCTION      133242
      PLUMBING                  113218
      Noise - Residential       111447
      Street Condition          104694
2012  HEATING                   182974
      Noise - Residential       127943
      GENERAL CONSTRUCTION      112436
      Street Light Condition     93981
      PLUMBING                   87561
2013  HEATING                   202896
      Noise - Residential       151516
      GENERAL CONSTRUCTION      112169
      Street Light Condition     92502
      PLUMBING                   86564
2014  Noise - Residential       192608
      HEAT/HOT WATER            132767
      Street Condition          114545
      HEATING                    97597
      Street Light Condition     94665
2015  HEAT/HOT WATER            225706
      Noise - Residential       208026
      Street Condition          124378
      Blocked Driveway          100881
      Street Light Condition     98106
2016  HEAT/HOT WATER            126906
      Noise - Residential       104324
      Illegal Parking            55932
      Blocked Driveway           53955
      Street Condition           51977
dtype: int64

These data gives us a rough idea of how the number of complaints have changed over time, but one problem is that the labels that were used changed each year (presumably because they were replaced by new labels that better reflected what the complaints were). Secondly, for our purposes, some categories can be combined. For example, 'Street Condition' and 'Street Light Condition' may be important for administrators, but for now, we can be content with just "stuff that's gone wrong on the street".

In [10]:
topfive.index.levels[1]
Out[10]:
Index([u'Blocked Driveway', u'GENERAL CONSTRUCTION', u'HEAT/HOT WATER',
       u'HEATING', u'Illegal Parking', u'Noise - Residential', u'PLUMBING',
       u'Street Condition', u'Street Light Condition'],
      dtype='object', name=u'Comptype')

So, over the last 5 years, there have been 10 different categories that have made up the top five. Notice that some of these labels have changed over time. We'll have to standardize them.

In [11]:
years = totalcomplaints.Year.unique()

#first have a list of labels that are the same
oldlabels = totalcomplaints.loc[totalcomplaints['Year'] == years[0],['Comptype']]['Comptype'].unique()

newlabels = []

for x in years:
    yearslabels = totalcomplaints.loc[totalcomplaints['Year'] == x,['Comptype']]['Comptype'].unique()
    [newlabels.append(x) for x in yearslabels if x not in oldlabels and x not in newlabels]

From doing this, we get two lists: oldlabels and newlabels, that we will use to decide how to combine stuff.

In [12]:
#sorted(oldlabels)
In [13]:
#sorted(newlabels)

One modification I'm going to make is to combine 'Street Condition', 'Street Light Condition', 'Street Sign - Damaged', 'Street Sign - Dangling', and 'Street Sign - Missing' into 'Street Condition'.

For the current purposes, these are the same thing.

I'm also going to collapse 'GENERAL CONSTRUCTION', 'General Construction/Plumbing' and 'PLUMBING' into a new category called 'CONSTRUCTION/PLUMBING'.

I'll collapse noise related complaints ('Noise', 'Noise - Commercial', 'Noise - Helicopter', 'Noise - House of Worship', 'Noise - Park', 'Noise - Residential', 'Noise - Street/Sidewalk', 'Noise - Vehicle') into a label named 'NOISY'.

Finally, I'll combine 'HEAT/HOT WATER' and 'HEATING' into 'HEAT RELATED'.

In [14]:
streetconditioncomb = ['Street Condition', 'Street Light Condition', 'Street Sign - Damaged', 'Street Sign - Dangling', \
                       'Street Sign - Missing']

totalcomplaints['RevComptype'] = ['Street Condition' if x in streetconditioncomb else x for x in totalcomplaints['Comptype']]

genconstructcomb = ['GENERAL CONSTRUCTION', 'General Construction/Plumbing', 'PLUMBING']

totalcomplaints['RevComptype'] = ['CONSTRUCTION/PLUMBING' if x in genconstructcomb else x for x in totalcomplaints['RevComptype']]

noisecomb = ['Noise', 'Noise - Commercial', 'Noise - Helicopter', 'Noise - House of Worship', 'Noise - Park',\
             'Noise - Residential', 'Noise - Street/Sidewalk', 'Noise - Vehicle']

totalcomplaints['RevComptype'] = ['NOISY' if x in noisecomb else x for x in totalcomplaints['RevComptype']]

heatcomb = ['HEAT/HOT WATER', 'HEATING']

totalcomplaints['RevComptype'] = ['HEAT RELATED' if x in heatcomb else x for x in totalcomplaints['RevComptype']]
In [15]:
revisedcomplainttypes = totalcomplaints.groupby(by=['Year','RevComptype']).count()
In [16]:
topfiverev = revisedcomplainttypes['Date'].groupby(level=0, group_keys=False).apply(lambda x : x.sort_values(ascending=False).head(8))
topfiverev
Out[16]:
Year  RevComptype             
2010  CONSTRUCTION/PLUMBING       287253
      Street Condition            229584
      HEAT RELATED                214218
      NOISY                       201834
      PAINT - PLASTER              93194
      Water System                 70346
      NONCONST                     69659
      Traffic Signal Condition     53773
2011  CONSTRUCTION/PLUMBING       272234
      Street Condition            216865
      NOISY                       196453
      HEAT RELATED                190184
      PAINT - PLASTER             100704
      NONCONST                     68407
      Water System                 61577
      Blocked Driveway             52592
2012  CONSTRUCTION/PLUMBING       225338
      NOISY                       220392
      HEAT RELATED                182974
      Street Condition            171910
      PAINT - PLASTER              77287
      NONCONST                     60055
      Water System                 57678
      Blocked Driveway             50818
2013  NOISY                       259356
      CONSTRUCTION/PLUMBING       223314
      HEAT RELATED                202896
      Street Condition            182618
      PAINT - PLASTER              77957
      Blocked Driveway             57408
      NONCONST                     53831
      Water System                 53186
2014  NOISY                       337892
      HEAT RELATED                230364
      Street Condition            221097
      CONSTRUCTION/PLUMBING       106251
      Blocked Driveway             79170
      PAINT/PLASTER                63503
      Illegal Parking              63243
      UNSANITARY CONDITION         61789
2015  NOISY                       387227
      Street Condition            233839
      HEAT RELATED                225706
      Blocked Driveway            100881
      Illegal Parking              92679
      CONSTRUCTION/PLUMBING        83585
      UNSANITARY CONDITION         82888
      Water System                 71086
2016  NOISY                       197186
      HEAT RELATED                126906
      Street Condition            105346
      Illegal Parking              55932
      Blocked Driveway             53955
      CONSTRUCTION/PLUMBING        39460
      UNSANITARY CONDITION         35357
      Water System                 31299
dtype: int64

This gives us some different numbers. Taken together, although heat related complaints are still constantly in the top 5, what we observe over time is also a gradual decrease in construction related complaints, a constant, variable but large amount of street-related complaints, and a steady increase in noise-related complaints. Next, I'll post a version of this looking at only the heating season.

Hack for Heat #9: Total complaints over time, complaints per capita, etc.

In [1]:
%matplotlib inline

from matplotlib import pyplot as plt
from matplotlib import patches as mpatches
import pandas as pd
import numpy as np
import psycopg2
from datetime import datetime
from datetime import date


pd.options.display.max_columns = 40

connection = psycopg2.connect('dbname = threeoneone user=threeoneoneadmin password=threeoneoneadmin')
cursor = connection.cursor()

Hack for Heat #9: Total complaints over time, complaints per capita, etc.

In this post, I'm going to be doing a couple of miscellaneous things to prepare for our first blog post of the summer. Last year's posts can be found here.

In [2]:
cursor.execute('''SELECT createddate, borough, complainttype FROM service;''')
borodata = cursor.fetchall()
In [3]:
borodata = pd.DataFrame(borodata)
In [4]:
borodata.columns = ['Date', 'Boro', 'Comptype']

Again, I removed entires earlier than 2011 because there are data quality issues (substantially fewer cases):

In [5]:
borodata = borodata.loc[borodata['Date'] > date(2011,3,1)]
In [6]:
borodata = borodata.loc [borodata['Date'] <date(2016,6,1)]

Remove cases where the borough was unspecified:

In [7]:
borodata = borodata.loc[borodata['Boro'] != 'Unspecified']

Total complaints by borough, year-over-year

We already have a chart of heat complaints by month; we need one for the total number of complaints received.

In [8]:
borodata['Year'] = [x.year for x in borodata['Date']]
borodata['Month'] = [x.month for x in borodata['Date']]
borodata['Day'] = [x.day for x in borodata['Date']]
In [9]:
plotdata = borodata.groupby(by=['Boro', 'Year', 'Month']).count()
plotdata.head()
Out[9]:
Date Comptype Day
Boro Year Month
BRONX 2011 3 26256 26256 26256
4 22898 22898 22898
5 22722 22722 22722
6 23677 23677 23677
7 27348 27348 27348

Now we generate a dictionary that we can use to plot these data:

In [10]:
plotdict = {x:[] for x in borodata['Boro'].unique()}
In [11]:
for boro in plotdict:
    plotdict[boro] = list(plotdata.xs(boro).Date)

Now we need to format our data for the plot. We need 5 "rows", each representing a borough, and n columns, each representing one month.

In [12]:
plotdata = np.zeros(len(plotdict['BROOKLYN']))
legendkey = []

for boro in plotdict.keys():
    plotdata = np.row_stack((plotdata, plotdict[boro]))
    legendkey.append(boro)
    
plotdata = np.delete(plotdata, (0), axis=0)
In [13]:
x = np.arange(len(plotdata[0]))

#crude xlabels
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
years = ['2011', '2012', '2013', '2014', '2015', '2016']
xlabels = []
for year in years:
    for month in months:
        xlabels.append("{0} {1}".format(month,year))

        
xlabels = xlabels[2:-7] #start from march 2011, end may2016
In [14]:
plotcolors = [(1,0,103),(213,255,0),(255,0,86),(158,0,142),(14,76,161),(255,229,2),(0,95,57),\
            (0,255,0),(149,0,58),(255,147,126),(164,36,0),(0,21,68),(145,208,203),(98,14,0)]

#rescaling rgb from 0-255 to 0 to 1
plotcolors = [(color[0]/float(255),color[1]/float(255),color[2]/float(255)) for color in plotcolors]
legendcolors = [mpatches.Patch(color = color) for color in plotcolors]

plt.figure(figsize = (15,10));
plt.stackplot(x,plotdata, colors = plotcolors);
plt.xticks(x,xlabels,rotation=90);
plt.xlim(0,len(xlabels))
plt.legend(legendcolors,legendkey, bbox_to_anchor=(0.2, 1));
plt.title('Heating Complaints by Borough', size = 24)
plt.ylabel('Number of Complaints',size = 14)

plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.gca().yaxis.set_ticks_position('left')
plt.gca().xaxis.set_ticks_position('bottom')
In [15]:
plotdatacsv = pd.DataFrame(plotdata)
plotdatacsv.index = legendkey
plotdatacsv.columns = xlabels
plotdatacsv.to_csv('totalcomplaintsbyboroughandmonth.csv')

Normalized total complaints

We'll also need a normalized version of the above, where everything is scaled to 100%, to see if complaints come disproportionately from one borough vs. another. There are probably better ways to do this, but I'm going to simply sum the number of complaints from each time point, and divide each column by the total for that column (i.e., number of complaints from that borough for that month divided by total number of complaints).

In [16]:
normplotdata = np.ndarray(shape=np.shape(plotdata))

It will thus be helpful to have an array of the total number of complaints:

In [17]:
totalcomplaintsbymonth = borodata.groupby(by=['Year','Month']).count().Date
In [18]:
totalcomplaintsbymonth = totalcomplaintsbymonth.values
In [19]:
totalcomplaintsbymonth = np.tile(totalcomplaintsbymonth,(5,1))
In [20]:
normplotdata = plotdata/totalcomplaintsbymonth
In [21]:
plotcolors = [(1,0,103),(213,255,0),(255,0,86),(158,0,142),(14,76,161),(255,229,2),(0,95,57),\
            (0,255,0),(149,0,58),(255,147,126),(164,36,0),(0,21,68),(145,208,203),(98,14,0)]

#rescaling rgb from 0-255 to 0 to 1
plotcolors = [(color[0]/float(255),color[1]/float(255),color[2]/float(255)) for color in plotcolors]
legendcolors = [mpatches.Patch(color = color) for color in plotcolors]

plt.figure(figsize = (15,10));
plt.stackplot(x,normplotdata, colors = plotcolors);
plt.xticks(x,xlabels,rotation=90);
plt.xlim(0,len(xlabels))
plt.ylim(0,1)
plt.legend(legendcolors,legendkey, bbox_to_anchor=(0.2, 1));
plt.title('Heating Complaints by Borough', size = 24)
plt.ylabel('Number of Complaints',size = 14)

plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.gca().yaxis.set_ticks_position('left')
plt.gca().xaxis.set_ticks_position('bottom')

Heat complaints per capita:

We'd also like to look at how many complaints per capita are generated. Given that we only have full years for 2012-2015, let's consider those.

In [22]:
boropop = {
    'MANHATTAN': 1636268,
    'BRONX': 1438159,
    'BROOKLYN': 2621793,
    'QUEENS': 2321580,
    'STATEN ISLAND': 473279,
    }
In [23]:
totalnycpop = reduce(lambda x, y: x +y, boropop.values())
totalnycpop
Out[23]:
8491079
In [24]:
complaintsbyyear = borodata.groupby(by=['Year']).count()
borocomplaintsbyyear = borodata.groupby(by=['Year','Boro']).count()
In [25]:
complaintsbyyear['Pop'] = [totalnycpop for x in complaintsbyyear.index]
In [26]:
borocomplaintsbyyear['Pop'] = [boropop.get(x[1]) for x in borocomplaintsbyyear.index]
In [27]:
complaintsbyyear['CompPerCap'] = complaintsbyyear['Day']/complaintsbyyear['Pop']
In [28]:
complaintsbyyear
Out[28]:
Date Boro Comptype Month Day Pop CompPerCap
Year
2011 1359368 1359368 1359368 1359368 1359368 8491079 0.160094
2012 1528597 1528597 1528597 1528597 1528597 8491079 0.180024
2013 1546188 1546188 1546188 1546188 1546188 8491079 0.182096
2014 1770357 1770357 1770357 1770357 1770357 8491079 0.208496
2015 1924917 1924917 1924917 1924917 1924917 8491079 0.226699
2016 891783 891783 891783 891783 891783 8491079 0.105026
In [29]:
borocomplaintsbyyear['CompPerCap'] = borocomplaintsbyyear['Day']/borocomplaintsbyyear['Pop']
In [30]:
borocomplaintsbyyear
Out[30]:
Date Comptype Month Day Pop CompPerCap
Year Boro
2011 BRONX 262597 262597 262597 262597 1438159 0.182592
BROOKLYN 434038 434038 434038 434038 2621793 0.165550
MANHATTAN 273282 273282 273282 273282 1636268 0.167015
QUEENS 315070 315070 315070 315070 2321580 0.135714
STATEN ISLAND 74381 74381 74381 74381 473279 0.157161
2012 BRONX 296502 296502 296502 296502 1438159 0.206168
BROOKLYN 482088 482088 482088 482088 2621793 0.183877
MANHATTAN 314437 314437 314437 314437 1636268 0.192167
QUEENS 358990 358990 358990 358990 2321580 0.154632
STATEN ISLAND 76580 76580 76580 76580 473279 0.161807
2013 BRONX 302328 302328 302328 302328 1438159 0.210219
BROOKLYN 490979 490979 490979 490979 2621793 0.187268
MANHATTAN 332487 332487 332487 332487 1636268 0.203198
QUEENS 343772 343772 343772 343772 2321580 0.148077
STATEN ISLAND 76622 76622 76622 76622 473279 0.161896
2014 BRONX 340297 340297 340297 340297 1438159 0.236620
BROOKLYN 557290 557290 557290 557290 2621793 0.212561
MANHATTAN 373895 373895 373895 373895 1636268 0.228505
QUEENS 408389 408389 408389 408389 2321580 0.175910
STATEN ISLAND 90486 90486 90486 90486 473279 0.191190
2015 BRONX 355017 355017 355017 355017 1438159 0.246855
BROOKLYN 602144 602144 602144 602144 2621793 0.229669
MANHATTAN 417223 417223 417223 417223 1636268 0.254985
QUEENS 455115 455115 455115 455115 2321580 0.196037
STATEN ISLAND 95418 95418 95418 95418 473279 0.201610
2016 BRONX 169773 169773 169773 169773 1438159 0.118049
BROOKLYN 271140 271140 271140 271140 2621793 0.103418
MANHATTAN 209840 209840 209840 209840 1636268 0.128243
QUEENS 200335 200335 200335 200335 2321580 0.086293
STATEN ISLAND 40695 40695 40695 40695 473279 0.085985

From these data, we can see two things: First, on average, about 1 in 5 people make a 311 complaint each year. Second, overall, this number are pretty consistent across the 5 boroughs.

Mousetracker Data #2

Overview

Since the time of the last post about these data, my friends and I at the NYU couples lab have collect some actual Mousetracker data from online participants. In this post, I wrote some code to pre-process and clean the dataset.

Current project

As a refresher, we hypothesized that at any given time, individuals are concerned with:

  1. their self-interest
  2. their partner's interests
  3. the interest of the group or dyad, or the relationship, or them as a pair

and these motives affect the way individuals choose to distribute resources.

To distinguish between these three motives, we generated three sets of stimuli using poker chips that pit each of these motives against each other:

  1. The first set of stimuli pit participants' self-interest against the interests of their partner.
  2. The second set of stimuli pits a participant's concern for the interest of their partner vs. their own self interest and the group's interest.
  3. The last set of stimuli pit participants' self-interest against that of their partner and the group.

The data

The data come in a person-period dataset. This is a "long" format where each participant has multiple rows that represent each trial of the experiment (there were 60 or so trials). However, each row also contains multiple columns each representing a bin of average locations the participant's mouse pointer was during that time span. There are ~100 such bins.

In other words, each participant made 60 choices, and their mouse positions were averaged into ~100 time points per trial.

The first thing we're going to do is actually some cleaning. Each of the .csv files have a redundant first row, so we're just going to delete that:

In [1]:
import os
import re

files = os.listdir('./data')
files
Out[1]:
['GBPG SMGL VS. SLGM n=118.csv',
 'GBPG SMGM VS. SLGL n=118.csv',
 'GBPG SO n=118.csv',
 'GBPR SMGL VS. SLGM n=121.csv',
 'GBPR SMGM vs. SLGL n=121.csv',
 'GBPR SO n=121.csv',
 'RBPG SMGL vs. SLGM n=105.csv',
 'RBPG SMGM vs. SLGL n=105.csv',
 'RBPG SO n=105.csv',
 'RBPR SMGL vs. SLGM n=120.csv',
 'RBPR SMGM vs. SLGL n=120.csv',
 'RBPR SO n=120.csv']
In [2]:
#for all our files
for name in files:
    with open('./data/{0}'.format(name)) as f:
        contents = f.readlines()
    out =  open('./data cleaned/{0}.csv'.format(re.match(r'(.*) n=.*', name).group(1)), 'wb')
    
    found = 0
    for line in contents[1:]:
        if line == 'MEAN SUBJECT-BY-SUBJECT DATA\n':
            found = 1
        elif found == 0:
            out.write(line)
        elif found == 1:
            pass
    out.close()

If there wasn't a section at the end I had to delete, I could have just done this:

In [3]:
import pandas as pd
   
testdata = pd.read_csv('./data/%s' % files[0],skiprows = 1)
testdata.head()
Out[3]:
subject trial stimfile condition code resp_1 resp_2 response distractor error ... Y_93 Y_94 Y_95 Y_96 Y_97 Y_98 Y_99 Y_100 Y_101 Unnamed: 226
0 75870 2 NaN 1 smglvsslgm4 ~S9_O7.jpg ~S8_O12.jpg 1 2 0 ... 1.1770 1.1811 1.1826 1.1825 1.1824 1.1874 1.1875 1.1875 1.1875 NaN
1 75870 3 NaN 1 smglvsslgm6 ~S10_O15.jpg ~S12_O8.jpg 2 1 0 ... 1.1825 1.1825 1.1823 1.1837 1.1865 1.1876 1.1875 1.1875 1.1875 NaN
2 75870 7 NaN 1 smglvsslgm18 ~S12_O7.jpg ~S8_O13.jpg 1 2 0 ... 1.2003 1.1955 1.1950 1.1935 1.1924 1.1925 1.1925 1.1925 1.1925 NaN
3 75870 9 NaN 1 smglvsslgm19 ~S9_O12.jpg ~S12_O8.jpg 2 1 0 ... 1.3081 1.3013 1.2999 1.3000 1.3000 1.3000 1.3000 1.3000 1.3000 NaN
4 75870 14 NaN 2 smglvsslgm9 ~S9_O7.jpg ~S10_O5.jpg 1 1 1 ... 1.2577 1.2575 1.2575 1.2577 1.2565 1.2543 1.2519 1.2500 1.2500 NaN

5 rows × 227 columns

Next, I'm going to write a loop that basically does what I did in the first post to all the separate datasets. Again, we're going to be finding the mean of participants' reaction time (RT), maximum deviation (MD), and the area under curve (AUC).

What we want in the end is a csv file that has the overall mean RT, MD, AUC, as well as those metrics when participants' were correct vs. incorrect.

I wrote two functions that basically do what I did by hand in the first post. The first combines two redundant columns, and the second finds the mean of that column, depending on whether the participant made an error or not, or whether we want the grand mean.

In [4]:
data = pd.read_csv('./data cleaned/%s' % os.listdir('./data cleaned')[0])
data.head()
Out[4]:
subject trial stimfile condition code resp_1 resp_2 response distractor error ... Y_93 Y_94 Y_95 Y_96 Y_97 Y_98 Y_99 Y_100 Y_101 Unnamed: 226
0 75870 2 NaN 1 smglvsslgm4 ~S9_O7.jpg ~S8_O12.jpg 1 2 0 ... 1.1770 1.1811 1.1826 1.1825 1.1824 1.1874 1.1875 1.1875 1.1875 NaN
1 75870 3 NaN 1 smglvsslgm6 ~S10_O15.jpg ~S12_O8.jpg 2 1 0 ... 1.1825 1.1825 1.1823 1.1837 1.1865 1.1876 1.1875 1.1875 1.1875 NaN
2 75870 7 NaN 1 smglvsslgm18 ~S12_O7.jpg ~S8_O13.jpg 1 2 0 ... 1.2003 1.1955 1.1950 1.1935 1.1924 1.1925 1.1925 1.1925 1.1925 NaN
3 75870 9 NaN 1 smglvsslgm19 ~S9_O12.jpg ~S12_O8.jpg 2 1 0 ... 1.3081 1.3013 1.2999 1.3000 1.3000 1.3000 1.3000 1.3000 1.3000 NaN
4 75870 14 NaN 2 smglvsslgm9 ~S9_O7.jpg ~S10_O5.jpg 1 1 1 ... 1.2577 1.2575 1.2575 1.2577 1.2565 1.2543 1.2519 1.2500 1.2500 NaN

5 rows × 227 columns

In [5]:
#first, combine the redundant columns and type the relevant columns
def combine_columns(dddd):
    dddd['MD'] = dddd.loc[dddd['MD_1'].isnull() == False, ['MD_1']]
    dddd.loc[dddd['MD'].isnull() == True,['MD']] = dddd.loc[dddd['MD_2'].isnull() == False]['MD_2'] 
    dddd['AUC'] = dddd.loc[dddd['AUC_1'].isnull() == False, ['AUC_1']]
    dddd.loc[dddd['AUC'].isnull() == True, ['AUC']] = dddd.loc[dddd['AUC_2'].isnull() == False]['AUC_2']
    
combine_columns(data)
In [6]:
def find_mean(datasource, participantid, metric, error=None):
    participantsdata = datasource.loc[datasource['subject'] == participantid]
    if error == 1:
        return participantsdata.loc[participantsdata['error']==1][metric].astype('float').mean()
    elif error == 0:
        return participantsdata.loc[participantsdata['error']==0][metric].astype('float').mean()
    else:        
        return participantsdata[metric].astype('float').mean()        

Next, we're going to test some code that calculates the mean of the afore-mentioned metrics for every participant in a dataset:

In [7]:
combine_columns(data)
In [8]:
participants = data['subject'].unique()
In [9]:
participantdict = {x:[] for x in participants}
In [10]:
for participant in participantdict:
    for metric in ['AUC', 'MD', 'RT']:
        try:
            participantdict[participant].append(find_mean(data, participant, metric, error = 1))
        except:
            participantdict[participant].append(None)
In [11]:
outdata = pd.DataFrame.from_dict(participantdict,orient='index')
outdata.columns = ['AUC', 'MD', 'RT']
In [12]:
outdata.head()
Out[12]:
AUC MD RT
9104386 0.222260 0.147950 618.200000
7279624 -0.039400 -0.006537 1208.375000
3452429 0.378033 0.257911 1381.444444
7354384 0.034620 0.038660 1102.700000
479766 -0.036642 -0.048183 867.166667

Let's write this as a function:

In [13]:
def sum_participants(pkeys,metrics,err,datttt):
    adictionary = {x:[] for x in pkeys}
    for participant in adictionary:
        for metric in metrics:
            try:
                adictionary[participant].append(find_mean(datttt, participant, metric ,error = err))
            except:
                adictionary[participant].append(None)
        adictionary[participant].append(err)
    return adictionary

Definitely not production code, but it should work.

Combining datasets

Alright, now we have all of that working, let's combine the datasets that we have, and add features to tell us where the data came from:

In [14]:
files = os.listdir('./data cleaned')
files
Out[14]:
['GBPG SMGL VS. SLGM.csv',
 'GBPG SMGM VS. SLGL.csv',
 'GBPG SO.csv',
 'GBPR SMGL VS. SLGM.csv',
 'GBPR SMGM vs. SLGL.csv',
 'GBPR SO.csv',
 'RBPG SMGL vs. SLGM.csv',
 'RBPG SMGM vs. SLGL.csv',
 'RBPG SO.csv',
 'RBPR SMGL vs. SLGM.csv',
 'RBPR SMGM vs. SLGL.csv',
 'RBPR SO.csv']

What we're going to do is first create an empty DataFrame with all our columns. Next, we'll load all of our data, do the math for the 3 measures, add a feature that captures where the data came from (i.e., one column for the color codes, another color for the comparison type).

In [15]:
combineddata = pd.DataFrame(columns= ['AUC', 'MD', 'RT', 'ERROR', 'COLORCODE', 'COMPARISON'])

Now, let's put everything together, and loop through all the filenames:

In [16]:
metrics = ['AUC', 'MD', 'RT']

for filename in files:
    tempdata = pd.read_csv('./data cleaned/{0}'.format(filename))
    
    combine_columns(tempdata)
    
    participants = tempdata['subject'].unique()
    
    correctdict = sum_participants(participants,metrics,0,tempdata)
    errordict = sum_participants(participants,metrics,1,tempdata)
    
    correctdata = pd.DataFrame.from_dict(correctdict,orient='index')
    errordata = pd.DataFrame.from_dict(errordict,orient='index')
    
    outdata = pd.concat([correctdata,errordata])
    outdata.columns = ['AUC', 'MD', 'RT', 'ERROR']
    outdata['COLORCODE'] = re.match(r'(....)', filename).group(1)
    outdata['COMPARISON'] = re.match(r'.... (.*).csv',filename).group(1)
    
    print filename
    print len(outdata)
    
    combineddata = pd.concat([combineddata,outdata])
GBPG SMGL VS. SLGM.csv
236
GBPG SMGM VS. SLGL.csv
236
GBPG SO.csv
236
GBPR SMGL VS. SLGM.csv
242
GBPR SMGM vs. SLGL.csv
242
GBPR SO.csv
242
RBPG SMGL vs. SLGM.csv
210
RBPG SMGM vs. SLGL.csv
210
RBPG SO.csv
210
RBPR SMGL vs. SLGM.csv
240
RBPR SMGM vs. SLGL.csv
240
RBPR SO.csv
240
In [17]:
combineddata.head()
Out[17]:
AUC MD RT ERROR COLORCODE COMPARISON
9104386 0.198260 0.117420 729.000000 0.0 GBPG SMGL VS. SLGM
7279624 0.147017 0.087525 1235.500000 0.0 GBPG SMGL VS. SLGM
3452429 0.530336 0.300000 1443.727273 0.0 GBPG SMGL VS. SLGM
7354384 0.006440 0.003520 1173.000000 0.0 GBPG SMGL VS. SLGM
479766 0.005037 0.016762 814.250000 0.0 GBPG SMGL VS. SLGM

This is what the data should look like, so let's write it to a csv:

In [18]:
combineddata.sort_index().to_csv('combineddata.csv')
In [19]:
outdata
Out[19]:
AUC MD RT ERROR COLORCODE COMPARISON
448004 2.707033 1.111633 906.333333 0 RBPR SO
2675720 0.352980 0.169195 953.950000 0 RBPR SO
2947082 0.160825 0.137081 1037.437500 0 RBPR SO
8594447 0.907000 0.577125 1015.250000 0 RBPR SO
515090 0.394933 0.282000 833.888889 0 RBPR SO
1283091 0.474890 0.310680 781.600000 0 RBPR SO
1591835 2.229329 0.997536 920.357143 0 RBPR SO
8448030 0.717493 0.313850 748.642857 0 RBPR SO
8767007 1.628893 0.699000 990.214286 0 RBPR SO
9444902 0.676900 0.361963 800.750000 0 RBPR SO
8873015 0.907767 0.487600 660.000000 0 RBPR SO
8469940 0.164550 0.122125 647.375000 0 RBPR SO
8325642 1.047320 0.590833 852.800000 0 RBPR SO
503358 0.058525 0.050817 682.333333 0 RBPR SO
7386178 1.234072 0.637233 924.611111 0 RBPR SO
953414 -0.118170 -0.146170 862.500000 0 RBPR SO
5312076 0.395989 0.157433 548.111111 0 RBPR SO
2515533 0.960689 0.571544 974.222222 0 RBPR SO
6012496 1.453787 0.773067 973.066667 0 RBPR SO
4717240 0.883219 0.595013 1131.687500 0 RBPR SO
3007065 0.044767 0.053333 739.555556 0 RBPR SO
3507292 -0.100017 -0.085700 537.000000 0 RBPR SO
1363052 1.691200 0.790547 985.941176 0 RBPR SO
1426541 0.000000 0.000000 404.636364 0 RBPR SO
9603603 0.744925 0.492655 1258.700000 0 RBPR SO
5480054 -0.025157 -0.026857 649.714286 0 RBPR SO
1355895 0.970286 0.480421 1062.285714 0 RBPR SO
7049336 0.971658 0.564100 852.736842 0 RBPR SO
3354234 0.390192 0.230254 1166.076923 0 RBPR SO
2585535 1.319237 0.650863 829.250000 0 RBPR SO
... ... ... ... ... ... ...
84352 0.928446 0.443062 743.384615 1 RBPR SO
3042691 NaN NaN NaN 1 RBPR SO
2597764 2.076241 0.906424 939.000000 1 RBPR SO
9324421 NaN NaN NaN 1 RBPR SO
5227406 NaN NaN NaN 1 RBPR SO
7229839 0.029771 0.035086 437.714286 1 RBPR SO
4169107 NaN NaN NaN 1 RBPR SO
6001558 -0.091046 -0.119262 596.538462 1 RBPR SO
647592 1.448136 0.694329 1193.571429 1 RBPR SO
6338986 -0.117800 -0.135150 452.000000 1 RBPR SO
3329964 0.321055 0.133191 687.454545 1 RBPR SO
4264365 0.253538 0.165177 686.923077 1 RBPR SO
3035571 0.064817 0.089733 607.333333 1 RBPR SO
6431156 0.261650 0.240750 892.500000 1 RBPR SO
7201720 0.105188 0.100663 598.250000 1 RBPR SO
5427132 0.064814 0.078571 694.571429 1 RBPR SO
1544127 0.308789 0.272522 692.666667 1 RBPR SO
265153 0.999410 0.517540 603.100000 1 RBPR SO
1399240 0.176720 0.162690 813.000000 1 RBPR SO
5108681 -0.053300 -0.107300 858.000000 1 RBPR SO
9541367 0.321488 0.219350 659.250000 1 RBPR SO
2938324 0.060600 0.097500 781.000000 1 RBPR SO
374233 0.635582 0.401791 760.272727 1 RBPR SO
8976353 0.608600 0.358400 1020.000000 1 RBPR SO
8706045 0.005117 0.013417 734.833333 1 RBPR SO
9728496 0.165137 0.152912 538.250000 1 RBPR SO
7825395 NaN NaN NaN 1 RBPR SO
6582776 0.665817 0.497900 1213.333333 1 RBPR SO
4016125 1.036192 0.571175 828.333333 1 RBPR SO
7075839 NaN NaN NaN 1 RBPR SO

240 rows × 6 columns

Hack for Heat #8: Complaint resolution time revisted

In [1]:
%matplotlib inline

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import psycopg2
from datetime import date

pd.options.display.max_columns = 40
pd.options.mode.chained_assignment = None  # default='warn'

Hack for Heat #8: Complaint resolution time revisited

In this post, I'm going to explore a simple way of dealing with the censoring issue in complaint resolution time: Basically, because we only care about heat complaints during the heat season, we can constrain our cases of interest to the heating season specifically (1st October to May 31st).

In [2]:
#pull from our database again:
connection = psycopg2.connect('dbname = threeoneone user= threeoneoneadmin password = threeoneoneadmin')
In [3]:
cursor = connection.cursor()
In [4]:
cursor.execute('''SELECT createddate, closeddate, borough, complainttype FROM service;''')
In [5]:
data = cursor.fetchall()
In [6]:
data = pd.DataFrame(data)
In [7]:
data.columns = ['createddate', 'closeddate', 'borough', 'complainttype']
In [8]:
data = data.loc[(data['complainttype'] == 'HEATING') | (data['complainttype'] == 'HEAT/HOT WATER') ]

So far, we've subsetted the data to only include the heating complaints. Now let's create a datetime mask. What we want is to have subsets of data for each of the heating seasons:

In [9]:
heatmonths = range(1, 6) + [10,11,12]
In [10]:
heatmonths
Out[10]:
[1, 2, 3, 4, 5, 10, 11, 12]
In [11]:
createdmask = data['createddate'].map(lambda x: (x.month in heatmonths))
closedmask = data['closeddate'].map(lambda x: (x.month in heatmonths) if x != None else False)
In [12]:
mask = createdmask & closedmask
In [13]:
heatseasondata = data.loc[mask]

How many heat/hot water complaints are created and closed inside vs. outside of heating season?

In [14]:
len(data.loc[mask]) #inside heating season
Out[14]:
1302688
In [15]:
len(data.loc[~mask])#outside heating season
Out[15]:
68750

The next thing we want to do is ignore cases where the complaint was resolved in the next heating season:

In [16]:
prevmonths = range(1, 6) 
nextmonths = [10,11,12]
In [17]:
heatseasondata['createdheatseason'] = [x.year if (x.month in prevmonths) else (x.year-1) for x in heatseasondata['createddate']]
In [18]:
heatseasondata.head()
Out[18]:
createddate closeddate borough complainttype createdheatseason
0 2014-11-18 2014-11-22 BROOKLYN HEAT/HOT WATER 2013
1 2014-11-18 2014-11-25 BROOKLYN HEAT/HOT WATER 2013
2 2014-11-18 2014-11-19 BRONX HEAT/HOT WATER 2013
3 2014-11-18 2014-11-19 MANHATTAN HEAT/HOT WATER 2013
4 2014-11-18 2014-11-20 BRONX HEAT/HOT WATER 2013
In [19]:
heatseasondata['closedheatseason'] = [x.year if (x.month in prevmonths) else (x.year-1) for x in heatseasondata['closeddate']]

Now that we've done this, we can select only the cases where the closed date was in the same season as the created date:

In [20]:
heatseasondata = heatseasondata.loc[heatseasondata['createdheatseason']  == heatseasondata['closedheatseason']]

Okay, now we can calculate some average resolution times:

In [21]:
heatseasondata['resolutiontime'] = heatseasondata['closeddate'] - heatseasondata['createddate']
In [22]:
heatseasondata['resolutiontimeint'] = heatseasondata.resolutiontime.astype('timedelta64[D]')
In [23]:
resolutiontimedata = heatseasondata.groupby(by='createdheatseason').mean()['resolutiontimeint']
In [24]:
resolutiontimedata.to_csv('resolutiontimebyyear.csv')
In [25]:
resolutiontimedata
Out[25]:
createdheatseason
2009    4.605845
2010    3.695498
2011    3.666409
2012    3.623974
2013    3.514244
2014    3.137578
2015    3.406145
2016    3.446075
Name: resolutiontimeint, dtype: float64

Resolution times by year:

In [26]:
x = resolutiontimedata.index.values
y = resolutiontimedata.values

plt.figure(figsize=(12,10));
plt.plot(x,y);

Resolution time by borough:

In [27]:
restimebyboro = heatseasondata.groupby(by=['borough', 'createdheatseason']).mean()['resolutiontimeint']
In [28]:
restimebyboro.to_csv('restimebyboro.csv')
In [29]:
restimebyboro = restimebyboro.loc[[x in range(2010,2017) for x in restimebyboro.index.get_level_values('createdheatseason')]]
In [31]:
boros = heatseasondata.borough.unique()
In [32]:
boroplots = {x:[] for x in boros}
for boro in boros:
    boroplots[boro] = restimebyboro.xs(boro).values
    
boroplots.pop('Unspecified')
Out[32]:
array([ 3.64268956,  3.89000015])
In [33]:
x = range(2010,2017)

plt.figure(figsize=(12,10));
for boro in boroplots:
    plt.plot(x,boroplots[boro]);
plt.legend(boroplots.keys());
plt.xticks(x, [str(label) for label in x]);

I was told beforehand that the average resolution time from a similar analysis last year was about 3-5 days, so this looks about right.

Hack for Heat #7: Heat complaints over time

In [1]:
%matplotlib inline

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import psycopg2
from datetime import datetime
from datetime import date

pd.options.display.max_columns = 40

Hack for Heat #7: Heat complaints over time

In the last post, I plotted how the number of complaints differd by borough over time. This time around, I'm going to revisit this process, but this time focusing on heating complaints only (this is Heat Seek, after all).

Loading data:

In [2]:
connection = psycopg2.connect('dbname = threeoneone user=threeoneoneadmin password=threeoneoneadmin')
cursor = connection.cursor()
In [3]:
cursor.execute('''SELECT DISTINCT complainttype FROM service;''')
complainttypes = cursor.fetchall()
In [4]:
cursor.execute('''SELECT createddate, borough, complainttype FROM service;''')
borodata = cursor.fetchall()
In [5]:
borodata = pd.DataFrame(borodata)
In [6]:
borodata.head()
Out[6]:
0 1 2
0 2011-04-15 MANHATTAN Street Condition
1 2011-04-15 MANHATTAN Street Condition
2 2011-04-15 MANHATTAN Street Condition
3 2011-04-15 MANHATTAN Street Condition
4 2011-04-14 QUEENS Street Condition
In [7]:
borodata.columns = ['Date', 'Boro', 'Comptype']
In [8]:
heatdata = borodata.loc[(borodata['Comptype'] == 'HEATING') | (borodata['Comptype'] == 'HEAT/HOT WATER')]

I removed entires earlier than 2011 because there are data quality issues (substantially fewer cases):

In [9]:
heatdata = heatdata.loc[heatdata['Date'] > date(2011,3,1)]

Heat complaints by borough:

In [10]:
len(heatdata)
Out[10]:
1092048

There were about a million heating complaints over the 3+ years we have data for.

In [11]:
heatbydate = heatdata.groupby(by='Boro').count()
In [12]:
heatbydate
Out[12]:
Date Comptype
Boro
BRONX 354335 354335
BROOKLYN 334574 334574
MANHATTAN 250296 250296
QUEENS 138477 138477
STATEN ISLAND 10601 10601
Unspecified 3765 3765

Per capita:

Again, let's look at how many heat complaints each boro generates per person:

In [13]:
boropop = {
    'MANHATTAN': 1636268,
    'BRONX': 1438159,
    'BROOKLYN': 2621793,
    'QUEENS': 2321580,
    'STATEN ISLAND': 473279,
    }
In [14]:
heatbydate['Pop'] = [boropop.get(x) for x in heatbydate.index]
In [15]:
heatbydate['CompPerCap'] = heatbydate['Date']/heatbydate['Pop']
In [16]:
heatbydate
Out[16]:
Date Comptype Pop CompPerCap
Boro
BRONX 354335 354335 1438159.0 0.246381
BROOKLYN 334574 334574 2621793.0 0.127613
MANHATTAN 250296 250296 1636268.0 0.152968
QUEENS 138477 138477 2321580.0 0.059648
STATEN ISLAND 10601 10601 473279.0 0.022399
Unspecified 3765 3765 NaN NaN

Complaints by borough over months

First, let's recreate the graph from before:

In [17]:
heatdata['Year'] = [x.year for x in heatdata['Date']]
heatdata['Month'] = [x.month for x in heatdata['Date']]
heatdata['Day'] = [x.day for x in heatdata['Date']]
In [18]:
heatdata.head()
Out[18]:
Date Boro Comptype Year Month Day
58 2015-10-17 BROOKLYN HEAT/HOT WATER 2015 10 17
728 2011-04-16 MANHATTAN HEATING 2011 4 16
1026 2011-04-16 BROOKLYN HEATING 2011 4 16
1418 2015-10-17 MANHATTAN HEAT/HOT WATER 2015 10 17
1624 2011-04-15 MANHATTAN HEATING 2011 4 15

We remove data where it was unspecified what the boro was

In [19]:
heatdata = heatdata.loc[heatdata['Boro'] != 'Unspecified']
In [20]:
heatplotdata = heatdata.groupby(by=['Boro', 'Year','Month']).count()
heatplotdata
Out[20]:
Date Comptype Day
Boro Year Month
BRONX 2011 3 5101 5101 5101
4 3339 3339 3339
5 1217 1217 1217
6 562 562 562
7 564 564 564
8 482 482 482
9 674 674 674
10 5954 5954 5954
11 7393 7393 7393
12 10536 10536 10536
2012 1 12655 12655 12655
2 7146 7146 7146
3 5049 5049 5049
4 3336 3336 3336
5 1214 1214 1214
6 672 672 672
7 688 688 688
8 711 711 711
9 675 675 675
10 3488 3488 3488
11 11853 11853 11853
12 9365 9365 9365
2013 1 13580 13580 13580
2 10037 10037 10037
3 6794 6794 6794
4 3706 3706 3706
5 1545 1545 1545
6 861 861 861
7 605 605 605
8 593 593 593
... ... ... ... ... ...
STATEN ISLAND 2013 12 261 261 261
2014 1 491 491 491
2 199 199 199
3 206 206 206
4 112 112 112
5 70 70 70
6 35 35 35
7 32 32 32
8 31 31 31
9 36 36 36
10 216 216 216
11 405 405 405
12 246 246 246
2015 1 402 402 402
2 523 523 523
3 216 216 216
4 91 91 91
5 58 58 58
6 31 31 31
7 24 24 24
8 31 31 31
9 31 31 31
10 261 261 261
11 250 250 250
12 230 230 230
2016 1 480 480 480
2 327 327 327
3 133 133 133
4 110 110 110
5 79 79 79

315 rows × 3 columns

In [21]:
boros = heatbydate.index
borodict = {x:[] for x in boros}
borodict.pop('Unspecified')

for boro in borodict:
    borodict[boro] = list(heatplotdata.xs(boro).Date)
In [22]:
plotdata = np.zeros(len(borodict['BROOKLYN']))
for boro in sorted(borodict.keys()):
    plotdata = np.row_stack((plotdata, borodict[boro]))
In [23]:
plotdata = np.delete(plotdata, (0), axis=0)
In [25]:
from matplotlib import patches as mpatches

x = np.arange(len(plotdata[0]))

#crude xlabels
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
years = ['2011', '2012', '2013', '2014', '2015', '2016']
xlabels = []
for year in years:
    for month in months:
        xlabels.append("{0} {1}".format(month,year))

xlabels = xlabels[2:-7] #start from march 2011, end may2016

A non-normalized plot:

The plot below is of raw complaint numbers, and is what we might expect: complaints about eat matter the most during the heating season!

In [26]:
plotcolors = [(1,0,103),(213,255,0),(255,0,86),(158,0,142),(14,76,161),(255,229,2),(0,95,57),\
            (0,255,0),(149,0,58),(255,147,126),(164,36,0),(0,21,68),(145,208,203),(98,14,0)]

#rescaling rgb from 0-255 to 0 to 1
plotcolors = [(color[0]/float(255),color[1]/float(255),color[2]/float(255)) for color in plotcolors]
legendcolors = [mpatches.Patch(color = color) for color in plotcolors]

plt.figure(figsize = (15,10));
plt.stackplot(x,plotdata, colors = plotcolors);
plt.xticks(x,xlabels,rotation=90);
plt.xlim(0,len(xlabels))
plt.legend(legendcolors,sorted(borodict.keys()), bbox_to_anchor=(0.2, 1));
plt.title('Heating Complaints by Borough', size = 24)
plt.ylabel('Number of Complaints',size = 14)

plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.gca().yaxis.set_ticks_position('left')
plt.gca().xaxis.set_ticks_position('bottom')

Normalizing data:

Next, we're going to normalize these data. This allows us to better visualize if the proportion of heating complaints changed by borough, over time (e.g., did one borough generate more/less complaints vs. others over time?).

What we want to do is divide each of the 5 rows by the total number of complaints by that column (the month)

In [27]:
totalcounts = heatdata.groupby(by=['Year', 'Month']).count().Date.values
In [40]:
boros = heatbydate.index
normdict = {x:[] for x in boros}
normdict.pop('Unspecified')

for boro in normdict:
    for i in range(0,len(plotdata[1])): # for all the values in each row
        normp = float(borodict[boro][i])/float(totalcounts[i])
        normdict[boro].append(normp*100)
In [41]:
normplotdata = np.zeros(len(borodict['BROOKLYN']))
for boro in sorted(normdict.keys()):
    normplotdata = np.row_stack((normplotdata, normdict[boro]))
In [42]:
normplotdata = np.delete(normplotdata,(0),axis = 0)
In [44]:
plotcolors = [(1,0,103),(213,255,0),(255,0,86),(158,0,142),(14,76,161),(255,229,2),(0,95,57),\
            (0,255,0),(149,0,58),(255,147,126),(164,36,0),(0,21,68),(145,208,203),(98,14,0)]

#rescaling rgb from 0-255 to 0 to 1
plotcolors = [(color[0]/float(255),color[1]/float(255),color[2]/float(255)) for color in plotcolors]
legendcolors = [mpatches.Patch(color = color) for color in plotcolors]

plt.figure(figsize = (15,10));
plt.stackplot(x,normplotdata, colors = plotcolors);
plt.xticks(x,xlabels,rotation=90);
plt.xlim(0,len(xlabels))
plt.ylim(0,100)
plt.legend(legendcolors,sorted(normdict.keys()), bbox_to_anchor=(0.2, 1));
plt.title('Heating Complaints by Borough (normalized)', size = 24)
plt.ylabel('% of Complaints',size = 14)

plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.gca().yaxis.set_ticks_position('left')
plt.gca().xaxis.set_ticks_position('bottom')

Overall, it looks like heating complaints were pretty consistent by borough. In the next post, I'm going to take a look at one way of addressing the afore-mentioned censoring problems.

Hack for Heat #6: Complaint resolution time, over time

In [1]:
%matplotlib inline

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import psycopg2

pd.options.display.max_columns = 40

Hack for Heat #6: Complaint resolution time, over time and censoring

As a follow up to the last post, we're going to try and find out if the average resolution time has changed over time. This might be tricky, as we might run into the issue of censoring. Censoring) is a problem in time-series analyses that occurs when data that are missing because they lie outside of the range of the measure.

In this case, what we might expect to find is that complaint resolution time appears shorter as we get to the present day, and that may be true, but it may also happen because cases where problems have yet to be solved show up as missing data. In other words, for a case that was opened in 2010, the longest possible resolution time is 5 years, whereas for a case opened yesterday, the longest possible resolution time is 1 day.

So, as a first step, let's look at the proportion of unseen cases over time:

In [2]:
#Like before, we're going to select the relevant columns from the database:
connection = psycopg2.connect('dbname= threeoneone user=threeoneoneadmin password=threeoneoneadmin')
cursor = connection.cursor()

cursor.execute('''SELECT createddate, closeddate, borough FROM service;''')
data = cursor.fetchall()
data = pd.DataFrame(data)
In [3]:
data.columns = ['createddate','closeddate','borough']

Let's extract years and months again:

In [4]:
data['cryear'] = [x.year for x in data['createddate']]
data['crmonth'] = [x.month for x in data['createddate']]

And now, we're going to filter out bad cases again. However, this time, we're going to be a bit more selective. We're going to include cases where the closed date is missing.

In [5]:
#filter out bad casesa
import datetime

today = datetime.date(2016,5,29)
janone = datetime.date(2010,1,1)

data = data.loc[(data['closeddate'] > data['createddate']) | (data['closeddate'].isnull() == True)]
In [6]:
databyyear = data.groupby(by='cryear').count()
databyyear
Out[6]:
createddate closeddate borough crmonth
cryear
2010 1340309 1281719 1340309 1340309
2011 1294795 1157180 1294795 1294795
2012 1185598 1148399 1185598 1185598
2013 1221690 1177863 1221690 1221690
2014 1354300 1305561 1354300 1354300
2015 1451521 1378090 1451521 1451521
2016 605070 530959 605070 605070
In [7]:
data['timedelta'] = data['closeddate'] - data['createddate']
data['timedeltaint'] = [x.days if pd.isnull(x) == False else None for x in data['timedelta']]
In [8]:
data.groupby(by='cryear').mean()
Out[8]:
crmonth timedeltaint
cryear
2010 6.505385 24.508043
2011 6.491262 22.512834
2012 6.617468 22.690249
2013 6.414707 20.205784
2014 6.280245 19.712918
2015 6.314193 16.731161
2016 2.799473 8.634098

This table shows exactly what I'm talking about - as we get closer to the current day, the average resolution time falls more and more. If censorig is occuring, we might expect that the proportion of cases closed is also decreasing over time. This is generally the case:

In [9]:
databyyear['propclosed'] = databyyear['closeddate']/databyyear['createddate']
databyyear
Out[9]:
createddate closeddate borough crmonth propclosed
cryear
2010 1340309 1281719 1340309 1340309 0.956286
2011 1294795 1157180 1294795 1294795 0.893717
2012 1185598 1148399 1185598 1185598 0.968624
2013 1221690 1177863 1221690 1221690 0.964126
2014 1354300 1305561 1354300 1354300 0.964012
2015 1451521 1378090 1451521 1451521 0.949411
2016 605070 530959 605070 605070 0.877517

With the exception of year 2011, which we have reason to believe is wonky (see Hack for Heat #4), this is generally the case. So, how do we deal with this issue? To be honest, I don't know at the moment. But I'm going to read this paper tomorrow, and see if I can implement something.