Hello!

I'm Bryan. This is my blog of the following projects (links above):

Hack for Heat

In my free time, I volunteer for Heat Seek, an NYC based non-profit that deploys temperature sensors to ensure that everyone in the city has adequate heating over the winter. My posts aim to show some of the groundwork that goes into writing some of our blog posts.

Mediocre Museum

The mediocre part of this project comes from my friend Emilia's and my goal of building the most average artifact possible. We're currently working with open data obtained from the Canada Science and Technology Museums Corporation, obtainable here.

Mousetracker Data

I'm working with my colleagues, Qi Xu and Carlina Conrad, at the NYU Couples Lab to learn more about how people in relationships allocate resources amongsts themselves, and balance self-, altruistic, and group interests.

Skytrax Ratings

What makes a good flight? Using Skytrax ratings data scraped by Quang Nguyen, I'm investigating what factors make up a passenger's overall rating, as well as what influences their decision to recommend an airline to others (vs. not).


Hack for Heat #5: Complaint resolution time

In [1]:
%matplotlib inline

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import psycopg2

pd.options.display.max_columns = 40

Hack for Heat #5: How long do complaints take to resolve?

In this post, we're going to see if we can graph how long it takes for complaints to get resolved.

In [2]:
connection = psycopg2.connect('dbname= threeoneone user=threeoneoneadmin password=threeoneoneadmin')
cursor = connection.cursor()

cursor.execute('''SELECT createddate, closeddate, borough FROM service;''')
data = cursor.fetchall()
data = pd.DataFrame(data)
In [3]:
data.columns = ['createddate','closeddate','borough']
In [4]:
data = data.loc[data['createddate'].notnull()]
data = data.loc[data['closeddate'].notnull()]
In [5]:
data['timedelta'] = data['closeddate'] - data['createddate']
In [6]:
data['timedeltaint'] = [x.days for x in data['timedelta']]
In [7]:
data.head()
Out[7]:
createddate closeddate borough timedelta timedeltaint
0 2013-04-26 2013-04-26 QUEENS 0 days 0
1 2013-04-26 2013-04-29 MANHATTAN 3 days 3
2 2013-04-26 2013-04-30 MANHATTAN 4 days 4
3 2013-04-26 2013-04-26 QUEENS 0 days 0
4 2013-04-26 2013-04-30 STATEN ISLAND 4 days 4
In [8]:
data.groupby(by='borough')['timedeltaint'].mean()
Out[8]:
borough
BRONX            -3.658691
BROOKLYN         -6.835417
MANHATTAN       -20.325506
QUEENS           -4.121806
STATEN ISLAND    -2.108396
Unspecified       4.699638
Name: timedeltaint, dtype: float64

Oops! Looks like something's wrong. Let's try and find out:

In [9]:
data.sort_values('timedeltaint').head()
Out[9]:
createddate closeddate borough timedelta timedeltaint
604564 2016-03-16 1900-01-01 Unspecified -42443 days -42443
596082 2016-03-16 1900-01-01 Unspecified -42443 days -42443
605654 2016-03-16 1900-01-01 QUEENS -42443 days -42443
606458 2016-03-15 1900-01-01 MANHATTAN -42442 days -42442
552013 2016-03-14 1900-01-01 Unspecified -42441 days -42441
In [10]:
data.sort_values('timedeltaint', ascending=False).head()
Out[10]:
createddate closeddate borough timedelta timedeltaint
9392808 2010-04-08 2201-05-13 BRONX 69796 days 69796
9525584 2010-06-01 2201-06-17 BROOKLYN 69777 days 69777
1581464 2011-01-20 2201-03-25 BRONX 69460 days 69460
1536326 2015-11-03 2100-01-01 QUEENS 30740 days 30740
2479091 2013-01-10 2023-05-01 BROOKLYN 3763 days 3763

Ah. Well, as a first step, let's remove any values that are before Jan 1st 2010 or after today:

In [11]:
import datetime

today = datetime.date(2016,5,29)
janone = datetime.date(2010,1,1)

Let's also remove any rows where the close date is before the created date:

In [12]:
subdata = data.loc[(data['closeddate'] > janone) & (data['closeddate'] < today)]
subdata = subdata.loc[data['closeddate'] > data['createddate']]
In [13]:
len(subdata)
Out[13]:
7979757
In [14]:
subdata.sort_values('timedeltaint').head()
Out[14]:
createddate closeddate borough timedelta timedeltaint
11371297 2015-08-28 2015-08-29 QUEENS 1 days 1
7950080 2015-06-20 2015-06-21 BROOKLYN 1 days 1
2366882 2012-11-19 2012-11-20 MANHATTAN 1 days 1
2366879 2015-06-09 2015-06-10 Unspecified 1 days 1
2366874 2012-04-17 2012-04-18 BROOKLYN 1 days 1
In [15]:
subdata.sort_values('timedeltaint',ascending = False).head()
Out[15]:
createddate closeddate borough timedelta timedeltaint
10774084 2010-01-06 2016-03-24 BROOKLYN 2269 days 2269
1506622 2010-02-19 2016-05-02 BROOKLYN 2264 days 2264
1508205 2010-03-02 2016-05-04 STATEN ISLAND 2255 days 2255
513580 2010-01-01 2016-02-17 STATEN ISLAND 2238 days 2238
589865 2010-01-22 2016-02-25 BRONX 2225 days 2225

This looks a little bit more realistic, but let's also visualize the distribution:

In [16]:
plotdata = list(subdata['timedeltaint'])
In [17]:
plt.figure(figsize=(12,10))
plt.hist(plotdata);

Okay, this still looks really wonky. Let's further subset the data, and see what happens when we remove the top and bottom 2.5%.

Pandas has a quantile function:

In [18]:
subdata.quantile([.025, .975])
Out[18]:
timedeltaint
0.025 1
0.975 138
In [19]:
quantcutdata = subdata.loc[(subdata['timedeltaint'] > 1) & (subdata['timedeltaint'] < 138) ]
In [20]:
len(quantcutdata)
Out[20]:
6278707
In [21]:
plotdata = list(quantcutdata['timedeltaint'])

plt.figure(figsize=(12,10))
plt.hist(plotdata);

That looks a little better, but there might be other ways to filter out bad data.

In [22]:
subdata.groupby(by='borough').median()
Out[22]:
timedeltaint
borough
BRONX 5
BROOKLYN 5
MANHATTAN 6
QUEENS 5
STATEN ISLAND 4
Unspecified 5
In [23]:
subdata.groupby(by='borough').mean()
Out[23]:
timedeltaint
borough
BRONX 18.789025
BROOKLYN 20.725656
MANHATTAN 20.841213
QUEENS 23.746771
STATEN ISLAND 26.283222
Unspecified 11.652647

So, I definitely wouldn't trust these data right now - I'm still working on finding a data dictionary for the 311 data, and I need to make sure the columns mean what I think they mean. The point of this is just to show that the process would be once I get my hands on reliable data.

Hack for Heat #4: Complaints by borough and date

In [2]:
%matplotlib inline

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import psycopg2

pd.options.display.max_columns = 40

Hack for Heat #4: Number of complaints over time pt.2

In this post, we're going to look at the number of complaints each borough received for the last five or so years. First, let's look at the total number of complaints received:

In [3]:
connection = psycopg2.connect('dbname = threeoneone user=threeoneoneadmin password=threeoneoneadmin')
cursor = connection.cursor()

Borough complaints by date

In [4]:
cursor.execute('''SELECT createddate, borough FROM service;''')
borodata = cursor.fetchall()
In [5]:
borodata = pd.DataFrame(borodata)
In [6]:
borodata.columns = ['Date', 'Boro']
In [7]:
borobydate = borodata.groupby(by='Boro').count()
In [8]:
borobydate
Out[8]:
Date
Boro
BRONX 1882215
BROOKLYN 3178206
MANHATTAN 2166983
QUEENS 2444874
STATEN ISLAND 541935
Unspecified 1157085

For fun, let's look at the number of complaints per capita in each of the 5 boroughs. The population values below are from Wikipedia.

In [9]:
boropop = {
    'MANHATTAN': 1636268,
    'BRONX': 1438159,
    'BROOKLYN': 2621793,
    'QUEENS': 2321580,
    'STATEN ISLAND': 473279,
    }
In [10]:
borobydate['Pop'] = [boropop.get(x) for x in borobydate.index]
In [11]:
borobydate['CompPerCap'] = borobydate['Date']/borobydate['Pop']
In [12]:
borobydate
Out[12]:
Date Pop CompPerCap
Boro
BRONX 1882215 1438159 1.308767
BROOKLYN 3178206 2621793 1.212226
MANHATTAN 2166983 1636268 1.324345
QUEENS 2444874 2321580 1.053108
STATEN ISLAND 541935 473279 1.145065
Unspecified 1157085 NaN NaN

Complaints by borough over months

The next thing we're going to do is make a stacked plot of complaints by borough, over months. To do this, we need to extract the day and month from the date column:

In [13]:
borodata['Year'] = [x.year for x in borodata['Date']]
borodata['Month'] = [x.month for x in borodata['Date']]
In [14]:
borodata = borodata.loc[borodata['Boro'] != 'Unspecified']

Next, we need to generate an array of Ys. We want the rows of this dataframe to be the 5 boroughs, and the columns to be the count of complaints for the year and month.

In [15]:
boroplotdata = borodata.groupby(by=['Boro', 'Year','Month']).count()
boroplotdata
Out[15]:
Date
Boro Year Month
BRONX 2010 1 9395
2 9475
3 13616
4 11786
5 12169
6 15681
7 16861
8 12920
9 11633
10 10590
11 9196
12 8815
2011 1 9599
2 11229
3 26772
4 22898
5 22722
6 23677
7 27348
8 25241
9 25138
10 30071
11 28319
12 30927
2012 1 33097
2 25259
3 24480
4 20330
5 21851
6 24170
... ... ... ...
STATEN ISLAND 2013 12 4984
2014 1 8753
2 8978
3 8211
4 7937
5 8278
6 8173
7 8111
8 6934
9 6597
10 7067
11 5558
12 5889
2015 1 6398
2 8228
3 10537
4 8489
5 8240
6 8803
7 8507
8 7684
9 7920
10 7592
11 6756
12 6264
2016 1 7186
2 7505
3 8531
4 8750
5 6688

385 rows × 1 columns

We basically need to get the above table into a graph.

In [16]:
boros = borobydate.index
borodict = {x:[] for x in boros}
borodict.pop('Unspecified')

for boro in borodict:
    borodict[boro] = list(boroplotdata.xs(boro).Date)
In [17]:
plotdata = np.zeros(len(borodict['BROOKLYN']))
for boro in borodict:
    plotdata = np.row_stack((plotdata, borodict[boro]))
In [18]:
plotdata = np.delete(plotdata, (0), axis=0)
In [19]:
plotdata
Out[19]:
array([[  5804.,   6791.,  11009.,   6578.,   6590.,   6370.,   5909.,
          5957.,   5926.,   5597.,   4823.,   4715.,   6008.,   7360.,
          8109.,   6836.,   7091.,   7257.,   6380.,   9858.,   8268.,
          8356.,   6778.,   5799.,   6169.,   5042.,   6018.,   5607.,
          6746.,   6839.,   7037.,   7024.,   6108.,   7036.,   7509.,
          5445.,   7207.,   5851.,   6352.,   6915.,   6843.,   7392.,
          7552.,   6442.,   5936.,   6152.,   4996.,   4984.,   8753.,
          8978.,   8211.,   7937.,   8278.,   8173.,   8111.,   6934.,
          6597.,   7067.,   5558.,   5889.,   6398.,   8228.,  10537.,
          8489.,   8240.,   8803.,   8507.,   7684.,   7920.,   7592.,
          6756.,   6264.,   7186.,   7505.,   8531.,   8750.,   6688.],
       [ 21753.,  22463.,  30847.,  25357.,  26545.,  27245.,  29419.,
         26279.,  26287.,  24483.,  22788.,  22361.,  22895.,  24331.,
         43807.,  41075.,  39214.,  41555.,  42120.,  46155.,  44927.,
         47186.,  44536.,  44451.,  46493.,  36073.,  37645.,  35357.,
         38451.,  38776.,  41169.,  39849.,  36523.,  41004.,  47840.,
         42908.,  50532.,  40369.,  40716.,  38720.,  40310.,  41070.,
         41466.,  37545.,  34560.,  40456.,  42451.,  42784.,  60679.,
         49657.,  45848.,  40934.,  46025.,  46460.,  46283.,  41112.,
         41180.,  45079.,  47862.,  46171.,  52516.,  55555.,  55122.,
         47002.,  47346.,  51835.,  51154.,  47652.,  47385.,  52645.,
         48386.,  45544.,  56649.,  52388.,  50952.,  52751.,  44888.],
       [  9395.,   9475.,  13616.,  11786.,  12169.,  15681.,  16861.,
         12920.,  11633.,  10590.,   9196.,   8815.,   9599.,  11229.,
         26772.,  22898.,  22722.,  23677.,  27348.,  25241.,  25138.,
         30071.,  28319.,  30927.,  33097.,  25259.,  24480.,  20330.,
         21851.,  24170.,  25939.,  22423.,  19910.,  23060.,  29396.,
         26587.,  33031.,  28243.,  24956.,  22083.,  20914.,  22673.,
         26734.,  20546.,  19060.,  23673.,  29511.,  30904.,  43650.,
         33274.,  29464.,  24904.,  24040.,  24024.,  26418.,  22771.,
         21910.,  25771.,  33015.,  31056.,  36197.,  37733.,  32791.,
         26419.,  24552.,  27103.,  29387.,  26814.,  25613.,  30593.,
         29559.,  28256.,  39647.,  36866.,  30192.,  29880.,  25408.],
       [ 15659.,  14831.,  18597.,  19642.,  20490.,  20316.,  21347.,
         19256.,  19772.,  19534.,  18897.,  14786.,  14499.,  17235.,
         27956.,  26018.,  26449.,  25271.,  25531.,  25630.,  27482.,
         31265.,  29656.,  28735.,  29867.,  24649.,  27054.,  24528.,
         25844.,  26173.,  25093.,  25044.,  23468.,  26678.,  28202.,
         27837.,  32110.,  27178.,  28061.,  26816.,  27207.,  25809.,
         26994.,  25026.,  24760.,  29790.,  30731.,  28005.,  37394.,
         29569.,  30384.,  29098.,  30921.,  30898.,  31221.,  28214.,
         28754.,  32569.,  33913.,  30960.,  34194.,  34262.,  33343.,
         33626.,  33555.,  33952.,  33896.,  34326.,  34295.,  40630.,
         37755.,  33389.,  40881.,  37560.,  40559.,  43912.,  37175.],
       [ 22951.,  22336.,  32976.,  27739.,  28528.,  28287.,  29154.,
         27223.,  34110.,  26295.,  22845.,  21670.,  22964.,  24969.,
         32932.,  29108.,  29876.,  31333.,  31231.,  38856.,  33368.,
         31570.,  29785.,  28166.,  29217.,  23758.,  26756.,  25594.,
         28361.,  30195.,  31792.,  31756.,  28628.,  34079.,  39404.,
         29450.,  33078.,  28121.,  27475.,  27384.,  29037.,  29633.,
         31044.,  27880.,  26220.,  28053.,  27561.,  28286.,  38144.,
         36074.,  35977.,  31202.,  34711.,  35843.,  36394.,  31615.,
         31129.,  32470.,  33042.,  31788.,  34332.,  38725.,  44965.,
         37836.,  36885.,  40637.,  39691.,  38212.,  36171.,  38864.,
         34327.,  34463.,  38982.,  38579.,  38548.,  40171.,  34063.]])

Awesome! Now we have 5 rows with 77 columns each denoting the complaints for each of the boros for each of the months from 2010 to 2016.

In [20]:
from matplotlib import patches as mpatches

x = np.arange(len(plotdata[0]))

#crude xlabels
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
years = ['2010', '2011', '2012', '2013', '2014', '2015', '2016']
xlabels = []
for year in years:
    for month in months:
        xlabels.append("{0} {1}".format(month,year))

plotcolors = [(1,0,103),(213,255,0),(255,0,86),(158,0,142),(14,76,161),(255,229,2),(0,95,57),\
            (0,255,0),(149,0,58),(255,147,126),(164,36,0),(0,21,68),(145,208,203),(98,14,0)]

#rescaling rgb from 0-255 to 0 to 1
plotcolors = [(color[0]/float(255),color[1]/float(255),color[2]/float(255)) for color in plotcolors]
legendcolors = [mpatches.Patch(color = color) for color in plotcolors]

plt.figure(figsize = (15,10));
plt.stackplot(x,plotdata, colors = plotcolors);
plt.xticks(x,xlabels,rotation=90);
plt.xlim(0,76)
plt.legend(legendcolors,borodict.keys(), bbox_to_anchor=(0.2, 1));
plt.title('311 Complaints by Borough', size = 24)
plt.ylabel('Number of Complaints',size = 14)

plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.gca().yaxis.set_ticks_position('left')
plt.gca().xaxis.set_ticks_position('bottom')
C:\Users\Bryan\Anaconda2\lib\site-packages\matplotlib\collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):

Sanity checks

So, some parts of this graph bear checking. First, did Staten Island really not increase in complaints over the years? The data below (complaints by by borough by year) suggest that that was the case:

In [21]:
borodata.groupby(by = ['Boro', 'Year']).count()
Out[21]:
Date Month
Boro Year
BRONX 2010 142137 142137
2011 283941 283941
2012 296502 296502
2013 302328 302328
2014 340297 340297
2015 355017 355017
2016 161993 161993
BROOKLYN 2010 305827 305827
2011 482252 482252
2012 482088 482088
2013 490979 490979
2014 557290 557290
2015 602142 602142
2016 257628 257628
MANHATTAN 2010 223127 223127
2011 305727 305727
2012 314437 314437
2013 332487 332487
2014 373895 373895
2015 417223 417223
2016 200087 200087
QUEENS 2010 324114 324114
2011 364158 364158
2012 358990 358990
2013 343772 343772
2014 408389 408389
2015 455108 455108
2016 190343 190343
STATEN ISLAND 2010 76069 76069
2011 88100 88100
2012 76580 76580
2013 76622 76622
2014 90486 90486
2015 95418 95418
2016 38660 38660

The other thing that would be worth investigating is what happened between Februrary and March 2011 - complaints doubled during this time. Perhaps some aspect of the pipeline changed during this time, but that's a future story.

Hack for Heat #3: Number of complaints over time

In [3]:
%matplotlib inline

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import psycopg2

pd.options.display.max_columns = 40

Hack for Heat #3: Number of complaints over time

This time, we're going to look at raw 311 complaint data. The data that I was working with previously was summarized data.

This dataset is much bigger, which is nice because it'll give me a chance to maintain my SQL-querying-from-memory-skills.

First, we're going to have to load all of this data into a postgres database. I wrote this tablebase.

SQL-ing this

The python library psycopg2 lets us work with postgres databases in python. We first create a connection object, that encapsulates the connection to the database, then create a cursor class that lets us make queries from that database.

In [22]:
connection = psycopg2.connect('dbname = threeoneone user=threeoneoneadmin password=threeoneoneadmin')
cursor = connection.cursor()

For example, we might want to extract the column names from our table:

In [23]:
cursor.execute('''SELECT * FROM threeoneone.INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'service'; ''')
columns = cursor.fetchall()
In [24]:
columns = [x[3] for x in columns]
In [25]:
columns[0:5]
Out[25]:
['uniquekey', 'createddate', 'closeddate', 'agency', 'agencyname']

Complaints over time

Let's start with something simple. First, let's extract a list of all complaints, and the plot the number of complaints by month.

In [27]:
cursor.execute('''SELECT createddate FROM service;''')
complaintdates = cursor.fetchall()
In [32]:
complaintdates = pd.DataFrame(complaintdates)
In [33]:
complaintdates.head()
Out[33]:
0
0 (2013-04-26,)
1 (2013-04-26,)
2 (2013-04-26,)
3 (2013-04-26,)
4 (2013-04-26,)

Renaming our column:

In [42]:
complaintdates.columns = ['Date']

Next we have to convert these tuples into strings:

In [51]:
complaintdates['Date'] = [x [0] for x in complaintdates['Date']]

Normally, if these were strings, we'd use the extract_dates function we wrote in a previous post. However, because I typed these as datetime objects, we can just extract the .year(), .month(), and .day() attributes:

In [60]:
type(complaintdates['Date'][0])
Out[60]:
datetime.date
In [63]:
complaintdates['Day'] = [x.day for x in complaintdates['Date']]
complaintdates['Month'] = [x.month for x in complaintdates['Date']]
complaintdates['Year'] = [x.year for x in complaintdates['Date']]

This is how many total complaints we have:

In [67]:
len(complaintdates)
Out[67]:
11371298

We can group them by month:

In [69]:
bymonth = complaintdates.groupby(by='Month').count()
bymonth
Out[69]:
Date Year Day
Month
1 1182682 1182682 1182682
2 1065065 1065065 1065065
3 1099983 1099983 1099983
4 985815 985815 985815
5 972913 972913 972913
6 849832 849832 849832
7 875210 875210 875210
8 829573 829573 829573
9 805231 805231 805231
10 895672 895672 895672
11 915066 915066 915066
12 894256 894256 894256

By year:

In [71]:
byyear = complaintdates.groupby(by='Year').count()
byyear
Out[71]:
Date Month Day
Year
2010 1754355 1754355 1754355
2011 1688091 1688091 1688091
2012 1568421 1568421 1568421
2013 1614207 1614207 1614207
2014 1839574 1839574 1839574
2015 2021963 2021963 2021963
2016 884687 884687 884687
In [88]:
byday = complaintdates.groupby(by='Day').count()
bydate = complaintdates.groupby(by='Date').count()

Some matplotlib

In [70]:
plt.figure(figsize = (12,10))
x = range(0,12)
y = bymonth['Date']
plt.plot(x,y)
Out[70]:
[<matplotlib.lines.Line2D at 0xa7a0ed68>]
In [78]:
plt.figure(figsize = (12,10))
x = range(0,7)
y = byyear['Date']

plt.plot(x,y)
Out[78]:
[<matplotlib.lines.Line2D at 0xa1925748>]
In [86]:
plt.figure(figsize = (12,10))
x = range(0,len(byday))
y = byday['Date']

plt.plot(x,y)
Out[86]:
[<matplotlib.lines.Line2D at 0x8c9c83c8>]

The sharp decline we see at the end is obviously because not all months have the same number of days.

In [94]:
plt.figure(figsize=(12,10))
x = range(0,len(bydate))
y = bydate['Year'] #This is arbitrary - year, month, and day are all series that store the counts

plt.plot(x,y)
Out[94]:
[<matplotlib.lines.Line2D at 0xa0ed9860>]

That's all for now. In the next post, I'm going to break this down by borough, as well as polish this graph.

Misc #1: Experimenting with plots

In [1]:
%matplotlib inline

from matplotlib import pyplot as plt

Misc #1: Experimenting with plots

I'm in the process of constructing figures for my dissertation, and I thought I'd share the different options I'm considering. I only need to plot simple line graphs, so my main considerations are aesthetics and ease of use (as I'll likely be making changes to them). This is the list I'm going to test:

  1. matplotlib
  2. MATLAB
  3. R + ggplot
  4. Excel
  5. gnuplot
  6. plot.ly
  7. Chart.js
  8. Tableau
  9. Bokeh
  10. MS paint (just kidding)

Without further ado, let's go!

matplotlib

In [2]:
intercept = -.2122
b1 = -1.1776 
b2 = -.1192
b3 = .4943

#x1 is a categorical predictor, x2 is a continuous predictor, x3 is their interaction
def y(x1, x2):
    return intercept + b1*x1 + b2*x2 + b3*x1*x2
In [3]:
x = range(1,8)
plt.plot(x, [y(0,z) for z in x], "blue", x, [y(1,z) for z in x], "red")

#pl.fill_between(x, y-error, y+error)

plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.gca().yaxis.set_ticks_position('left')
plt.gca().xaxis.set_ticks_position('bottom')

MATLAB

intercept = -.2122
b1 = -1.1776 
b2 = -.1192
b3 = .4943
x = 1:7
% moderator = 0
y1 = intercept + b2*x
% moderator = 1
y2 = intercept + b1 + b2*x + b3*x

figure
plot(x,y1,x,y2)

R

intercept = -.2122
b1 = -1.1776 
b2 = -.1192
b3 = .4943

y <- function(x1, x2){
  return(intercept + b1*x1 + b2*x2 + b3*x1*x2)
}

x = c(1:7)
y1 = sapply(x,y,x1=0)
y2 = sapply(x,y,x1=1)
xrange <- range(1:7) 
yrange <- range(-1.5:1.5) 
plot(xrange,yrange, type='n')
lines(x,y1)
lines(x,y2)

ggplot2

d1 <- data.frame(x,y=y1,g='0')
d2 <- data.frame(x,y=y2,g='1')
data <- rbind(d1, d2)

cbbPalette <- c("blue", "red")
ggplot(data=data, aes(x=x, y=y, group=g, color = g)) +
  geom_line() +
  scale_color_manual(values=cbbPalette) +
  scale_x_continuous(limits = c(1,7)) + 
  scale_y_continuous(limits = c(-1.5,1.5))

Excel

gnuplot

set title ""
intercept = -.2122
b1 = -1.1776 
b2 = -.1192
b3 = .4943
y1(x) = intercept + b2*x
y2(x) = intercept + b1 + b2*x + b3*x

plot [1:7] y1(x) lt rgb "blue", y2(x) lt rgb "red"

plot.ly

plot.ly is web-based; I took screencaps of the interface:

The plot looks like this:

chart.js

<html>
    <head>
        <script src="Chart.js"></script>
    </head>
<canvas id="myChart" width="640" height="480"></canvas>
<body>
    <script>
        var ctx = document.getElementById("myChart");

        var data = {
            labels: Array.apply(null, Array(8)).map(function (_, i) {return i;}),
            datasets: [
                {
                    label: "y1",
                    fill: false,
                    lineTension: 0.1,
                    backgroundColor: "blue",
                    borderColor: "blue",
                    borderCapStyle: 'butt',
                    borderJoinStyle: 'miter',
                    pointBorderColor: "blue",
                    pointBackgroundColor: "#fff",
                    pointBorderWidth: 1,
                    pointHoverRadius: 5,
                    pointHoverBackgroundColor: "blue",
                    pointHoverBorderColor: "blue",
                    pointHoverBorderWidth: 2,
                    pointRadius: 1,
                    pointHitRadius: 10,
                    data: [-0.3314,-0.4506,-0.5698,-0.689,-0.8082,-0.9274,-1.0466],
                },
                        {
                    label: "y2",
                    fill: false,
                    lineTension: 0.1,
                    backgroundColor: "red",
                    borderColor: "red",
                    borderCapStyle: 'butt',
                    borderJoinStyle: 'miter',
                    pointBorderColor: "red",
                    pointBackgroundColor: "#fff",
                    pointBorderWidth: 1,
                    pointHoverRadius: 5,
                    pointHoverBackgroundColor: "red",
                    pointHoverBorderColor: "red",
                    pointHoverBorderWidth: 2,
                    pointRadius: 1,
                    pointHitRadius: 10,
                    data: [-1.0147,-0.6396,-0.2645,0.1106,0.4857,0.8608,1.2359],
                }
            ]
        };

        var myChart = new Chart(ctx, {
            type: 'line',
            data: data,
            options: {
                responsive:false,
                scales: {
                    yAxes: [{
                        ticks: {
                            beginAtZero:false
                        }
                    }]
                }
            }
        });
</script>
</body>
</html>
In [5]:
from IPython.display import IFrame
IFrame('https://bryansim.github.io/chartjsplot.html', width=700, height=500)
Out[5]:

Bokeh

In [5]:
from bokeh.plotting import figure, output_file, show

intercept = -.2122
b1 = -1.1776 
b2 = -.1192
b3 = .4943

#x1 is a categorical predictor, x2 is a continuous predictor, x3 is their interaction
def y(x1, x2):
    return intercept + b1*x1 + b2*x2 + b3*x1*x2

x = range(1,8)
y1 = [y(0,z) for z in x]
y2 = [y(1,z) for z in x]

# output to static HTML file
output_file("bokehplot.html")

# create a new plot
p = figure(
   tools="pan,box_zoom,reset,save",
   y_range=[-1.5, 1.5],
)

# add some renderers
p.line(x, y1, legend="mod=0", line_color="blue")
p.line(x, y2, legend="mod=1", line_color="red")

# show the results
show(p)
In [4]:
from IPython.display import IFrame
IFrame('https://bryansim.github.io/bokehplot.html', width=700, height=675)
Out[4]:

Summary

I think, from here on out, I'll be using matplotlib for simple plots, Bokeh if I need to generate something interactive, and chart.js if I ever need to implement a dynamic chart.

With regard to the alternatives:

  1. MATLAB is proprietary.
  2. The built-in packages in R generate ugly plots.
  3. ggplot is a little unwieldy, and has defaults that I don't like (e.g., the grey background)
  4. Excel is good to visualize something really quick, but most people who have worked with linear models for a while (including myself) can do this in our heads or with a pen and paper.
  5. gnuplot looks really interesting, and may be something I spend more time on in the future, but it's foreign to me at the moment.
  6. plot.ly and Tableau (which I wanted to test, but their sign-up system never sent me the confirmation email) are both web-based, and for me, needlessly obfuscate what happens in the backend. I'd be comfortable writing my own code to generate the plots I want.

BONUS: MS Paint

Hack for Heat #2: Problem types over time

In [1]:
%matplotlib inline

import pandas as pd
from matplotlib import pyplot as plt
import numpy as np

pd.options.display.max_columns = 40

Hack for Heat #2: Problem types over time

In this post, I'm going to explore how we might track the composition of problems that the HPD(Housing Preservation and Development Board) might receive over time. The data that I currently have are only limited to 2.5 years, so this is just a proof of concept.

In [2]:
hpdcompprob = pd.read_csv("Complaint_Problems.csv")
In [3]:
type(hpdcompprob.StatusDate[0])
Out[3]:
str

The first thing we're going to do is convert string dates into simple years, months, and days.

In [6]:
def extract_dates(series, nameofdatecolumn):
    series['Year'] = series[nameofdatecolumn].str.extract(r'../../(.*)', expand=True)
    series['Month'] = series[nameofdatecolumn].str.extract(r'(.*)/../....', expand=True)
    series['Day'] = series[nameofdatecolumn].str.extract(r'../(.*)/....', expand=True)
    
extract_dates(hpdcompprob, 'StatusDate')
In [7]:
def count_problem_types(pdseries):
    problemtypes = pdseries.unique()
    problemcounts = {k:0 for k in problemtypes}
    problemcounts['TOTAL'] = 0
    
    for problem in pdseries:
        problemcounts[problem] += 1
    
    problemcounts['TOTAL'] = len(pdseries)
    print year, problemcounts #this is just to show us what those counts are
        
    return problemcounts
In [8]:
#years = hpdcompprob.Year.unique()
years = ['2014','2015','2016'] #We know that only these data are relevant (the other, older years, are remnants of old cases)
yearsdict = {k:{} for k in years}
for year in years:
    yearsdict[year] = count_problem_types(hpdcompprob.loc[hpdcompprob.Year == year].MajorCategory)
2014 {'TOTAL': 259437, 'ELECTRIC': 12684, 'HEAT/HOT WATER': 93548, 'WATER LEAK': 14315, 'APPLIANCE': 5231, 'DOOR/WINDOW': 16352, 'HEATING': 16, 'ELEVATOR': 363, 'GENERAL': 9680, 'FLOORING/STAIRS': 11768, 'NONCONST': 185, 'UNSANITARY CONDITION': 34779, 'SAFETY': 5521, 'PLUMBING': 23179, 'CONSTRUCTION': 3, 'OUTSIDE BUILDING': 758, 'PAINT/PLASTER': 31055}
2015 {'TOTAL': 705644, 'PAINT/PLASTER': 84062, 'WATER LEAK': 42512, 'DOOR/WINDOW': 49154, 'APPLIANCE': 14719, 'NONCONST': 69, 'ELEVATOR': 1213, 'GENERAL': 31482, 'HEAT/HOT WATER': 228157, 'FLOORING/STAIRS': 34741, 'UNSANITARY CONDITION': 97221, 'SAFETY': 17105, 'PLUMBING': 67472, 'CONSTRUCTION': 1, 'OUTSIDE BUILDING': 2183, 'ELECTRIC': 35553}
2016 {'TOTAL': 211830, 'ELECTRIC': 9239, 'WATER LEAK': 12063, 'DOOR/WINDOW': 15465, 'APPLIANCE': 4644, 'HEATING': 39, 'ELEVATOR': 377, 'GENERAL': 9738, 'HEAT/HOT WATER': 73803, 'NONCONST': 33, 'UNSANITARY CONDITION': 27727, 'SAFETY': 5243, 'PLUMBING': 20218, 'FLOORING/STAIRS': 9989, 'OUTSIDE BUILDING': 543, 'PAINT/PLASTER': 22709}

Graphing our results

We're going to use matplotlib to make a stacked area graph. I referenced this post.

First, let's arrange our dictionary keys in alphabetical order, so that we can extract the values in a meaningful way.

In [9]:
labels = sorted(yearsdict['2014'].keys())
labels.remove('CONSTRUCTION')
labels.remove('HEATING') #removed these because 2015 doesn't have an entry, and both the other years have negligible entries
labels.remove('TOTAL') #not interested in total in this plot
In [10]:
years = ['2014','2015','2016']
plotdict = {k:[] for k in years}
for year in years:
    for label in labels:
        plotdict[year].append(yearsdict[year][label])
In [11]:
len(labels)
Out[11]:
14
In [12]:
x = np.arange(len(years))
y = np.column_stack((plotdict['2014'],plotdict['2015'],plotdict['2016']))
plt.figure(figsize = (15,10));
plt.stackplot(x,y);
plt.legend(labels)
Out[12]:
<matplotlib.legend.Legend at 0xeb24ac8>

This might legit be the ugliest plot I have ever seen. Let's try and fix the colors first.

From this blog post, I got a list of 14 contrasting colors.

In [13]:
from matplotlib import patches as mpatches

plotcolors = [(0,0,0),(1,0,103),(213,255,0),(255,0,86),(158,0,142),(14,76,161),(255,229,2),(0,95,57),\
            (0,255,0),(149,0,58),(255,147,126),(164,36,0),(0,21,68),(145,208,203),(98,14,0)]

#rescaling rgb from 0-255 to 0 to 1
plotcolors = [(x[0]/float(255),x[1]/float(255),x[2]/float(255)) for x in plotcolors]
legendcolors = [mpatches.Patch(color = x) for x in plotcolors]

x = np.arange(len(years))
y = np.column_stack((plotdict['2014'],plotdict['2015'],plotdict['2016']))
plt.figure(figsize = (15,10));
plt.stackplot(x,y,colors = plotcolors);
plt.legend(legendcolors, labels);

#some labels for the x axis
xlabels = ['2014', '2015', '2016']
plt.xticks(x,xlabels, size = 18);
plt.yticks(size = 18);
plt.title('HPD Complaints by type over time', size = 24);

Okay, so this looks a little better, and with some tweaking, we can make this work.

Hack for Heat #1: Initial thoughts on HPD open data

Hacking for heat

In this series, I'm going to be posting about the process that goes on behind some of the blog posts we end up writing. In this first entry, I'm going to be exploring a number of datsets.

These are the ones that I'm going to be looking at:

  1. HPD (Housing Preservation and Development) housing litigations
  2. Housing maintenance code complaints
  3. Housing maintanence code violations
  4. HPD complaints

(for HPD datasets, some documentation can be found here)

HPD litigation database

First, we're going to look at the smallest dataset, one that contains cases against landlords. From the documentation, this file contains "All cases commenced by HPD or by tennants (naming HPD as a party) in [housing court] since August 2006" either seeking orders for landlords to comply with regulations, or awarding HPD civil penalties (i.e., collecting on fines).

In [2]:
import pandas as pd
In [3]:
litigation = pd.read_csv("Housing_Litigations.csv")
In [5]:
litigation.head()
Out[5]:
LitigationID BuildingID BoroID Boro HouseNumber StreetName Zip Block Lot CaseType CaseOpenDate CaseStatus CaseJudgement
0 98385 806633 2 BRONX 866 EAST 221 STREET 10467.0 4679 73 Tenant Action 04/23/2009 CLOSED NO
1 98387 55134 2 BRONX 3940 CARPENTER AVENUE 10466.0 4825 51 Tenant Action 04/23/2009 CLOSED NO
2 98389 119044 2 BRONX 1356 WALTON AVENUE 10452.0 2841 41 Tenant Action 04/23/2009 CLOSED NO
3 98391 889796 2 BRONX 871 EAST 179 STREET 10460.0 3123 77 Tenant Action 04/24/2009 CLOSED NO
4 98393 56949 2 BRONX 1680 CLAY AVENUE 10457.0 2889 1 Tenant Action 04/24/2009 CLOSED NO

Let's take a look at unique values for some of the columns:

In [6]:
litigation['Boro'].unique()
Out[6]:
array(['BRONX', 'QUEENS', 'MANHATTAN', 'BROOKLYN', 'STATEN ISLAND'], dtype=object)
In [14]:
litigation.groupby(by = ['Boro','CaseJudgement']).count()
Out[14]:
LitigationID BuildingID BoroID HouseNumber StreetName Zip Block Lot CaseType CaseOpenDate CaseStatus
Boro CaseJudgement
BRONX NO 19734 19734 19734 19734 19734 19734 19734 19734 19734 19701 19734
YES 481 481 481 481 481 481 481 481 481 478 481
BROOKLYN NO 19833 19833 19833 19833 19833 19831 19833 19833 19833 19807 19833
YES 762 762 762 762 762 760 762 762 762 762 762
MANHATTAN NO 11849 11849 11849 11849 11849 11849 11849 11849 11849 11841 11849
YES 129 129 129 129 129 129 129 129 129 129 129
QUEENS NO 9068 9068 9068 9068 9068 9061 9068 9068 9068 9063 9068
YES 313 313 313 313 313 313 313 313 313 308 313
STATEN ISLAND NO 1064 1064 1064 1064 1064 1064 1064 1064 1064 1064 1064
YES 84 84 84 84 84 84 84 84 84 84 84

The above table tells us that Manhattan has the lowest proportion of cases that receive judgement (about 1 in 80), whereas Staten Island has the highest (about 1 in 12). It may be something worth looking into, but it's also important to note that many cases settle out of court, and landlords in Manhattan may be more willing (or able) to do so.

In [7]:
litigation['CaseType'].unique()
Out[7]:
array(['Tenant Action', 'Heat and Hot Water', 'CONH',
       'Comp Supplemental Cases', 'Comprehensive',
       'Access Warrant - Non-Lead', 'Access Warrant - lead',
       'Lead False Certification', 'False Certification Non-Lead',
       'Heat Supplemental Cases', '7A', 'Failure to Register Only',
       'HLD - Other Case Type'], dtype=object)
In [16]:
litigation.groupby(by = ['CaseType', 'CaseJudgement']).count()
Out[16]:
LitigationID BuildingID BoroID Boro HouseNumber StreetName Zip Block Lot CaseOpenDate CaseStatus
CaseType CaseJudgement
7A NO 105 105 105 105 105 105 105 105 105 102 105
Access Warrant - Non-Lead NO 5665 5665 5665 5665 5665 5665 5665 5665 5665 5665 5665
YES 8 8 8 8 8 8 8 8 8 8 8
Access Warrant - lead NO 1786 1786 1786 1786 1786 1786 1786 1786 1786 1785 1786
CONH NO 797 797 797 797 797 797 797 797 797 797 797
Comp Supplemental Cases NO 750 750 750 750 750 750 750 750 750 702 750
YES 31 31 31 31 31 31 31 31 31 29 31
Comprehensive NO 2274 2274 2274 2274 2274 2274 2274 2274 2274 2274 2274
YES 121 121 121 121 121 121 121 121 121 121 121
Failure to Register Only NO 60 60 60 60 60 60 60 60 60 60 60
YES 3 3 3 3 3 3 3 3 3 3 3
False Certification Non-Lead NO 2355 2355 2355 2355 2355 2355 2355 2355 2355 2355 2355
YES 20 20 20 20 20 20 20 20 20 20 20
HLD - Other Case Type NO 3 3 3 3 3 3 3 3 3 2 3
Heat Supplemental Cases NO 124 124 124 124 124 124 124 124 124 108 124
YES 7 7 7 7 7 7 6 7 7 6 7
Heat and Hot Water NO 15760 15760 15760 15760 15760 15760 15758 15760 15760 15757 15760
YES 1299 1299 1299 1299 1299 1299 1298 1299 1299 1294 1299
Lead False Certification NO 239 239 239 239 239 239 239 239 239 239 239
YES 2 2 2 2 2 2 2 2 2 2 2
Tenant Action NO 31630 31630 31630 31630 31630 31630 31623 31630 31630 31630 31630
YES 278 278 278 278 278 278 278 278 278 278 278

The table above shows the same case judgement proportions, but conditioned on what type of case it was. Unhelpfully, the documentation does not specify what the difference between Access Warrant - Lead and Non-Lead is. It could be one of two possibilities: The first is whether the warrants have to do with lead-based paint, which is a common problem, but perhaps still too idiosyncratic to have it's own warrant type. The second, perhaps more likely possibility is whether or not HPD was the lead party in the case.

We'll probably end up using these data by aggregating it and examining how complaints change over time, perhaps as a function of what type they are. There's also the possibility of looking up specific buildings' complaints and tying them to landlords. There's probably also an easy way to join this dataset with another, by converting the address information into something standardized, like borough-block-lot (BBL; http://www1.nyc.gov/nyc-resources/service/1232/borough-block-lot-bbl-lookup)

HPD complaints

Next, we're going to look at a dataset of HPD complaints.

In [32]:
hpdcomp = pd.read_csv('Housing_Maintenance_Code_Complaints.csv')
In [34]:
hpdcomp.head()
Out[34]:
ComplaintID BuildingID BoroughID Borough HouseNumber StreetName Zip Block Lot Apartment CommunityBoard ReceivedDate StatusID Status StatusDate
0 6960137 3418 1 MANHATTAN 1989 ADAM C POWELL BOULEVARD 10026.0 1904 4 12D 10 07/07/2014 2 CLOSE 07/29/2014
1 6960832 3512 1 MANHATTAN 2267 ADAM C POWELL BOULEVARD 10030.0 1918 4 3B 10 07/08/2014 2 CLOSE 07/12/2014
2 6946867 5318 1 MANHATTAN 778 11 AVENUE 10019.0 1083 1 4P 4 06/19/2014 2 CLOSE 07/13/2014
3 6966946 5608 1 MANHATTAN 1640 AMSTERDAM AVENUE 10031.0 2073 29 5A 9 07/16/2014 2 CLOSE 07/21/2014
4 6956574 17896 1 MANHATTAN 230 EAST 88 STREET 10128.0 1533 32 1E 8 07/01/2014 2 CLOSE 07/09/2014
In [39]:
len(hpdcomp)
Out[39]:
662672

This dataset is less useful on its own. It doesn't tell us what the type of complaint was, only the date it was received and whether or not the complaint is still open. However, it may be useful in conjunction with the earlier dataset. For example, we might be interested in how many of these complaints end up in court (or at least, have some sort of legal action taken).

HPD violations

The following dataset tracks HPD violations.

In [35]:
hpdviol = pd.read_csv('Housing_Maintenance_Code_Violations.csv')
In [36]:
hpdviol.head()
Out[36]:
ViolationID BuildingID RegistrationID BoroID Boro HouseNumber LowHouseNumber HighHouseNumber StreetName StreetCode ... NewCertifyByDate NewCorrectByDate CertifiedDate OrderNumber NOVID NOVDescription NOVIssuedDate CurrentStatusID CurrentStatus CurrentStatusDate
0 2293208 444 130476 1 MANHATTAN 22 22 22 1 AVENUE 10010 ... NaN NaN NaN 772 338596.0 § 27-2098 ADM CODE FILE WITH THIS DEPARTMENT ... 04/22/1997 19 VIOLATION CLOSED 03/10/2015
1 2293181 444 130476 1 MANHATTAN 22 22 22 1 AVENUE 10010 ... NaN NaN 09/10/1974 775 338584.0 D26-41.05 ADM CODE FILE WITH THIS DEPARTMENT R... 07/16/1974 19 VIOLATION CLOSED 03/10/2015
2 2293249 448 135326 1 MANHATTAN 2222 2222 2222 1 AVENUE 10010 ... NaN NaN NaN 772 338619.0 § 27-2098 ADM CODE FILE WITH THIS DEPARTMENT ... 11/19/1996 19 VIOLATION CLOSED 03/10/2015
3 2293486 467 136913 1 MANHATTAN 2250 2250 2250 1 AVENUE 10010 ... NaN NaN NaN 772 338733.0 D26-41.03 ADM CODE FILE WITH THIS DEPARTMENT A... 05/10/1982 19 VIOLATION CLOSED 03/10/2015
4 2293490 467 136913 1 MANHATTAN 2250 2250 2250 1 AVENUE 10010 ... NaN NaN NaN 772 338737.0 § 27-2098 ADM CODE FILE WITH THIS DEPARTMENT ... 11/19/1996 19 VIOLATION CLOSED 03/10/2015

5 rows × 30 columns

In [40]:
len(hpdviol)
Out[40]:
1437246

These datasets all have different lengths, but that's not surprising, given they come from different years. One productive initial step would be to convert the date strings into something numerical.

HPD complaint problems database

In [45]:
hpdcompprob = pd.read_csv('Complaint_Problems.csv')
In [48]:
hpdcompprob.head()
Out[48]:
ProblemID ComplaintID UnitTypeID UnitType SpaceTypeID SpaceType TypeID Type MajorCategoryID MajorCategory MinorCategoryID MinorCategory CodeID Code StatusID Status StatusDate StatusDescription
0 14548958 6967900 91 APARTMENT 541 BATHROOM 1 EMERGENCY 9 PLUMBING 63 BATHTUB/SHOWER 2538 BROKEN OR MISSING 2 CLOSE 07/29/2014 The Department of Housing Preservation and De...
1 14548959 6967900 91 APARTMENT 541 BATHROOM 3 NON EMERGENCY 9 PLUMBING 63 BATHTUB/SHOWER 2540 FAUCET BROKEN/MISSING/LEAKING 2 CLOSE 08/04/2014 The Department of Housing Preservation and De...
2 14548960 6967900 91 APARTMENT 543 ENTIRE APARTMENT 3 NON EMERGENCY 58 FLOORING/STAIRS 343 FLOOR 2691 TILE BROKEN OR MISSING 2 CLOSE 08/04/2014 The Department of Housing Preservation and De...
3 14548961 6967900 91 APARTMENT 541 BATHROOM 3 NON EMERGENCY 9 PLUMBING 63 BATHTUB/SHOWER 2541 CHIPPED OR RUSTED 2 CLOSE 08/04/2014 The Department of Housing Preservation and De...
4 14615271 6994958 20 APARTMENT 68 ENTIRE APARTMENT 1 EMERGENCY 59 HEAT/HOT WATER 349 ENTIRE BUILDING 2717 NO HOT WATER 2 CLOSE 08/22/2014 The Department of Housing Preservation and De...

Awesome! This dataset provides some more details about the complaints, and lets us join by ComplaintID.

Summary and next steps

In the immediate future, I'm going to be writing script to join and clean this dataset. This can either be done either in python, or by writing some SQL. I haven't decided yet. Additionally, I'm going to be writing some code to do things like convert date strings, and perhaps scrape text.