%matplotlib inline

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import psycopg2

pd.options.display.max_columns = 40

Hack for Heat #5: How long do complaints take to resolve?¶

In this post, we're going to see if we can graph how long it takes for complaints to get resolved.

connection = psycopg2.connect('dbname= threeoneone user=threeoneoneadmin password=threeoneoneadmin')
cursor = connection.cursor()

cursor.execute('''SELECT createddate, closeddate, borough FROM service;''')
data = cursor.fetchall()
data = pd.DataFrame(data)

data.columns = ['createddate','closeddate','borough']

data = data.loc[data['createddate'].notnull()]
data = data.loc[data['closeddate'].notnull()]

data['timedelta'] = data['closeddate'] - data['createddate']

data['timedeltaint'] = [x.days for x in data['timedelta']]

data.head()

data.groupby(by='borough')['timedeltaint'].mean()

borough
BRONX            -3.658691
BROOKLYN         -6.835417
MANHATTAN       -20.325506
QUEENS           -4.121806
STATEN ISLAND    -2.108396
Unspecified       4.699638
Name: timedeltaint, dtype: float64

Oops! Looks like something's wrong. Let's try and find out:

data.sort_values('timedeltaint').head()

data.sort_values('timedeltaint', ascending=False).head()

Ah. Well, as a first step, let's remove any values that are before Jan 1st 2010 or after today:

import datetime

today = datetime.date(2016,5,29)
janone = datetime.date(2010,1,1)

Let's also remove any rows where the close date is before the created date:

subdata = data.loc[(data['closeddate'] > janone) & (data['closeddate'] < today)]
subdata = subdata.loc[data['closeddate'] > data['createddate']]

len(subdata)

7979757

subdata.sort_values('timedeltaint').head()

subdata.sort_values('timedeltaint',ascending = False).head()

This looks a little bit more realistic, but let's also visualize the distribution:

plotdata = list(subdata['timedeltaint'])

plt.figure(figsize=(12,10))
plt.hist(plotdata);

Okay, this still looks really wonky. Let's further subset the data, and see what happens when we remove the top and bottom 2.5%.

Pandas has a quantile function:

subdata.quantile([.025, .975])

quantcutdata = subdata.loc[(subdata['timedeltaint'] > 1) & (subdata['timedeltaint'] < 138) ]

len(quantcutdata)

6278707

plotdata = list(quantcutdata['timedeltaint'])

plt.figure(figsize=(12,10))
plt.hist(plotdata);

That looks a little better, but there might be other ways to filter out bad data.

subdata.groupby(by='borough').median()

subdata.groupby(by='borough').mean()

So, I definitely wouldn't trust these data right now - I'm still working on finding a data dictionary for the 311 data, and I need to make sure the columns mean what I think they mean. The point of this is just to show that the process would be once I get my hands on reliable data.

%matplotlib inline

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import psycopg2

pd.options.display.max_columns = 40

Hack for Heat #4: Number of complaints over time pt.2¶

In this post, we're going to look at the number of complaints each borough received for the last five or so years. First, let's look at the total number of complaints received:

connection = psycopg2.connect('dbname = threeoneone user=threeoneoneadmin password=threeoneoneadmin')
cursor = connection.cursor()

Borough complaints by date¶

cursor.execute('''SELECT createddate, borough FROM service;''')
borodata = cursor.fetchall()

borodata = pd.DataFrame(borodata)

borodata.columns = ['Date', 'Boro']

borobydate = borodata.groupby(by='Boro').count()

borobydate

For fun, let's look at the number of complaints per capita in each of the 5 boroughs. The population values below are from Wikipedia.

boropop = {
    'MANHATTAN': 1636268,
    'BRONX': 1438159,
    'BROOKLYN': 2621793,
    'QUEENS': 2321580,
    'STATEN ISLAND': 473279,
    }

borobydate['Pop'] = [boropop.get(x) for x in borobydate.index]

borobydate['CompPerCap'] = borobydate['Date']/borobydate['Pop']

borobydate

Complaints by borough over months¶

The next thing we're going to do is make a stacked plot of complaints by borough, over months. To do this, we need to extract the day and month from the date column:

borodata['Year'] = [x.year for x in borodata['Date']]
borodata['Month'] = [x.month for x in borodata['Date']]

borodata = borodata.loc[borodata['Boro'] != 'Unspecified']

Next, we need to generate an array of Ys. We want the rows of this dataframe to be the 5 boroughs, and the columns to be the count of complaints for the year and month.

boroplotdata = borodata.groupby(by=['Boro', 'Year','Month']).count()
boroplotdata

We basically need to get the above table into a graph.

boros = borobydate.index
borodict = {x:[] for x in boros}
borodict.pop('Unspecified')

for boro in borodict:
    borodict[boro] = list(boroplotdata.xs(boro).Date)

plotdata = np.zeros(len(borodict['BROOKLYN']))
for boro in borodict:
    plotdata = np.row_stack((plotdata, borodict[boro]))

plotdata = np.delete(plotdata, (0), axis=0)

plotdata

array([[  5804.,   6791.,  11009.,   6578.,   6590.,   6370.,   5909.,
          5957.,   5926.,   5597.,   4823.,   4715.,   6008.,   7360.,
          8109.,   6836.,   7091.,   7257.,   6380.,   9858.,   8268.,
          8356.,   6778.,   5799.,   6169.,   5042.,   6018.,   5607.,
          6746.,   6839.,   7037.,   7024.,   6108.,   7036.,   7509.,
          5445.,   7207.,   5851.,   6352.,   6915.,   6843.,   7392.,
          7552.,   6442.,   5936.,   6152.,   4996.,   4984.,   8753.,
          8978.,   8211.,   7937.,   8278.,   8173.,   8111.,   6934.,
          6597.,   7067.,   5558.,   5889.,   6398.,   8228.,  10537.,
          8489.,   8240.,   8803.,   8507.,   7684.,   7920.,   7592.,
          6756.,   6264.,   7186.,   7505.,   8531.,   8750.,   6688.],
       [ 21753.,  22463.,  30847.,  25357.,  26545.,  27245.,  29419.,
         26279.,  26287.,  24483.,  22788.,  22361.,  22895.,  24331.,
         43807.,  41075.,  39214.,  41555.,  42120.,  46155.,  44927.,
         47186.,  44536.,  44451.,  46493.,  36073.,  37645.,  35357.,
         38451.,  38776.,  41169.,  39849.,  36523.,  41004.,  47840.,
         42908.,  50532.,  40369.,  40716.,  38720.,  40310.,  41070.,
         41466.,  37545.,  34560.,  40456.,  42451.,  42784.,  60679.,
         49657.,  45848.,  40934.,  46025.,  46460.,  46283.,  41112.,
         41180.,  45079.,  47862.,  46171.,  52516.,  55555.,  55122.,
         47002.,  47346.,  51835.,  51154.,  47652.,  47385.,  52645.,
         48386.,  45544.,  56649.,  52388.,  50952.,  52751.,  44888.],
       [  9395.,   9475.,  13616.,  11786.,  12169.,  15681.,  16861.,
         12920.,  11633.,  10590.,   9196.,   8815.,   9599.,  11229.,
         26772.,  22898.,  22722.,  23677.,  27348.,  25241.,  25138.,
         30071.,  28319.,  30927.,  33097.,  25259.,  24480.,  20330.,
         21851.,  24170.,  25939.,  22423.,  19910.,  23060.,  29396.,
         26587.,  33031.,  28243.,  24956.,  22083.,  20914.,  22673.,
         26734.,  20546.,  19060.,  23673.,  29511.,  30904.,  43650.,
         33274.,  29464.,  24904.,  24040.,  24024.,  26418.,  22771.,
         21910.,  25771.,  33015.,  31056.,  36197.,  37733.,  32791.,
         26419.,  24552.,  27103.,  29387.,  26814.,  25613.,  30593.,
         29559.,  28256.,  39647.,  36866.,  30192.,  29880.,  25408.],
       [ 15659.,  14831.,  18597.,  19642.,  20490.,  20316.,  21347.,
         19256.,  19772.,  19534.,  18897.,  14786.,  14499.,  17235.,
         27956.,  26018.,  26449.,  25271.,  25531.,  25630.,  27482.,
         31265.,  29656.,  28735.,  29867.,  24649.,  27054.,  24528.,
         25844.,  26173.,  25093.,  25044.,  23468.,  26678.,  28202.,
         27837.,  32110.,  27178.,  28061.,  26816.,  27207.,  25809.,
         26994.,  25026.,  24760.,  29790.,  30731.,  28005.,  37394.,
         29569.,  30384.,  29098.,  30921.,  30898.,  31221.,  28214.,
         28754.,  32569.,  33913.,  30960.,  34194.,  34262.,  33343.,
         33626.,  33555.,  33952.,  33896.,  34326.,  34295.,  40630.,
         37755.,  33389.,  40881.,  37560.,  40559.,  43912.,  37175.],
       [ 22951.,  22336.,  32976.,  27739.,  28528.,  28287.,  29154.,
         27223.,  34110.,  26295.,  22845.,  21670.,  22964.,  24969.,
         32932.,  29108.,  29876.,  31333.,  31231.,  38856.,  33368.,
         31570.,  29785.,  28166.,  29217.,  23758.,  26756.,  25594.,
         28361.,  30195.,  31792.,  31756.,  28628.,  34079.,  39404.,
         29450.,  33078.,  28121.,  27475.,  27384.,  29037.,  29633.,
         31044.,  27880.,  26220.,  28053.,  27561.,  28286.,  38144.,
         36074.,  35977.,  31202.,  34711.,  35843.,  36394.,  31615.,
         31129.,  32470.,  33042.,  31788.,  34332.,  38725.,  44965.,
         37836.,  36885.,  40637.,  39691.,  38212.,  36171.,  38864.,
         34327.,  34463.,  38982.,  38579.,  38548.,  40171.,  34063.]])

Awesome! Now we have 5 rows with 77 columns each denoting the complaints for each of the boros for each of the months from 2010 to 2016.

from matplotlib import patches as mpatches

x = np.arange(len(plotdata[0]))

#crude xlabels
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
years = ['2010', '2011', '2012', '2013', '2014', '2015', '2016']
xlabels = []
for year in years:
    for month in months:
        xlabels.append("{0} {1}".format(month,year))

plotcolors = [(1,0,103),(213,255,0),(255,0,86),(158,0,142),(14,76,161),(255,229,2),(0,95,57),\
            (0,255,0),(149,0,58),(255,147,126),(164,36,0),(0,21,68),(145,208,203),(98,14,0)]

#rescaling rgb from 0-255 to 0 to 1
plotcolors = [(color[0]/float(255),color[1]/float(255),color[2]/float(255)) for color in plotcolors]
legendcolors = [mpatches.Patch(color = color) for color in plotcolors]

plt.figure(figsize = (15,10));
plt.stackplot(x,plotdata, colors = plotcolors);
plt.xticks(x,xlabels,rotation=90);
plt.xlim(0,76)
plt.legend(legendcolors,borodict.keys(), bbox_to_anchor=(0.2, 1));
plt.title('311 Complaints by Borough', size = 24)
plt.ylabel('Number of Complaints',size = 14)

plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.gca().yaxis.set_ticks_position('left')
plt.gca().xaxis.set_ticks_position('bottom')

C:\Users\Bryan\Anaconda2\lib\site-packages\matplotlib\collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):

Sanity checks¶

So, some parts of this graph bear checking. First, did Staten Island really not increase in complaints over the years? The data below (complaints by by borough by year) suggest that that was the case:

borodata.groupby(by = ['Boro', 'Year']).count()

The other thing that would be worth investigating is what happened between Februrary and March 2011 - complaints doubled during this time. Perhaps some aspect of the pipeline changed during this time, but that's a future story.

%matplotlib inline

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import psycopg2

pd.options.display.max_columns = 40

Hack for Heat #3: Number of complaints over time¶

This time, we're going to look at raw 311 complaint data. The data that I was working with previously was summarized data.

This dataset is much bigger, which is nice because it'll give me a chance to maintain my SQL-querying-from-memory-skills.

First, we're going to have to load all of this data into a postgres database. I wrote this tablebase.

SQL-ing this¶

The python library psycopg2 lets us work with postgres databases in python. We first create a connection object, that encapsulates the connection to the database, then create a cursor class that lets us make queries from that database.

connection = psycopg2.connect('dbname = threeoneone user=threeoneoneadmin password=threeoneoneadmin')
cursor = connection.cursor()

For example, we might want to extract the column names from our table:

cursor.execute('''SELECT * FROM threeoneone.INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'service'; ''')
columns = cursor.fetchall()

columns = [x[3] for x in columns]

columns[0:5]

['uniquekey', 'createddate', 'closeddate', 'agency', 'agencyname']

Complaints over time¶

Let's start with something simple. First, let's extract a list of all complaints, and the plot the number of complaints by month.

cursor.execute('''SELECT createddate FROM service;''')
complaintdates = cursor.fetchall()

complaintdates = pd.DataFrame(complaintdates)

complaintdates.head()

Renaming our column:

complaintdates.columns = ['Date']

Next we have to convert these tuples into strings:

complaintdates['Date'] = [x [0] for x in complaintdates['Date']]

Normally, if these were strings, we'd use the extract_dates function we wrote in a previous post. However, because I typed these as datetime objects, we can just extract the .year(), .month(), and .day() attributes:

type(complaintdates['Date'][0])

datetime.date

complaintdates['Day'] = [x.day for x in complaintdates['Date']]
complaintdates['Month'] = [x.month for x in complaintdates['Date']]
complaintdates['Year'] = [x.year for x in complaintdates['Date']]

This is how many total complaints we have:

len(complaintdates)

11371298

We can group them by month:

bymonth = complaintdates.groupby(by='Month').count()
bymonth

By year:

byyear = complaintdates.groupby(by='Year').count()
byyear

byday = complaintdates.groupby(by='Day').count()
bydate = complaintdates.groupby(by='Date').count()

Some matplotlib¶

plt.figure(figsize = (12,10))
x = range(0,12)
y = bymonth['Date']
plt.plot(x,y)

[<matplotlib.lines.Line2D at 0xa7a0ed68>]

plt.figure(figsize = (12,10))
x = range(0,7)
y = byyear['Date']

plt.plot(x,y)

[<matplotlib.lines.Line2D at 0xa1925748>]

plt.figure(figsize = (12,10))
x = range(0,len(byday))
y = byday['Date']

plt.plot(x,y)

[<matplotlib.lines.Line2D at 0x8c9c83c8>]

The sharp decline we see at the end is obviously because not all months have the same number of days.

plt.figure(figsize=(12,10))
x = range(0,len(bydate))
y = bydate['Year'] #This is arbitrary - year, month, and day are all series that store the counts

plt.plot(x,y)

[<matplotlib.lines.Line2D at 0xa0ed9860>]

That's all for now. In the next post, I'm going to break this down by borough, as well as polish this graph.

%matplotlib inline

from matplotlib import pyplot as plt

Misc #1: Experimenting with plots¶

I'm in the process of constructing figures for my dissertation, and I thought I'd share the different options I'm considering. I only need to plot simple line graphs, so my main considerations are aesthetics and ease of use (as I'll likely be making changes to them). This is the list I'm going to test:

matplotlib
MATLAB
R + ggplot
Excel
gnuplot
plot.ly
Chart.js
Tableau
Bokeh
MS paint (just kidding)

Without further ado, let's go!

matplotlib¶

intercept = -.2122
b1 = -1.1776 
b2 = -.1192
b3 = .4943

#x1 is a categorical predictor, x2 is a continuous predictor, x3 is their interaction
def y(x1, x2):
    return intercept + b1*x1 + b2*x2 + b3*x1*x2

x = range(1,8)
plt.plot(x, [y(0,z) for z in x], "blue", x, [y(1,z) for z in x], "red")

#pl.fill_between(x, y-error, y+error)

plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.gca().yaxis.set_ticks_position('left')
plt.gca().xaxis.set_ticks_position('bottom')

MATLAB¶

intercept = -.2122
b1 = -1.1776 
b2 = -.1192
b3 = .4943
x = 1:7
% moderator = 0
y1 = intercept + b2*x
% moderator = 1
y2 = intercept + b1 + b2*x + b3*x

figure
plot(x,y1,x,y2)

R¶

intercept = -.2122
b1 = -1.1776 
b2 = -.1192
b3 = .4943

y <- function(x1, x2){
  return(intercept + b1*x1 + b2*x2 + b3*x1*x2)
}

x = c(1:7)
y1 = sapply(x,y,x1=0)
y2 = sapply(x,y,x1=1)
xrange <- range(1:7) 
yrange <- range(-1.5:1.5) 
plot(xrange,yrange, type='n')
lines(x,y1)
lines(x,y2)

ggplot2¶

d1 <- data.frame(x,y=y1,g='0')
d2 <- data.frame(x,y=y2,g='1')
data <- rbind(d1, d2)

cbbPalette <- c("blue", "red")
ggplot(data=data, aes(x=x, y=y, group=g, color = g)) +
  geom_line() +
  scale_color_manual(values=cbbPalette) +
  scale_x_continuous(limits = c(1,7)) + 
  scale_y_continuous(limits = c(-1.5,1.5))

Excel¶

gnuplot¶

set title ""
intercept = -.2122
b1 = -1.1776 
b2 = -.1192
b3 = .4943
y1(x) = intercept + b2*x
y2(x) = intercept + b1 + b2*x + b3*x

plot [1:7] y1(x) lt rgb "blue", y2(x) lt rgb "red"

plot.ly¶

plot.ly is web-based; I took screencaps of the interface:

The plot looks like this:

chart.js¶

<html>
    <head>
        <script src="Chart.js"></script>
    </head>
<canvas id="myChart" width="640" height="480"></canvas>
<body>
    <script>
        var ctx = document.getElementById("myChart");

        var data = {
            labels: Array.apply(null, Array(8)).map(function (_, i) {return i;}),
            datasets: [
                {
                    label: "y1",
                    fill: false,
                    lineTension: 0.1,
                    backgroundColor: "blue",
                    borderColor: "blue",
                    borderCapStyle: 'butt',
                    borderJoinStyle: 'miter',
                    pointBorderColor: "blue",
                    pointBackgroundColor: "#fff",
                    pointBorderWidth: 1,
                    pointHoverRadius: 5,
                    pointHoverBackgroundColor: "blue",
                    pointHoverBorderColor: "blue",
                    pointHoverBorderWidth: 2,
                    pointRadius: 1,
                    pointHitRadius: 10,
                    data: [-0.3314,-0.4506,-0.5698,-0.689,-0.8082,-0.9274,-1.0466],
                },
                        {
                    label: "y2",
                    fill: false,
                    lineTension: 0.1,
                    backgroundColor: "red",
                    borderColor: "red",
                    borderCapStyle: 'butt',
                    borderJoinStyle: 'miter',
                    pointBorderColor: "red",
                    pointBackgroundColor: "#fff",
                    pointBorderWidth: 1,
                    pointHoverRadius: 5,
                    pointHoverBackgroundColor: "red",
                    pointHoverBorderColor: "red",
                    pointHoverBorderWidth: 2,
                    pointRadius: 1,
                    pointHitRadius: 10,
                    data: [-1.0147,-0.6396,-0.2645,0.1106,0.4857,0.8608,1.2359],
                }
            ]
        };

        var myChart = new Chart(ctx, {
            type: 'line',
            data: data,
            options: {
                responsive:false,
                scales: {
                    yAxes: [{
                        ticks: {
                            beginAtZero:false
                        }
                    }]
                }
            }
        });
</script>
</body>
</html>

from IPython.display import IFrame
IFrame('https://bryansim.github.io/chartjsplot.html', width=700, height=500)

Bokeh¶

from bokeh.plotting import figure, output_file, show

intercept = -.2122
b1 = -1.1776 
b2 = -.1192
b3 = .4943

#x1 is a categorical predictor, x2 is a continuous predictor, x3 is their interaction
def y(x1, x2):
    return intercept + b1*x1 + b2*x2 + b3*x1*x2

x = range(1,8)
y1 = [y(0,z) for z in x]
y2 = [y(1,z) for z in x]

# output to static HTML file
output_file("bokehplot.html")

# create a new plot
p = figure(
   tools="pan,box_zoom,reset,save",
   y_range=[-1.5, 1.5],
)

# add some renderers
p.line(x, y1, legend="mod=0", line_color="blue")
p.line(x, y2, legend="mod=1", line_color="red")

# show the results
show(p)

from IPython.display import IFrame
IFrame('https://bryansim.github.io/bokehplot.html', width=700, height=675)

Summary¶

I think, from here on out, I'll be using matplotlib for simple plots, Bokeh if I need to generate something interactive, and chart.js if I ever need to implement a dynamic chart.

With regard to the alternatives:

MATLAB is proprietary.
The built-in packages in R generate ugly plots.
ggplot is a little unwieldy, and has defaults that I don't like (e.g., the grey background)
Excel is good to visualize something really quick, but most people who have worked with linear models for a while (including myself) can do this in our heads or with a pen and paper.
gnuplot looks really interesting, and may be something I spend more time on in the future, but it's foreign to me at the moment.
plot.ly and Tableau (which I wanted to test, but their sign-up system never sent me the confirmation email) are both web-based, and for me, needlessly obfuscate what happens in the backend. I'd be comfortable writing my own code to generate the plots I want.

BONUS: MS Paint¶

%matplotlib inline

import pandas as pd
from matplotlib import pyplot as plt
import numpy as np

pd.options.display.max_columns = 40

Hack for Heat #2: Problem types over time¶

In this post, I'm going to explore how we might track the composition of problems that the HPD(Housing Preservation and Development Board) might receive over time. The data that I currently have are only limited to 2.5 years, so this is just a proof of concept.

hpdcompprob = pd.read_csv("Complaint_Problems.csv")

type(hpdcompprob.StatusDate[0])

str

The first thing we're going to do is convert string dates into simple years, months, and days.

def extract_dates(series, nameofdatecolumn):
    series['Year'] = series[nameofdatecolumn].str.extract(r'../../(.*)', expand=True)
    series['Month'] = series[nameofdatecolumn].str.extract(r'(.*)/../....', expand=True)
    series['Day'] = series[nameofdatecolumn].str.extract(r'../(.*)/....', expand=True)
    
extract_dates(hpdcompprob, 'StatusDate')

def count_problem_types(pdseries):
    problemtypes = pdseries.unique()
    problemcounts = {k:0 for k in problemtypes}
    problemcounts['TOTAL'] = 0
    
    for problem in pdseries:
        problemcounts[problem] += 1
    
    problemcounts['TOTAL'] = len(pdseries)
    print year, problemcounts #this is just to show us what those counts are
        
    return problemcounts

#years = hpdcompprob.Year.unique()
years = ['2014','2015','2016'] #We know that only these data are relevant (the other, older years, are remnants of old cases)
yearsdict = {k:{} for k in years}
for year in years:
    yearsdict[year] = count_problem_types(hpdcompprob.loc[hpdcompprob.Year == year].MajorCategory)

2014 {'TOTAL': 259437, 'ELECTRIC': 12684, 'HEAT/HOT WATER': 93548, 'WATER LEAK': 14315, 'APPLIANCE': 5231, 'DOOR/WINDOW': 16352, 'HEATING': 16, 'ELEVATOR': 363, 'GENERAL': 9680, 'FLOORING/STAIRS': 11768, 'NONCONST': 185, 'UNSANITARY CONDITION': 34779, 'SAFETY': 5521, 'PLUMBING': 23179, 'CONSTRUCTION': 3, 'OUTSIDE BUILDING': 758, 'PAINT/PLASTER': 31055}
2015 {'TOTAL': 705644, 'PAINT/PLASTER': 84062, 'WATER LEAK': 42512, 'DOOR/WINDOW': 49154, 'APPLIANCE': 14719, 'NONCONST': 69, 'ELEVATOR': 1213, 'GENERAL': 31482, 'HEAT/HOT WATER': 228157, 'FLOORING/STAIRS': 34741, 'UNSANITARY CONDITION': 97221, 'SAFETY': 17105, 'PLUMBING': 67472, 'CONSTRUCTION': 1, 'OUTSIDE BUILDING': 2183, 'ELECTRIC': 35553}
2016 {'TOTAL': 211830, 'ELECTRIC': 9239, 'WATER LEAK': 12063, 'DOOR/WINDOW': 15465, 'APPLIANCE': 4644, 'HEATING': 39, 'ELEVATOR': 377, 'GENERAL': 9738, 'HEAT/HOT WATER': 73803, 'NONCONST': 33, 'UNSANITARY CONDITION': 27727, 'SAFETY': 5243, 'PLUMBING': 20218, 'FLOORING/STAIRS': 9989, 'OUTSIDE BUILDING': 543, 'PAINT/PLASTER': 22709}

Graphing our results¶

We're going to use matplotlib to make a stacked area graph. I referenced this post.

First, let's arrange our dictionary keys in alphabetical order, so that we can extract the values in a meaningful way.

labels = sorted(yearsdict['2014'].keys())
labels.remove('CONSTRUCTION')
labels.remove('HEATING') #removed these because 2015 doesn't have an entry, and both the other years have negligible entries
labels.remove('TOTAL') #not interested in total in this plot

years = ['2014','2015','2016']
plotdict = {k:[] for k in years}
for year in years:
    for label in labels:
        plotdict[year].append(yearsdict[year][label])

len(labels)

14

x = np.arange(len(years))
y = np.column_stack((plotdict['2014'],plotdict['2015'],plotdict['2016']))
plt.figure(figsize = (15,10));
plt.stackplot(x,y);
plt.legend(labels)

<matplotlib.legend.Legend at 0xeb24ac8>

This might legit be the ugliest plot I have ever seen. Let's try and fix the colors first.

From this blog post, I got a list of 14 contrasting colors.

from matplotlib import patches as mpatches

plotcolors = [(0,0,0),(1,0,103),(213,255,0),(255,0,86),(158,0,142),(14,76,161),(255,229,2),(0,95,57),\
            (0,255,0),(149,0,58),(255,147,126),(164,36,0),(0,21,68),(145,208,203),(98,14,0)]

#rescaling rgb from 0-255 to 0 to 1
plotcolors = [(x[0]/float(255),x[1]/float(255),x[2]/float(255)) for x in plotcolors]
legendcolors = [mpatches.Patch(color = x) for x in plotcolors]

x = np.arange(len(years))
y = np.column_stack((plotdict['2014'],plotdict['2015'],plotdict['2016']))
plt.figure(figsize = (15,10));
plt.stackplot(x,y,colors = plotcolors);
plt.legend(legendcolors, labels);

#some labels for the x axis
xlabels = ['2014', '2015', '2016']
plt.xticks(x,xlabels, size = 18);
plt.yticks(size = 18);
plt.title('HPD Complaints by type over time', size = 24);

Okay, so this looks a little better, and with some tweaking, we can make this work.

Hacking for heat¶

In this series, I'm going to be posting about the process that goes on behind some of the blog posts we end up writing. In this first entry, I'm going to be exploring a number of datsets.

These are the ones that I'm going to be looking at:

(for HPD datasets, some documentation can be found here)

HPD litigation database¶

First, we're going to look at the smallest dataset, one that contains cases against landlords. From the documentation, this file contains "All cases commenced by HPD or by tennants (naming HPD as a party) in [housing court] since August 2006" either seeking orders for landlords to comply with regulations, or awarding HPD civil penalties (i.e., collecting on fines).

import pandas as pd

litigation = pd.read_csv("Housing_Litigations.csv")

litigation.head()

Let's take a look at unique values for some of the columns:

litigation['Boro'].unique()

array(['BRONX', 'QUEENS', 'MANHATTAN', 'BROOKLYN', 'STATEN ISLAND'], dtype=object)

litigation.groupby(by = ['Boro','CaseJudgement']).count()

The above table tells us that Manhattan has the lowest proportion of cases that receive judgement (about 1 in 80), whereas Staten Island has the highest (about 1 in 12). It may be something worth looking into, but it's also important to note that many cases settle out of court, and landlords in Manhattan may be more willing (or able) to do so.

litigation['CaseType'].unique()

array(['Tenant Action', 'Heat and Hot Water', 'CONH',
       'Comp Supplemental Cases', 'Comprehensive',
       'Access Warrant - Non-Lead', 'Access Warrant - lead',
       'Lead False Certification', 'False Certification Non-Lead',
       'Heat Supplemental Cases', '7A', 'Failure to Register Only',
       'HLD - Other Case Type'], dtype=object)

litigation.groupby(by = ['CaseType', 'CaseJudgement']).count()

The table above shows the same case judgement proportions, but conditioned on what type of case it was. Unhelpfully, the documentation does not specify what the difference between Access Warrant - Lead and Non-Lead is. It could be one of two possibilities: The first is whether the warrants have to do with lead-based paint, which is a common problem, but perhaps still too idiosyncratic to have it's own warrant type. The second, perhaps more likely possibility is whether or not HPD was the lead party in the case.

We'll probably end up using these data by aggregating it and examining how complaints change over time, perhaps as a function of what type they are. There's also the possibility of looking up specific buildings' complaints and tying them to landlords. There's probably also an easy way to join this dataset with another, by converting the address information into something standardized, like borough-block-lot (BBL; http://www1.nyc.gov/nyc-resources/service/1232/borough-block-lot-bbl-lookup)

HPD complaints¶

Next, we're going to look at a dataset of HPD complaints.

hpdcomp = pd.read_csv('Housing_Maintenance_Code_Complaints.csv')

hpdcomp.head()

len(hpdcomp)

662672

This dataset is less useful on its own. It doesn't tell us what the type of complaint was, only the date it was received and whether or not the complaint is still open. However, it may be useful in conjunction with the earlier dataset. For example, we might be interested in how many of these complaints end up in court (or at least, have some sort of legal action taken).

HPD violations¶

The following dataset tracks HPD violations.

hpdviol = pd.read_csv('Housing_Maintenance_Code_Violations.csv')

hpdviol.head()

len(hpdviol)

1437246

These datasets all have different lengths, but that's not surprising, given they come from different years. One productive initial step would be to convert the date strings into something numerical.

HPD complaint problems database¶

hpdcompprob = pd.read_csv('Complaint_Problems.csv')

hpdcompprob.head()

Awesome! This dataset provides some more details about the complaints, and lets us join by ComplaintID.

Summary and next steps¶

In the immediate future, I'm going to be writing script to join and clean this dataset. This can either be done either in python, or by writing some SQL. I haven't decided yet. Additionally, I'm going to be writing some code to do things like convert date strings, and perhaps scrape text.

	createddate	closeddate	borough	timedelta	timedeltaint
0	2013-04-26	2013-04-26	QUEENS	0 days	0
1	2013-04-26	2013-04-29	MANHATTAN	3 days	3
2	2013-04-26	2013-04-30	MANHATTAN	4 days	4
3	2013-04-26	2013-04-26	QUEENS	0 days	0
4	2013-04-26	2013-04-30	STATEN ISLAND	4 days	4

	createddate	closeddate	borough	timedelta	timedeltaint
604564	2016-03-16	1900-01-01	Unspecified	-42443 days	-42443
596082	2016-03-16	1900-01-01	Unspecified	-42443 days	-42443
605654	2016-03-16	1900-01-01	QUEENS	-42443 days	-42443
606458	2016-03-15	1900-01-01	MANHATTAN	-42442 days	-42442
552013	2016-03-14	1900-01-01	Unspecified	-42441 days	-42441

	createddate	closeddate	borough	timedelta	timedeltaint
9392808	2010-04-08	2201-05-13	BRONX	69796 days	69796
9525584	2010-06-01	2201-06-17	BROOKLYN	69777 days	69777
1581464	2011-01-20	2201-03-25	BRONX	69460 days	69460
1536326	2015-11-03	2100-01-01	QUEENS	30740 days	30740
2479091	2013-01-10	2023-05-01	BROOKLYN	3763 days	3763

	createddate	closeddate	borough	timedelta	timedeltaint
11371297	2015-08-28	2015-08-29	QUEENS	1 days	1
7950080	2015-06-20	2015-06-21	BROOKLYN	1 days	1
2366882	2012-11-19	2012-11-20	MANHATTAN	1 days	1
2366879	2015-06-09	2015-06-10	Unspecified	1 days	1
2366874	2012-04-17	2012-04-18	BROOKLYN	1 days	1

	createddate	closeddate	borough	timedelta	timedeltaint
10774084	2010-01-06	2016-03-24	BROOKLYN	2269 days	2269
1506622	2010-02-19	2016-05-02	BROOKLYN	2264 days	2264
1508205	2010-03-02	2016-05-04	STATEN ISLAND	2255 days	2255
513580	2010-01-01	2016-02-17	STATEN ISLAND	2238 days	2238
589865	2010-01-22	2016-02-25	BRONX	2225 days	2225

Hello!

Hack for Heat

Mediocre Museum

Mousetracker Data

Skytrax Ratings

Hack for Heat #5: Complaint resolution time