Hack for Heat #11: NYC's top complaints pt.2

In [1]:
%matplotlib inline

from matplotlib import pyplot as plt
from matplotlib import patches as mpatches
import pandas as pd
import numpy as np
import psycopg2
from datetime import datetime
from datetime import date


pd.options.display.max_columns = 40

Hack for Heat #11: NYC's top complaints pt. 2

In this post, I'm going to post a visualization of the top complaints reported in the previous post. I have this in a csv:

In [2]:
tempdata = pd.read_csv('topfiverev.csv', names = ['Year','Category','Count'])
In [3]:
tempdata.head()
Out[3]:
Year Category Count
0 2010 CONSTRUCTION/PLUMBING 287253
1 2010 Street Condition 229584
2 2010 HEAT RELATED 214218
3 2010 NOISY 201834
4 2010 PAINT - PLASTER 93194
In [4]:
#Fixing some category names
tempdata['Category'] = ['PAINT/PLASTER' if (x == 'PAINT/PLASTER') | (x== 'PAINT - PLASTER') else x for x in tempdata['Category']]
tempdata['Category'] = ['Street Condition' if x == 'Traffic Signal Condition' else x for x in tempdata['Category'] ]

What we want is to create a data frame with as many rows as there are categories, as many columns as there are years, and the counts as the data points.

In [5]:
upcaseindex = [x.upper() for x in tempdata['Category'].unique()]
upcaseindex.remove('NONCONST') #remove this category because it refers to missing values
In [6]:
plotdata = pd.DataFrame(index = upcaseindex, columns = tempdata['Year'].unique())
In [7]:
plotdata
Out[7]:
2010 2011 2012 2013 2014 2015 2016
CONSTRUCTION/PLUMBING NaN NaN NaN NaN NaN NaN NaN
STREET CONDITION NaN NaN NaN NaN NaN NaN NaN
HEAT RELATED NaN NaN NaN NaN NaN NaN NaN
NOISY NaN NaN NaN NaN NaN NaN NaN
PAINT/PLASTER NaN NaN NaN NaN NaN NaN NaN
WATER SYSTEM NaN NaN NaN NaN NaN NaN NaN
BLOCKED DRIVEWAY NaN NaN NaN NaN NaN NaN NaN
ILLEGAL PARKING NaN NaN NaN NaN NaN NaN NaN
UNSANITARY CONDITION NaN NaN NaN NaN NaN NaN NaN

Now let's populate it with values.

In [8]:
for row in tempdata['Category'].unique():
    for year in  tempdata['Year'].unique():
        try:
            plotdata.loc[row.upper(),year] = tempdata.loc[(tempdata['Year'] == year) & (tempdata['Category'] == row)]['Count'].values[0]
        except:
            plotdata.loc[row.upper(),year] = None
In [9]:
plotdata = plotdata.drop(['NONCONST'])
plotdata
Out[9]:
2010 2011 2012 2013 2014 2015 2016
CONSTRUCTION/PLUMBING 287253 272234 225338 223314 106251 83585 39460
STREET CONDITION 229584 216865 171910 182618 221097 233839 105346
HEAT RELATED 214218 190184 182974 202896 230364 225706 126906
NOISY 201834 196453 220392 259356 337892 387227 197186
PAINT/PLASTER 93194 100704 77287 77957 63503 None None
WATER SYSTEM 70346 61577 57678 53186 None 71086 31299
BLOCKED DRIVEWAY None 52592 50818 57408 79170 100881 53955
ILLEGAL PARKING None None None None 63243 92679 55932
UNSANITARY CONDITION None None None None 61789 82888 35357
In [10]:
x = range(0,len(plotdata.columns))
xlabels = plotdata.columns

plt.figure(figsize=(16,10));
plt.xlim(-1,len(xlabels));
plt.ylim(0,500000);
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.gca().yaxis.set_ticks_position('left')
plt.gca().xaxis.set_ticks_position('bottom')
plt.title('Complaint Category by Year', size = 24)
plt.ylabel('Number of Complaints',size = 14)

plotcolors = [(1,0,103),(213,255,0),(255,0,86),(158,0,142),(14,76,161),(255,229,2),(0,95,57),\
            (0,255,0),(149,0,58),(255,147,126),(164,36,0),(0,21,68),(145,208,203),(98,14,0)]
plotcolors = [(color[0]/float(255),color[1]/float(255),color[2]/float(255)) for color in plotcolors]
legendcolors = [mpatches.Patch(color = color) for color in plotcolors]

count = 0
for row in plotdata.index:
    y = plotdata.loc[row]
    plt.plot(x, y, color = plotcolors[count], lw=4, marker = '.', markersize = 20);
    count += 1

plt.legend(legendcolors, upcaseindex, bbox_to_anchor=(0.3, 1));
plt.xticks(x,xlabels);

So that's my graph. Do note that the reason these complaints dropped in 2016 is obviously because we're only midway through the year! I grabbed these data in early July, so that's when these numbers were last updated.

Hack for Heat #10: NYC's top complaints

In [1]:
%matplotlib inline

from matplotlib import pyplot as plt
from matplotlib import patches as mpatches
import pandas as pd
import numpy as np
import psycopg2
from datetime import datetime
from datetime import date


pd.options.display.max_columns = 40

connection = psycopg2.connect('dbname = threeoneone user=threeoneoneadmin password=threeoneoneadmin')
cursor = connection.cursor()

Hack for Heat #10: NYC's top complaints

In this post, I'm going to look at what the top 311 complaints in NYC are, and how their numbers have changed over the years.

In [2]:
cursor.execute('''SELECT createddate, borough, complainttype FROM service''')
totalcomplaints = cursor.fetchall()
In [3]:
len(totalcomplaints)
Out[3]:
13092182

Over the past 7 years, there were about 13 million complaints, which is about just 1.5 per person over the 7 years, or .22 complaints per person per year

In [4]:
totalcomplaints = pd.DataFrame(totalcomplaints)
In [5]:
totalcomplaints.columns = ['Date', 'Boro', 'Comptype']
In [6]:
totalcomplaints['Year'] = [x.year for x in totalcomplaints['Date']]

This is how it breaks down over the years:

In [7]:
totalcomplaints.groupby(by=['Year']).count()
Out[7]:
Date Boro Comptype
Year
2010 2005730 2005730 2005730
2011 1918888 1918888 1918888
2012 1783013 1783013 1783013
2013 1848938 1848938 1848938
2014 2102177 2102177 2102177
2015 2286697 2286697 2286697
2016 1146739 1146739 1146739
In [8]:
complainttypesbyseason = totalcomplaints.groupby(by=['Year','Comptype']).count()
In [9]:
topfive = complainttypesbyseason['Date'].groupby(level=0,group_keys=False).apply(lambda x:x.sort_values(ascending=False).head(5))
topfive
Out[9]:
Year  Comptype              
2010  HEATING                   214218
      GENERAL CONSTRUCTION      127131
      Street Light Condition    116466
      Noise - Residential       115615
      PLUMBING                  111504
2011  HEATING                   190184
      GENERAL CONSTRUCTION      133242
      PLUMBING                  113218
      Noise - Residential       111447
      Street Condition          104694
2012  HEATING                   182974
      Noise - Residential       127943
      GENERAL CONSTRUCTION      112436
      Street Light Condition     93981
      PLUMBING                   87561
2013  HEATING                   202896
      Noise - Residential       151516
      GENERAL CONSTRUCTION      112169
      Street Light Condition     92502
      PLUMBING                   86564
2014  Noise - Residential       192608
      HEAT/HOT WATER            132767
      Street Condition          114545
      HEATING                    97597
      Street Light Condition     94665
2015  HEAT/HOT WATER            225706
      Noise - Residential       208026
      Street Condition          124378
      Blocked Driveway          100881
      Street Light Condition     98106
2016  HEAT/HOT WATER            126906
      Noise - Residential       104324
      Illegal Parking            55932
      Blocked Driveway           53955
      Street Condition           51977
dtype: int64

These data gives us a rough idea of how the number of complaints have changed over time, but one problem is that the labels that were used changed each year (presumably because they were replaced by new labels that better reflected what the complaints were). Secondly, for our purposes, some categories can be combined. For example, 'Street Condition' and 'Street Light Condition' may be important for administrators, but for now, we can be content with just "stuff that's gone wrong on the street".

In [10]:
topfive.index.levels[1]
Out[10]:
Index([u'Blocked Driveway', u'GENERAL CONSTRUCTION', u'HEAT/HOT WATER',
       u'HEATING', u'Illegal Parking', u'Noise - Residential', u'PLUMBING',
       u'Street Condition', u'Street Light Condition'],
      dtype='object', name=u'Comptype')

So, over the last 5 years, there have been 10 different categories that have made up the top five. Notice that some of these labels have changed over time. We'll have to standardize them.

In [11]:
years = totalcomplaints.Year.unique()

#first have a list of labels that are the same
oldlabels = totalcomplaints.loc[totalcomplaints['Year'] == years[0],['Comptype']]['Comptype'].unique()

newlabels = []

for x in years:
    yearslabels = totalcomplaints.loc[totalcomplaints['Year'] == x,['Comptype']]['Comptype'].unique()
    [newlabels.append(x) for x in yearslabels if x not in oldlabels and x not in newlabels]

From doing this, we get two lists: oldlabels and newlabels, that we will use to decide how to combine stuff.

In [12]:
#sorted(oldlabels)
In [13]:
#sorted(newlabels)

One modification I'm going to make is to combine 'Street Condition', 'Street Light Condition', 'Street Sign - Damaged', 'Street Sign - Dangling', and 'Street Sign - Missing' into 'Street Condition'.

For the current purposes, these are the same thing.

I'm also going to collapse 'GENERAL CONSTRUCTION', 'General Construction/Plumbing' and 'PLUMBING' into a new category called 'CONSTRUCTION/PLUMBING'.

I'll collapse noise related complaints ('Noise', 'Noise - Commercial', 'Noise - Helicopter', 'Noise - House of Worship', 'Noise - Park', 'Noise - Residential', 'Noise - Street/Sidewalk', 'Noise - Vehicle') into a label named 'NOISY'.

Finally, I'll combine 'HEAT/HOT WATER' and 'HEATING' into 'HEAT RELATED'.

In [14]:
streetconditioncomb = ['Street Condition', 'Street Light Condition', 'Street Sign - Damaged', 'Street Sign - Dangling', \
                       'Street Sign - Missing']

totalcomplaints['RevComptype'] = ['Street Condition' if x in streetconditioncomb else x for x in totalcomplaints['Comptype']]

genconstructcomb = ['GENERAL CONSTRUCTION', 'General Construction/Plumbing', 'PLUMBING']

totalcomplaints['RevComptype'] = ['CONSTRUCTION/PLUMBING' if x in genconstructcomb else x for x in totalcomplaints['RevComptype']]

noisecomb = ['Noise', 'Noise - Commercial', 'Noise - Helicopter', 'Noise - House of Worship', 'Noise - Park',\
             'Noise - Residential', 'Noise - Street/Sidewalk', 'Noise - Vehicle']

totalcomplaints['RevComptype'] = ['NOISY' if x in noisecomb else x for x in totalcomplaints['RevComptype']]

heatcomb = ['HEAT/HOT WATER', 'HEATING']

totalcomplaints['RevComptype'] = ['HEAT RELATED' if x in heatcomb else x for x in totalcomplaints['RevComptype']]
In [15]:
revisedcomplainttypes = totalcomplaints.groupby(by=['Year','RevComptype']).count()
In [16]:
topfiverev = revisedcomplainttypes['Date'].groupby(level=0, group_keys=False).apply(lambda x : x.sort_values(ascending=False).head(8))
topfiverev
Out[16]:
Year  RevComptype             
2010  CONSTRUCTION/PLUMBING       287253
      Street Condition            229584
      HEAT RELATED                214218
      NOISY                       201834
      PAINT - PLASTER              93194
      Water System                 70346
      NONCONST                     69659
      Traffic Signal Condition     53773
2011  CONSTRUCTION/PLUMBING       272234
      Street Condition            216865
      NOISY                       196453
      HEAT RELATED                190184
      PAINT - PLASTER             100704
      NONCONST                     68407
      Water System                 61577
      Blocked Driveway             52592
2012  CONSTRUCTION/PLUMBING       225338
      NOISY                       220392
      HEAT RELATED                182974
      Street Condition            171910
      PAINT - PLASTER              77287
      NONCONST                     60055
      Water System                 57678
      Blocked Driveway             50818
2013  NOISY                       259356
      CONSTRUCTION/PLUMBING       223314
      HEAT RELATED                202896
      Street Condition            182618
      PAINT - PLASTER              77957
      Blocked Driveway             57408
      NONCONST                     53831
      Water System                 53186
2014  NOISY                       337892
      HEAT RELATED                230364
      Street Condition            221097
      CONSTRUCTION/PLUMBING       106251
      Blocked Driveway             79170
      PAINT/PLASTER                63503
      Illegal Parking              63243
      UNSANITARY CONDITION         61789
2015  NOISY                       387227
      Street Condition            233839
      HEAT RELATED                225706
      Blocked Driveway            100881
      Illegal Parking              92679
      CONSTRUCTION/PLUMBING        83585
      UNSANITARY CONDITION         82888
      Water System                 71086
2016  NOISY                       197186
      HEAT RELATED                126906
      Street Condition            105346
      Illegal Parking              55932
      Blocked Driveway             53955
      CONSTRUCTION/PLUMBING        39460
      UNSANITARY CONDITION         35357
      Water System                 31299
dtype: int64

This gives us some different numbers. Taken together, although heat related complaints are still constantly in the top 5, what we observe over time is also a gradual decrease in construction related complaints, a constant, variable but large amount of street-related complaints, and a steady increase in noise-related complaints. Next, I'll post a version of this looking at only the heating season.

Hack for Heat #9: Total complaints over time, complaints per capita, etc.

In [1]:
%matplotlib inline

from matplotlib import pyplot as plt
from matplotlib import patches as mpatches
import pandas as pd
import numpy as np
import psycopg2
from datetime import datetime
from datetime import date


pd.options.display.max_columns = 40

connection = psycopg2.connect('dbname = threeoneone user=threeoneoneadmin password=threeoneoneadmin')
cursor = connection.cursor()

Hack for Heat #9: Total complaints over time, complaints per capita, etc.

In this post, I'm going to be doing a couple of miscellaneous things to prepare for our first blog post of the summer. Last year's posts can be found here.

In [2]:
cursor.execute('''SELECT createddate, borough, complainttype FROM service;''')
borodata = cursor.fetchall()
In [3]:
borodata = pd.DataFrame(borodata)
In [4]:
borodata.columns = ['Date', 'Boro', 'Comptype']

Again, I removed entires earlier than 2011 because there are data quality issues (substantially fewer cases):

In [5]:
borodata = borodata.loc[borodata['Date'] > date(2011,3,1)]
In [6]:
borodata = borodata.loc [borodata['Date'] <date(2016,6,1)]

Remove cases where the borough was unspecified:

In [7]:
borodata = borodata.loc[borodata['Boro'] != 'Unspecified']

Total complaints by borough, year-over-year

We already have a chart of heat complaints by month; we need one for the total number of complaints received.

In [8]:
borodata['Year'] = [x.year for x in borodata['Date']]
borodata['Month'] = [x.month for x in borodata['Date']]
borodata['Day'] = [x.day for x in borodata['Date']]
In [9]:
plotdata = borodata.groupby(by=['Boro', 'Year', 'Month']).count()
plotdata.head()
Out[9]:
Date Comptype Day
Boro Year Month
BRONX 2011 3 26256 26256 26256
4 22898 22898 22898
5 22722 22722 22722
6 23677 23677 23677
7 27348 27348 27348

Now we generate a dictionary that we can use to plot these data:

In [10]:
plotdict = {x:[] for x in borodata['Boro'].unique()}
In [11]:
for boro in plotdict:
    plotdict[boro] = list(plotdata.xs(boro).Date)

Now we need to format our data for the plot. We need 5 "rows", each representing a borough, and n columns, each representing one month.

In [12]:
plotdata = np.zeros(len(plotdict['BROOKLYN']))
legendkey = []

for boro in plotdict.keys():
    plotdata = np.row_stack((plotdata, plotdict[boro]))
    legendkey.append(boro)
    
plotdata = np.delete(plotdata, (0), axis=0)
In [13]:
x = np.arange(len(plotdata[0]))

#crude xlabels
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
years = ['2011', '2012', '2013', '2014', '2015', '2016']
xlabels = []
for year in years:
    for month in months:
        xlabels.append("{0} {1}".format(month,year))

        
xlabels = xlabels[2:-7] #start from march 2011, end may2016
In [14]:
plotcolors = [(1,0,103),(213,255,0),(255,0,86),(158,0,142),(14,76,161),(255,229,2),(0,95,57),\
            (0,255,0),(149,0,58),(255,147,126),(164,36,0),(0,21,68),(145,208,203),(98,14,0)]

#rescaling rgb from 0-255 to 0 to 1
plotcolors = [(color[0]/float(255),color[1]/float(255),color[2]/float(255)) for color in plotcolors]
legendcolors = [mpatches.Patch(color = color) for color in plotcolors]

plt.figure(figsize = (15,10));
plt.stackplot(x,plotdata, colors = plotcolors);
plt.xticks(x,xlabels,rotation=90);
plt.xlim(0,len(xlabels))
plt.legend(legendcolors,legendkey, bbox_to_anchor=(0.2, 1));
plt.title('Heating Complaints by Borough', size = 24)
plt.ylabel('Number of Complaints',size = 14)

plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.gca().yaxis.set_ticks_position('left')
plt.gca().xaxis.set_ticks_position('bottom')
In [15]:
plotdatacsv = pd.DataFrame(plotdata)
plotdatacsv.index = legendkey
plotdatacsv.columns = xlabels
plotdatacsv.to_csv('totalcomplaintsbyboroughandmonth.csv')

Normalized total complaints

We'll also need a normalized version of the above, where everything is scaled to 100%, to see if complaints come disproportionately from one borough vs. another. There are probably better ways to do this, but I'm going to simply sum the number of complaints from each time point, and divide each column by the total for that column (i.e., number of complaints from that borough for that month divided by total number of complaints).

In [16]:
normplotdata = np.ndarray(shape=np.shape(plotdata))

It will thus be helpful to have an array of the total number of complaints:

In [17]:
totalcomplaintsbymonth = borodata.groupby(by=['Year','Month']).count().Date
In [18]:
totalcomplaintsbymonth = totalcomplaintsbymonth.values
In [19]:
totalcomplaintsbymonth = np.tile(totalcomplaintsbymonth,(5,1))
In [20]:
normplotdata = plotdata/totalcomplaintsbymonth
In [21]:
plotcolors = [(1,0,103),(213,255,0),(255,0,86),(158,0,142),(14,76,161),(255,229,2),(0,95,57),\
            (0,255,0),(149,0,58),(255,147,126),(164,36,0),(0,21,68),(145,208,203),(98,14,0)]

#rescaling rgb from 0-255 to 0 to 1
plotcolors = [(color[0]/float(255),color[1]/float(255),color[2]/float(255)) for color in plotcolors]
legendcolors = [mpatches.Patch(color = color) for color in plotcolors]

plt.figure(figsize = (15,10));
plt.stackplot(x,normplotdata, colors = plotcolors);
plt.xticks(x,xlabels,rotation=90);
plt.xlim(0,len(xlabels))
plt.ylim(0,1)
plt.legend(legendcolors,legendkey, bbox_to_anchor=(0.2, 1));
plt.title('Heating Complaints by Borough', size = 24)
plt.ylabel('Number of Complaints',size = 14)

plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.gca().yaxis.set_ticks_position('left')
plt.gca().xaxis.set_ticks_position('bottom')

Heat complaints per capita:

We'd also like to look at how many complaints per capita are generated. Given that we only have full years for 2012-2015, let's consider those.

In [22]:
boropop = {
    'MANHATTAN': 1636268,
    'BRONX': 1438159,
    'BROOKLYN': 2621793,
    'QUEENS': 2321580,
    'STATEN ISLAND': 473279,
    }
In [23]:
totalnycpop = reduce(lambda x, y: x +y, boropop.values())
totalnycpop
Out[23]:
8491079
In [24]:
complaintsbyyear = borodata.groupby(by=['Year']).count()
borocomplaintsbyyear = borodata.groupby(by=['Year','Boro']).count()
In [25]:
complaintsbyyear['Pop'] = [totalnycpop for x in complaintsbyyear.index]
In [26]:
borocomplaintsbyyear['Pop'] = [boropop.get(x[1]) for x in borocomplaintsbyyear.index]
In [27]:
complaintsbyyear['CompPerCap'] = complaintsbyyear['Day']/complaintsbyyear['Pop']
In [28]:
complaintsbyyear
Out[28]:
Date Boro Comptype Month Day Pop CompPerCap
Year
2011 1359368 1359368 1359368 1359368 1359368 8491079 0.160094
2012 1528597 1528597 1528597 1528597 1528597 8491079 0.180024
2013 1546188 1546188 1546188 1546188 1546188 8491079 0.182096
2014 1770357 1770357 1770357 1770357 1770357 8491079 0.208496
2015 1924917 1924917 1924917 1924917 1924917 8491079 0.226699
2016 891783 891783 891783 891783 891783 8491079 0.105026
In [29]:
borocomplaintsbyyear['CompPerCap'] = borocomplaintsbyyear['Day']/borocomplaintsbyyear['Pop']
In [30]:
borocomplaintsbyyear
Out[30]:
Date Comptype Month Day Pop CompPerCap
Year Boro
2011 BRONX 262597 262597 262597 262597 1438159 0.182592
BROOKLYN 434038 434038 434038 434038 2621793 0.165550
MANHATTAN 273282 273282 273282 273282 1636268 0.167015
QUEENS 315070 315070 315070 315070 2321580 0.135714
STATEN ISLAND 74381 74381 74381 74381 473279 0.157161
2012 BRONX 296502 296502 296502 296502 1438159 0.206168
BROOKLYN 482088 482088 482088 482088 2621793 0.183877
MANHATTAN 314437 314437 314437 314437 1636268 0.192167
QUEENS 358990 358990 358990 358990 2321580 0.154632
STATEN ISLAND 76580 76580 76580 76580 473279 0.161807
2013 BRONX 302328 302328 302328 302328 1438159 0.210219
BROOKLYN 490979 490979 490979 490979 2621793 0.187268
MANHATTAN 332487 332487 332487 332487 1636268 0.203198
QUEENS 343772 343772 343772 343772 2321580 0.148077
STATEN ISLAND 76622 76622 76622 76622 473279 0.161896
2014 BRONX 340297 340297 340297 340297 1438159 0.236620
BROOKLYN 557290 557290 557290 557290 2621793 0.212561
MANHATTAN 373895 373895 373895 373895 1636268 0.228505
QUEENS 408389 408389 408389 408389 2321580 0.175910
STATEN ISLAND 90486 90486 90486 90486 473279 0.191190
2015 BRONX 355017 355017 355017 355017 1438159 0.246855
BROOKLYN 602144 602144 602144 602144 2621793 0.229669
MANHATTAN 417223 417223 417223 417223 1636268 0.254985
QUEENS 455115 455115 455115 455115 2321580 0.196037
STATEN ISLAND 95418 95418 95418 95418 473279 0.201610
2016 BRONX 169773 169773 169773 169773 1438159 0.118049
BROOKLYN 271140 271140 271140 271140 2621793 0.103418
MANHATTAN 209840 209840 209840 209840 1636268 0.128243
QUEENS 200335 200335 200335 200335 2321580 0.086293
STATEN ISLAND 40695 40695 40695 40695 473279 0.085985

From these data, we can see two things: First, on average, about 1 in 5 people make a 311 complaint each year. Second, overall, this number are pretty consistent across the 5 boroughs.

Hack for Heat #8: Complaint resolution time revisted

In [1]:
%matplotlib inline

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import psycopg2
from datetime import date

pd.options.display.max_columns = 40
pd.options.mode.chained_assignment = None  # default='warn'

Hack for Heat #8: Complaint resolution time revisited

In this post, I'm going to explore a simple way of dealing with the censoring issue in complaint resolution time: Basically, because we only care about heat complaints during the heat season, we can constrain our cases of interest to the heating season specifically (1st October to May 31st).

In [2]:
#pull from our database again:
connection = psycopg2.connect('dbname = threeoneone user= threeoneoneadmin password = threeoneoneadmin')
In [3]:
cursor = connection.cursor()
In [4]:
cursor.execute('''SELECT createddate, closeddate, borough, complainttype FROM service;''')
In [5]:
data = cursor.fetchall()
In [6]:
data = pd.DataFrame(data)
In [7]:
data.columns = ['createddate', 'closeddate', 'borough', 'complainttype']
In [8]:
data = data.loc[(data['complainttype'] == 'HEATING') | (data['complainttype'] == 'HEAT/HOT WATER') ]

So far, we've subsetted the data to only include the heating complaints. Now let's create a datetime mask. What we want is to have subsets of data for each of the heating seasons:

In [9]:
heatmonths = range(1, 6) + [10,11,12]
In [10]:
heatmonths
Out[10]:
[1, 2, 3, 4, 5, 10, 11, 12]
In [11]:
createdmask = data['createddate'].map(lambda x: (x.month in heatmonths))
closedmask = data['closeddate'].map(lambda x: (x.month in heatmonths) if x != None else False)
In [12]:
mask = createdmask & closedmask
In [13]:
heatseasondata = data.loc[mask]

How many heat/hot water complaints are created and closed inside vs. outside of heating season?

In [14]:
len(data.loc[mask]) #inside heating season
Out[14]:
1302688
In [15]:
len(data.loc[~mask])#outside heating season
Out[15]:
68750

The next thing we want to do is ignore cases where the complaint was resolved in the next heating season:

In [16]:
prevmonths = range(1, 6) 
nextmonths = [10,11,12]
In [17]:
heatseasondata['createdheatseason'] = [x.year if (x.month in prevmonths) else (x.year-1) for x in heatseasondata['createddate']]
In [18]:
heatseasondata.head()
Out[18]:
createddate closeddate borough complainttype createdheatseason
0 2014-11-18 2014-11-22 BROOKLYN HEAT/HOT WATER 2013
1 2014-11-18 2014-11-25 BROOKLYN HEAT/HOT WATER 2013
2 2014-11-18 2014-11-19 BRONX HEAT/HOT WATER 2013
3 2014-11-18 2014-11-19 MANHATTAN HEAT/HOT WATER 2013
4 2014-11-18 2014-11-20 BRONX HEAT/HOT WATER 2013
In [19]:
heatseasondata['closedheatseason'] = [x.year if (x.month in prevmonths) else (x.year-1) for x in heatseasondata['closeddate']]

Now that we've done this, we can select only the cases where the closed date was in the same season as the created date:

In [20]:
heatseasondata = heatseasondata.loc[heatseasondata['createdheatseason']  == heatseasondata['closedheatseason']]

Okay, now we can calculate some average resolution times:

In [21]:
heatseasondata['resolutiontime'] = heatseasondata['closeddate'] - heatseasondata['createddate']
In [22]:
heatseasondata['resolutiontimeint'] = heatseasondata.resolutiontime.astype('timedelta64[D]')
In [23]:
resolutiontimedata = heatseasondata.groupby(by='createdheatseason').mean()['resolutiontimeint']
In [24]:
resolutiontimedata.to_csv('resolutiontimebyyear.csv')
In [25]:
resolutiontimedata
Out[25]:
createdheatseason
2009    4.605845
2010    3.695498
2011    3.666409
2012    3.623974
2013    3.514244
2014    3.137578
2015    3.406145
2016    3.446075
Name: resolutiontimeint, dtype: float64

Resolution times by year:

In [26]:
x = resolutiontimedata.index.values
y = resolutiontimedata.values

plt.figure(figsize=(12,10));
plt.plot(x,y);

Resolution time by borough:

In [27]:
restimebyboro = heatseasondata.groupby(by=['borough', 'createdheatseason']).mean()['resolutiontimeint']
In [28]:
restimebyboro.to_csv('restimebyboro.csv')
In [29]:
restimebyboro = restimebyboro.loc[[x in range(2010,2017) for x in restimebyboro.index.get_level_values('createdheatseason')]]
In [31]:
boros = heatseasondata.borough.unique()
In [32]:
boroplots = {x:[] for x in boros}
for boro in boros:
    boroplots[boro] = restimebyboro.xs(boro).values
    
boroplots.pop('Unspecified')
Out[32]:
array([ 3.64268956,  3.89000015])
In [33]:
x = range(2010,2017)

plt.figure(figsize=(12,10));
for boro in boroplots:
    plt.plot(x,boroplots[boro]);
plt.legend(boroplots.keys());
plt.xticks(x, [str(label) for label in x]);

I was told beforehand that the average resolution time from a similar analysis last year was about 3-5 days, so this looks about right.

Hack for Heat #7: Heat complaints over time

In [1]:
%matplotlib inline

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import psycopg2
from datetime import datetime
from datetime import date

pd.options.display.max_columns = 40

Hack for Heat #7: Heat complaints over time

In the last post, I plotted how the number of complaints differd by borough over time. This time around, I'm going to revisit this process, but this time focusing on heating complaints only (this is Heat Seek, after all).

Loading data:

In [2]:
connection = psycopg2.connect('dbname = threeoneone user=threeoneoneadmin password=threeoneoneadmin')
cursor = connection.cursor()
In [3]:
cursor.execute('''SELECT DISTINCT complainttype FROM service;''')
complainttypes = cursor.fetchall()
In [4]:
cursor.execute('''SELECT createddate, borough, complainttype FROM service;''')
borodata = cursor.fetchall()
In [5]:
borodata = pd.DataFrame(borodata)
In [6]:
borodata.head()
Out[6]:
0 1 2
0 2011-04-15 MANHATTAN Street Condition
1 2011-04-15 MANHATTAN Street Condition
2 2011-04-15 MANHATTAN Street Condition
3 2011-04-15 MANHATTAN Street Condition
4 2011-04-14 QUEENS Street Condition
In [7]:
borodata.columns = ['Date', 'Boro', 'Comptype']
In [8]:
heatdata = borodata.loc[(borodata['Comptype'] == 'HEATING') | (borodata['Comptype'] == 'HEAT/HOT WATER')]

I removed entires earlier than 2011 because there are data quality issues (substantially fewer cases):

In [9]:
heatdata = heatdata.loc[heatdata['Date'] > date(2011,3,1)]

Heat complaints by borough:

In [10]:
len(heatdata)
Out[10]:
1092048

There were about a million heating complaints over the 3+ years we have data for.

In [11]:
heatbydate = heatdata.groupby(by='Boro').count()
In [12]:
heatbydate
Out[12]:
Date Comptype
Boro
BRONX 354335 354335
BROOKLYN 334574 334574
MANHATTAN 250296 250296
QUEENS 138477 138477
STATEN ISLAND 10601 10601
Unspecified 3765 3765

Per capita:

Again, let's look at how many heat complaints each boro generates per person:

In [13]:
boropop = {
    'MANHATTAN': 1636268,
    'BRONX': 1438159,
    'BROOKLYN': 2621793,
    'QUEENS': 2321580,
    'STATEN ISLAND': 473279,
    }
In [14]:
heatbydate['Pop'] = [boropop.get(x) for x in heatbydate.index]
In [15]:
heatbydate['CompPerCap'] = heatbydate['Date']/heatbydate['Pop']
In [16]:
heatbydate
Out[16]:
Date Comptype Pop CompPerCap
Boro
BRONX 354335 354335 1438159.0 0.246381
BROOKLYN 334574 334574 2621793.0 0.127613
MANHATTAN 250296 250296 1636268.0 0.152968
QUEENS 138477 138477 2321580.0 0.059648
STATEN ISLAND 10601 10601 473279.0 0.022399
Unspecified 3765 3765 NaN NaN

Complaints by borough over months

First, let's recreate the graph from before:

In [17]:
heatdata['Year'] = [x.year for x in heatdata['Date']]
heatdata['Month'] = [x.month for x in heatdata['Date']]
heatdata['Day'] = [x.day for x in heatdata['Date']]
In [18]:
heatdata.head()
Out[18]:
Date Boro Comptype Year Month Day
58 2015-10-17 BROOKLYN HEAT/HOT WATER 2015 10 17
728 2011-04-16 MANHATTAN HEATING 2011 4 16
1026 2011-04-16 BROOKLYN HEATING 2011 4 16
1418 2015-10-17 MANHATTAN HEAT/HOT WATER 2015 10 17
1624 2011-04-15 MANHATTAN HEATING 2011 4 15

We remove data where it was unspecified what the boro was

In [19]:
heatdata = heatdata.loc[heatdata['Boro'] != 'Unspecified']
In [20]:
heatplotdata = heatdata.groupby(by=['Boro', 'Year','Month']).count()
heatplotdata
Out[20]:
Date Comptype Day
Boro Year Month
BRONX 2011 3 5101 5101 5101
4 3339 3339 3339
5 1217 1217 1217
6 562 562 562
7 564 564 564
8 482 482 482
9 674 674 674
10 5954 5954 5954
11 7393 7393 7393
12 10536 10536 10536
2012 1 12655 12655 12655
2 7146 7146 7146
3 5049 5049 5049
4 3336 3336 3336
5 1214 1214 1214
6 672 672 672
7 688 688 688
8 711 711 711
9 675 675 675
10 3488 3488 3488
11 11853 11853 11853
12 9365 9365 9365
2013 1 13580 13580 13580
2 10037 10037 10037
3 6794 6794 6794
4 3706 3706 3706
5 1545 1545 1545
6 861 861 861
7 605 605 605
8 593 593 593
... ... ... ... ... ...
STATEN ISLAND 2013 12 261 261 261
2014 1 491 491 491
2 199 199 199
3 206 206 206
4 112 112 112
5 70 70 70
6 35 35 35
7 32 32 32
8 31 31 31
9 36 36 36
10 216 216 216
11 405 405 405
12 246 246 246
2015 1 402 402 402
2 523 523 523
3 216 216 216
4 91 91 91
5 58 58 58
6 31 31 31
7 24 24 24
8 31 31 31
9 31 31 31
10 261 261 261
11 250 250 250
12 230 230 230
2016 1 480 480 480
2 327 327 327
3 133 133 133
4 110 110 110
5 79 79 79

315 rows × 3 columns

In [21]:
boros = heatbydate.index
borodict = {x:[] for x in boros}
borodict.pop('Unspecified')

for boro in borodict:
    borodict[boro] = list(heatplotdata.xs(boro).Date)
In [22]:
plotdata = np.zeros(len(borodict['BROOKLYN']))
for boro in sorted(borodict.keys()):
    plotdata = np.row_stack((plotdata, borodict[boro]))
In [23]:
plotdata = np.delete(plotdata, (0), axis=0)
In [25]:
from matplotlib import patches as mpatches

x = np.arange(len(plotdata[0]))

#crude xlabels
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
years = ['2011', '2012', '2013', '2014', '2015', '2016']
xlabels = []
for year in years:
    for month in months:
        xlabels.append("{0} {1}".format(month,year))

xlabels = xlabels[2:-7] #start from march 2011, end may2016

A non-normalized plot:

The plot below is of raw complaint numbers, and is what we might expect: complaints about eat matter the most during the heating season!

In [26]:
plotcolors = [(1,0,103),(213,255,0),(255,0,86),(158,0,142),(14,76,161),(255,229,2),(0,95,57),\
            (0,255,0),(149,0,58),(255,147,126),(164,36,0),(0,21,68),(145,208,203),(98,14,0)]

#rescaling rgb from 0-255 to 0 to 1
plotcolors = [(color[0]/float(255),color[1]/float(255),color[2]/float(255)) for color in plotcolors]
legendcolors = [mpatches.Patch(color = color) for color in plotcolors]

plt.figure(figsize = (15,10));
plt.stackplot(x,plotdata, colors = plotcolors);
plt.xticks(x,xlabels,rotation=90);
plt.xlim(0,len(xlabels))
plt.legend(legendcolors,sorted(borodict.keys()), bbox_to_anchor=(0.2, 1));
plt.title('Heating Complaints by Borough', size = 24)
plt.ylabel('Number of Complaints',size = 14)

plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.gca().yaxis.set_ticks_position('left')
plt.gca().xaxis.set_ticks_position('bottom')

Normalizing data:

Next, we're going to normalize these data. This allows us to better visualize if the proportion of heating complaints changed by borough, over time (e.g., did one borough generate more/less complaints vs. others over time?).

What we want to do is divide each of the 5 rows by the total number of complaints by that column (the month)

In [27]:
totalcounts = heatdata.groupby(by=['Year', 'Month']).count().Date.values
In [40]:
boros = heatbydate.index
normdict = {x:[] for x in boros}
normdict.pop('Unspecified')

for boro in normdict:
    for i in range(0,len(plotdata[1])): # for all the values in each row
        normp = float(borodict[boro][i])/float(totalcounts[i])
        normdict[boro].append(normp*100)
In [41]:
normplotdata = np.zeros(len(borodict['BROOKLYN']))
for boro in sorted(normdict.keys()):
    normplotdata = np.row_stack((normplotdata, normdict[boro]))
In [42]:
normplotdata = np.delete(normplotdata,(0),axis = 0)
In [44]:
plotcolors = [(1,0,103),(213,255,0),(255,0,86),(158,0,142),(14,76,161),(255,229,2),(0,95,57),\
            (0,255,0),(149,0,58),(255,147,126),(164,36,0),(0,21,68),(145,208,203),(98,14,0)]

#rescaling rgb from 0-255 to 0 to 1
plotcolors = [(color[0]/float(255),color[1]/float(255),color[2]/float(255)) for color in plotcolors]
legendcolors = [mpatches.Patch(color = color) for color in plotcolors]

plt.figure(figsize = (15,10));
plt.stackplot(x,normplotdata, colors = plotcolors);
plt.xticks(x,xlabels,rotation=90);
plt.xlim(0,len(xlabels))
plt.ylim(0,100)
plt.legend(legendcolors,sorted(normdict.keys()), bbox_to_anchor=(0.2, 1));
plt.title('Heating Complaints by Borough (normalized)', size = 24)
plt.ylabel('% of Complaints',size = 14)

plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.gca().yaxis.set_ticks_position('left')
plt.gca().xaxis.set_ticks_position('bottom')

Overall, it looks like heating complaints were pretty consistent by borough. In the next post, I'm going to take a look at one way of addressing the afore-mentioned censoring problems.

Hack for Heat #5: Complaint resolution time

In [1]:
%matplotlib inline

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import psycopg2

pd.options.display.max_columns = 40

Hack for Heat #5: How long do complaints take to resolve?

In this post, we're going to see if we can graph how long it takes for complaints to get resolved.

In [2]:
connection = psycopg2.connect('dbname= threeoneone user=threeoneoneadmin password=threeoneoneadmin')
cursor = connection.cursor()

cursor.execute('''SELECT createddate, closeddate, borough FROM service;''')
data = cursor.fetchall()
data = pd.DataFrame(data)
In [3]:
data.columns = ['createddate','closeddate','borough']
In [4]:
data = data.loc[data['createddate'].notnull()]
data = data.loc[data['closeddate'].notnull()]
In [5]:
data['timedelta'] = data['closeddate'] - data['createddate']
In [6]:
data['timedeltaint'] = [x.days for x in data['timedelta']]
In [7]:
data.head()
Out[7]:
createddate closeddate borough timedelta timedeltaint
0 2013-04-26 2013-04-26 QUEENS 0 days 0
1 2013-04-26 2013-04-29 MANHATTAN 3 days 3
2 2013-04-26 2013-04-30 MANHATTAN 4 days 4
3 2013-04-26 2013-04-26 QUEENS 0 days 0
4 2013-04-26 2013-04-30 STATEN ISLAND 4 days 4
In [8]:
data.groupby(by='borough')['timedeltaint'].mean()
Out[8]:
borough
BRONX            -3.658691
BROOKLYN         -6.835417
MANHATTAN       -20.325506
QUEENS           -4.121806
STATEN ISLAND    -2.108396
Unspecified       4.699638
Name: timedeltaint, dtype: float64

Oops! Looks like something's wrong. Let's try and find out:

In [9]:
data.sort_values('timedeltaint').head()
Out[9]:
createddate closeddate borough timedelta timedeltaint
604564 2016-03-16 1900-01-01 Unspecified -42443 days -42443
596082 2016-03-16 1900-01-01 Unspecified -42443 days -42443
605654 2016-03-16 1900-01-01 QUEENS -42443 days -42443
606458 2016-03-15 1900-01-01 MANHATTAN -42442 days -42442
552013 2016-03-14 1900-01-01 Unspecified -42441 days -42441
In [10]:
data.sort_values('timedeltaint', ascending=False).head()
Out[10]:
createddate closeddate borough timedelta timedeltaint
9392808 2010-04-08 2201-05-13 BRONX 69796 days 69796
9525584 2010-06-01 2201-06-17 BROOKLYN 69777 days 69777
1581464 2011-01-20 2201-03-25 BRONX 69460 days 69460
1536326 2015-11-03 2100-01-01 QUEENS 30740 days 30740
2479091 2013-01-10 2023-05-01 BROOKLYN 3763 days 3763

Ah. Well, as a first step, let's remove any values that are before Jan 1st 2010 or after today:

In [11]:
import datetime

today = datetime.date(2016,5,29)
janone = datetime.date(2010,1,1)

Let's also remove any rows where the close date is before the created date:

In [12]:
subdata = data.loc[(data['closeddate'] > janone) & (data['closeddate'] < today)]
subdata = subdata.loc[data['closeddate'] > data['createddate']]
In [13]:
len(subdata)
Out[13]:
7979757
In [14]:
subdata.sort_values('timedeltaint').head()
Out[14]:
createddate closeddate borough timedelta timedeltaint
11371297 2015-08-28 2015-08-29 QUEENS 1 days 1
7950080 2015-06-20 2015-06-21 BROOKLYN 1 days 1
2366882 2012-11-19 2012-11-20 MANHATTAN 1 days 1
2366879 2015-06-09 2015-06-10 Unspecified 1 days 1
2366874 2012-04-17 2012-04-18 BROOKLYN 1 days 1
In [15]:
subdata.sort_values('timedeltaint',ascending = False).head()
Out[15]:
createddate closeddate borough timedelta timedeltaint
10774084 2010-01-06 2016-03-24 BROOKLYN 2269 days 2269
1506622 2010-02-19 2016-05-02 BROOKLYN 2264 days 2264
1508205 2010-03-02 2016-05-04 STATEN ISLAND 2255 days 2255
513580 2010-01-01 2016-02-17 STATEN ISLAND 2238 days 2238
589865 2010-01-22 2016-02-25 BRONX 2225 days 2225

This looks a little bit more realistic, but let's also visualize the distribution:

In [16]:
plotdata = list(subdata['timedeltaint'])
In [17]:
plt.figure(figsize=(12,10))
plt.hist(plotdata);

Okay, this still looks really wonky. Let's further subset the data, and see what happens when we remove the top and bottom 2.5%.

Pandas has a quantile function:

In [18]:
subdata.quantile([.025, .975])
Out[18]:
timedeltaint
0.025 1
0.975 138
In [19]:
quantcutdata = subdata.loc[(subdata['timedeltaint'] > 1) & (subdata['timedeltaint'] < 138) ]
In [20]:
len(quantcutdata)
Out[20]:
6278707
In [21]:
plotdata = list(quantcutdata['timedeltaint'])

plt.figure(figsize=(12,10))
plt.hist(plotdata);

That looks a little better, but there might be other ways to filter out bad data.

In [22]:
subdata.groupby(by='borough').median()
Out[22]:
timedeltaint
borough
BRONX 5
BROOKLYN 5
MANHATTAN 6
QUEENS 5
STATEN ISLAND 4
Unspecified 5
In [23]:
subdata.groupby(by='borough').mean()
Out[23]:
timedeltaint
borough
BRONX 18.789025
BROOKLYN 20.725656
MANHATTAN 20.841213
QUEENS 23.746771
STATEN ISLAND 26.283222
Unspecified 11.652647

So, I definitely wouldn't trust these data right now - I'm still working on finding a data dictionary for the 311 data, and I need to make sure the columns mean what I think they mean. The point of this is just to show that the process would be once I get my hands on reliable data.

Hack for Heat #4: Complaints by borough and date

In [2]:
%matplotlib inline

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import psycopg2

pd.options.display.max_columns = 40

Hack for Heat #4: Number of complaints over time pt.2

In this post, we're going to look at the number of complaints each borough received for the last five or so years. First, let's look at the total number of complaints received:

In [3]:
connection = psycopg2.connect('dbname = threeoneone user=threeoneoneadmin password=threeoneoneadmin')
cursor = connection.cursor()

Borough complaints by date

In [4]:
cursor.execute('''SELECT createddate, borough FROM service;''')
borodata = cursor.fetchall()
In [5]:
borodata = pd.DataFrame(borodata)
In [6]:
borodata.columns = ['Date', 'Boro']
In [7]:
borobydate = borodata.groupby(by='Boro').count()
In [8]:
borobydate
Out[8]:
Date
Boro
BRONX 1882215
BROOKLYN 3178206
MANHATTAN 2166983
QUEENS 2444874
STATEN ISLAND 541935
Unspecified 1157085

For fun, let's look at the number of complaints per capita in each of the 5 boroughs. The population values below are from Wikipedia.

In [9]:
boropop = {
    'MANHATTAN': 1636268,
    'BRONX': 1438159,
    'BROOKLYN': 2621793,
    'QUEENS': 2321580,
    'STATEN ISLAND': 473279,
    }
In [10]:
borobydate['Pop'] = [boropop.get(x) for x in borobydate.index]
In [11]:
borobydate['CompPerCap'] = borobydate['Date']/borobydate['Pop']
In [12]:
borobydate
Out[12]:
Date Pop CompPerCap
Boro
BRONX 1882215 1438159 1.308767
BROOKLYN 3178206 2621793 1.212226
MANHATTAN 2166983 1636268 1.324345
QUEENS 2444874 2321580 1.053108
STATEN ISLAND 541935 473279 1.145065
Unspecified 1157085 NaN NaN

Complaints by borough over months

The next thing we're going to do is make a stacked plot of complaints by borough, over months. To do this, we need to extract the day and month from the date column:

In [13]:
borodata['Year'] = [x.year for x in borodata['Date']]
borodata['Month'] = [x.month for x in borodata['Date']]
In [14]:
borodata = borodata.loc[borodata['Boro'] != 'Unspecified']

Next, we need to generate an array of Ys. We want the rows of this dataframe to be the 5 boroughs, and the columns to be the count of complaints for the year and month.

In [15]:
boroplotdata = borodata.groupby(by=['Boro', 'Year','Month']).count()
boroplotdata
Out[15]:
Date
Boro Year Month
BRONX 2010 1 9395
2 9475
3 13616
4 11786
5 12169
6 15681
7 16861
8 12920
9 11633
10 10590
11 9196
12 8815
2011 1 9599
2 11229
3 26772
4 22898
5 22722
6 23677
7 27348
8 25241
9 25138
10 30071
11 28319
12 30927
2012 1 33097
2 25259
3 24480
4 20330
5 21851
6 24170
... ... ... ...
STATEN ISLAND 2013 12 4984
2014 1 8753
2 8978
3 8211
4 7937
5 8278
6 8173
7 8111
8 6934
9 6597
10 7067
11 5558
12 5889
2015 1 6398
2 8228
3 10537
4 8489
5 8240
6 8803
7 8507
8 7684
9 7920
10 7592
11 6756
12 6264
2016 1 7186
2 7505
3 8531
4 8750
5 6688

385 rows × 1 columns

We basically need to get the above table into a graph.

In [16]:
boros = borobydate.index
borodict = {x:[] for x in boros}
borodict.pop('Unspecified')

for boro in borodict:
    borodict[boro] = list(boroplotdata.xs(boro).Date)
In [17]:
plotdata = np.zeros(len(borodict['BROOKLYN']))
for boro in borodict:
    plotdata = np.row_stack((plotdata, borodict[boro]))
In [18]:
plotdata = np.delete(plotdata, (0), axis=0)
In [19]:
plotdata
Out[19]:
array([[  5804.,   6791.,  11009.,   6578.,   6590.,   6370.,   5909.,
          5957.,   5926.,   5597.,   4823.,   4715.,   6008.,   7360.,
          8109.,   6836.,   7091.,   7257.,   6380.,   9858.,   8268.,
          8356.,   6778.,   5799.,   6169.,   5042.,   6018.,   5607.,
          6746.,   6839.,   7037.,   7024.,   6108.,   7036.,   7509.,
          5445.,   7207.,   5851.,   6352.,   6915.,   6843.,   7392.,
          7552.,   6442.,   5936.,   6152.,   4996.,   4984.,   8753.,
          8978.,   8211.,   7937.,   8278.,   8173.,   8111.,   6934.,
          6597.,   7067.,   5558.,   5889.,   6398.,   8228.,  10537.,
          8489.,   8240.,   8803.,   8507.,   7684.,   7920.,   7592.,
          6756.,   6264.,   7186.,   7505.,   8531.,   8750.,   6688.],
       [ 21753.,  22463.,  30847.,  25357.,  26545.,  27245.,  29419.,
         26279.,  26287.,  24483.,  22788.,  22361.,  22895.,  24331.,
         43807.,  41075.,  39214.,  41555.,  42120.,  46155.,  44927.,
         47186.,  44536.,  44451.,  46493.,  36073.,  37645.,  35357.,
         38451.,  38776.,  41169.,  39849.,  36523.,  41004.,  47840.,
         42908.,  50532.,  40369.,  40716.,  38720.,  40310.,  41070.,
         41466.,  37545.,  34560.,  40456.,  42451.,  42784.,  60679.,
         49657.,  45848.,  40934.,  46025.,  46460.,  46283.,  41112.,
         41180.,  45079.,  47862.,  46171.,  52516.,  55555.,  55122.,
         47002.,  47346.,  51835.,  51154.,  47652.,  47385.,  52645.,
         48386.,  45544.,  56649.,  52388.,  50952.,  52751.,  44888.],
       [  9395.,   9475.,  13616.,  11786.,  12169.,  15681.,  16861.,
         12920.,  11633.,  10590.,   9196.,   8815.,   9599.,  11229.,
         26772.,  22898.,  22722.,  23677.,  27348.,  25241.,  25138.,
         30071.,  28319.,  30927.,  33097.,  25259.,  24480.,  20330.,
         21851.,  24170.,  25939.,  22423.,  19910.,  23060.,  29396.,
         26587.,  33031.,  28243.,  24956.,  22083.,  20914.,  22673.,
         26734.,  20546.,  19060.,  23673.,  29511.,  30904.,  43650.,
         33274.,  29464.,  24904.,  24040.,  24024.,  26418.,  22771.,
         21910.,  25771.,  33015.,  31056.,  36197.,  37733.,  32791.,
         26419.,  24552.,  27103.,  29387.,  26814.,  25613.,  30593.,
         29559.,  28256.,  39647.,  36866.,  30192.,  29880.,  25408.],
       [ 15659.,  14831.,  18597.,  19642.,  20490.,  20316.,  21347.,
         19256.,  19772.,  19534.,  18897.,  14786.,  14499.,  17235.,
         27956.,  26018.,  26449.,  25271.,  25531.,  25630.,  27482.,
         31265.,  29656.,  28735.,  29867.,  24649.,  27054.,  24528.,
         25844.,  26173.,  25093.,  25044.,  23468.,  26678.,  28202.,
         27837.,  32110.,  27178.,  28061.,  26816.,  27207.,  25809.,
         26994.,  25026.,  24760.,  29790.,  30731.,  28005.,  37394.,
         29569.,  30384.,  29098.,  30921.,  30898.,  31221.,  28214.,
         28754.,  32569.,  33913.,  30960.,  34194.,  34262.,  33343.,
         33626.,  33555.,  33952.,  33896.,  34326.,  34295.,  40630.,
         37755.,  33389.,  40881.,  37560.,  40559.,  43912.,  37175.],
       [ 22951.,  22336.,  32976.,  27739.,  28528.,  28287.,  29154.,
         27223.,  34110.,  26295.,  22845.,  21670.,  22964.,  24969.,
         32932.,  29108.,  29876.,  31333.,  31231.,  38856.,  33368.,
         31570.,  29785.,  28166.,  29217.,  23758.,  26756.,  25594.,
         28361.,  30195.,  31792.,  31756.,  28628.,  34079.,  39404.,
         29450.,  33078.,  28121.,  27475.,  27384.,  29037.,  29633.,
         31044.,  27880.,  26220.,  28053.,  27561.,  28286.,  38144.,
         36074.,  35977.,  31202.,  34711.,  35843.,  36394.,  31615.,
         31129.,  32470.,  33042.,  31788.,  34332.,  38725.,  44965.,
         37836.,  36885.,  40637.,  39691.,  38212.,  36171.,  38864.,
         34327.,  34463.,  38982.,  38579.,  38548.,  40171.,  34063.]])

Awesome! Now we have 5 rows with 77 columns each denoting the complaints for each of the boros for each of the months from 2010 to 2016.

In [20]:
from matplotlib import patches as mpatches

x = np.arange(len(plotdata[0]))

#crude xlabels
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
years = ['2010', '2011', '2012', '2013', '2014', '2015', '2016']
xlabels = []
for year in years:
    for month in months:
        xlabels.append("{0} {1}".format(month,year))

plotcolors = [(1,0,103),(213,255,0),(255,0,86),(158,0,142),(14,76,161),(255,229,2),(0,95,57),\
            (0,255,0),(149,0,58),(255,147,126),(164,36,0),(0,21,68),(145,208,203),(98,14,0)]

#rescaling rgb from 0-255 to 0 to 1
plotcolors = [(color[0]/float(255),color[1]/float(255),color[2]/float(255)) for color in plotcolors]
legendcolors = [mpatches.Patch(color = color) for color in plotcolors]

plt.figure(figsize = (15,10));
plt.stackplot(x,plotdata, colors = plotcolors);
plt.xticks(x,xlabels,rotation=90);
plt.xlim(0,76)
plt.legend(legendcolors,borodict.keys(), bbox_to_anchor=(0.2, 1));
plt.title('311 Complaints by Borough', size = 24)
plt.ylabel('Number of Complaints',size = 14)

plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.gca().yaxis.set_ticks_position('left')
plt.gca().xaxis.set_ticks_position('bottom')
C:\Users\Bryan\Anaconda2\lib\site-packages\matplotlib\collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):

Sanity checks

So, some parts of this graph bear checking. First, did Staten Island really not increase in complaints over the years? The data below (complaints by by borough by year) suggest that that was the case:

In [21]:
borodata.groupby(by = ['Boro', 'Year']).count()
Out[21]:
Date Month
Boro Year
BRONX 2010 142137 142137
2011 283941 283941
2012 296502 296502
2013 302328 302328
2014 340297 340297
2015 355017 355017
2016 161993 161993
BROOKLYN 2010 305827 305827
2011 482252 482252
2012 482088 482088
2013 490979 490979
2014 557290 557290
2015 602142 602142
2016 257628 257628
MANHATTAN 2010 223127 223127
2011 305727 305727
2012 314437 314437
2013 332487 332487
2014 373895 373895
2015 417223 417223
2016 200087 200087
QUEENS 2010 324114 324114
2011 364158 364158
2012 358990 358990
2013 343772 343772
2014 408389 408389
2015 455108 455108
2016 190343 190343
STATEN ISLAND 2010 76069 76069
2011 88100 88100
2012 76580 76580
2013 76622 76622
2014 90486 90486
2015 95418 95418
2016 38660 38660

The other thing that would be worth investigating is what happened between Februrary and March 2011 - complaints doubled during this time. Perhaps some aspect of the pipeline changed during this time, but that's a future story.

Hack for Heat #3: Number of complaints over time

In [3]:
%matplotlib inline

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import psycopg2

pd.options.display.max_columns = 40

Hack for Heat #3: Number of complaints over time

This time, we're going to look at raw 311 complaint data. The data that I was working with previously was summarized data.

This dataset is much bigger, which is nice because it'll give me a chance to maintain my SQL-querying-from-memory-skills.

First, we're going to have to load all of this data into a postgres database. I wrote this tablebase.

SQL-ing this

The python library psycopg2 lets us work with postgres databases in python. We first create a connection object, that encapsulates the connection to the database, then create a cursor class that lets us make queries from that database.

In [22]:
connection = psycopg2.connect('dbname = threeoneone user=threeoneoneadmin password=threeoneoneadmin')
cursor = connection.cursor()

For example, we might want to extract the column names from our table:

In [23]:
cursor.execute('''SELECT * FROM threeoneone.INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'service'; ''')
columns = cursor.fetchall()
In [24]:
columns = [x[3] for x in columns]
In [25]:
columns[0:5]
Out[25]:
['uniquekey', 'createddate', 'closeddate', 'agency', 'agencyname']

Complaints over time

Let's start with something simple. First, let's extract a list of all complaints, and the plot the number of complaints by month.

In [27]:
cursor.execute('''SELECT createddate FROM service;''')
complaintdates = cursor.fetchall()
In [32]:
complaintdates = pd.DataFrame(complaintdates)
In [33]:
complaintdates.head()
Out[33]:
0
0 (2013-04-26,)
1 (2013-04-26,)
2 (2013-04-26,)
3 (2013-04-26,)
4 (2013-04-26,)

Renaming our column:

In [42]:
complaintdates.columns = ['Date']

Next we have to convert these tuples into strings:

In [51]:
complaintdates['Date'] = [x [0] for x in complaintdates['Date']]

Normally, if these were strings, we'd use the extract_dates function we wrote in a previous post. However, because I typed these as datetime objects, we can just extract the .year(), .month(), and .day() attributes:

In [60]:
type(complaintdates['Date'][0])
Out[60]:
datetime.date
In [63]:
complaintdates['Day'] = [x.day for x in complaintdates['Date']]
complaintdates['Month'] = [x.month for x in complaintdates['Date']]
complaintdates['Year'] = [x.year for x in complaintdates['Date']]

This is how many total complaints we have:

In [67]:
len(complaintdates)
Out[67]:
11371298

We can group them by month:

In [69]:
bymonth = complaintdates.groupby(by='Month').count()
bymonth
Out[69]:
Date Year Day
Month
1 1182682 1182682 1182682
2 1065065 1065065 1065065
3 1099983 1099983 1099983
4 985815 985815 985815
5 972913 972913 972913
6 849832 849832 849832
7 875210 875210 875210
8 829573 829573 829573
9 805231 805231 805231
10 895672 895672 895672
11 915066 915066 915066
12 894256 894256 894256

By year:

In [71]:
byyear = complaintdates.groupby(by='Year').count()
byyear
Out[71]:
Date Month Day
Year
2010 1754355 1754355 1754355
2011 1688091 1688091 1688091
2012 1568421 1568421 1568421
2013 1614207 1614207 1614207
2014 1839574 1839574 1839574
2015 2021963 2021963 2021963
2016 884687 884687 884687
In [88]:
byday = complaintdates.groupby(by='Day').count()
bydate = complaintdates.groupby(by='Date').count()

Some matplotlib

In [70]:
plt.figure(figsize = (12,10))
x = range(0,12)
y = bymonth['Date']
plt.plot(x,y)
Out[70]:
[<matplotlib.lines.Line2D at 0xa7a0ed68>]
In [78]:
plt.figure(figsize = (12,10))
x = range(0,7)
y = byyear['Date']

plt.plot(x,y)
Out[78]:
[<matplotlib.lines.Line2D at 0xa1925748>]
In [86]:
plt.figure(figsize = (12,10))
x = range(0,len(byday))
y = byday['Date']

plt.plot(x,y)
Out[86]:
[<matplotlib.lines.Line2D at 0x8c9c83c8>]

The sharp decline we see at the end is obviously because not all months have the same number of days.

In [94]:
plt.figure(figsize=(12,10))
x = range(0,len(bydate))
y = bydate['Year'] #This is arbitrary - year, month, and day are all series that store the counts

plt.plot(x,y)
Out[94]:
[<matplotlib.lines.Line2D at 0xa0ed9860>]

That's all for now. In the next post, I'm going to break this down by borough, as well as polish this graph.

Hack for Heat #2: Problem types over time

In [1]:
%matplotlib inline

import pandas as pd
from matplotlib import pyplot as plt
import numpy as np

pd.options.display.max_columns = 40

Hack for Heat #2: Problem types over time

In this post, I'm going to explore how we might track the composition of problems that the HPD(Housing Preservation and Development Board) might receive over time. The data that I currently have are only limited to 2.5 years, so this is just a proof of concept.

In [2]:
hpdcompprob = pd.read_csv("Complaint_Problems.csv")
In [3]:
type(hpdcompprob.StatusDate[0])
Out[3]:
str

The first thing we're going to do is convert string dates into simple years, months, and days.

In [6]:
def extract_dates(series, nameofdatecolumn):
    series['Year'] = series[nameofdatecolumn].str.extract(r'../../(.*)', expand=True)
    series['Month'] = series[nameofdatecolumn].str.extract(r'(.*)/../....', expand=True)
    series['Day'] = series[nameofdatecolumn].str.extract(r'../(.*)/....', expand=True)
    
extract_dates(hpdcompprob, 'StatusDate')
In [7]:
def count_problem_types(pdseries):
    problemtypes = pdseries.unique()
    problemcounts = {k:0 for k in problemtypes}
    problemcounts['TOTAL'] = 0
    
    for problem in pdseries:
        problemcounts[problem] += 1
    
    problemcounts['TOTAL'] = len(pdseries)
    print year, problemcounts #this is just to show us what those counts are
        
    return problemcounts
In [8]:
#years = hpdcompprob.Year.unique()
years = ['2014','2015','2016'] #We know that only these data are relevant (the other, older years, are remnants of old cases)
yearsdict = {k:{} for k in years}
for year in years:
    yearsdict[year] = count_problem_types(hpdcompprob.loc[hpdcompprob.Year == year].MajorCategory)
2014 {'TOTAL': 259437, 'ELECTRIC': 12684, 'HEAT/HOT WATER': 93548, 'WATER LEAK': 14315, 'APPLIANCE': 5231, 'DOOR/WINDOW': 16352, 'HEATING': 16, 'ELEVATOR': 363, 'GENERAL': 9680, 'FLOORING/STAIRS': 11768, 'NONCONST': 185, 'UNSANITARY CONDITION': 34779, 'SAFETY': 5521, 'PLUMBING': 23179, 'CONSTRUCTION': 3, 'OUTSIDE BUILDING': 758, 'PAINT/PLASTER': 31055}
2015 {'TOTAL': 705644, 'PAINT/PLASTER': 84062, 'WATER LEAK': 42512, 'DOOR/WINDOW': 49154, 'APPLIANCE': 14719, 'NONCONST': 69, 'ELEVATOR': 1213, 'GENERAL': 31482, 'HEAT/HOT WATER': 228157, 'FLOORING/STAIRS': 34741, 'UNSANITARY CONDITION': 97221, 'SAFETY': 17105, 'PLUMBING': 67472, 'CONSTRUCTION': 1, 'OUTSIDE BUILDING': 2183, 'ELECTRIC': 35553}
2016 {'TOTAL': 211830, 'ELECTRIC': 9239, 'WATER LEAK': 12063, 'DOOR/WINDOW': 15465, 'APPLIANCE': 4644, 'HEATING': 39, 'ELEVATOR': 377, 'GENERAL': 9738, 'HEAT/HOT WATER': 73803, 'NONCONST': 33, 'UNSANITARY CONDITION': 27727, 'SAFETY': 5243, 'PLUMBING': 20218, 'FLOORING/STAIRS': 9989, 'OUTSIDE BUILDING': 543, 'PAINT/PLASTER': 22709}

Graphing our results

We're going to use matplotlib to make a stacked area graph. I referenced this post.

First, let's arrange our dictionary keys in alphabetical order, so that we can extract the values in a meaningful way.

In [9]:
labels = sorted(yearsdict['2014'].keys())
labels.remove('CONSTRUCTION')
labels.remove('HEATING') #removed these because 2015 doesn't have an entry, and both the other years have negligible entries
labels.remove('TOTAL') #not interested in total in this plot
In [10]:
years = ['2014','2015','2016']
plotdict = {k:[] for k in years}
for year in years:
    for label in labels:
        plotdict[year].append(yearsdict[year][label])
In [11]:
len(labels)
Out[11]:
14
In [12]:
x = np.arange(len(years))
y = np.column_stack((plotdict['2014'],plotdict['2015'],plotdict['2016']))
plt.figure(figsize = (15,10));
plt.stackplot(x,y);
plt.legend(labels)
Out[12]:
<matplotlib.legend.Legend at 0xeb24ac8>

This might legit be the ugliest plot I have ever seen. Let's try and fix the colors first.

From this blog post, I got a list of 14 contrasting colors.

In [13]:
from matplotlib import patches as mpatches

plotcolors = [(0,0,0),(1,0,103),(213,255,0),(255,0,86),(158,0,142),(14,76,161),(255,229,2),(0,95,57),\
            (0,255,0),(149,0,58),(255,147,126),(164,36,0),(0,21,68),(145,208,203),(98,14,0)]

#rescaling rgb from 0-255 to 0 to 1
plotcolors = [(x[0]/float(255),x[1]/float(255),x[2]/float(255)) for x in plotcolors]
legendcolors = [mpatches.Patch(color = x) for x in plotcolors]

x = np.arange(len(years))
y = np.column_stack((plotdict['2014'],plotdict['2015'],plotdict['2016']))
plt.figure(figsize = (15,10));
plt.stackplot(x,y,colors = plotcolors);
plt.legend(legendcolors, labels);

#some labels for the x axis
xlabels = ['2014', '2015', '2016']
plt.xticks(x,xlabels, size = 18);
plt.yticks(size = 18);
plt.title('HPD Complaints by type over time', size = 24);

Okay, so this looks a little better, and with some tweaking, we can make this work.

Hack for Heat #1: Initial thoughts on HPD open data

Hacking for heat

In this series, I'm going to be posting about the process that goes on behind some of the blog posts we end up writing. In this first entry, I'm going to be exploring a number of datsets.

These are the ones that I'm going to be looking at:

  1. HPD (Housing Preservation and Development) housing litigations
  2. Housing maintenance code complaints
  3. Housing maintanence code violations
  4. HPD complaints

(for HPD datasets, some documentation can be found here)

HPD litigation database

First, we're going to look at the smallest dataset, one that contains cases against landlords. From the documentation, this file contains "All cases commenced by HPD or by tennants (naming HPD as a party) in [housing court] since August 2006" either seeking orders for landlords to comply with regulations, or awarding HPD civil penalties (i.e., collecting on fines).

In [2]:
import pandas as pd
In [3]:
litigation = pd.read_csv("Housing_Litigations.csv")
In [5]:
litigation.head()
Out[5]:
LitigationID BuildingID BoroID Boro HouseNumber StreetName Zip Block Lot CaseType CaseOpenDate CaseStatus CaseJudgement
0 98385 806633 2 BRONX 866 EAST 221 STREET 10467.0 4679 73 Tenant Action 04/23/2009 CLOSED NO
1 98387 55134 2 BRONX 3940 CARPENTER AVENUE 10466.0 4825 51 Tenant Action 04/23/2009 CLOSED NO
2 98389 119044 2 BRONX 1356 WALTON AVENUE 10452.0 2841 41 Tenant Action 04/23/2009 CLOSED NO
3 98391 889796 2 BRONX 871 EAST 179 STREET 10460.0 3123 77 Tenant Action 04/24/2009 CLOSED NO
4 98393 56949 2 BRONX 1680 CLAY AVENUE 10457.0 2889 1 Tenant Action 04/24/2009 CLOSED NO

Let's take a look at unique values for some of the columns:

In [6]:
litigation['Boro'].unique()
Out[6]:
array(['BRONX', 'QUEENS', 'MANHATTAN', 'BROOKLYN', 'STATEN ISLAND'], dtype=object)
In [14]:
litigation.groupby(by = ['Boro','CaseJudgement']).count()
Out[14]:
LitigationID BuildingID BoroID HouseNumber StreetName Zip Block Lot CaseType CaseOpenDate CaseStatus
Boro CaseJudgement
BRONX NO 19734 19734 19734 19734 19734 19734 19734 19734 19734 19701 19734
YES 481 481 481 481 481 481 481 481 481 478 481
BROOKLYN NO 19833 19833 19833 19833 19833 19831 19833 19833 19833 19807 19833
YES 762 762 762 762 762 760 762 762 762 762 762
MANHATTAN NO 11849 11849 11849 11849 11849 11849 11849 11849 11849 11841 11849
YES 129 129 129 129 129 129 129 129 129 129 129
QUEENS NO 9068 9068 9068 9068 9068 9061 9068 9068 9068 9063 9068
YES 313 313 313 313 313 313 313 313 313 308 313
STATEN ISLAND NO 1064 1064 1064 1064 1064 1064 1064 1064 1064 1064 1064
YES 84 84 84 84 84 84 84 84 84 84 84

The above table tells us that Manhattan has the lowest proportion of cases that receive judgement (about 1 in 80), whereas Staten Island has the highest (about 1 in 12). It may be something worth looking into, but it's also important to note that many cases settle out of court, and landlords in Manhattan may be more willing (or able) to do so.

In [7]:
litigation['CaseType'].unique()
Out[7]:
array(['Tenant Action', 'Heat and Hot Water', 'CONH',
       'Comp Supplemental Cases', 'Comprehensive',
       'Access Warrant - Non-Lead', 'Access Warrant - lead',
       'Lead False Certification', 'False Certification Non-Lead',
       'Heat Supplemental Cases', '7A', 'Failure to Register Only',
       'HLD - Other Case Type'], dtype=object)
In [16]:
litigation.groupby(by = ['CaseType', 'CaseJudgement']).count()
Out[16]:
LitigationID BuildingID BoroID Boro HouseNumber StreetName Zip Block Lot CaseOpenDate CaseStatus
CaseType CaseJudgement
7A NO 105 105 105 105 105 105 105 105 105 102 105
Access Warrant - Non-Lead NO 5665 5665 5665 5665 5665 5665 5665 5665 5665 5665 5665
YES 8 8 8 8 8 8 8 8 8 8 8
Access Warrant - lead NO 1786 1786 1786 1786 1786 1786 1786 1786 1786 1785 1786
CONH NO 797 797 797 797 797 797 797 797 797 797 797
Comp Supplemental Cases NO 750 750 750 750 750 750 750 750 750 702 750
YES 31 31 31 31 31 31 31 31 31 29 31
Comprehensive NO 2274 2274 2274 2274 2274 2274 2274 2274 2274 2274 2274
YES 121 121 121 121 121 121 121 121 121 121 121
Failure to Register Only NO 60 60 60 60 60 60 60 60 60 60 60
YES 3 3 3 3 3 3 3 3 3 3 3
False Certification Non-Lead NO 2355 2355 2355 2355 2355 2355 2355 2355 2355 2355 2355
YES 20 20 20 20 20 20 20 20 20 20 20
HLD - Other Case Type NO 3 3 3 3 3 3 3 3 3 2 3
Heat Supplemental Cases NO 124 124 124 124 124 124 124 124 124 108 124
YES 7 7 7 7 7 7 6 7 7 6 7
Heat and Hot Water NO 15760 15760 15760 15760 15760 15760 15758 15760 15760 15757 15760
YES 1299 1299 1299 1299 1299 1299 1298 1299 1299 1294 1299
Lead False Certification NO 239 239 239 239 239 239 239 239 239 239 239
YES 2 2 2 2 2 2 2 2 2 2 2
Tenant Action NO 31630 31630 31630 31630 31630 31630 31623 31630 31630 31630 31630
YES 278 278 278 278 278 278 278 278 278 278 278

The table above shows the same case judgement proportions, but conditioned on what type of case it was. Unhelpfully, the documentation does not specify what the difference between Access Warrant - Lead and Non-Lead is. It could be one of two possibilities: The first is whether the warrants have to do with lead-based paint, which is a common problem, but perhaps still too idiosyncratic to have it's own warrant type. The second, perhaps more likely possibility is whether or not HPD was the lead party in the case.

We'll probably end up using these data by aggregating it and examining how complaints change over time, perhaps as a function of what type they are. There's also the possibility of looking up specific buildings' complaints and tying them to landlords. There's probably also an easy way to join this dataset with another, by converting the address information into something standardized, like borough-block-lot (BBL; http://www1.nyc.gov/nyc-resources/service/1232/borough-block-lot-bbl-lookup)

HPD complaints

Next, we're going to look at a dataset of HPD complaints.

In [32]:
hpdcomp = pd.read_csv('Housing_Maintenance_Code_Complaints.csv')
In [34]:
hpdcomp.head()
Out[34]:
ComplaintID BuildingID BoroughID Borough HouseNumber StreetName Zip Block Lot Apartment CommunityBoard ReceivedDate StatusID Status StatusDate
0 6960137 3418 1 MANHATTAN 1989 ADAM C POWELL BOULEVARD 10026.0 1904 4 12D 10 07/07/2014 2 CLOSE 07/29/2014
1 6960832 3512 1 MANHATTAN 2267 ADAM C POWELL BOULEVARD 10030.0 1918 4 3B 10 07/08/2014 2 CLOSE 07/12/2014
2 6946867 5318 1 MANHATTAN 778 11 AVENUE 10019.0 1083 1 4P 4 06/19/2014 2 CLOSE 07/13/2014
3 6966946 5608 1 MANHATTAN 1640 AMSTERDAM AVENUE 10031.0 2073 29 5A 9 07/16/2014 2 CLOSE 07/21/2014
4 6956574 17896 1 MANHATTAN 230 EAST 88 STREET 10128.0 1533 32 1E 8 07/01/2014 2 CLOSE 07/09/2014
In [39]:
len(hpdcomp)
Out[39]:
662672

This dataset is less useful on its own. It doesn't tell us what the type of complaint was, only the date it was received and whether or not the complaint is still open. However, it may be useful in conjunction with the earlier dataset. For example, we might be interested in how many of these complaints end up in court (or at least, have some sort of legal action taken).

HPD violations

The following dataset tracks HPD violations.

In [35]:
hpdviol = pd.read_csv('Housing_Maintenance_Code_Violations.csv')
In [36]:
hpdviol.head()
Out[36]:
ViolationID BuildingID RegistrationID BoroID Boro HouseNumber LowHouseNumber HighHouseNumber StreetName StreetCode ... NewCertifyByDate NewCorrectByDate CertifiedDate OrderNumber NOVID NOVDescription NOVIssuedDate CurrentStatusID CurrentStatus CurrentStatusDate
0 2293208 444 130476 1 MANHATTAN 22 22 22 1 AVENUE 10010 ... NaN NaN NaN 772 338596.0 § 27-2098 ADM CODE FILE WITH THIS DEPARTMENT ... 04/22/1997 19 VIOLATION CLOSED 03/10/2015
1 2293181 444 130476 1 MANHATTAN 22 22 22 1 AVENUE 10010 ... NaN NaN 09/10/1974 775 338584.0 D26-41.05 ADM CODE FILE WITH THIS DEPARTMENT R... 07/16/1974 19 VIOLATION CLOSED 03/10/2015
2 2293249 448 135326 1 MANHATTAN 2222 2222 2222 1 AVENUE 10010 ... NaN NaN NaN 772 338619.0 § 27-2098 ADM CODE FILE WITH THIS DEPARTMENT ... 11/19/1996 19 VIOLATION CLOSED 03/10/2015
3 2293486 467 136913 1 MANHATTAN 2250 2250 2250 1 AVENUE 10010 ... NaN NaN NaN 772 338733.0 D26-41.03 ADM CODE FILE WITH THIS DEPARTMENT A... 05/10/1982 19 VIOLATION CLOSED 03/10/2015
4 2293490 467 136913 1 MANHATTAN 2250 2250 2250 1 AVENUE 10010 ... NaN NaN NaN 772 338737.0 § 27-2098 ADM CODE FILE WITH THIS DEPARTMENT ... 11/19/1996 19 VIOLATION CLOSED 03/10/2015

5 rows × 30 columns

In [40]:
len(hpdviol)
Out[40]:
1437246

These datasets all have different lengths, but that's not surprising, given they come from different years. One productive initial step would be to convert the date strings into something numerical.

HPD complaint problems database

In [45]:
hpdcompprob = pd.read_csv('Complaint_Problems.csv')
In [48]:
hpdcompprob.head()
Out[48]:
ProblemID ComplaintID UnitTypeID UnitType SpaceTypeID SpaceType TypeID Type MajorCategoryID MajorCategory MinorCategoryID MinorCategory CodeID Code StatusID Status StatusDate StatusDescription
0 14548958 6967900 91 APARTMENT 541 BATHROOM 1 EMERGENCY 9 PLUMBING 63 BATHTUB/SHOWER 2538 BROKEN OR MISSING 2 CLOSE 07/29/2014 The Department of Housing Preservation and De...
1 14548959 6967900 91 APARTMENT 541 BATHROOM 3 NON EMERGENCY 9 PLUMBING 63 BATHTUB/SHOWER 2540 FAUCET BROKEN/MISSING/LEAKING 2 CLOSE 08/04/2014 The Department of Housing Preservation and De...
2 14548960 6967900 91 APARTMENT 543 ENTIRE APARTMENT 3 NON EMERGENCY 58 FLOORING/STAIRS 343 FLOOR 2691 TILE BROKEN OR MISSING 2 CLOSE 08/04/2014 The Department of Housing Preservation and De...
3 14548961 6967900 91 APARTMENT 541 BATHROOM 3 NON EMERGENCY 9 PLUMBING 63 BATHTUB/SHOWER 2541 CHIPPED OR RUSTED 2 CLOSE 08/04/2014 The Department of Housing Preservation and De...
4 14615271 6994958 20 APARTMENT 68 ENTIRE APARTMENT 1 EMERGENCY 59 HEAT/HOT WATER 349 ENTIRE BUILDING 2717 NO HOT WATER 2 CLOSE 08/22/2014 The Department of Housing Preservation and De...

Awesome! This dataset provides some more details about the complaints, and lets us join by ComplaintID.

Summary and next steps

In the immediate future, I'm going to be writing script to join and clean this dataset. This can either be done either in python, or by writing some SQL. I haven't decided yet. Additionally, I'm going to be writing some code to do things like convert date strings, and perhaps scrape text.