%matplotlib inline

from matplotlib import pyplot as plt
from matplotlib import patches as mpatches
import pandas as pd
import numpy as np
import psycopg2
from datetime import datetime
from datetime import date


pd.options.display.max_columns = 40

Hack for Heat #11: NYC's top complaints pt. 2¶

In this post, I'm going to post a visualization of the top complaints reported in the previous post. I have this in a csv:

tempdata = pd.read_csv('topfiverev.csv', names = ['Year','Category','Count'])

tempdata.head()

#Fixing some category names
tempdata['Category'] = ['PAINT/PLASTER' if (x == 'PAINT/PLASTER') | (x== 'PAINT - PLASTER') else x for x in tempdata['Category']]
tempdata['Category'] = ['Street Condition' if x == 'Traffic Signal Condition' else x for x in tempdata['Category'] ]

What we want is to create a data frame with as many rows as there are categories, as many columns as there are years, and the counts as the data points.

upcaseindex = [x.upper() for x in tempdata['Category'].unique()]
upcaseindex.remove('NONCONST') #remove this category because it refers to missing values

plotdata = pd.DataFrame(index = upcaseindex, columns = tempdata['Year'].unique())

plotdata

Now let's populate it with values.

for row in tempdata['Category'].unique():
    for year in  tempdata['Year'].unique():
        try:
            plotdata.loc[row.upper(),year] = tempdata.loc[(tempdata['Year'] == year) & (tempdata['Category'] == row)]['Count'].values[0]
        except:
            plotdata.loc[row.upper(),year] = None

plotdata = plotdata.drop(['NONCONST'])
plotdata

x = range(0,len(plotdata.columns))
xlabels = plotdata.columns

plt.figure(figsize=(16,10));
plt.xlim(-1,len(xlabels));
plt.ylim(0,500000);
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.gca().yaxis.set_ticks_position('left')
plt.gca().xaxis.set_ticks_position('bottom')
plt.title('Complaint Category by Year', size = 24)
plt.ylabel('Number of Complaints',size = 14)

plotcolors = [(1,0,103),(213,255,0),(255,0,86),(158,0,142),(14,76,161),(255,229,2),(0,95,57),\
            (0,255,0),(149,0,58),(255,147,126),(164,36,0),(0,21,68),(145,208,203),(98,14,0)]
plotcolors = [(color[0]/float(255),color[1]/float(255),color[2]/float(255)) for color in plotcolors]
legendcolors = [mpatches.Patch(color = color) for color in plotcolors]

count = 0
for row in plotdata.index:
    y = plotdata.loc[row]
    plt.plot(x, y, color = plotcolors[count], lw=4, marker = '.', markersize = 20);
    count += 1

plt.legend(legendcolors, upcaseindex, bbox_to_anchor=(0.3, 1));
plt.xticks(x,xlabels);

So that's my graph. Do note that the reason these complaints dropped in 2016 is obviously because we're only midway through the year! I grabbed these data in early July, so that's when these numbers were last updated.

%matplotlib inline

from matplotlib import pyplot as plt
from matplotlib import patches as mpatches
import pandas as pd
import numpy as np
import psycopg2
from datetime import datetime
from datetime import date


pd.options.display.max_columns = 40

connection = psycopg2.connect('dbname = threeoneone user=threeoneoneadmin password=threeoneoneadmin')
cursor = connection.cursor()

Hack for Heat #10: NYC's top complaints¶

In this post, I'm going to look at what the top 311 complaints in NYC are, and how their numbers have changed over the years.

cursor.execute('''SELECT createddate, borough, complainttype FROM service''')
totalcomplaints = cursor.fetchall()

len(totalcomplaints)

13092182

Over the past 7 years, there were about 13 million complaints, which is about just 1.5 per person over the 7 years, or .22 complaints per person per year

totalcomplaints = pd.DataFrame(totalcomplaints)

totalcomplaints.columns = ['Date', 'Boro', 'Comptype']

totalcomplaints['Year'] = [x.year for x in totalcomplaints['Date']]

This is how it breaks down over the years:

totalcomplaints.groupby(by=['Year']).count()

complainttypesbyseason = totalcomplaints.groupby(by=['Year','Comptype']).count()

topfive = complainttypesbyseason['Date'].groupby(level=0,group_keys=False).apply(lambda x:x.sort_values(ascending=False).head(5))
topfive

Year  Comptype              
2010  HEATING                   214218
      GENERAL CONSTRUCTION      127131
      Street Light Condition    116466
      Noise - Residential       115615
      PLUMBING                  111504
2011  HEATING                   190184
      GENERAL CONSTRUCTION      133242
      PLUMBING                  113218
      Noise - Residential       111447
      Street Condition          104694
2012  HEATING                   182974
      Noise - Residential       127943
      GENERAL CONSTRUCTION      112436
      Street Light Condition     93981
      PLUMBING                   87561
2013  HEATING                   202896
      Noise - Residential       151516
      GENERAL CONSTRUCTION      112169
      Street Light Condition     92502
      PLUMBING                   86564
2014  Noise - Residential       192608
      HEAT/HOT WATER            132767
      Street Condition          114545
      HEATING                    97597
      Street Light Condition     94665
2015  HEAT/HOT WATER            225706
      Noise - Residential       208026
      Street Condition          124378
      Blocked Driveway          100881
      Street Light Condition     98106
2016  HEAT/HOT WATER            126906
      Noise - Residential       104324
      Illegal Parking            55932
      Blocked Driveway           53955
      Street Condition           51977
dtype: int64

These data gives us a rough idea of how the number of complaints have changed over time, but one problem is that the labels that were used changed each year (presumably because they were replaced by new labels that better reflected what the complaints were). Secondly, for our purposes, some categories can be combined. For example, 'Street Condition' and 'Street Light Condition' may be important for administrators, but for now, we can be content with just "stuff that's gone wrong on the street".

topfive.index.levels[1]

Index([u'Blocked Driveway', u'GENERAL CONSTRUCTION', u'HEAT/HOT WATER',
       u'HEATING', u'Illegal Parking', u'Noise - Residential', u'PLUMBING',
       u'Street Condition', u'Street Light Condition'],
      dtype='object', name=u'Comptype')

So, over the last 5 years, there have been 10 different categories that have made up the top five. Notice that some of these labels have changed over time. We'll have to standardize them.

years = totalcomplaints.Year.unique()

#first have a list of labels that are the same
oldlabels = totalcomplaints.loc[totalcomplaints['Year'] == years[0],['Comptype']]['Comptype'].unique()

newlabels = []

for x in years:
    yearslabels = totalcomplaints.loc[totalcomplaints['Year'] == x,['Comptype']]['Comptype'].unique()
    [newlabels.append(x) for x in yearslabels if x not in oldlabels and x not in newlabels]

From doing this, we get two lists: oldlabels and newlabels, that we will use to decide how to combine stuff.

#sorted(oldlabels)

#sorted(newlabels)

One modification I'm going to make is to combine 'Street Condition', 'Street Light Condition', 'Street Sign - Damaged', 'Street Sign - Dangling', and 'Street Sign - Missing' into 'Street Condition'.

For the current purposes, these are the same thing.

I'm also going to collapse 'GENERAL CONSTRUCTION', 'General Construction/Plumbing' and 'PLUMBING' into a new category called 'CONSTRUCTION/PLUMBING'.

I'll collapse noise related complaints ('Noise', 'Noise - Commercial', 'Noise - Helicopter', 'Noise - House of Worship', 'Noise - Park', 'Noise - Residential', 'Noise - Street/Sidewalk', 'Noise - Vehicle') into a label named 'NOISY'.

Finally, I'll combine 'HEAT/HOT WATER' and 'HEATING' into 'HEAT RELATED'.

streetconditioncomb = ['Street Condition', 'Street Light Condition', 'Street Sign - Damaged', 'Street Sign - Dangling', \
                       'Street Sign - Missing']

totalcomplaints['RevComptype'] = ['Street Condition' if x in streetconditioncomb else x for x in totalcomplaints['Comptype']]

genconstructcomb = ['GENERAL CONSTRUCTION', 'General Construction/Plumbing', 'PLUMBING']

totalcomplaints['RevComptype'] = ['CONSTRUCTION/PLUMBING' if x in genconstructcomb else x for x in totalcomplaints['RevComptype']]

noisecomb = ['Noise', 'Noise - Commercial', 'Noise - Helicopter', 'Noise - House of Worship', 'Noise - Park',\
             'Noise - Residential', 'Noise - Street/Sidewalk', 'Noise - Vehicle']

totalcomplaints['RevComptype'] = ['NOISY' if x in noisecomb else x for x in totalcomplaints['RevComptype']]

heatcomb = ['HEAT/HOT WATER', 'HEATING']

totalcomplaints['RevComptype'] = ['HEAT RELATED' if x in heatcomb else x for x in totalcomplaints['RevComptype']]

revisedcomplainttypes = totalcomplaints.groupby(by=['Year','RevComptype']).count()

topfiverev = revisedcomplainttypes['Date'].groupby(level=0, group_keys=False).apply(lambda x : x.sort_values(ascending=False).head(8))
topfiverev

Year  RevComptype             
2010  CONSTRUCTION/PLUMBING       287253
      Street Condition            229584
      HEAT RELATED                214218
      NOISY                       201834
      PAINT - PLASTER              93194
      Water System                 70346
      NONCONST                     69659
      Traffic Signal Condition     53773
2011  CONSTRUCTION/PLUMBING       272234
      Street Condition            216865
      NOISY                       196453
      HEAT RELATED                190184
      PAINT - PLASTER             100704
      NONCONST                     68407
      Water System                 61577
      Blocked Driveway             52592
2012  CONSTRUCTION/PLUMBING       225338
      NOISY                       220392
      HEAT RELATED                182974
      Street Condition            171910
      PAINT - PLASTER              77287
      NONCONST                     60055
      Water System                 57678
      Blocked Driveway             50818
2013  NOISY                       259356
      CONSTRUCTION/PLUMBING       223314
      HEAT RELATED                202896
      Street Condition            182618
      PAINT - PLASTER              77957
      Blocked Driveway             57408
      NONCONST                     53831
      Water System                 53186
2014  NOISY                       337892
      HEAT RELATED                230364
      Street Condition            221097
      CONSTRUCTION/PLUMBING       106251
      Blocked Driveway             79170
      PAINT/PLASTER                63503
      Illegal Parking              63243
      UNSANITARY CONDITION         61789
2015  NOISY                       387227
      Street Condition            233839
      HEAT RELATED                225706
      Blocked Driveway            100881
      Illegal Parking              92679
      CONSTRUCTION/PLUMBING        83585
      UNSANITARY CONDITION         82888
      Water System                 71086
2016  NOISY                       197186
      HEAT RELATED                126906
      Street Condition            105346
      Illegal Parking              55932
      Blocked Driveway             53955
      CONSTRUCTION/PLUMBING        39460
      UNSANITARY CONDITION         35357
      Water System                 31299
dtype: int64

This gives us some different numbers. Taken together, although heat related complaints are still constantly in the top 5, what we observe over time is also a gradual decrease in construction related complaints, a constant, variable but large amount of street-related complaints, and a steady increase in noise-related complaints. Next, I'll post a version of this looking at only the heating season.

%matplotlib inline

from matplotlib import pyplot as plt
from matplotlib import patches as mpatches
import pandas as pd
import numpy as np
import psycopg2
from datetime import datetime
from datetime import date


pd.options.display.max_columns = 40

connection = psycopg2.connect('dbname = threeoneone user=threeoneoneadmin password=threeoneoneadmin')
cursor = connection.cursor()

Hack for Heat #9: Total complaints over time, complaints per capita, etc.¶

In this post, I'm going to be doing a couple of miscellaneous things to prepare for our first blog post of the summer. Last year's posts can be found here.

cursor.execute('''SELECT createddate, borough, complainttype FROM service;''')
borodata = cursor.fetchall()

borodata = pd.DataFrame(borodata)

borodata.columns = ['Date', 'Boro', 'Comptype']

Again, I removed entires earlier than 2011 because there are data quality issues (substantially fewer cases):

borodata = borodata.loc[borodata['Date'] > date(2011,3,1)]

borodata = borodata.loc [borodata['Date'] <date(2016,6,1)]

Remove cases where the borough was unspecified:

borodata = borodata.loc[borodata['Boro'] != 'Unspecified']

Total complaints by borough, year-over-year¶

We already have a chart of heat complaints by month; we need one for the total number of complaints received.

borodata['Year'] = [x.year for x in borodata['Date']]
borodata['Month'] = [x.month for x in borodata['Date']]
borodata['Day'] = [x.day for x in borodata['Date']]

plotdata = borodata.groupby(by=['Boro', 'Year', 'Month']).count()
plotdata.head()

Now we generate a dictionary that we can use to plot these data:

plotdict = {x:[] for x in borodata['Boro'].unique()}

for boro in plotdict:
    plotdict[boro] = list(plotdata.xs(boro).Date)

Now we need to format our data for the plot. We need 5 "rows", each representing a borough, and n columns, each representing one month.

plotdata = np.zeros(len(plotdict['BROOKLYN']))
legendkey = []

for boro in plotdict.keys():
    plotdata = np.row_stack((plotdata, plotdict[boro]))
    legendkey.append(boro)
    
plotdata = np.delete(plotdata, (0), axis=0)

x = np.arange(len(plotdata[0]))

#crude xlabels
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
years = ['2011', '2012', '2013', '2014', '2015', '2016']
xlabels = []
for year in years:
    for month in months:
        xlabels.append("{0} {1}".format(month,year))

        
xlabels = xlabels[2:-7] #start from march 2011, end may2016

plotcolors = [(1,0,103),(213,255,0),(255,0,86),(158,0,142),(14,76,161),(255,229,2),(0,95,57),\
            (0,255,0),(149,0,58),(255,147,126),(164,36,0),(0,21,68),(145,208,203),(98,14,0)]

#rescaling rgb from 0-255 to 0 to 1
plotcolors = [(color[0]/float(255),color[1]/float(255),color[2]/float(255)) for color in plotcolors]
legendcolors = [mpatches.Patch(color = color) for color in plotcolors]

plt.figure(figsize = (15,10));
plt.stackplot(x,plotdata, colors = plotcolors);
plt.xticks(x,xlabels,rotation=90);
plt.xlim(0,len(xlabels))
plt.legend(legendcolors,legendkey, bbox_to_anchor=(0.2, 1));
plt.title('Heating Complaints by Borough', size = 24)
plt.ylabel('Number of Complaints',size = 14)

plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.gca().yaxis.set_ticks_position('left')
plt.gca().xaxis.set_ticks_position('bottom')

plotdatacsv = pd.DataFrame(plotdata)
plotdatacsv.index = legendkey
plotdatacsv.columns = xlabels
plotdatacsv.to_csv('totalcomplaintsbyboroughandmonth.csv')

Normalized total complaints¶

We'll also need a normalized version of the above, where everything is scaled to 100%, to see if complaints come disproportionately from one borough vs. another. There are probably better ways to do this, but I'm going to simply sum the number of complaints from each time point, and divide each column by the total for that column (i.e., number of complaints from that borough for that month divided by total number of complaints).

normplotdata = np.ndarray(shape=np.shape(plotdata))

It will thus be helpful to have an array of the total number of complaints:

totalcomplaintsbymonth = borodata.groupby(by=['Year','Month']).count().Date

totalcomplaintsbymonth = totalcomplaintsbymonth.values

totalcomplaintsbymonth = np.tile(totalcomplaintsbymonth,(5,1))

normplotdata = plotdata/totalcomplaintsbymonth

plotcolors = [(1,0,103),(213,255,0),(255,0,86),(158,0,142),(14,76,161),(255,229,2),(0,95,57),\
            (0,255,0),(149,0,58),(255,147,126),(164,36,0),(0,21,68),(145,208,203),(98,14,0)]

#rescaling rgb from 0-255 to 0 to 1
plotcolors = [(color[0]/float(255),color[1]/float(255),color[2]/float(255)) for color in plotcolors]
legendcolors = [mpatches.Patch(color = color) for color in plotcolors]

plt.figure(figsize = (15,10));
plt.stackplot(x,normplotdata, colors = plotcolors);
plt.xticks(x,xlabels,rotation=90);
plt.xlim(0,len(xlabels))
plt.ylim(0,1)
plt.legend(legendcolors,legendkey, bbox_to_anchor=(0.2, 1));
plt.title('Heating Complaints by Borough', size = 24)
plt.ylabel('Number of Complaints',size = 14)

plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.gca().yaxis.set_ticks_position('left')
plt.gca().xaxis.set_ticks_position('bottom')

Heat complaints per capita:¶

We'd also like to look at how many complaints per capita are generated. Given that we only have full years for 2012-2015, let's consider those.

boropop = {
    'MANHATTAN': 1636268,
    'BRONX': 1438159,
    'BROOKLYN': 2621793,
    'QUEENS': 2321580,
    'STATEN ISLAND': 473279,
    }

totalnycpop = reduce(lambda x, y: x +y, boropop.values())
totalnycpop

8491079

complaintsbyyear = borodata.groupby(by=['Year']).count()
borocomplaintsbyyear = borodata.groupby(by=['Year','Boro']).count()

complaintsbyyear['Pop'] = [totalnycpop for x in complaintsbyyear.index]

borocomplaintsbyyear['Pop'] = [boropop.get(x[1]) for x in borocomplaintsbyyear.index]

complaintsbyyear['CompPerCap'] = complaintsbyyear['Day']/complaintsbyyear['Pop']

complaintsbyyear

borocomplaintsbyyear['CompPerCap'] = borocomplaintsbyyear['Day']/borocomplaintsbyyear['Pop']

borocomplaintsbyyear

From these data, we can see two things: First, on average, about 1 in 5 people make a 311 complaint each year. Second, overall, this number are pretty consistent across the 5 boroughs.

%matplotlib inline

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import psycopg2
from datetime import date

pd.options.display.max_columns = 40
pd.options.mode.chained_assignment = None  # default='warn'

Hack for Heat #8: Complaint resolution time revisited¶

In this post, I'm going to explore a simple way of dealing with the censoring issue in complaint resolution time: Basically, because we only care about heat complaints during the heat season, we can constrain our cases of interest to the heating season specifically (1^st October to May 31^st).

#pull from our database again:
connection = psycopg2.connect('dbname = threeoneone user= threeoneoneadmin password = threeoneoneadmin')

cursor = connection.cursor()

cursor.execute('''SELECT createddate, closeddate, borough, complainttype FROM service;''')

data = cursor.fetchall()

data = pd.DataFrame(data)

data.columns = ['createddate', 'closeddate', 'borough', 'complainttype']

data = data.loc[(data['complainttype'] == 'HEATING') | (data['complainttype'] == 'HEAT/HOT WATER') ]

So far, we've subsetted the data to only include the heating complaints. Now let's create a datetime mask. What we want is to have subsets of data for each of the heating seasons:

heatmonths = range(1, 6) + [10,11,12]

heatmonths

[1, 2, 3, 4, 5, 10, 11, 12]

createdmask = data['createddate'].map(lambda x: (x.month in heatmonths))
closedmask = data['closeddate'].map(lambda x: (x.month in heatmonths) if x != None else False)

mask = createdmask & closedmask

heatseasondata = data.loc[mask]

How many heat/hot water complaints are created and closed inside vs. outside of heating season?

len(data.loc[mask]) #inside heating season

1302688

len(data.loc[~mask])#outside heating season

68750

The next thing we want to do is ignore cases where the complaint was resolved in the next heating season:

prevmonths = range(1, 6) 
nextmonths = [10,11,12]

heatseasondata['createdheatseason'] = [x.year if (x.month in prevmonths) else (x.year-1) for x in heatseasondata['createddate']]

heatseasondata.head()

heatseasondata['closedheatseason'] = [x.year if (x.month in prevmonths) else (x.year-1) for x in heatseasondata['closeddate']]

Now that we've done this, we can select only the cases where the closed date was in the same season as the created date:

heatseasondata = heatseasondata.loc[heatseasondata['createdheatseason']  == heatseasondata['closedheatseason']]

Okay, now we can calculate some average resolution times:

heatseasondata['resolutiontime'] = heatseasondata['closeddate'] - heatseasondata['createddate']

heatseasondata['resolutiontimeint'] = heatseasondata.resolutiontime.astype('timedelta64[D]')

resolutiontimedata = heatseasondata.groupby(by='createdheatseason').mean()['resolutiontimeint']

resolutiontimedata.to_csv('resolutiontimebyyear.csv')

resolutiontimedata

createdheatseason
2009    4.605845
2010    3.695498
2011    3.666409
2012    3.623974
2013    3.514244
2014    3.137578
2015    3.406145
2016    3.446075
Name: resolutiontimeint, dtype: float64

Resolution times by year:¶

x = resolutiontimedata.index.values
y = resolutiontimedata.values

plt.figure(figsize=(12,10));
plt.plot(x,y);

Resolution time by borough:¶

restimebyboro = heatseasondata.groupby(by=['borough', 'createdheatseason']).mean()['resolutiontimeint']

restimebyboro.to_csv('restimebyboro.csv')

restimebyboro = restimebyboro.loc[[x in range(2010,2017) for x in restimebyboro.index.get_level_values('createdheatseason')]]

boros = heatseasondata.borough.unique()

boroplots = {x:[] for x in boros}
for boro in boros:
    boroplots[boro] = restimebyboro.xs(boro).values
    
boroplots.pop('Unspecified')

array([ 3.64268956,  3.89000015])

x = range(2010,2017)

plt.figure(figsize=(12,10));
for boro in boroplots:
    plt.plot(x,boroplots[boro]);
plt.legend(boroplots.keys());
plt.xticks(x, [str(label) for label in x]);

I was told beforehand that the average resolution time from a similar analysis last year was about 3-5 days, so this looks about right.

%matplotlib inline

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import psycopg2
from datetime import datetime
from datetime import date

pd.options.display.max_columns = 40

Hack for Heat #7: Heat complaints over time¶

In the last post, I plotted how the number of complaints differd by borough over time. This time around, I'm going to revisit this process, but this time focusing on heating complaints only (this is Heat Seek, after all).

Loading data:¶

connection = psycopg2.connect('dbname = threeoneone user=threeoneoneadmin password=threeoneoneadmin')
cursor = connection.cursor()

cursor.execute('''SELECT DISTINCT complainttype FROM service;''')
complainttypes = cursor.fetchall()

cursor.execute('''SELECT createddate, borough, complainttype FROM service;''')
borodata = cursor.fetchall()

borodata = pd.DataFrame(borodata)

borodata.head()

borodata.columns = ['Date', 'Boro', 'Comptype']

heatdata = borodata.loc[(borodata['Comptype'] == 'HEATING') | (borodata['Comptype'] == 'HEAT/HOT WATER')]

I removed entires earlier than 2011 because there are data quality issues (substantially fewer cases):

heatdata = heatdata.loc[heatdata['Date'] > date(2011,3,1)]

Heat complaints by borough:¶

len(heatdata)

1092048

There were about a million heating complaints over the 3+ years we have data for.

heatbydate = heatdata.groupby(by='Boro').count()

heatbydate

Per capita:¶

Again, let's look at how many heat complaints each boro generates per person:

boropop = {
    'MANHATTAN': 1636268,
    'BRONX': 1438159,
    'BROOKLYN': 2621793,
    'QUEENS': 2321580,
    'STATEN ISLAND': 473279,
    }

heatbydate['Pop'] = [boropop.get(x) for x in heatbydate.index]

heatbydate['CompPerCap'] = heatbydate['Date']/heatbydate['Pop']

heatbydate

Complaints by borough over months¶

First, let's recreate the graph from before:

heatdata['Year'] = [x.year for x in heatdata['Date']]
heatdata['Month'] = [x.month for x in heatdata['Date']]
heatdata['Day'] = [x.day for x in heatdata['Date']]

heatdata.head()

We remove data where it was unspecified what the boro was

heatdata = heatdata.loc[heatdata['Boro'] != 'Unspecified']

heatplotdata = heatdata.groupby(by=['Boro', 'Year','Month']).count()
heatplotdata

boros = heatbydate.index
borodict = {x:[] for x in boros}
borodict.pop('Unspecified')

for boro in borodict:
    borodict[boro] = list(heatplotdata.xs(boro).Date)

plotdata = np.zeros(len(borodict['BROOKLYN']))
for boro in sorted(borodict.keys()):
    plotdata = np.row_stack((plotdata, borodict[boro]))

plotdata = np.delete(plotdata, (0), axis=0)

from matplotlib import patches as mpatches

x = np.arange(len(plotdata[0]))

#crude xlabels
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
years = ['2011', '2012', '2013', '2014', '2015', '2016']
xlabels = []
for year in years:
    for month in months:
        xlabels.append("{0} {1}".format(month,year))

xlabels = xlabels[2:-7] #start from march 2011, end may2016

A non-normalized plot:¶

The plot below is of raw complaint numbers, and is what we might expect: complaints about eat matter the most during the heating season!

plotcolors = [(1,0,103),(213,255,0),(255,0,86),(158,0,142),(14,76,161),(255,229,2),(0,95,57),\
            (0,255,0),(149,0,58),(255,147,126),(164,36,0),(0,21,68),(145,208,203),(98,14,0)]

#rescaling rgb from 0-255 to 0 to 1
plotcolors = [(color[0]/float(255),color[1]/float(255),color[2]/float(255)) for color in plotcolors]
legendcolors = [mpatches.Patch(color = color) for color in plotcolors]

plt.figure(figsize = (15,10));
plt.stackplot(x,plotdata, colors = plotcolors);
plt.xticks(x,xlabels,rotation=90);
plt.xlim(0,len(xlabels))
plt.legend(legendcolors,sorted(borodict.keys()), bbox_to_anchor=(0.2, 1));
plt.title('Heating Complaints by Borough', size = 24)
plt.ylabel('Number of Complaints',size = 14)

plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.gca().yaxis.set_ticks_position('left')
plt.gca().xaxis.set_ticks_position('bottom')

Normalizing data:¶

Next, we're going to normalize these data. This allows us to better visualize if the proportion of heating complaints changed by borough, over time (e.g., did one borough generate more/less complaints vs. others over time?).

What we want to do is divide each of the 5 rows by the total number of complaints by that column (the month)

totalcounts = heatdata.groupby(by=['Year', 'Month']).count().Date.values

boros = heatbydate.index
normdict = {x:[] for x in boros}
normdict.pop('Unspecified')

for boro in normdict:
    for i in range(0,len(plotdata[1])): # for all the values in each row
        normp = float(borodict[boro][i])/float(totalcounts[i])
        normdict[boro].append(normp*100)

normplotdata = np.zeros(len(borodict['BROOKLYN']))
for boro in sorted(normdict.keys()):
    normplotdata = np.row_stack((normplotdata, normdict[boro]))

normplotdata = np.delete(normplotdata,(0),axis = 0)

plotcolors = [(1,0,103),(213,255,0),(255,0,86),(158,0,142),(14,76,161),(255,229,2),(0,95,57),\
            (0,255,0),(149,0,58),(255,147,126),(164,36,0),(0,21,68),(145,208,203),(98,14,0)]

#rescaling rgb from 0-255 to 0 to 1
plotcolors = [(color[0]/float(255),color[1]/float(255),color[2]/float(255)) for color in plotcolors]
legendcolors = [mpatches.Patch(color = color) for color in plotcolors]

plt.figure(figsize = (15,10));
plt.stackplot(x,normplotdata, colors = plotcolors);
plt.xticks(x,xlabels,rotation=90);
plt.xlim(0,len(xlabels))
plt.ylim(0,100)
plt.legend(legendcolors,sorted(normdict.keys()), bbox_to_anchor=(0.2, 1));
plt.title('Heating Complaints by Borough (normalized)', size = 24)
plt.ylabel('% of Complaints',size = 14)

plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.gca().yaxis.set_ticks_position('left')
plt.gca().xaxis.set_ticks_position('bottom')

Overall, it looks like heating complaints were pretty consistent by borough. In the next post, I'm going to take a look at one way of addressing the afore-mentioned censoring problems.

%matplotlib inline

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import psycopg2

pd.options.display.max_columns = 40

Hack for Heat #5: How long do complaints take to resolve?¶

In this post, we're going to see if we can graph how long it takes for complaints to get resolved.

connection = psycopg2.connect('dbname= threeoneone user=threeoneoneadmin password=threeoneoneadmin')
cursor = connection.cursor()

cursor.execute('''SELECT createddate, closeddate, borough FROM service;''')
data = cursor.fetchall()
data = pd.DataFrame(data)

data.columns = ['createddate','closeddate','borough']

data = data.loc[data['createddate'].notnull()]
data = data.loc[data['closeddate'].notnull()]

data['timedelta'] = data['closeddate'] - data['createddate']

data['timedeltaint'] = [x.days for x in data['timedelta']]

data.head()

data.groupby(by='borough')['timedeltaint'].mean()

borough
BRONX            -3.658691
BROOKLYN         -6.835417
MANHATTAN       -20.325506
QUEENS           -4.121806
STATEN ISLAND    -2.108396
Unspecified       4.699638
Name: timedeltaint, dtype: float64

Oops! Looks like something's wrong. Let's try and find out:

data.sort_values('timedeltaint').head()

data.sort_values('timedeltaint', ascending=False).head()

Ah. Well, as a first step, let's remove any values that are before Jan 1st 2010 or after today:

import datetime

today = datetime.date(2016,5,29)
janone = datetime.date(2010,1,1)

Let's also remove any rows where the close date is before the created date:

subdata = data.loc[(data['closeddate'] > janone) & (data['closeddate'] < today)]
subdata = subdata.loc[data['closeddate'] > data['createddate']]

len(subdata)

7979757

subdata.sort_values('timedeltaint').head()

subdata.sort_values('timedeltaint',ascending = False).head()

This looks a little bit more realistic, but let's also visualize the distribution:

plotdata = list(subdata['timedeltaint'])

plt.figure(figsize=(12,10))
plt.hist(plotdata);

Okay, this still looks really wonky. Let's further subset the data, and see what happens when we remove the top and bottom 2.5%.

Pandas has a quantile function:

subdata.quantile([.025, .975])

quantcutdata = subdata.loc[(subdata['timedeltaint'] > 1) & (subdata['timedeltaint'] < 138) ]

len(quantcutdata)

6278707

plotdata = list(quantcutdata['timedeltaint'])

plt.figure(figsize=(12,10))
plt.hist(plotdata);

That looks a little better, but there might be other ways to filter out bad data.

subdata.groupby(by='borough').median()

subdata.groupby(by='borough').mean()

So, I definitely wouldn't trust these data right now - I'm still working on finding a data dictionary for the 311 data, and I need to make sure the columns mean what I think they mean. The point of this is just to show that the process would be once I get my hands on reliable data.

%matplotlib inline

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import psycopg2

pd.options.display.max_columns = 40

Hack for Heat #4: Number of complaints over time pt.2¶

In this post, we're going to look at the number of complaints each borough received for the last five or so years. First, let's look at the total number of complaints received:

connection = psycopg2.connect('dbname = threeoneone user=threeoneoneadmin password=threeoneoneadmin')
cursor = connection.cursor()

Borough complaints by date¶

cursor.execute('''SELECT createddate, borough FROM service;''')
borodata = cursor.fetchall()

borodata = pd.DataFrame(borodata)

borodata.columns = ['Date', 'Boro']

borobydate = borodata.groupby(by='Boro').count()

borobydate

For fun, let's look at the number of complaints per capita in each of the 5 boroughs. The population values below are from Wikipedia.

boropop = {
    'MANHATTAN': 1636268,
    'BRONX': 1438159,
    'BROOKLYN': 2621793,
    'QUEENS': 2321580,
    'STATEN ISLAND': 473279,
    }

borobydate['Pop'] = [boropop.get(x) for x in borobydate.index]

borobydate['CompPerCap'] = borobydate['Date']/borobydate['Pop']

borobydate

Complaints by borough over months¶

The next thing we're going to do is make a stacked plot of complaints by borough, over months. To do this, we need to extract the day and month from the date column:

borodata['Year'] = [x.year for x in borodata['Date']]
borodata['Month'] = [x.month for x in borodata['Date']]

borodata = borodata.loc[borodata['Boro'] != 'Unspecified']

Next, we need to generate an array of Ys. We want the rows of this dataframe to be the 5 boroughs, and the columns to be the count of complaints for the year and month.

boroplotdata = borodata.groupby(by=['Boro', 'Year','Month']).count()
boroplotdata

We basically need to get the above table into a graph.

boros = borobydate.index
borodict = {x:[] for x in boros}
borodict.pop('Unspecified')

for boro in borodict:
    borodict[boro] = list(boroplotdata.xs(boro).Date)

plotdata = np.zeros(len(borodict['BROOKLYN']))
for boro in borodict:
    plotdata = np.row_stack((plotdata, borodict[boro]))

plotdata = np.delete(plotdata, (0), axis=0)

plotdata

array([[  5804.,   6791.,  11009.,   6578.,   6590.,   6370.,   5909.,
          5957.,   5926.,   5597.,   4823.,   4715.,   6008.,   7360.,
          8109.,   6836.,   7091.,   7257.,   6380.,   9858.,   8268.,
          8356.,   6778.,   5799.,   6169.,   5042.,   6018.,   5607.,
          6746.,   6839.,   7037.,   7024.,   6108.,   7036.,   7509.,
          5445.,   7207.,   5851.,   6352.,   6915.,   6843.,   7392.,
          7552.,   6442.,   5936.,   6152.,   4996.,   4984.,   8753.,
          8978.,   8211.,   7937.,   8278.,   8173.,   8111.,   6934.,
          6597.,   7067.,   5558.,   5889.,   6398.,   8228.,  10537.,
          8489.,   8240.,   8803.,   8507.,   7684.,   7920.,   7592.,
          6756.,   6264.,   7186.,   7505.,   8531.,   8750.,   6688.],
       [ 21753.,  22463.,  30847.,  25357.,  26545.,  27245.,  29419.,
         26279.,  26287.,  24483.,  22788.,  22361.,  22895.,  24331.,
         43807.,  41075.,  39214.,  41555.,  42120.,  46155.,  44927.,
         47186.,  44536.,  44451.,  46493.,  36073.,  37645.,  35357.,
         38451.,  38776.,  41169.,  39849.,  36523.,  41004.,  47840.,
         42908.,  50532.,  40369.,  40716.,  38720.,  40310.,  41070.,
         41466.,  37545.,  34560.,  40456.,  42451.,  42784.,  60679.,
         49657.,  45848.,  40934.,  46025.,  46460.,  46283.,  41112.,
         41180.,  45079.,  47862.,  46171.,  52516.,  55555.,  55122.,
         47002.,  47346.,  51835.,  51154.,  47652.,  47385.,  52645.,
         48386.,  45544.,  56649.,  52388.,  50952.,  52751.,  44888.],
       [  9395.,   9475.,  13616.,  11786.,  12169.,  15681.,  16861.,
         12920.,  11633.,  10590.,   9196.,   8815.,   9599.,  11229.,
         26772.,  22898.,  22722.,  23677.,  27348.,  25241.,  25138.,
         30071.,  28319.,  30927.,  33097.,  25259.,  24480.,  20330.,
         21851.,  24170.,  25939.,  22423.,  19910.,  23060.,  29396.,
         26587.,  33031.,  28243.,  24956.,  22083.,  20914.,  22673.,
         26734.,  20546.,  19060.,  23673.,  29511.,  30904.,  43650.,
         33274.,  29464.,  24904.,  24040.,  24024.,  26418.,  22771.,
         21910.,  25771.,  33015.,  31056.,  36197.,  37733.,  32791.,
         26419.,  24552.,  27103.,  29387.,  26814.,  25613.,  30593.,
         29559.,  28256.,  39647.,  36866.,  30192.,  29880.,  25408.],
       [ 15659.,  14831.,  18597.,  19642.,  20490.,  20316.,  21347.,
         19256.,  19772.,  19534.,  18897.,  14786.,  14499.,  17235.,
         27956.,  26018.,  26449.,  25271.,  25531.,  25630.,  27482.,
         31265.,  29656.,  28735.,  29867.,  24649.,  27054.,  24528.,
         25844.,  26173.,  25093.,  25044.,  23468.,  26678.,  28202.,
         27837.,  32110.,  27178.,  28061.,  26816.,  27207.,  25809.,
         26994.,  25026.,  24760.,  29790.,  30731.,  28005.,  37394.,
         29569.,  30384.,  29098.,  30921.,  30898.,  31221.,  28214.,
         28754.,  32569.,  33913.,  30960.,  34194.,  34262.,  33343.,
         33626.,  33555.,  33952.,  33896.,  34326.,  34295.,  40630.,
         37755.,  33389.,  40881.,  37560.,  40559.,  43912.,  37175.],
       [ 22951.,  22336.,  32976.,  27739.,  28528.,  28287.,  29154.,
         27223.,  34110.,  26295.,  22845.,  21670.,  22964.,  24969.,
         32932.,  29108.,  29876.,  31333.,  31231.,  38856.,  33368.,
         31570.,  29785.,  28166.,  29217.,  23758.,  26756.,  25594.,
         28361.,  30195.,  31792.,  31756.,  28628.,  34079.,  39404.,
         29450.,  33078.,  28121.,  27475.,  27384.,  29037.,  29633.,
         31044.,  27880.,  26220.,  28053.,  27561.,  28286.,  38144.,
         36074.,  35977.,  31202.,  34711.,  35843.,  36394.,  31615.,
         31129.,  32470.,  33042.,  31788.,  34332.,  38725.,  44965.,
         37836.,  36885.,  40637.,  39691.,  38212.,  36171.,  38864.,
         34327.,  34463.,  38982.,  38579.,  38548.,  40171.,  34063.]])

Awesome! Now we have 5 rows with 77 columns each denoting the complaints for each of the boros for each of the months from 2010 to 2016.

from matplotlib import patches as mpatches

x = np.arange(len(plotdata[0]))

#crude xlabels
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
years = ['2010', '2011', '2012', '2013', '2014', '2015', '2016']
xlabels = []
for year in years:
    for month in months:
        xlabels.append("{0} {1}".format(month,year))

plotcolors = [(1,0,103),(213,255,0),(255,0,86),(158,0,142),(14,76,161),(255,229,2),(0,95,57),\
            (0,255,0),(149,0,58),(255,147,126),(164,36,0),(0,21,68),(145,208,203),(98,14,0)]

#rescaling rgb from 0-255 to 0 to 1
plotcolors = [(color[0]/float(255),color[1]/float(255),color[2]/float(255)) for color in plotcolors]
legendcolors = [mpatches.Patch(color = color) for color in plotcolors]

plt.figure(figsize = (15,10));
plt.stackplot(x,plotdata, colors = plotcolors);
plt.xticks(x,xlabels,rotation=90);
plt.xlim(0,76)
plt.legend(legendcolors,borodict.keys(), bbox_to_anchor=(0.2, 1));
plt.title('311 Complaints by Borough', size = 24)
plt.ylabel('Number of Complaints',size = 14)

plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.gca().yaxis.set_ticks_position('left')
plt.gca().xaxis.set_ticks_position('bottom')

C:\Users\Bryan\Anaconda2\lib\site-packages\matplotlib\collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):

Sanity checks¶

So, some parts of this graph bear checking. First, did Staten Island really not increase in complaints over the years? The data below (complaints by by borough by year) suggest that that was the case:

borodata.groupby(by = ['Boro', 'Year']).count()

The other thing that would be worth investigating is what happened between Februrary and March 2011 - complaints doubled during this time. Perhaps some aspect of the pipeline changed during this time, but that's a future story.

%matplotlib inline

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import psycopg2

pd.options.display.max_columns = 40

Hack for Heat #3: Number of complaints over time¶

This time, we're going to look at raw 311 complaint data. The data that I was working with previously was summarized data.

This dataset is much bigger, which is nice because it'll give me a chance to maintain my SQL-querying-from-memory-skills.

First, we're going to have to load all of this data into a postgres database. I wrote this tablebase.

SQL-ing this¶

The python library psycopg2 lets us work with postgres databases in python. We first create a connection object, that encapsulates the connection to the database, then create a cursor class that lets us make queries from that database.

connection = psycopg2.connect('dbname = threeoneone user=threeoneoneadmin password=threeoneoneadmin')
cursor = connection.cursor()

For example, we might want to extract the column names from our table:

cursor.execute('''SELECT * FROM threeoneone.INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'service'; ''')
columns = cursor.fetchall()

columns = [x[3] for x in columns]

columns[0:5]

['uniquekey', 'createddate', 'closeddate', 'agency', 'agencyname']

Complaints over time¶

Let's start with something simple. First, let's extract a list of all complaints, and the plot the number of complaints by month.

cursor.execute('''SELECT createddate FROM service;''')
complaintdates = cursor.fetchall()

complaintdates = pd.DataFrame(complaintdates)

complaintdates.head()

Renaming our column:

complaintdates.columns = ['Date']

Next we have to convert these tuples into strings:

complaintdates['Date'] = [x [0] for x in complaintdates['Date']]

Normally, if these were strings, we'd use the extract_dates function we wrote in a previous post. However, because I typed these as datetime objects, we can just extract the .year(), .month(), and .day() attributes:

type(complaintdates['Date'][0])

datetime.date

complaintdates['Day'] = [x.day for x in complaintdates['Date']]
complaintdates['Month'] = [x.month for x in complaintdates['Date']]
complaintdates['Year'] = [x.year for x in complaintdates['Date']]

This is how many total complaints we have:

len(complaintdates)

11371298

We can group them by month:

bymonth = complaintdates.groupby(by='Month').count()
bymonth

By year:

byyear = complaintdates.groupby(by='Year').count()
byyear

byday = complaintdates.groupby(by='Day').count()
bydate = complaintdates.groupby(by='Date').count()

Some matplotlib¶

plt.figure(figsize = (12,10))
x = range(0,12)
y = bymonth['Date']
plt.plot(x,y)

[<matplotlib.lines.Line2D at 0xa7a0ed68>]

plt.figure(figsize = (12,10))
x = range(0,7)
y = byyear['Date']

plt.plot(x,y)

[<matplotlib.lines.Line2D at 0xa1925748>]

plt.figure(figsize = (12,10))
x = range(0,len(byday))
y = byday['Date']

plt.plot(x,y)

[<matplotlib.lines.Line2D at 0x8c9c83c8>]

The sharp decline we see at the end is obviously because not all months have the same number of days.

plt.figure(figsize=(12,10))
x = range(0,len(bydate))
y = bydate['Year'] #This is arbitrary - year, month, and day are all series that store the counts

plt.plot(x,y)

[<matplotlib.lines.Line2D at 0xa0ed9860>]

That's all for now. In the next post, I'm going to break this down by borough, as well as polish this graph.

%matplotlib inline

import pandas as pd
from matplotlib import pyplot as plt
import numpy as np

pd.options.display.max_columns = 40

Hack for Heat #2: Problem types over time¶

In this post, I'm going to explore how we might track the composition of problems that the HPD(Housing Preservation and Development Board) might receive over time. The data that I currently have are only limited to 2.5 years, so this is just a proof of concept.

hpdcompprob = pd.read_csv("Complaint_Problems.csv")

type(hpdcompprob.StatusDate[0])

str

The first thing we're going to do is convert string dates into simple years, months, and days.

def extract_dates(series, nameofdatecolumn):
    series['Year'] = series[nameofdatecolumn].str.extract(r'../../(.*)', expand=True)
    series['Month'] = series[nameofdatecolumn].str.extract(r'(.*)/../....', expand=True)
    series['Day'] = series[nameofdatecolumn].str.extract(r'../(.*)/....', expand=True)
    
extract_dates(hpdcompprob, 'StatusDate')

def count_problem_types(pdseries):
    problemtypes = pdseries.unique()
    problemcounts = {k:0 for k in problemtypes}
    problemcounts['TOTAL'] = 0
    
    for problem in pdseries:
        problemcounts[problem] += 1
    
    problemcounts['TOTAL'] = len(pdseries)
    print year, problemcounts #this is just to show us what those counts are
        
    return problemcounts

#years = hpdcompprob.Year.unique()
years = ['2014','2015','2016'] #We know that only these data are relevant (the other, older years, are remnants of old cases)
yearsdict = {k:{} for k in years}
for year in years:
    yearsdict[year] = count_problem_types(hpdcompprob.loc[hpdcompprob.Year == year].MajorCategory)

2014 {'TOTAL': 259437, 'ELECTRIC': 12684, 'HEAT/HOT WATER': 93548, 'WATER LEAK': 14315, 'APPLIANCE': 5231, 'DOOR/WINDOW': 16352, 'HEATING': 16, 'ELEVATOR': 363, 'GENERAL': 9680, 'FLOORING/STAIRS': 11768, 'NONCONST': 185, 'UNSANITARY CONDITION': 34779, 'SAFETY': 5521, 'PLUMBING': 23179, 'CONSTRUCTION': 3, 'OUTSIDE BUILDING': 758, 'PAINT/PLASTER': 31055}
2015 {'TOTAL': 705644, 'PAINT/PLASTER': 84062, 'WATER LEAK': 42512, 'DOOR/WINDOW': 49154, 'APPLIANCE': 14719, 'NONCONST': 69, 'ELEVATOR': 1213, 'GENERAL': 31482, 'HEAT/HOT WATER': 228157, 'FLOORING/STAIRS': 34741, 'UNSANITARY CONDITION': 97221, 'SAFETY': 17105, 'PLUMBING': 67472, 'CONSTRUCTION': 1, 'OUTSIDE BUILDING': 2183, 'ELECTRIC': 35553}
2016 {'TOTAL': 211830, 'ELECTRIC': 9239, 'WATER LEAK': 12063, 'DOOR/WINDOW': 15465, 'APPLIANCE': 4644, 'HEATING': 39, 'ELEVATOR': 377, 'GENERAL': 9738, 'HEAT/HOT WATER': 73803, 'NONCONST': 33, 'UNSANITARY CONDITION': 27727, 'SAFETY': 5243, 'PLUMBING': 20218, 'FLOORING/STAIRS': 9989, 'OUTSIDE BUILDING': 543, 'PAINT/PLASTER': 22709}

Graphing our results¶

We're going to use matplotlib to make a stacked area graph. I referenced this post.

First, let's arrange our dictionary keys in alphabetical order, so that we can extract the values in a meaningful way.

labels = sorted(yearsdict['2014'].keys())
labels.remove('CONSTRUCTION')
labels.remove('HEATING') #removed these because 2015 doesn't have an entry, and both the other years have negligible entries
labels.remove('TOTAL') #not interested in total in this plot

years = ['2014','2015','2016']
plotdict = {k:[] for k in years}
for year in years:
    for label in labels:
        plotdict[year].append(yearsdict[year][label])

len(labels)

14

x = np.arange(len(years))
y = np.column_stack((plotdict['2014'],plotdict['2015'],plotdict['2016']))
plt.figure(figsize = (15,10));
plt.stackplot(x,y);
plt.legend(labels)

<matplotlib.legend.Legend at 0xeb24ac8>

This might legit be the ugliest plot I have ever seen. Let's try and fix the colors first.

From this blog post, I got a list of 14 contrasting colors.

from matplotlib import patches as mpatches

plotcolors = [(0,0,0),(1,0,103),(213,255,0),(255,0,86),(158,0,142),(14,76,161),(255,229,2),(0,95,57),\
            (0,255,0),(149,0,58),(255,147,126),(164,36,0),(0,21,68),(145,208,203),(98,14,0)]

#rescaling rgb from 0-255 to 0 to 1
plotcolors = [(x[0]/float(255),x[1]/float(255),x[2]/float(255)) for x in plotcolors]
legendcolors = [mpatches.Patch(color = x) for x in plotcolors]

x = np.arange(len(years))
y = np.column_stack((plotdict['2014'],plotdict['2015'],plotdict['2016']))
plt.figure(figsize = (15,10));
plt.stackplot(x,y,colors = plotcolors);
plt.legend(legendcolors, labels);

#some labels for the x axis
xlabels = ['2014', '2015', '2016']
plt.xticks(x,xlabels, size = 18);
plt.yticks(size = 18);
plt.title('HPD Complaints by type over time', size = 24);

Okay, so this looks a little better, and with some tweaking, we can make this work.

Hacking for heat¶

In this series, I'm going to be posting about the process that goes on behind some of the blog posts we end up writing. In this first entry, I'm going to be exploring a number of datsets.

These are the ones that I'm going to be looking at:

(for HPD datasets, some documentation can be found here)

HPD litigation database¶

First, we're going to look at the smallest dataset, one that contains cases against landlords. From the documentation, this file contains "All cases commenced by HPD or by tennants (naming HPD as a party) in [housing court] since August 2006" either seeking orders for landlords to comply with regulations, or awarding HPD civil penalties (i.e., collecting on fines).

import pandas as pd

litigation = pd.read_csv("Housing_Litigations.csv")

litigation.head()

Let's take a look at unique values for some of the columns:

litigation['Boro'].unique()

array(['BRONX', 'QUEENS', 'MANHATTAN', 'BROOKLYN', 'STATEN ISLAND'], dtype=object)

litigation.groupby(by = ['Boro','CaseJudgement']).count()

The above table tells us that Manhattan has the lowest proportion of cases that receive judgement (about 1 in 80), whereas Staten Island has the highest (about 1 in 12). It may be something worth looking into, but it's also important to note that many cases settle out of court, and landlords in Manhattan may be more willing (or able) to do so.

litigation['CaseType'].unique()

array(['Tenant Action', 'Heat and Hot Water', 'CONH',
       'Comp Supplemental Cases', 'Comprehensive',
       'Access Warrant - Non-Lead', 'Access Warrant - lead',
       'Lead False Certification', 'False Certification Non-Lead',
       'Heat Supplemental Cases', '7A', 'Failure to Register Only',
       'HLD - Other Case Type'], dtype=object)

litigation.groupby(by = ['CaseType', 'CaseJudgement']).count()

The table above shows the same case judgement proportions, but conditioned on what type of case it was. Unhelpfully, the documentation does not specify what the difference between Access Warrant - Lead and Non-Lead is. It could be one of two possibilities: The first is whether the warrants have to do with lead-based paint, which is a common problem, but perhaps still too idiosyncratic to have it's own warrant type. The second, perhaps more likely possibility is whether or not HPD was the lead party in the case.

We'll probably end up using these data by aggregating it and examining how complaints change over time, perhaps as a function of what type they are. There's also the possibility of looking up specific buildings' complaints and tying them to landlords. There's probably also an easy way to join this dataset with another, by converting the address information into something standardized, like borough-block-lot (BBL; http://www1.nyc.gov/nyc-resources/service/1232/borough-block-lot-bbl-lookup)

HPD complaints¶

Next, we're going to look at a dataset of HPD complaints.

hpdcomp = pd.read_csv('Housing_Maintenance_Code_Complaints.csv')

hpdcomp.head()

len(hpdcomp)

662672

This dataset is less useful on its own. It doesn't tell us what the type of complaint was, only the date it was received and whether or not the complaint is still open. However, it may be useful in conjunction with the earlier dataset. For example, we might be interested in how many of these complaints end up in court (or at least, have some sort of legal action taken).

HPD violations¶

The following dataset tracks HPD violations.

hpdviol = pd.read_csv('Housing_Maintenance_Code_Violations.csv')

hpdviol.head()

len(hpdviol)

1437246

These datasets all have different lengths, but that's not surprising, given they come from different years. One productive initial step would be to convert the date strings into something numerical.

HPD complaint problems database¶

hpdcompprob = pd.read_csv('Complaint_Problems.csv')

hpdcompprob.head()

Awesome! This dataset provides some more details about the complaints, and lets us join by ComplaintID.

Summary and next steps¶

In the immediate future, I'm going to be writing script to join and clean this dataset. This can either be done either in python, or by writing some SQL. I haven't decided yet. Additionally, I'm going to be writing some code to do things like convert date strings, and perhaps scrape text.

	2010	2011	2012	2013	2014	2015	2016
CONSTRUCTION/PLUMBING	NaN	NaN	NaN	NaN	NaN	NaN	NaN
STREET CONDITION	NaN	NaN	NaN	NaN	NaN	NaN	NaN
HEAT RELATED	NaN	NaN	NaN	NaN	NaN	NaN	NaN
NOISY	NaN	NaN	NaN	NaN	NaN	NaN	NaN
PAINT/PLASTER	NaN	NaN	NaN	NaN	NaN	NaN	NaN
WATER SYSTEM	NaN	NaN	NaN	NaN	NaN	NaN	NaN
BLOCKED DRIVEWAY	NaN	NaN	NaN	NaN	NaN	NaN	NaN
ILLEGAL PARKING	NaN	NaN	NaN	NaN	NaN	NaN	NaN
UNSANITARY CONDITION	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	Date	Boro	Comptype
Year
2010	2005730	2005730	2005730
2011	1918888	1918888	1918888
2012	1783013	1783013	1783013
2013	1848938	1848938	1848938
2014	2102177	2102177	2102177
2015	2286697	2286697	2286697
2016	1146739	1146739	1146739

	Date	Boro	Comptype	Month	Day	Pop	CompPerCap
Year
2011	1359368	1359368	1359368	1359368	1359368	8491079	0.160094
2012	1528597	1528597	1528597	1528597	1528597	8491079	0.180024
2013	1546188	1546188	1546188	1546188	1546188	8491079	0.182096
2014	1770357	1770357	1770357	1770357	1770357	8491079	0.208496
2015	1924917	1924917	1924917	1924917	1924917	8491079	0.226699
2016	891783	891783	891783	891783	891783	8491079	0.105026

		Date	Comptype	Month	Day	Pop	CompPerCap
Year	Boro
2011	BRONX	262597	262597	262597	262597	1438159	0.182592
	BROOKLYN	434038	434038	434038	434038	2621793	0.165550
	MANHATTAN	273282	273282	273282	273282	1636268	0.167015
	QUEENS	315070	315070	315070	315070	2321580	0.135714
	STATEN ISLAND	74381	74381	74381	74381	473279	0.157161
2012	BRONX	296502	296502	296502	296502	1438159	0.206168
	BROOKLYN	482088	482088	482088	482088	2621793	0.183877
	MANHATTAN	314437	314437	314437	314437	1636268	0.192167
	QUEENS	358990	358990	358990	358990	2321580	0.154632
	STATEN ISLAND	76580	76580	76580	76580	473279	0.161807
2013	BRONX	302328	302328	302328	302328	1438159	0.210219
	BROOKLYN	490979	490979	490979	490979	2621793	0.187268
	MANHATTAN	332487	332487	332487	332487	1636268	0.203198
	QUEENS	343772	343772	343772	343772	2321580	0.148077
	STATEN ISLAND	76622	76622	76622	76622	473279	0.161896
2014	BRONX	340297	340297	340297	340297	1438159	0.236620
	BROOKLYN	557290	557290	557290	557290	2621793	0.212561
	MANHATTAN	373895	373895	373895	373895	1636268	0.228505
	QUEENS	408389	408389	408389	408389	2321580	0.175910
	STATEN ISLAND	90486	90486	90486	90486	473279	0.191190
2015	BRONX	355017	355017	355017	355017	1438159	0.246855
	BROOKLYN	602144	602144	602144	602144	2621793	0.229669
	MANHATTAN	417223	417223	417223	417223	1636268	0.254985
	QUEENS	455115	455115	455115	455115	2321580	0.196037
	STATEN ISLAND	95418	95418	95418	95418	473279	0.201610
2016	BRONX	169773	169773	169773	169773	1438159	0.118049
	BROOKLYN	271140	271140	271140	271140	2621793	0.103418
	MANHATTAN	209840	209840	209840	209840	1636268	0.128243
	QUEENS	200335	200335	200335	200335	2321580	0.086293
	STATEN ISLAND	40695	40695	40695	40695	473279	0.085985

	Date	Comptype
Boro
BRONX	354335	354335
BROOKLYN	334574	334574
MANHATTAN	250296	250296
QUEENS	138477	138477
STATEN ISLAND	10601	10601
Unspecified	3765	3765

	Year	Category	Count
0	2010	CONSTRUCTION/PLUMBING	287253
1	2010	Street Condition	229584
2	2010	HEAT RELATED	214218
3	2010	NOISY	201834
4	2010	PAINT - PLASTER	93194

	2010	2011	2012	2013	2014	2015	2016
CONSTRUCTION/PLUMBING	287253	272234	225338	223314	106251	83585	39460
STREET CONDITION	229584	216865	171910	182618	221097	233839	105346
HEAT RELATED	214218	190184	182974	202896	230364	225706	126906
NOISY	201834	196453	220392	259356	337892	387227	197186
PAINT/PLASTER	93194	100704	77287	77957	63503	None	None
WATER SYSTEM	70346	61577	57678	53186	None	71086	31299
BLOCKED DRIVEWAY	None	52592	50818	57408	79170	100881	53955
ILLEGAL PARKING	None	None	None	None	63243	92679	55932
UNSANITARY CONDITION	None	None	None	None	61789	82888	35357

	createddate	closeddate	borough	complainttype	createdheatseason
0	2014-11-18	2014-11-22	BROOKLYN	HEAT/HOT WATER	2013
1	2014-11-18	2014-11-25	BROOKLYN	HEAT/HOT WATER	2013
2	2014-11-18	2014-11-19	BRONX	HEAT/HOT WATER	2013
3	2014-11-18	2014-11-19	MANHATTAN	HEAT/HOT WATER	2013
4	2014-11-18	2014-11-20	BRONX	HEAT/HOT WATER	2013

	0	1	2
0	2011-04-15	MANHATTAN	Street Condition
1	2011-04-15	MANHATTAN	Street Condition
2	2011-04-15	MANHATTAN	Street Condition
3	2011-04-15	MANHATTAN	Street Condition
4	2011-04-14	QUEENS	Street Condition

	Date	Boro	Comptype	Year	Month	Day
58	2015-10-17	BROOKLYN	HEAT/HOT WATER	2015	10	17
728	2011-04-16	MANHATTAN	HEATING	2011	4	16
1026	2011-04-16	BROOKLYN	HEATING	2011	4	16
1418	2015-10-17	MANHATTAN	HEAT/HOT WATER	2015	10	17
1624	2011-04-15	MANHATTAN	HEATING	2011	4	15

	createddate	closeddate	borough	timedelta	timedeltaint
0	2013-04-26	2013-04-26	QUEENS	0 days	0
1	2013-04-26	2013-04-29	MANHATTAN	3 days	3
2	2013-04-26	2013-04-30	MANHATTAN	4 days	4
3	2013-04-26	2013-04-26	QUEENS	0 days	0
4	2013-04-26	2013-04-30	STATEN ISLAND	4 days	4

	createddate	closeddate	borough	timedelta	timedeltaint
604564	2016-03-16	1900-01-01	Unspecified	-42443 days	-42443
596082	2016-03-16	1900-01-01	Unspecified	-42443 days	-42443
605654	2016-03-16	1900-01-01	QUEENS	-42443 days	-42443
606458	2016-03-15	1900-01-01	MANHATTAN	-42442 days	-42442
552013	2016-03-14	1900-01-01	Unspecified	-42441 days	-42441

	createddate	closeddate	borough	timedelta	timedeltaint
9392808	2010-04-08	2201-05-13	BRONX	69796 days	69796
9525584	2010-06-01	2201-06-17	BROOKLYN	69777 days	69777
1581464	2011-01-20	2201-03-25	BRONX	69460 days	69460
1536326	2015-11-03	2100-01-01	QUEENS	30740 days	30740
2479091	2013-01-10	2023-05-01	BROOKLYN	3763 days	3763

	createddate	closeddate	borough	timedelta	timedeltaint
11371297	2015-08-28	2015-08-29	QUEENS	1 days	1
7950080	2015-06-20	2015-06-21	BROOKLYN	1 days	1
2366882	2012-11-19	2012-11-20	MANHATTAN	1 days	1
2366879	2015-06-09	2015-06-10	Unspecified	1 days	1
2366874	2012-04-17	2012-04-18	BROOKLYN	1 days	1

	createddate	closeddate	borough	timedelta	timedeltaint
10774084	2010-01-06	2016-03-24	BROOKLYN	2269 days	2269
1506622	2010-02-19	2016-05-02	BROOKLYN	2264 days	2264
1508205	2010-03-02	2016-05-04	STATEN ISLAND	2255 days	2255
513580	2010-01-01	2016-02-17	STATEN ISLAND	2238 days	2238
589865	2010-01-22	2016-02-25	BRONX	2225 days	2225

			Date	Comptype	Day
Boro	Year	Month
BRONX	2011	3	26256	26256	26256
		4	22898	22898	22898
		5	22722	22722	22722
		6	23677	23677	23677
		7	27348	27348	27348

			Date	Comptype	Day
Boro	Year	Month
BRONX	2011	3	5101	5101	5101
		4	3339	3339	3339
		5	1217	1217	1217
		6	562	562	562
		7	564	564	564
		8	482	482	482
		9	674	674	674
		10	5954	5954	5954
		11	7393	7393	7393
		12	10536	10536	10536
	2012	1	12655	12655	12655
		2	7146	7146	7146
		3	5049	5049	5049
		4	3336	3336	3336
		5	1214	1214	1214
		6	672	672	672
		7	688	688	688
		8	711	711	711
		9	675	675	675
		10	3488	3488	3488
		11	11853	11853	11853
		12	9365	9365	9365
	2013	1	13580	13580	13580
		2	10037	10037	10037
		3	6794	6794	6794
		4	3706	3706	3706
		5	1545	1545	1545
		6	861	861	861
		7	605	605	605
		8	593	593	593
...	...	...	...	...	...
STATEN ISLAND	2013	12	261	261	261
	2014	1	491	491	491
		2	199	199	199
		3	206	206	206
		4	112	112	112
		5	70	70	70
		6	35	35	35
		7	32	32	32

	timedeltaint
0.025	1
0.975	138

			Date
Boro	Year	Month
BRONX	2010	1	9395
		2	9475
		3	13616
		4	11786
		5	12169
		6	15681
		7	16861
		8	12920
		9	11633
		10	10590
		11	9196
		12	8815
	2011	1	9599
		2	11229
		3	26772
		4	22898
		5	22722
		6	23677
		7	27348
		8	25241
		9	25138
		10	30071
		11	28319
		12	30927
	2012	1	33097
		2	25259
		3	24480
		4	20330
		5	21851
		6	24170
...	...	...	...
STATEN ISLAND	2013	12	4984
	2014	1	8753
		2	8978
		3	8211
		4	7937
		5	8278
		6	8173
		7	8111

	timedeltaint
borough
BRONX	18.789025
BROOKLYN	20.725656
MANHATTAN	20.841213
QUEENS	23.746771
STATEN ISLAND	26.283222
Unspecified	11.652647

	Date
Boro
BRONX	1882215
BROOKLYN	3178206
MANHATTAN	2166983
QUEENS	2444874
STATEN ISLAND	541935
Unspecified	1157085

	0
0	(2013-04-26,)
1	(2013-04-26,)
2	(2013-04-26,)
3	(2013-04-26,)
4	(2013-04-26,)

	Date	Year	Day
Month
1	1182682	1182682	1182682
2	1065065	1065065	1065065
3	1099983	1099983	1099983
4	985815	985815	985815
5	972913	972913	972913
6	849832	849832	849832
7	875210	875210	875210
8	829573	829573	829573
9	805231	805231	805231
10	895672	895672	895672
11	915066	915066	915066
12	894256	894256	894256

	Date	Month	Day
Year
2010	1754355	1754355	1754355
2011	1688091	1688091	1688091
2012	1568421	1568421	1568421
2013	1614207	1614207	1614207
2014	1839574	1839574	1839574
2015	2021963	2021963	2021963
2016	884687	884687	884687

	LitigationID	BuildingID	BoroID	Boro	HouseNumber	StreetName	Zip	Block	Lot	CaseType	CaseOpenDate	CaseStatus	CaseJudgement
0	98385	806633	2	BRONX	866	EAST 221 STREET	10467.0	4679	73	Tenant Action	04/23/2009	CLOSED	NO
1	98387	55134	2	BRONX	3940	CARPENTER AVENUE	10466.0	4825	51	Tenant Action	04/23/2009	CLOSED	NO
2	98389	119044	2	BRONX	1356	WALTON AVENUE	10452.0	2841	41	Tenant Action	04/23/2009	CLOSED	NO
3	98391	889796	2	BRONX	871	EAST 179 STREET	10460.0	3123	77	Tenant Action	04/24/2009	CLOSED	NO
4	98393	56949	2	BRONX	1680	CLAY AVENUE	10457.0	2889	1	Tenant Action	04/24/2009	CLOSED	NO

		LitigationID	BuildingID	BoroID	HouseNumber	StreetName	Zip	Block	Lot	CaseType	CaseOpenDate	CaseStatus
Boro	CaseJudgement
BRONX	NO	19734	19734	19734	19734	19734	19734	19734	19734	19734	19701	19734
BRONX	YES	481	481	481	481	481	481	481	481	481	478	481
BROOKLYN	NO	19833	19833	19833	19833	19833	19831	19833	19833	19833	19807	19833
BROOKLYN	YES	762	762	762	762	762	760	762	762	762	762	762
MANHATTAN	NO	11849	11849	11849	11849	11849	11849	11849	11849	11849	11841	11849
MANHATTAN	YES	129	129	129	129	129	129	129	129	129	129	129
QUEENS	NO	9068	9068	9068	9068	9068	9061	9068	9068	9068	9063	9068
QUEENS	YES	313	313	313	313	313	313	313	313	313	308	313
STATEN ISLAND	NO	1064	1064	1064	1064	1064	1064	1064	1064	1064	1064	1064
STATEN ISLAND	YES	84	84	84	84	84	84	84	84	84	84	84

		LitigationID	BuildingID	BoroID	Boro	HouseNumber	StreetName	Zip	Block	Lot	CaseOpenDate	CaseStatus
CaseType	CaseJudgement
7A	NO	105	105	105	105	105	105	105	105	105	102	105
Access Warrant - Non-Lead	NO	5665	5665	5665	5665	5665	5665	5665	5665	5665	5665	5665
Access Warrant - Non-Lead	YES	8	8	8	8	8	8	8	8	8	8	8
Access Warrant - lead	NO	1786	1786	1786	1786	1786	1786	1786	1786	1786	1785	1786
CONH	NO	797	797	797	797	797	797	797	797	797	797	797
Comp Supplemental Cases	NO	750	750	750	750	750	750	750	750	750	702	750
Comp Supplemental Cases	YES	31	31	31	31	31	31	31	31	31	29	31
Comprehensive	NO	2274	2274	2274	2274	2274	2274	2274	2274	2274	2274	2274
Comprehensive	YES	121	121	121	121	121	121	121	121	121	121	121
Failure to Register Only	NO	60	60	60	60	60	60	60	60	60	60	60
Failure to Register Only	YES	3	3	3	3	3	3	3	3	3	3	3
False Certification Non-Lead	NO	2355	2355	2355	2355	2355	2355	2355	2355	2355	2355	2355
False Certification Non-Lead	YES	20	20	20	20	20	20	20	20	20	20	20
HLD - Other Case Type	NO	3	3	3	3	3	3	3	3	3	2	3
Heat Supplemental Cases	NO	124	124	124	124	124	124	124	124	124	108	124
Heat Supplemental Cases	YES	7	7	7	7	7	7	6	7	7	6	7
Heat and Hot Water	NO	15760	15760	15760	15760	15760	15760	15758	15760	15760	15757	15760
Heat and Hot Water	YES	1299	1299	1299	1299	1299	1299	1298	1299	1299	1294	1299
Lead False Certification	NO	239	239	239	239	239	239	239	239	239	239	239
Lead False Certification	YES	2	2	2	2	2	2	2	2	2	2	2
Tenant Action	NO	31630	31630	31630	31630	31630	31630	31623	31630	31630	31630	31630
Tenant Action	YES	278	278	278	278	278	278	278	278	278	278	278

	ComplaintID	BuildingID	BoroughID	Borough	HouseNumber	StreetName	Zip	Block	Lot	Apartment	CommunityBoard	ReceivedDate	StatusID	Status	StatusDate
0	6960137	3418	1	MANHATTAN	1989	ADAM C POWELL BOULEVARD	10026.0	1904	4	12D	10	07/07/2014	2	CLOSE	07/29/2014
1	6960832	3512	1	MANHATTAN	2267	ADAM C POWELL BOULEVARD	10030.0	1918	4	3B	10	07/08/2014	2	CLOSE	07/12/2014
2	6946867	5318	1	MANHATTAN	778	11 AVENUE	10019.0	1083	1	4P	4	06/19/2014	2	CLOSE	07/13/2014
3	6966946	5608	1	MANHATTAN	1640	AMSTERDAM AVENUE	10031.0	2073	29	5A	9	07/16/2014	2	CLOSE	07/21/2014
4	6956574	17896	1	MANHATTAN	230	EAST 88 STREET	10128.0	1533	32	1E	8	07/01/2014	2	CLOSE	07/09/2014

	ViolationID	BuildingID	RegistrationID	BoroID	Boro	HouseNumber	LowHouseNumber	HighHouseNumber	StreetName	StreetCode	...	NewCertifyByDate	NewCorrectByDate	CertifiedDate	OrderNumber	NOVID	NOVDescription	NOVIssuedDate	CurrentStatusID	CurrentStatus	CurrentStatusDate
0	2293208	444	130476	1	MANHATTAN	22	22	22	1 AVENUE	10010	...	NaN	NaN	NaN	772	338596.0	Â§ 27-2098 ADM CODE FILE WITH THIS DEPARTMENT ...	04/22/1997	19	VIOLATION CLOSED	03/10/2015
1	2293181	444	130476	1	MANHATTAN	22	22	22	1 AVENUE	10010	...	NaN	NaN	09/10/1974	775	338584.0	D26-41.05 ADM CODE FILE WITH THIS DEPARTMENT R...	07/16/1974	19	VIOLATION CLOSED	03/10/2015
2	2293249	448	135326	1	MANHATTAN	2222	2222	2222	1 AVENUE	10010	...	NaN	NaN	NaN	772	338619.0	Â§ 27-2098 ADM CODE FILE WITH THIS DEPARTMENT ...	11/19/1996	19	VIOLATION CLOSED	03/10/2015
3	2293486	467	136913	1	MANHATTAN	2250	2250	2250	1 AVENUE	10010	...	NaN	NaN	NaN	772	338733.0	D26-41.03 ADM CODE FILE WITH THIS DEPARTMENT A...	05/10/1982	19	VIOLATION CLOSED	03/10/2015
4	2293490	467	136913	1	MANHATTAN	2250	2250	2250	1 AVENUE	10010	...	NaN	NaN	NaN	772	338737.0	Â§ 27-2098 ADM CODE FILE WITH THIS DEPARTMENT ...	11/19/1996	19	VIOLATION CLOSED	03/10/2015

	ProblemID	ComplaintID	UnitTypeID	UnitType	SpaceTypeID	SpaceType	TypeID	Type	MajorCategoryID	MajorCategory	MinorCategoryID	MinorCategory	CodeID	Code	StatusID	Status	StatusDate	StatusDescription
0	14548958	6967900	91	APARTMENT	541	BATHROOM	1	EMERGENCY	9	PLUMBING	63	BATHTUB/SHOWER	2538	BROKEN OR MISSING	2	CLOSE	07/29/2014	The Department of Housing Preservation and De...
1	14548959	6967900	91	APARTMENT	541	BATHROOM	3	NON EMERGENCY	9	PLUMBING	63	BATHTUB/SHOWER	2540	FAUCET BROKEN/MISSING/LEAKING	2	CLOSE	08/04/2014	The Department of Housing Preservation and De...
2	14548960	6967900	91	APARTMENT	543	ENTIRE APARTMENT	3	NON EMERGENCY	58	FLOORING/STAIRS	343	FLOOR	2691	TILE BROKEN OR MISSING	2	CLOSE	08/04/2014	The Department of Housing Preservation and De...
3	14548961	6967900	91	APARTMENT	541	BATHROOM	3	NON EMERGENCY	9	PLUMBING	63	BATHTUB/SHOWER	2541	CHIPPED OR RUSTED	2	CLOSE	08/04/2014	The Department of Housing Preservation and De...
4	14615271	6994958	20	APARTMENT	68	ENTIRE APARTMENT	1	EMERGENCY	59	HEAT/HOT WATER	349	ENTIRE BUILDING	2717	NO HOT WATER	2	CLOSE	08/22/2014	The Department of Housing Preservation and De...