Hack for Heat #5: Complaint resolution time
%matplotlib inline
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import psycopg2
pd.options.display.max_columns = 40
Hack for Heat #5: How long do complaints take to resolve?¶
In this post, we're going to see if we can graph how long it takes for complaints to get resolved.
connection = psycopg2.connect('dbname= threeoneone user=threeoneoneadmin password=threeoneoneadmin')
cursor = connection.cursor()
cursor.execute('''SELECT createddate, closeddate, borough FROM service;''')
data = cursor.fetchall()
data = pd.DataFrame(data)
data.columns = ['createddate','closeddate','borough']
data = data.loc[data['createddate'].notnull()]
data = data.loc[data['closeddate'].notnull()]
data['timedelta'] = data['closeddate'] - data['createddate']
data['timedeltaint'] = [x.days for x in data['timedelta']]
data.head()
data.groupby(by='borough')['timedeltaint'].mean()
Oops! Looks like something's wrong. Let's try and find out:
data.sort_values('timedeltaint').head()
data.sort_values('timedeltaint', ascending=False).head()
Ah. Well, as a first step, let's remove any values that are before Jan 1st 2010 or after today:
import datetime
today = datetime.date(2016,5,29)
janone = datetime.date(2010,1,1)
Let's also remove any rows where the close date is before the created date:
subdata = data.loc[(data['closeddate'] > janone) & (data['closeddate'] < today)]
subdata = subdata.loc[data['closeddate'] > data['createddate']]
len(subdata)
subdata.sort_values('timedeltaint').head()
subdata.sort_values('timedeltaint',ascending = False).head()
This looks a little bit more realistic, but let's also visualize the distribution:
plotdata = list(subdata['timedeltaint'])
plt.figure(figsize=(12,10))
plt.hist(plotdata);
Okay, this still looks really wonky. Let's further subset the data, and see what happens when we remove the top and bottom 2.5%.
Pandas has a quantile function:
subdata.quantile([.025, .975])
quantcutdata = subdata.loc[(subdata['timedeltaint'] > 1) & (subdata['timedeltaint'] < 138) ]
len(quantcutdata)
plotdata = list(quantcutdata['timedeltaint'])
plt.figure(figsize=(12,10))
plt.hist(plotdata);
That looks a little better, but there might be other ways to filter out bad data.
subdata.groupby(by='borough').median()
subdata.groupby(by='borough').mean()
So, I definitely wouldn't trust these data right now - I'm still working on finding a data dictionary for the 311 data, and I need to make sure the columns mean what I think they mean. The point of this is just to show that the process would be once I get my hands on reliable data.