Hack for Heat #6: Complaint resolution time, over time
%matplotlib inline
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import psycopg2
pd.options.display.max_columns = 40
Hack for Heat #6: Complaint resolution time, over time and censoring¶
As a follow up to the last post, we're going to try and find out if the average resolution time has changed over time. This might be tricky, as we might run into the issue of censoring. Censoring) is a problem in time-series analyses that occurs when data that are missing because they lie outside of the range of the measure.
In this case, what we might expect to find is that complaint resolution time appears shorter as we get to the present day, and that may be true, but it may also happen because cases where problems have yet to be solved show up as missing data. In other words, for a case that was opened in 2010, the longest possible resolution time is 5 years, whereas for a case opened yesterday, the longest possible resolution time is 1 day.
So, as a first step, let's look at the proportion of unseen cases over time:
#Like before, we're going to select the relevant columns from the database:
connection = psycopg2.connect('dbname= threeoneone user=threeoneoneadmin password=threeoneoneadmin')
cursor = connection.cursor()
cursor.execute('''SELECT createddate, closeddate, borough FROM service;''')
data = cursor.fetchall()
data = pd.DataFrame(data)
data.columns = ['createddate','closeddate','borough']
Let's extract years and months again:
data['cryear'] = [x.year for x in data['createddate']]
data['crmonth'] = [x.month for x in data['createddate']]
And now, we're going to filter out bad cases again. However, this time, we're going to be a bit more selective. We're going to include cases where the closed date is missing.
#filter out bad casesa
import datetime
today = datetime.date(2016,5,29)
janone = datetime.date(2010,1,1)
data = data.loc[(data['closeddate'] > data['createddate']) | (data['closeddate'].isnull() == True)]
databyyear = data.groupby(by='cryear').count()
databyyear
data['timedelta'] = data['closeddate'] - data['createddate']
data['timedeltaint'] = [x.days if pd.isnull(x) == False else None for x in data['timedelta']]
data.groupby(by='cryear').mean()
This table shows exactly what I'm talking about - as we get closer to the current day, the average resolution time falls more and more. If censorig is occuring, we might expect that the proportion of cases closed is also decreasing over time. This is generally the case:
databyyear['propclosed'] = databyyear['closeddate']/databyyear['createddate']
databyyear
With the exception of year 2011, which we have reason to believe is wonky (see Hack for Heat #4), this is generally the case. So, how do we deal with this issue? To be honest, I don't know at the moment. But I'm going to read this paper tomorrow, and see if I can implement something.