Hack for Heat #6: Complaint resolution time, over time

In [1]:
%matplotlib inline

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import psycopg2

pd.options.display.max_columns = 40

Hack for Heat #6: Complaint resolution time, over time and censoring

As a follow up to the last post, we're going to try and find out if the average resolution time has changed over time. This might be tricky, as we might run into the issue of censoring. Censoring) is a problem in time-series analyses that occurs when data that are missing because they lie outside of the range of the measure.

In this case, what we might expect to find is that complaint resolution time appears shorter as we get to the present day, and that may be true, but it may also happen because cases where problems have yet to be solved show up as missing data. In other words, for a case that was opened in 2010, the longest possible resolution time is 5 years, whereas for a case opened yesterday, the longest possible resolution time is 1 day.

So, as a first step, let's look at the proportion of unseen cases over time:

In [2]:
#Like before, we're going to select the relevant columns from the database:
connection = psycopg2.connect('dbname= threeoneone user=threeoneoneadmin password=threeoneoneadmin')
cursor = connection.cursor()

cursor.execute('''SELECT createddate, closeddate, borough FROM service;''')
data = cursor.fetchall()
data = pd.DataFrame(data)
In [3]:
data.columns = ['createddate','closeddate','borough']

Let's extract years and months again:

In [4]:
data['cryear'] = [x.year for x in data['createddate']]
data['crmonth'] = [x.month for x in data['createddate']]

And now, we're going to filter out bad cases again. However, this time, we're going to be a bit more selective. We're going to include cases where the closed date is missing.

In [5]:
#filter out bad casesa
import datetime

today = datetime.date(2016,5,29)
janone = datetime.date(2010,1,1)

data = data.loc[(data['closeddate'] > data['createddate']) | (data['closeddate'].isnull() == True)]
In [6]:
databyyear = data.groupby(by='cryear').count()
databyyear
Out[6]:
createddate closeddate borough crmonth
cryear
2010 1340309 1281719 1340309 1340309
2011 1294795 1157180 1294795 1294795
2012 1185598 1148399 1185598 1185598
2013 1221690 1177863 1221690 1221690
2014 1354300 1305561 1354300 1354300
2015 1451521 1378090 1451521 1451521
2016 605070 530959 605070 605070
In [7]:
data['timedelta'] = data['closeddate'] - data['createddate']
data['timedeltaint'] = [x.days if pd.isnull(x) == False else None for x in data['timedelta']]
In [8]:
data.groupby(by='cryear').mean()
Out[8]:
crmonth timedeltaint
cryear
2010 6.505385 24.508043
2011 6.491262 22.512834
2012 6.617468 22.690249
2013 6.414707 20.205784
2014 6.280245 19.712918
2015 6.314193 16.731161
2016 2.799473 8.634098

This table shows exactly what I'm talking about - as we get closer to the current day, the average resolution time falls more and more. If censorig is occuring, we might expect that the proportion of cases closed is also decreasing over time. This is generally the case:

In [9]:
databyyear['propclosed'] = databyyear['closeddate']/databyyear['createddate']
databyyear
Out[9]:
createddate closeddate borough crmonth propclosed
cryear
2010 1340309 1281719 1340309 1340309 0.956286
2011 1294795 1157180 1294795 1294795 0.893717
2012 1185598 1148399 1185598 1185598 0.968624
2013 1221690 1177863 1221690 1221690 0.964126
2014 1354300 1305561 1354300 1354300 0.964012
2015 1451521 1378090 1451521 1451521 0.949411
2016 605070 530959 605070 605070 0.877517

With the exception of year 2011, which we have reason to believe is wonky (see Hack for Heat #4), this is generally the case. So, how do we deal with this issue? To be honest, I don't know at the moment. But I'm going to read this paper tomorrow, and see if I can implement something.