%matplotlib inline

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import psycopg2

pd.options.display.max_columns = 40

Hack for Heat #6: Complaint resolution time, over time and censoring¶

As a follow up to the last post, we're going to try and find out if the average resolution time has changed over time. This might be tricky, as we might run into the issue of censoring. Censoring) is a problem in time-series analyses that occurs when data that are missing because they lie outside of the range of the measure.

In this case, what we might expect to find is that complaint resolution time appears shorter as we get to the present day, and that may be true, but it may also happen because cases where problems have yet to be solved show up as missing data. In other words, for a case that was opened in 2010, the longest possible resolution time is 5 years, whereas for a case opened yesterday, the longest possible resolution time is 1 day.

So, as a first step, let's look at the proportion of unseen cases over time:

#Like before, we're going to select the relevant columns from the database:
connection = psycopg2.connect('dbname= threeoneone user=threeoneoneadmin password=threeoneoneadmin')
cursor = connection.cursor()

cursor.execute('''SELECT createddate, closeddate, borough FROM service;''')
data = cursor.fetchall()
data = pd.DataFrame(data)

data.columns = ['createddate','closeddate','borough']

Let's extract years and months again:

data['cryear'] = [x.year for x in data['createddate']]
data['crmonth'] = [x.month for x in data['createddate']]

And now, we're going to filter out bad cases again. However, this time, we're going to be a bit more selective. We're going to include cases where the closed date is missing.

#filter out bad casesa
import datetime

today = datetime.date(2016,5,29)
janone = datetime.date(2010,1,1)

data = data.loc[(data['closeddate'] > data['createddate']) | (data['closeddate'].isnull() == True)]

databyyear = data.groupby(by='cryear').count()
databyyear

data['timedelta'] = data['closeddate'] - data['createddate']
data['timedeltaint'] = [x.days if pd.isnull(x) == False else None for x in data['timedelta']]

data.groupby(by='cryear').mean()

This table shows exactly what I'm talking about - as we get closer to the current day, the average resolution time falls more and more. If censorig is occuring, we might expect that the proportion of cases closed is also decreasing over time. This is generally the case:

databyyear['propclosed'] = databyyear['closeddate']/databyyear['createddate']
databyyear

With the exception of year 2011, which we have reason to believe is wonky (see Hack for Heat #4), this is generally the case. So, how do we deal with this issue? To be honest, I don't know at the moment. But I'm going to read this paper tomorrow, and see if I can implement something.

	createddate	closeddate	borough	crmonth
cryear
2010	1340309	1281719	1340309	1340309
2011	1294795	1157180	1294795	1294795
2012	1185598	1148399	1185598	1185598
2013	1221690	1177863	1221690	1221690
2014	1354300	1305561	1354300	1354300
2015	1451521	1378090	1451521	1451521
2016	605070	530959	605070	605070

	crmonth	timedeltaint
cryear
2010	6.505385	24.508043
2011	6.491262	22.512834
2012	6.617468	22.690249
2013	6.414707	20.205784
2014	6.280245	19.712918
2015	6.314193	16.731161
2016	2.799473	8.634098

	createddate	closeddate	borough	crmonth	propclosed
cryear
2010	1340309	1281719	1340309	1340309	0.956286
2011	1294795	1157180	1294795	1294795	0.893717
2012	1185598	1148399	1185598	1185598	0.968624
2013	1221690	1177863	1221690	1221690	0.964126
2014	1354300	1305561	1354300	1354300	0.964012
2015	1451521	1378090	1451521	1451521	0.949411
2016	605070	530959	605070	605070	0.877517