Hack for Heat #1: Initial thoughts on HPD open data
Hacking for heat¶
In this series, I'm going to be posting about the process that goes on behind some of the blog posts we end up writing. In this first entry, I'm going to be exploring a number of datsets.
These are the ones that I'm going to be looking at:
- HPD (Housing Preservation and Development) housing litigations
- Housing maintenance code complaints
- Housing maintanence code violations
- HPD complaints
(for HPD datasets, some documentation can be found here)
HPD litigation database¶
First, we're going to look at the smallest dataset, one that contains cases against landlords. From the documentation, this file contains "All cases commenced by HPD or by tennants (naming HPD as a party) in [housing court] since August 2006" either seeking orders for landlords to comply with regulations, or awarding HPD civil penalties (i.e., collecting on fines).
import pandas as pd
litigation = pd.read_csv("Housing_Litigations.csv")
litigation.head()
Let's take a look at unique values for some of the columns:
litigation['Boro'].unique()
litigation.groupby(by = ['Boro','CaseJudgement']).count()
The above table tells us that Manhattan has the lowest proportion of cases that receive judgement (about 1 in 80), whereas Staten Island has the highest (about 1 in 12). It may be something worth looking into, but it's also important to note that many cases settle out of court, and landlords in Manhattan may be more willing (or able) to do so.
litigation['CaseType'].unique()
litigation.groupby(by = ['CaseType', 'CaseJudgement']).count()
The table above shows the same case judgement proportions, but conditioned on what type of case it was. Unhelpfully, the documentation does not specify what the difference between Access Warrant - Lead and Non-Lead is. It could be one of two possibilities: The first is whether the warrants have to do with lead-based paint, which is a common problem, but perhaps still too idiosyncratic to have it's own warrant type. The second, perhaps more likely possibility is whether or not HPD was the lead party in the case.
We'll probably end up using these data by aggregating it and examining how complaints change over time, perhaps as a function of what type they are. There's also the possibility of looking up specific buildings' complaints and tying them to landlords. There's probably also an easy way to join this dataset with another, by converting the address information into something standardized, like borough-block-lot (BBL; http://www1.nyc.gov/nyc-resources/service/1232/borough-block-lot-bbl-lookup)
HPD complaints¶
Next, we're going to look at a dataset of HPD complaints.
hpdcomp = pd.read_csv('Housing_Maintenance_Code_Complaints.csv')
hpdcomp.head()
len(hpdcomp)
This dataset is less useful on its own. It doesn't tell us what the type of complaint was, only the date it was received and whether or not the complaint is still open. However, it may be useful in conjunction with the earlier dataset. For example, we might be interested in how many of these complaints end up in court (or at least, have some sort of legal action taken).
HPD violations¶
The following dataset tracks HPD violations.
hpdviol = pd.read_csv('Housing_Maintenance_Code_Violations.csv')
hpdviol.head()
len(hpdviol)
These datasets all have different lengths, but that's not surprising, given they come from different years. One productive initial step would be to convert the date strings into something numerical.
HPD complaint problems database¶
hpdcompprob = pd.read_csv('Complaint_Problems.csv')
hpdcompprob.head()
Awesome! This dataset provides some more details about the complaints, and lets us join by ComplaintID.
Summary and next steps¶
In the immediate future, I'm going to be writing script to join and clean this dataset. This can either be done either in python, or by writing some SQL. I haven't decided yet. Additionally, I'm going to be writing some code to do things like convert date strings, and perhaps scrape text.