Hack for Heat #1: Initial thoughts on HPD open data

Hacking for heat

In this series, I'm going to be posting about the process that goes on behind some of the blog posts we end up writing. In this first entry, I'm going to be exploring a number of datsets.

These are the ones that I'm going to be looking at:

  1. HPD (Housing Preservation and Development) housing litigations
  2. Housing maintenance code complaints
  3. Housing maintanence code violations
  4. HPD complaints

(for HPD datasets, some documentation can be found here)

HPD litigation database

First, we're going to look at the smallest dataset, one that contains cases against landlords. From the documentation, this file contains "All cases commenced by HPD or by tennants (naming HPD as a party) in [housing court] since August 2006" either seeking orders for landlords to comply with regulations, or awarding HPD civil penalties (i.e., collecting on fines).

In [2]:
import pandas as pd
In [3]:
litigation = pd.read_csv("Housing_Litigations.csv")
In [5]:
litigation.head()
Out[5]:
LitigationID BuildingID BoroID Boro HouseNumber StreetName Zip Block Lot CaseType CaseOpenDate CaseStatus CaseJudgement
0 98385 806633 2 BRONX 866 EAST 221 STREET 10467.0 4679 73 Tenant Action 04/23/2009 CLOSED NO
1 98387 55134 2 BRONX 3940 CARPENTER AVENUE 10466.0 4825 51 Tenant Action 04/23/2009 CLOSED NO
2 98389 119044 2 BRONX 1356 WALTON AVENUE 10452.0 2841 41 Tenant Action 04/23/2009 CLOSED NO
3 98391 889796 2 BRONX 871 EAST 179 STREET 10460.0 3123 77 Tenant Action 04/24/2009 CLOSED NO
4 98393 56949 2 BRONX 1680 CLAY AVENUE 10457.0 2889 1 Tenant Action 04/24/2009 CLOSED NO

Let's take a look at unique values for some of the columns:

In [6]:
litigation['Boro'].unique()
Out[6]:
array(['BRONX', 'QUEENS', 'MANHATTAN', 'BROOKLYN', 'STATEN ISLAND'], dtype=object)
In [14]:
litigation.groupby(by = ['Boro','CaseJudgement']).count()
Out[14]:
LitigationID BuildingID BoroID HouseNumber StreetName Zip Block Lot CaseType CaseOpenDate CaseStatus
Boro CaseJudgement
BRONX NO 19734 19734 19734 19734 19734 19734 19734 19734 19734 19701 19734
YES 481 481 481 481 481 481 481 481 481 478 481
BROOKLYN NO 19833 19833 19833 19833 19833 19831 19833 19833 19833 19807 19833
YES 762 762 762 762 762 760 762 762 762 762 762
MANHATTAN NO 11849 11849 11849 11849 11849 11849 11849 11849 11849 11841 11849
YES 129 129 129 129 129 129 129 129 129 129 129
QUEENS NO 9068 9068 9068 9068 9068 9061 9068 9068 9068 9063 9068
YES 313 313 313 313 313 313 313 313 313 308 313
STATEN ISLAND NO 1064 1064 1064 1064 1064 1064 1064 1064 1064 1064 1064
YES 84 84 84 84 84 84 84 84 84 84 84

The above table tells us that Manhattan has the lowest proportion of cases that receive judgement (about 1 in 80), whereas Staten Island has the highest (about 1 in 12). It may be something worth looking into, but it's also important to note that many cases settle out of court, and landlords in Manhattan may be more willing (or able) to do so.

In [7]:
litigation['CaseType'].unique()
Out[7]:
array(['Tenant Action', 'Heat and Hot Water', 'CONH',
       'Comp Supplemental Cases', 'Comprehensive',
       'Access Warrant - Non-Lead', 'Access Warrant - lead',
       'Lead False Certification', 'False Certification Non-Lead',
       'Heat Supplemental Cases', '7A', 'Failure to Register Only',
       'HLD - Other Case Type'], dtype=object)
In [16]:
litigation.groupby(by = ['CaseType', 'CaseJudgement']).count()
Out[16]:
LitigationID BuildingID BoroID Boro HouseNumber StreetName Zip Block Lot CaseOpenDate CaseStatus
CaseType CaseJudgement
7A NO 105 105 105 105 105 105 105 105 105 102 105
Access Warrant - Non-Lead NO 5665 5665 5665 5665 5665 5665 5665 5665 5665 5665 5665
YES 8 8 8 8 8 8 8 8 8 8 8
Access Warrant - lead NO 1786 1786 1786 1786 1786 1786 1786 1786 1786 1785 1786
CONH NO 797 797 797 797 797 797 797 797 797 797 797
Comp Supplemental Cases NO 750 750 750 750 750 750 750 750 750 702 750
YES 31 31 31 31 31 31 31 31 31 29 31
Comprehensive NO 2274 2274 2274 2274 2274 2274 2274 2274 2274 2274 2274
YES 121 121 121 121 121 121 121 121 121 121 121
Failure to Register Only NO 60 60 60 60 60 60 60 60 60 60 60
YES 3 3 3 3 3 3 3 3 3 3 3
False Certification Non-Lead NO 2355 2355 2355 2355 2355 2355 2355 2355 2355 2355 2355
YES 20 20 20 20 20 20 20 20 20 20 20
HLD - Other Case Type NO 3 3 3 3 3 3 3 3 3 2 3
Heat Supplemental Cases NO 124 124 124 124 124 124 124 124 124 108 124
YES 7 7 7 7 7 7 6 7 7 6 7
Heat and Hot Water NO 15760 15760 15760 15760 15760 15760 15758 15760 15760 15757 15760
YES 1299 1299 1299 1299 1299 1299 1298 1299 1299 1294 1299
Lead False Certification NO 239 239 239 239 239 239 239 239 239 239 239
YES 2 2 2 2 2 2 2 2 2 2 2
Tenant Action NO 31630 31630 31630 31630 31630 31630 31623 31630 31630 31630 31630
YES 278 278 278 278 278 278 278 278 278 278 278

The table above shows the same case judgement proportions, but conditioned on what type of case it was. Unhelpfully, the documentation does not specify what the difference between Access Warrant - Lead and Non-Lead is. It could be one of two possibilities: The first is whether the warrants have to do with lead-based paint, which is a common problem, but perhaps still too idiosyncratic to have it's own warrant type. The second, perhaps more likely possibility is whether or not HPD was the lead party in the case.

We'll probably end up using these data by aggregating it and examining how complaints change over time, perhaps as a function of what type they are. There's also the possibility of looking up specific buildings' complaints and tying them to landlords. There's probably also an easy way to join this dataset with another, by converting the address information into something standardized, like borough-block-lot (BBL; http://www1.nyc.gov/nyc-resources/service/1232/borough-block-lot-bbl-lookup)

HPD complaints

Next, we're going to look at a dataset of HPD complaints.

In [32]:
hpdcomp = pd.read_csv('Housing_Maintenance_Code_Complaints.csv')
In [34]:
hpdcomp.head()
Out[34]:
ComplaintID BuildingID BoroughID Borough HouseNumber StreetName Zip Block Lot Apartment CommunityBoard ReceivedDate StatusID Status StatusDate
0 6960137 3418 1 MANHATTAN 1989 ADAM C POWELL BOULEVARD 10026.0 1904 4 12D 10 07/07/2014 2 CLOSE 07/29/2014
1 6960832 3512 1 MANHATTAN 2267 ADAM C POWELL BOULEVARD 10030.0 1918 4 3B 10 07/08/2014 2 CLOSE 07/12/2014
2 6946867 5318 1 MANHATTAN 778 11 AVENUE 10019.0 1083 1 4P 4 06/19/2014 2 CLOSE 07/13/2014
3 6966946 5608 1 MANHATTAN 1640 AMSTERDAM AVENUE 10031.0 2073 29 5A 9 07/16/2014 2 CLOSE 07/21/2014
4 6956574 17896 1 MANHATTAN 230 EAST 88 STREET 10128.0 1533 32 1E 8 07/01/2014 2 CLOSE 07/09/2014
In [39]:
len(hpdcomp)
Out[39]:
662672

This dataset is less useful on its own. It doesn't tell us what the type of complaint was, only the date it was received and whether or not the complaint is still open. However, it may be useful in conjunction with the earlier dataset. For example, we might be interested in how many of these complaints end up in court (or at least, have some sort of legal action taken).

HPD violations

The following dataset tracks HPD violations.

In [35]:
hpdviol = pd.read_csv('Housing_Maintenance_Code_Violations.csv')
In [36]:
hpdviol.head()
Out[36]:
ViolationID BuildingID RegistrationID BoroID Boro HouseNumber LowHouseNumber HighHouseNumber StreetName StreetCode ... NewCertifyByDate NewCorrectByDate CertifiedDate OrderNumber NOVID NOVDescription NOVIssuedDate CurrentStatusID CurrentStatus CurrentStatusDate
0 2293208 444 130476 1 MANHATTAN 22 22 22 1 AVENUE 10010 ... NaN NaN NaN 772 338596.0 § 27-2098 ADM CODE FILE WITH THIS DEPARTMENT ... 04/22/1997 19 VIOLATION CLOSED 03/10/2015
1 2293181 444 130476 1 MANHATTAN 22 22 22 1 AVENUE 10010 ... NaN NaN 09/10/1974 775 338584.0 D26-41.05 ADM CODE FILE WITH THIS DEPARTMENT R... 07/16/1974 19 VIOLATION CLOSED 03/10/2015
2 2293249 448 135326 1 MANHATTAN 2222 2222 2222 1 AVENUE 10010 ... NaN NaN NaN 772 338619.0 § 27-2098 ADM CODE FILE WITH THIS DEPARTMENT ... 11/19/1996 19 VIOLATION CLOSED 03/10/2015
3 2293486 467 136913 1 MANHATTAN 2250 2250 2250 1 AVENUE 10010 ... NaN NaN NaN 772 338733.0 D26-41.03 ADM CODE FILE WITH THIS DEPARTMENT A... 05/10/1982 19 VIOLATION CLOSED 03/10/2015
4 2293490 467 136913 1 MANHATTAN 2250 2250 2250 1 AVENUE 10010 ... NaN NaN NaN 772 338737.0 § 27-2098 ADM CODE FILE WITH THIS DEPARTMENT ... 11/19/1996 19 VIOLATION CLOSED 03/10/2015

5 rows × 30 columns

In [40]:
len(hpdviol)
Out[40]:
1437246

These datasets all have different lengths, but that's not surprising, given they come from different years. One productive initial step would be to convert the date strings into something numerical.

HPD complaint problems database

In [45]:
hpdcompprob = pd.read_csv('Complaint_Problems.csv')
In [48]:
hpdcompprob.head()
Out[48]:
ProblemID ComplaintID UnitTypeID UnitType SpaceTypeID SpaceType TypeID Type MajorCategoryID MajorCategory MinorCategoryID MinorCategory CodeID Code StatusID Status StatusDate StatusDescription
0 14548958 6967900 91 APARTMENT 541 BATHROOM 1 EMERGENCY 9 PLUMBING 63 BATHTUB/SHOWER 2538 BROKEN OR MISSING 2 CLOSE 07/29/2014 The Department of Housing Preservation and De...
1 14548959 6967900 91 APARTMENT 541 BATHROOM 3 NON EMERGENCY 9 PLUMBING 63 BATHTUB/SHOWER 2540 FAUCET BROKEN/MISSING/LEAKING 2 CLOSE 08/04/2014 The Department of Housing Preservation and De...
2 14548960 6967900 91 APARTMENT 543 ENTIRE APARTMENT 3 NON EMERGENCY 58 FLOORING/STAIRS 343 FLOOR 2691 TILE BROKEN OR MISSING 2 CLOSE 08/04/2014 The Department of Housing Preservation and De...
3 14548961 6967900 91 APARTMENT 541 BATHROOM 3 NON EMERGENCY 9 PLUMBING 63 BATHTUB/SHOWER 2541 CHIPPED OR RUSTED 2 CLOSE 08/04/2014 The Department of Housing Preservation and De...
4 14615271 6994958 20 APARTMENT 68 ENTIRE APARTMENT 1 EMERGENCY 59 HEAT/HOT WATER 349 ENTIRE BUILDING 2717 NO HOT WATER 2 CLOSE 08/22/2014 The Department of Housing Preservation and De...

Awesome! This dataset provides some more details about the complaints, and lets us join by ComplaintID.

Summary and next steps

In the immediate future, I'm going to be writing script to join and clean this dataset. This can either be done either in python, or by writing some SQL. I haven't decided yet. Additionally, I'm going to be writing some code to do things like convert date strings, and perhaps scrape text.