The Average Thing #1
The Average Thing¶
The Canada Science and Technology Museums Corporation hosts a repository of all the artifacts in its collection here: (http://techno-science.ca/en/data.php). Let's do something interesting with it.
Reading and cleaning data¶
The data come in either a .csv or .xml format. Let's first try reading the .csv file, because Pandas has native support for that. First, we import all the packages that we might need (e.g., regex), and tell Pandas not to truncate the columns like it usually does.
import pandas as pd
import re
import csv
import itertools
pd.get_option("display.max_columns")
pd.set_option("display.max_columns", 40)
Now, we import the data to a datafram named "data", using Pandas' built-in method .read_csv, specifying the delimiter "|", and telling it not to break when it comes across a bad line, but instead show a bunch of warnings and still continue.
data = pd.read_csv("cstmc-CSV-en.csv", delimiter = "|", error_bad_lines=False, warn_bad_lines=False)
If you turn the warnings on, it looks like some of the rows of the csv file have extra columns (i.e., they have 37 columns when we only have 36 column names. Because there are multiple missing values, it's not possible to determine what the extra columns represent. Just by eyeballing things, it looks like about ~500 of the 100,000 cases are problematic.
Our two options are to either ignore these cases, or try and read the xml file, which may or may not pay off. My recommendation is that we just move forward and try and get some initial analysis done before backtracking later if we want to.
Here are the first 3 artifacts in the data:
data.head(n = 3)
And here are some cases which have too many columns:
debug = pd.read_csv("cstmc-CSV-en.csv", header=None, delimiter = "|", error_bad_lines=False, skiprows = 37899, nrows=3)
debug
Recoding dimensions into actual units¶
The first thing we're going to do is to do convert all the dimensions, which have mixed units (e.g., 16cm and 10m or whatever) into one common unit. To do this, we're going to use regular expressions and loop through the dimension columns.
The other thing that we might want to do is decide what artifacts to actually use. It looks like some of the artifacts are things like documents and paper, and others are really big things like engines.
Dealing with missing data¶
As you can see from the example cases above, lots of data is missing or unavailable, which is probably to be expected given that not all columns are necessarily relevant to each artifact. However, this forces us to make a decision when trying to summarize data: How do we decide whether to include or exclude artifacts?
For example, the code below subsets data depending on whether or not any of the measurements of their physical dimensions (weight, height, etc.) are missing:
datasub1 = data.loc[data['Length'].isnull() == False]
datasub2 = datasub1.loc[datasub1['Width'].isnull() == False]
datasub3 = datasub2.loc[datasub2['Height'].isnull() == False]
datasub4 = datasub3.loc[datasub3['Weight'].isnull() == False]
Running the code below will list the problem cases with our regex - some have typos in them, others have European style decimal places (thanks Canada!)
subdata = datasub3
subdata = subdata.reset_index(drop = True)
x = 0
badlines = []
for line in subdata['Length']:
try:
re.search(r'(\d*\.\d*) .*', line).group(1)
except:
badlines.append(subdata.loc[x, ['Length']][0])
x +=1
print badlines
Let's first sub out the commas for decimal places and remove the '\t' characters.
subdata['Length'] = [re.sub(',', '.', x) for x in subdata['Length']]
subdata['Length'] = [re.sub('\t', '', x) for x in subdata['Length']]
AS IT HAPPENS, you can achieve the above using Pandas' built-in string methods, because I am a moron. Below, we use these methods on the "Length" dimension to replace the commas, extract the digits, extract the units, and then replace the units with the numbers to multiply the digits by (i.e., cm = 1, m = 100).
artifactlengths = data['Length'].str.extract(r'(\d*.\d*)', expand=False)
artifactlengths = artifactlengths.str.replace(',', '.')
artifactlengthmultipliers = data['Length'].str.extract(r'.* ([a-zA-Z]*)', expand=False)
Doing this, we've created a vector (i.e., a Pandas series) of floats, and a separate vector of strings that have the units. For the next step, we want to first replace cm and CM with 1, and m and M with 100, and look for the person who used inches and shoot them.
artifactlengthmultipliers.unique()
We create a dictionary to use with the .replace method:
unitreplacement = {
"mm" : .1,
"MM" : .1,
"cm" : 1,
"CM" : 1,
"m" : 100,
"M" : 100,
"in" : 2.54,
"kg" : 1,
"KG" : 1,
"gm" : .001,
"tons" : 1000,
"lbs" : .453
}
artifactlengthmultipliersNUM = artifactlengthmultipliers.replace(unitreplacement)
And then we multiple our numbers by our multipliers, and now we can use the data to find out the average length of all artifacts:
#We have to coerce because our series contains missing values that cannot be typed
artifactlengthincm = pd.to_numeric(artifactlengths, errors = 'coerce') * artifactlengthmultipliersNUM
print artifactlengthincm.mean()
print artifactlengthincm.median()
The average length is 44.71 cm, and the median is 22.8 cm.
Functions to extract digits and units¶
Now that we've done this once with length, we can write some functions to do the above easily with the rest of the dimensions. We also have a function that will take a list of dimensions, and "numberize" all of them.
#These are the above as functions to run on our other dimensions
def extract_digits(series):
lengths = series.str.extract(r'(\d*.\d*)', expand = False)
lengths = lengths.str.replace(',' , '.')
return pd.to_numeric(lengths, errors = 'coerce')
def extract_multipliers(series):
multipliers = series.str.extract(r'.* ([a-zA-Z]*)', expand = False)
unitreplacement = {
"cm" : 1,
"CM" : 1,
"m" : 100,
"M" : 100,
"in" : 2.54,
"kg" : 1,
"KG" : 1,
"gm" : .001,
"tons" : 1000,
"lbs" : .453
}
multipliers = multipliers.replace(unitreplacement)
return pd.to_numeric(multipliers, errors = 'coerce')
def numberize_dimensions(listofdimensions, data):
for dimension in listofdimensions:
data[dimension + "INT"] = extract_digits(data[dimension])
data[dimension + "MULT"] = extract_multipliers(data[dimension])
data[dimension + "NUM"] = data[dimension + "INT"] + data[dimension + "MULT"]
#This was some hacky debugging when I manually did a binary search-type thing to find out which row numbers were messed up
#pd.to_numeric(artifactlengths[96050:96100])
#artifactlengths[96050:96100]
Rest of the dimensions¶
From the above, the average length of all the artifacts in the collection is 44.7cm. Let's find out what the average is for all the other dimensions:
data['HeightINT'] = extract_digits(data['Height'])
data['HeightMULTI'] = extract_multipliers(data['Height'])
data['HeightNUM'] = data['HeightINT'] * data['HeightMULTI']
print data['HeightNUM'].mean()
print data['HeightNUM'].median()
The average height is 25.88 cm and the median is 9.4 cm.
data['WidthINT'] = extract_digits(data['Width'])
data['WidthMULTI'] = extract_multipliers(data['Width'])
data['WidthNUM'] = data['WidthINT'] * data['WidthMULTI']
print data['WidthNUM'].mean()
print data['WidthNUM'].median()
The average width is 24.16 cm and the median is 12 cm.
Now, we work on weight. Given that the units for weights range from gm to tons, let's go with kg as the unit we'll use, and assume that tons refer to metric tons. There's also a 'cm' there. Something is wrong.
data['Weight'].str.extract('.* (.*)', expand = False).unique()
Below, we extract the indices of cases with "cm" in the weight. There are 13 cases, and some of them look like a misalignment of the columns, but it's impossible to tell which way they are misaligned.
cmtest = data['Weight'].str.extract('(.* cm)', expand=False)
#Here are three
data[cmtest.isnull() == False][0:3]
Let's set those cases to NaN, and just ignore them for now:
data.loc[cmtest.isnull() == False,['Weight']] = None
data['WeightINT'] = extract_digits(data['Weight'])
data['WeightMULT'] = extract_multipliers(data['Weight'])
data['WeightNUM'] = data['WeightINT'] * data['WeightMULT']
print data['WeightNUM'].mean()
print data['WeightNUM'].median()
The average weight is 1 ton, which may be a problem if we want to fabricate this. However, the median is much lower, which suggests that the distribution is skewed, and this is also true for the other dimensions, let's take a look at that next.