The Average Thing #1

The Average Thing

The Canada Science and Technology Museums Corporation hosts a repository of all the artifacts in its collection here: (http://techno-science.ca/en/data.php). Let's do something interesting with it.

Reading and cleaning data

The data come in either a .csv or .xml format. Let's first try reading the .csv file, because Pandas has native support for that. First, we import all the packages that we might need (e.g., regex), and tell Pandas not to truncate the columns like it usually does.

In [1]:
import pandas as pd
import re
import csv
import itertools
pd.get_option("display.max_columns")
pd.set_option("display.max_columns", 40)

Now, we import the data to a datafram named "data", using Pandas' built-in method .read_csv, specifying the delimiter "|", and telling it not to break when it comes across a bad line, but instead show a bunch of warnings and still continue.

In [2]:
data = pd.read_csv("cstmc-CSV-en.csv", delimiter = "|", error_bad_lines=False, warn_bad_lines=False)
C:\Users\dfthd\Anaconda3\envs\python2\lib\site-packages\IPython\core\interactiveshell.py:2723: DtypeWarning: Columns (9,10) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

If you turn the warnings on, it looks like some of the rows of the csv file have extra columns (i.e., they have 37 columns when we only have 36 column names. Because there are multiple missing values, it's not possible to determine what the extra columns represent. Just by eyeballing things, it looks like about ~500 of the 100,000 cases are problematic.

Our two options are to either ignore these cases, or try and read the xml file, which may or may not pay off. My recommendation is that we just move forward and try and get some initial analysis done before backtracking later if we want to.

Here are the first 3 artifacts in the data:

In [3]:
data.head(n = 3)
Out[3]:
artifactNumber ObjectName GeneralDescription model SerialNumber Manufacturer ManuCountry ManuProvince ManuCity BeginDate EndDate date_qualifier patent NumberOfComponents ArtifactFinish ContextCanada ContextFunction ContextTechnical group1 category1 subcategory1 group2 category2 subcategory2 group3 category3 subcategory3 material Length Width Height Thickness Weight Diameter image thumbnail
0 1966.0001.001 Cover PAPER WESTERN CANADA AIRWAYS LTD. NaN Unknown Unknown NaN NaN 1927 NaN NaN NaN 1.0 NaN AT THE TIME IT WAS THE WORLD'S MOST NORTHERNLY... NaN NaN Aviation Commemorative Stamps & coins NaN NaN NaN NaN NaN NaN paper 4.5 cm 2.6 cm NaN NaN NaN NaN http://source.techno-science.ca/artifacts-arte... http://source.techno-science.ca/artifacts-arte...
1 1966.0002.001 Stamp, postage PAPER WESTERN CANADA AIRWAYS LTD. NaN Unknown Unknown NaN NaN 1927 NaN NaN NaN 1.0 PINK & BLACK ON WHITE NaN NaN NaN Aviation Commemorative Stamps & coins NaN NaN NaN NaN NaN NaN paper 3.8 cm 2.7 cm NaN NaN NaN NaN http://source.techno-science.ca/artifacts-arte... http://source.techno-science.ca/artifacts-arte...
2 1966.0003.001 Stamp, postage PAPER NaN NaN Unknown Unknown NaN NaN 1932 NaN NaN NaN 1.0 DARK & PALE BLUE ON WHITE NaN NaN NaN Aviation Commemorative Stamps & coins NaN NaN NaN NaN NaN NaN paper 12.8 cm 8.4 cm NaN NaN NaN NaN http://source.techno-science.ca/artifacts-arte... http://source.techno-science.ca/artifacts-arte...

And here are some cases which have too many columns:

In [4]:
debug = pd.read_csv("cstmc-CSV-en.csv", header=None, delimiter = "|", error_bad_lines=False, skiprows = 37899, nrows=3)
debug
Out[4]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
0 1984.1214.001 Engine, airplane NaN LTC1K-4A/LYCOMING T53 LE09281X Avco Lycoming Div. United States of America Connecticut Stratford 1969 NaN circa NaN 1 NaN NaN Converts heat into mechanical power NaN Aviation Motive power NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN http://source.techno-science.ca/artifacts-arte...
1 1984.1215.001 Engine, airplane NaN LTC1K-4A/LYCOMING T53 LE09285X Avco Lycoming Div. United States of America Connecticut Stratford 1969 NaN circa NaN 1 NaN NaN Converts heat into mechanical power NaN Aviation Motive power NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN http://source.techno-science.ca/artifacts-arte...
2 1984.1216.001 Propeller Glass - Fibre HAMILTON STANDARD 337-339-330-325 Hamilton Standard Propeller Div., United Aircr... United States of America Connecticut East hartford 1969 NaN circa NaN 1 NaN NaN Aircraft propeller NaN Aviation Aircraft parts NaN NaN NaN NaN NaN NaN NaN glass;fibre NaN NaN NaN NaN NaN NaN http://source.techno-science.ca/artifacts-arte... http://source.techno-science.ca/artifacts-arte... http://source.techno-science.ca/artifacts-arte...

Recoding dimensions into actual units

The first thing we're going to do is to do convert all the dimensions, which have mixed units (e.g., 16cm and 10m or whatever) into one common unit. To do this, we're going to use regular expressions and loop through the dimension columns.

The other thing that we might want to do is decide what artifacts to actually use. It looks like some of the artifacts are things like documents and paper, and others are really big things like engines.

Dealing with missing data

As you can see from the example cases above, lots of data is missing or unavailable, which is probably to be expected given that not all columns are necessarily relevant to each artifact. However, this forces us to make a decision when trying to summarize data: How do we decide whether to include or exclude artifacts?

For example, the code below subsets data depending on whether or not any of the measurements of their physical dimensions (weight, height, etc.) are missing:

In [5]:
datasub1 = data.loc[data['Length'].isnull() == False]
datasub2 = datasub1.loc[datasub1['Width'].isnull() == False]
datasub3 = datasub2.loc[datasub2['Height'].isnull() == False]
datasub4 = datasub3.loc[datasub3['Weight'].isnull() == False]

Running the code below will list the problem cases with our regex - some have typos in them, others have European style decimal places (thanks Canada!)

In [6]:
subdata = datasub3
subdata = subdata.reset_index(drop = True)

x = 0
badlines = []
for line in subdata['Length']:
    try:
        re.search(r'(\d*\.\d*) .*', line).group(1)
    except:
        badlines.append(subdata.loc[x, ['Length']][0])
    x +=1

print badlines
['610 cm', '12 cm', '210 cm', '254,9 cm', '12,7 cm', '12,7 cm', '71.0\t cm', '81.0\t cm', '61,5 cm', '61,5 cm', '61,5 cm', '61,5 cm', '42.3\t cm', '59,5 cm', '58,0 cm', '176.O cm', '157,5 cm', '18,5 cm', '2,4 cm', '19,5 cm', '32,0 cm', ' cm', '21,0 cm', '30 cm', '58 cm', '5200 cm', '172,0 cm', '26.,0 cm', '26.,0 cm', '43,7 cm', '200,0 cm', '41,5 cm', '180,0 cm', '150 cm', '41,5 cm', '200 cm', '24 cm', '190 cm', '229 cm', '236 cm']

Let's first sub out the commas for decimal places and remove the '\t' characters.

In [7]:
subdata['Length'] = [re.sub(',', '.', x) for x in subdata['Length']]
subdata['Length'] = [re.sub('\t', '', x) for x in subdata['Length']]

AS IT HAPPENS, you can achieve the above using Pandas' built-in string methods, because I am a moron. Below, we use these methods on the "Length" dimension to replace the commas, extract the digits, extract the units, and then replace the units with the numbers to multiply the digits by (i.e., cm = 1, m = 100).

In [8]:
artifactlengths = data['Length'].str.extract(r'(\d*.\d*)', expand=False)
artifactlengths = artifactlengths.str.replace(',', '.')
artifactlengthmultipliers = data['Length'].str.extract(r'.* ([a-zA-Z]*)', expand=False)

Doing this, we've created a vector (i.e., a Pandas series) of floats, and a separate vector of strings that have the units. For the next step, we want to first replace cm and CM with 1, and m and M with 100, and look for the person who used inches and shoot them.

In [9]:
artifactlengthmultipliers.unique()
Out[9]:
array(['cm', nan, 'M', 'in', 'CM', 'm'], dtype=object)

We create a dictionary to use with the .replace method:

In [10]:
unitreplacement = {
    "mm" : .1,
    "MM" : .1,
    "cm" : 1, 
    "CM" : 1,
    "m" : 100,
    "M" : 100,
    "in" : 2.54,
    "kg" : 1,
    "KG" : 1,
    "gm" : .001,
    "tons" : 1000,
    "lbs" : .453
}
artifactlengthmultipliersNUM = artifactlengthmultipliers.replace(unitreplacement)

And then we multiple our numbers by our multipliers, and now we can use the data to find out the average length of all artifacts:

In [11]:
#We have to coerce because our series contains missing values that cannot be typed
artifactlengthincm = pd.to_numeric(artifactlengths, errors = 'coerce') * artifactlengthmultipliersNUM
In [12]:
print artifactlengthincm.mean()
print artifactlengthincm.median()
44.7114966571
22.8

The average length is 44.71 cm, and the median is 22.8 cm.

Functions to extract digits and units

Now that we've done this once with length, we can write some functions to do the above easily with the rest of the dimensions. We also have a function that will take a list of dimensions, and "numberize" all of them.

In [13]:
#These are the above as functions to run on our other dimensions
def extract_digits(series):
    lengths = series.str.extract(r'(\d*.\d*)', expand = False)
    lengths = lengths.str.replace(',' , '.')
    return pd.to_numeric(lengths, errors = 'coerce')

def extract_multipliers(series):
    multipliers = series.str.extract(r'.* ([a-zA-Z]*)', expand = False)
    unitreplacement = {
    "cm" : 1, 
    "CM" : 1,
    "m" : 100,
    "M" : 100,
    "in" : 2.54,
    "kg" : 1,
    "KG" : 1,
    "gm" : .001,
    "tons" : 1000,
    "lbs" : .453
    }
    multipliers = multipliers.replace(unitreplacement)
    return pd.to_numeric(multipliers, errors = 'coerce')

def numberize_dimensions(listofdimensions, data):
    for dimension in listofdimensions:
        data[dimension + "INT"] = extract_digits(data[dimension])
        data[dimension + "MULT"] = extract_multipliers(data[dimension])
        data[dimension + "NUM"] = data[dimension + "INT"] + data[dimension + "MULT"]
In [14]:
#This was some hacky debugging when I manually did a binary search-type thing to find out which row numbers were messed up
#pd.to_numeric(artifactlengths[96050:96100])
#artifactlengths[96050:96100]

Rest of the dimensions

From the above, the average length of all the artifacts in the collection is 44.7cm. Let's find out what the average is for all the other dimensions:

In [15]:
data['HeightINT'] = extract_digits(data['Height'])
data['HeightMULTI'] = extract_multipliers(data['Height'])
data['HeightNUM'] = data['HeightINT'] * data['HeightMULTI']
print data['HeightNUM'].mean()
print data['HeightNUM'].median()
25.8840089939
9.4

The average height is 25.88 cm and the median is 9.4 cm.

In [16]:
data['WidthINT'] = extract_digits(data['Width'])
data['WidthMULTI'] = extract_multipliers(data['Width'])
data['WidthNUM'] = data['WidthINT'] * data['WidthMULTI']
print data['WidthNUM'].mean()
print data['WidthNUM'].median()
24.1553862744
12.0

The average width is 24.16 cm and the median is 12 cm.

Now, we work on weight. Given that the units for weights range from gm to tons, let's go with kg as the unit we'll use, and assume that tons refer to metric tons. There's also a 'cm' there. Something is wrong.

In [17]:
data['Weight'].str.extract('.* (.*)', expand = False).unique()
Out[17]:
array([nan, 'kg', 'cm', 'tons', 'gm', 'lb'], dtype=object)

Below, we extract the indices of cases with "cm" in the weight. There are 13 cases, and some of them look like a misalignment of the columns, but it's impossible to tell which way they are misaligned.

In [18]:
cmtest = data['Weight'].str.extract('(.* cm)', expand=False)
#Here are three
data[cmtest.isnull() == False][0:3]
Out[18]:
artifactNumber ObjectName GeneralDescription model SerialNumber Manufacturer ManuCountry ManuProvince ManuCity BeginDate EndDate date_qualifier patent NumberOfComponents ArtifactFinish ContextCanada ContextFunction ContextTechnical group1 category1 ... category2 subcategory2 group3 category3 subcategory3 material Length Width Height Thickness Weight Diameter image thumbnail HeightINT HeightMULTI HeightNUM WidthINT WidthMULTI WidthNUM
1934 1966.0971.001 Photograph photograph paper; wood frame; glass; paper & c... NaN NaN JARMAN, FRANK LTD. Canada Ontario Ottawa 1941 NaN NaN NaN 1.0 b + w photograph; black frame presented by Frank Learman to Ralph Bell, Dire... display NaN Aviation Archives ... Archives Personal NaN NaN NaN paper->cardboard;wood->;glass->;metal->steel 49.8 cm 38.8 cm NaN NaN 1.9 cm NaN http://source.techno-science.ca/artifacts-arte... http://source.techno-science.ca/artifacts-arte... NaN NaN NaN 38.8 1.0 38.8
1941 1966.0978.001 Photograph photograph paper; glass; plastic frame; steel ... NaN NaN Royal Canadian Air Force Canada Ontario Ottawa 1927 NaN circa NaN 1.0 b + w photograph; caption in white ink; mounte... RCAF Station Ottawa (later Rockcliffe) was the... taken as part of aerial photography survey for... NaN Aviation Archives ... NaN NaN NaN NaN NaN paper->;glass->;synthetic->plastic;metal->steel 45.3 cm 40.0 cm NaN NaN 2.3 cm NaN http://source.techno-science.ca/artifacts-arte... http://source.techno-science.ca/artifacts-arte... NaN NaN NaN 40.0 1.0 40.0
52400 1989.0488.001 Calculator, air navigation plastic; metal swivel at centre - possibly alu... B-24D NaN Consolidated Aircraft Corp. United States of America California San diego 1941 NaN after NaN 2.0 white with black detail; pale yellow rotating ... NaN used to determine gross weight & fore & aft ce... especially useful on bombers because of varyin... Aviation Navigation instruments & equipment ... NaN NaN NaN NaN NaN synthetic->plastic;metal->aluminum - possible 28.0 cm 25.2 cm NaN NaN 1.0 cm NaN http://source.techno-science.ca/artifacts-arte... http://source.techno-science.ca/artifacts-arte... NaN NaN NaN 25.2 1.0 25.2

3 rows × 42 columns

Let's set those cases to NaN, and just ignore them for now:

In [19]:
data.loc[cmtest.isnull() == False,['Weight']] = None
In [20]:
data['WeightINT'] = extract_digits(data['Weight'])
data['WeightMULT'] = extract_multipliers(data['Weight'])
data['WeightNUM'] = data['WeightINT'] * data['WeightMULT']
print data['WeightNUM'].mean()
print data['WeightNUM'].median()
1058.87995356
16.78

The average weight is 1 ton, which may be a problem if we want to fabricate this. However, the median is much lower, which suggests that the distribution is skewed, and this is also true for the other dimensions, let's take a look at that next.