The Average Thing¶

The Canada Science and Technology Museums Corporation hosts a repository of all the artifacts in its collection here: (http://techno-science.ca/en/data.php). Let's do something interesting with it.

Reading and cleaning data¶

The data come in either a .csv or .xml format. Let's first try reading the .csv file, because Pandas has native support for that. First, we import all the packages that we might need (e.g., regex), and tell Pandas not to truncate the columns like it usually does.

import pandas as pd
import re
import csv
import itertools
pd.get_option("display.max_columns")
pd.set_option("display.max_columns", 40)

Now, we import the data to a datafram named "data", using Pandas' built-in method .read_csv, specifying the delimiter "|", and telling it not to break when it comes across a bad line, but instead show a bunch of warnings and still continue.

data = pd.read_csv("cstmc-CSV-en.csv", delimiter = "|", error_bad_lines=False, warn_bad_lines=False)

C:\Users\dfthd\Anaconda3\envs\python2\lib\site-packages\IPython\core\interactiveshell.py:2723: DtypeWarning: Columns (9,10) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

If you turn the warnings on, it looks like some of the rows of the csv file have extra columns (i.e., they have 37 columns when we only have 36 column names. Because there are multiple missing values, it's not possible to determine what the extra columns represent. Just by eyeballing things, it looks like about ~500 of the 100,000 cases are problematic.

Our two options are to either ignore these cases, or try and read the xml file, which may or may not pay off. My recommendation is that we just move forward and try and get some initial analysis done before backtracking later if we want to.

Here are the first 3 artifacts in the data:

data.head(n = 3)

And here are some cases which have too many columns:

debug = pd.read_csv("cstmc-CSV-en.csv", header=None, delimiter = "|", error_bad_lines=False, skiprows = 37899, nrows=3)
debug

Recoding dimensions into actual units¶

The first thing we're going to do is to do convert all the dimensions, which have mixed units (e.g., 16cm and 10m or whatever) into one common unit. To do this, we're going to use regular expressions and loop through the dimension columns.

The other thing that we might want to do is decide what artifacts to actually use. It looks like some of the artifacts are things like documents and paper, and others are really big things like engines.

Dealing with missing data¶

As you can see from the example cases above, lots of data is missing or unavailable, which is probably to be expected given that not all columns are necessarily relevant to each artifact. However, this forces us to make a decision when trying to summarize data: How do we decide whether to include or exclude artifacts?

For example, the code below subsets data depending on whether or not any of the measurements of their physical dimensions (weight, height, etc.) are missing:

datasub1 = data.loc[data['Length'].isnull() == False]
datasub2 = datasub1.loc[datasub1['Width'].isnull() == False]
datasub3 = datasub2.loc[datasub2['Height'].isnull() == False]
datasub4 = datasub3.loc[datasub3['Weight'].isnull() == False]

Running the code below will list the problem cases with our regex - some have typos in them, others have European style decimal places (thanks Canada!)

subdata = datasub3
subdata = subdata.reset_index(drop = True)

x = 0
badlines = []
for line in subdata['Length']:
    try:
        re.search(r'(\d*\.\d*) .*', line).group(1)
    except:
        badlines.append(subdata.loc[x, ['Length']][0])
    x +=1

print badlines

['610 cm', '12 cm', '210 cm', '254,9 cm', '12,7 cm', '12,7 cm', '71.0\t cm', '81.0\t cm', '61,5 cm', '61,5 cm', '61,5 cm', '61,5 cm', '42.3\t cm', '59,5 cm', '58,0 cm', '176.O cm', '157,5 cm', '18,5 cm', '2,4 cm', '19,5 cm', '32,0 cm', ' cm', '21,0 cm', '30 cm', '58 cm', '5200 cm', '172,0 cm', '26.,0 cm', '26.,0 cm', '43,7 cm', '200,0 cm', '41,5 cm', '180,0 cm', '150 cm', '41,5 cm', '200 cm', '24 cm', '190 cm', '229 cm', '236 cm']

Let's first sub out the commas for decimal places and remove the '\t' characters.

subdata['Length'] = [re.sub(',', '.', x) for x in subdata['Length']]
subdata['Length'] = [re.sub('\t', '', x) for x in subdata['Length']]

AS IT HAPPENS, you can achieve the above using Pandas' built-in string methods, because I am a moron. Below, we use these methods on the "Length" dimension to replace the commas, extract the digits, extract the units, and then replace the units with the numbers to multiply the digits by (i.e., cm = 1, m = 100).

artifactlengths = data['Length'].str.extract(r'(\d*.\d*)', expand=False)
artifactlengths = artifactlengths.str.replace(',', '.')
artifactlengthmultipliers = data['Length'].str.extract(r'.* ([a-zA-Z]*)', expand=False)

Doing this, we've created a vector (i.e., a Pandas series) of floats, and a separate vector of strings that have the units. For the next step, we want to first replace cm and CM with 1, and m and M with 100, and look for the person who used inches and shoot them.

artifactlengthmultipliers.unique()

array(['cm', nan, 'M', 'in', 'CM', 'm'], dtype=object)

We create a dictionary to use with the .replace method:

unitreplacement = {
    "mm" : .1,
    "MM" : .1,
    "cm" : 1, 
    "CM" : 1,
    "m" : 100,
    "M" : 100,
    "in" : 2.54,
    "kg" : 1,
    "KG" : 1,
    "gm" : .001,
    "tons" : 1000,
    "lbs" : .453
}
artifactlengthmultipliersNUM = artifactlengthmultipliers.replace(unitreplacement)

And then we multiple our numbers by our multipliers, and now we can use the data to find out the average length of all artifacts:

#We have to coerce because our series contains missing values that cannot be typed
artifactlengthincm = pd.to_numeric(artifactlengths, errors = 'coerce') * artifactlengthmultipliersNUM

print artifactlengthincm.mean()
print artifactlengthincm.median()

44.7114966571
22.8

The average length is 44.71 cm, and the median is 22.8 cm.

Functions to extract digits and units¶

Now that we've done this once with length, we can write some functions to do the above easily with the rest of the dimensions. We also have a function that will take a list of dimensions, and "numberize" all of them.

#These are the above as functions to run on our other dimensions
def extract_digits(series):
    lengths = series.str.extract(r'(\d*.\d*)', expand = False)
    lengths = lengths.str.replace(',' , '.')
    return pd.to_numeric(lengths, errors = 'coerce')

def extract_multipliers(series):
    multipliers = series.str.extract(r'.* ([a-zA-Z]*)', expand = False)
    unitreplacement = {
    "cm" : 1, 
    "CM" : 1,
    "m" : 100,
    "M" : 100,
    "in" : 2.54,
    "kg" : 1,
    "KG" : 1,
    "gm" : .001,
    "tons" : 1000,
    "lbs" : .453
    }
    multipliers = multipliers.replace(unitreplacement)
    return pd.to_numeric(multipliers, errors = 'coerce')

def numberize_dimensions(listofdimensions, data):
    for dimension in listofdimensions:
        data[dimension + "INT"] = extract_digits(data[dimension])
        data[dimension + "MULT"] = extract_multipliers(data[dimension])
        data[dimension + "NUM"] = data[dimension + "INT"] + data[dimension + "MULT"]

#This was some hacky debugging when I manually did a binary search-type thing to find out which row numbers were messed up
#pd.to_numeric(artifactlengths[96050:96100])
#artifactlengths[96050:96100]

Rest of the dimensions¶

From the above, the average length of all the artifacts in the collection is 44.7cm. Let's find out what the average is for all the other dimensions:

data['HeightINT'] = extract_digits(data['Height'])
data['HeightMULTI'] = extract_multipliers(data['Height'])
data['HeightNUM'] = data['HeightINT'] * data['HeightMULTI']
print data['HeightNUM'].mean()
print data['HeightNUM'].median()

25.8840089939
9.4

The average height is 25.88 cm and the median is 9.4 cm.

data['WidthINT'] = extract_digits(data['Width'])
data['WidthMULTI'] = extract_multipliers(data['Width'])
data['WidthNUM'] = data['WidthINT'] * data['WidthMULTI']
print data['WidthNUM'].mean()
print data['WidthNUM'].median()

24.1553862744
12.0

The average width is 24.16 cm and the median is 12 cm.

Now, we work on weight. Given that the units for weights range from gm to tons, let's go with kg as the unit we'll use, and assume that tons refer to metric tons. There's also a 'cm' there. Something is wrong.

data['Weight'].str.extract('.* (.*)', expand = False).unique()

array([nan, 'kg', 'cm', 'tons', 'gm', 'lb'], dtype=object)

Below, we extract the indices of cases with "cm" in the weight. There are 13 cases, and some of them look like a misalignment of the columns, but it's impossible to tell which way they are misaligned.

cmtest = data['Weight'].str.extract('(.* cm)', expand=False)
#Here are three
data[cmtest.isnull() == False][0:3]

Let's set those cases to NaN, and just ignore them for now:

data.loc[cmtest.isnull() == False,['Weight']] = None

data['WeightINT'] = extract_digits(data['Weight'])
data['WeightMULT'] = extract_multipliers(data['Weight'])
data['WeightNUM'] = data['WeightINT'] * data['WeightMULT']
print data['WeightNUM'].mean()
print data['WeightNUM'].median()

1058.87995356
16.78

The average weight is 1 ton, which may be a problem if we want to fabricate this. However, the median is much lower, which suggests that the distribution is skewed, and this is also true for the other dimensions, let's take a look at that next.

	artifactNumber	ObjectName	GeneralDescription	model	SerialNumber	Manufacturer	ManuCountry	ManuProvince	ManuCity	BeginDate	EndDate	date_qualifier	patent	NumberOfComponents	ArtifactFinish	ContextCanada	ContextFunction	ContextTechnical	group1	category1	subcategory1	group2	category2	subcategory2	group3	category3	subcategory3	material	Length	Width	Height	Thickness	Weight	Diameter	image	thumbnail
0	1966.0001.001	Cover	PAPER	WESTERN CANADA AIRWAYS LTD.	NaN	Unknown	Unknown	NaN	NaN	1927	NaN	NaN	NaN	1.0	NaN	AT THE TIME IT WAS THE WORLD'S MOST NORTHERNLY...	NaN	NaN	Aviation	Commemorative	Stamps & coins	NaN	NaN	NaN	NaN	NaN	NaN	paper	4.5 cm	2.6 cm	NaN	NaN	NaN	NaN	http://source.techno-science.ca/artifacts-arte...	http://source.techno-science.ca/artifacts-arte...
1	1966.0002.001	Stamp, postage	PAPER	WESTERN CANADA AIRWAYS LTD.	NaN	Unknown	Unknown	NaN	NaN	1927	NaN	NaN	NaN	1.0	PINK & BLACK ON WHITE	NaN	NaN	NaN	Aviation	Commemorative	Stamps & coins	NaN	NaN	NaN	NaN	NaN	NaN	paper	3.8 cm	2.7 cm	NaN	NaN	NaN	NaN	http://source.techno-science.ca/artifacts-arte...	http://source.techno-science.ca/artifacts-arte...
2	1966.0003.001	Stamp, postage	PAPER	NaN	NaN	Unknown	Unknown	NaN	NaN	1932	NaN	NaN	NaN	1.0	DARK & PALE BLUE ON WHITE	NaN	NaN	NaN	Aviation	Commemorative	Stamps & coins	NaN	NaN	NaN	NaN	NaN	NaN	paper	12.8 cm	8.4 cm	NaN	NaN	NaN	NaN	http://source.techno-science.ca/artifacts-arte...	http://source.techno-science.ca/artifacts-arte...

	artifactNumber	ObjectName	GeneralDescription	model	SerialNumber	Manufacturer	ManuCountry	ManuProvince	ManuCity	BeginDate	EndDate	date_qualifier	patent	NumberOfComponents	ArtifactFinish	ContextCanada	ContextFunction	ContextTechnical	group1	category1	...	category2	subcategory2	group3	category3	subcategory3	material	Length	Width	Height	Thickness	Weight	Diameter	image	thumbnail	HeightINT	HeightMULTI	HeightNUM	WidthINT	WidthMULTI	WidthNUM
1934	1966.0971.001	Photograph	photograph paper; wood frame; glass; paper & c...	NaN	NaN	JARMAN, FRANK LTD.	Canada	Ontario	Ottawa	1941	NaN	NaN	NaN	1.0	b + w photograph; black frame	presented by Frank Learman to Ralph Bell, Dire...	display	NaN	Aviation	Archives	...	Archives	Personal	NaN	NaN	NaN	paper->cardboard;wood->;glass->;metal->steel	49.8 cm	38.8 cm	NaN	NaN	1.9 cm	NaN	http://source.techno-science.ca/artifacts-arte...	http://source.techno-science.ca/artifacts-arte...	NaN	NaN	NaN	38.8	1.0	38.8
1941	1966.0978.001	Photograph	photograph paper; glass; plastic frame; steel ...	NaN	NaN	Royal Canadian Air Force	Canada	Ontario	Ottawa	1927	NaN	circa	NaN	1.0	b + w photograph; caption in white ink; mounte...	RCAF Station Ottawa (later Rockcliffe) was the...	taken as part of aerial photography survey for...	NaN	Aviation	Archives	...	NaN	NaN	NaN	NaN	NaN	paper->;glass->;synthetic->plastic;metal->steel	45.3 cm	40.0 cm	NaN	NaN	2.3 cm	NaN	http://source.techno-science.ca/artifacts-arte...	http://source.techno-science.ca/artifacts-arte...	NaN	NaN	NaN	40.0	1.0	40.0
52400	1989.0488.001	Calculator, air navigation	plastic; metal swivel at centre - possibly alu...	B-24D	NaN	Consolidated Aircraft Corp.	United States of America	California	San diego	1941	NaN	after	NaN	2.0	white with black detail; pale yellow rotating ...	NaN	used to determine gross weight & fore & aft ce...	especially useful on bombers because of varyin...	Aviation	Navigation instruments & equipment	...	NaN	NaN	NaN	NaN	NaN	synthetic->plastic;metal->aluminum - possible	28.0 cm	25.2 cm	NaN	NaN	1.0 cm	NaN	http://source.techno-science.ca/artifacts-arte...	http://source.techno-science.ca/artifacts-arte...	NaN	NaN	NaN	25.2	1.0	25.2

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31	32	33	34	35	36
0	1984.1214.001	Engine, airplane	NaN	LTC1K-4A/LYCOMING T53	LE09281X	Avco Lycoming Div.	United States of America	Connecticut	Stratford	1969	NaN	circa	NaN	1	NaN	NaN	Converts heat into mechanical power	NaN	Aviation	Motive power	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	http://source.techno-science.ca/artifacts-arte...
1	1984.1215.001	Engine, airplane	NaN	LTC1K-4A/LYCOMING T53	LE09285X	Avco Lycoming Div.	United States of America	Connecticut	Stratford	1969	NaN	circa	NaN	1	NaN	NaN	Converts heat into mechanical power	NaN	Aviation	Motive power	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	http://source.techno-science.ca/artifacts-arte...
2	1984.1216.001	Propeller	Glass - Fibre	HAMILTON STANDARD	337-339-330-325	Hamilton Standard Propeller Div., United Aircr...	United States of America	Connecticut	East hartford	1969	NaN	circa	NaN	1	NaN	NaN	Aircraft propeller	NaN	Aviation	Aircraft parts	NaN	NaN	NaN	NaN	NaN	NaN	NaN	glass;fibre	NaN	NaN	NaN	NaN	NaN	NaN	http://source.techno-science.ca/artifacts-arte...	http://source.techno-science.ca/artifacts-arte...	http://source.techno-science.ca/artifacts-arte...