%matplotlib inline

The Average Thing #5: The average place pt. 2¶

This is a quick follow up on the last post: We're going to check our math and look at an alternate way of finding the average location: using the Mercator Projection.

Earlier, we saw that the origin of the average object was somewhere in the Hudson Bay in Canada. This seemed at odds with my intuition. I also had a feeling that the reason this result seemed weird was because my mental representation of the earth was not spherical, but cylindrical (e.g., the Mercator Projection).

To demonstrate this, let's test the simple case where we find the point between the middle of Canada and the UK:

canada = country_locations['Canada']
uk = country_locations['England']

#convert to radians
canada = to_radians(canada)
uk = to_radians(uk)
#convert to cartesian
canada = to_cartesian(canada)
uk = to_cartesian(uk)
#find average
averagepoint = ((canada[0]+uk[0])/2, (canada[1]+uk[1])/2, (canada[2]+uk[2])/2)

print to_latlong(averagepoint)

(68.71255890231265, -45.65693500322893)

from IPython.display import IFrame

IFrame('''https://www.google.com/maps/embed?pb=!1m18!1m12!1m3!1d32299245.83742049!2d-81.10312744481017!3d60\
.38924413554626!2m3!1f0!2f0!3f0!3m2!1i1024!2i768!4f13.1!3m3!1m2!1s0x0%3A0x0!2zNjjCsDQyJzQ1LjIiTiA0NcKwMzknM\
jUuMCJX!5e0!3m2!1sen!2sus!4v1462973875452''', width=800, height=600)

To visualize how this is really the mid point, we can look at that point on a 3D globe (external link):

The above visualization makes it clear how my intuition differed from reality - the shortest distance between the middle of Canada and the UK is not across the Atlantic, but crosses Greenland.

But let's, for a moment, be flat-earthers, and stick to our flat representation of the earth, like we do whenever we use a map or chart. In this case, we simply average the latitude and longitudes of all the points (see: http://www.geomidpoint.com/calculation.html)

latlon_no_na = data.loc[data['GeoLocations'].isnull() == False]['GeoLocations']
latlon_no_na = latlon_no_na[latlon_no_na != '']
        
avelatlon = reduce(lambda x,y: (x[0] + y[0], x[1] + y[1]), latlon_no_na)

avelatlon = (avelatlon[0]/len(latlon_no_na), avelatlon[1]/len(latlon_no_na))
avelatlon

(49.76023781406744, -68.28145709782275)

Constructing the map this way gives us a point somewhere in Quebec, which is in line with how we might think of the average point if the world was flat:

IFrame('''https://www.google.com/maps/embed?pb=!1m18!1m12!1m3!1d46671329.280431576!2d-113.3722364450685!\
3d44.44153336080491!2m3!1f0!2f0!3f0!3m2!1i1024!2i768!4f13.1!3m3!1m2!1s0x0%3A0x0!2zNDnCsDQ1JzM2LjkiTiA2OM\
KwMTYnNTMuMiJX!5e0!3m2!1sen!2sus!4v1462977105095''', width = 800, height = 600)

%matplotlib inline

The Average Thing #4: The average place pt. 1¶

Today, we're going to take an initial look at finding where the average artifact comes from.

import pandas as pd
import re
import numpy as np
from matplotlib import pyplot as plt
from geopy.geocoders import Nominatim
pd.get_option("display.max_columns")
pd.set_option("display.max_columns", 40)

data = pd.read_csv("cstmc-CSV-en.csv", delimiter = "|", error_bad_lines=False, warn_bad_lines=False)

C:\Users\dfthd\Anaconda3\envs\python2\lib\site-packages\IPython\core\interactiveshell.py:2723: DtypeWarning: Columns (9,10) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

We're going to use the GeoPy package to see if we can extract the latitude and longitude data from each of the countries. Let's start with a test case:

geolocator = Nominatim() #Using the service Nominatim
examplecase = data['ManuCountry'][7]
print examplecase
location = geolocator.geocode(examplecase) #Use the geocode method, 
print location.latitude
print location.longitude

United States of America
39.7837304
-100.4458824

Now, with that latitude and longitude, let's see where that takes us. If you imput the coordinates on Google Maps, it points to somewhere in the middle of the U.S. That's good enough for now:

from IPython.display import IFrame
from IPython.core.display import display

IFrame('''https://www.google.com/maps/embed?pb=!1m14!1m8!1m3!1d11800310.147808608!\
2d-107.43346408765365!3d38.054495438444!3m2!1i1024!2i768!4f13.1!3m3!1m2!1s0x0%3A0x0!2zMznCsDQ3JzAx\
LjQiTiAxMDDCsDI2JzQ1LjIiVw!5e0!3m2!1sen!2sus!4v1462918136743''', width = 800, height = 600)

Next, we're going to do the same for all the countries of origin:

#This simple function takes a string and returns a tuple of the latitude and longitude
def find_lat_long(string):
    geolocator = Nominatim() #Using the service Nominatim
    location = geolocator.geocode(string, timeout=None) #Use the geocode method, 
    return(location.latitude, location.longitude)

import math

#create a dictionary of all the countries
country_locations = {country:"" for country in data['ManuCountry'].unique() if (country != 'Unknown')\
                          and (country != 'unknown') and (type(country) is str)}

#country_locations = {k: find_lat_long(k) for k, v in country_locations.iteritems()}
for k, v in country_locations.iteritems():
    try:
        country_locations[k] = find_lat_long(k)
    except:
        pass

print country_locations

{'Canada': (61.0666922, -107.991707), 'Brazil': (-10.3333332, -53.1999999), 'Qatar': (25.3336984, 51.2295295), 'Italy': (42.6384261, 12.674297), 'Peru': (-6.8699696, -75.0458514), 'Antarctica': (-82.1081819, 34.378236), 'Panama': (8.3096067, -81.3066245), 'Scotland': (56.7861112, -4.1140517), 'Cambodia': (13.5066394, 104.869423), 'France': (46.603354, 1.8883335), 'Great Britain': (54.31536155, -1.91802339480124), 'Bermuda': (32.3018217, -64.7603582), 'Ireland': (52.865196, -7.9794598), 'Wales': (52.2928116, -3.7389299), 'Argentina': (-34.9964962, -64.9672816), 'Norway': (64.574131, 11.5270102892405), 'Thailand': (14.8971921, 100.83273), 'Bangladesh': (24.4768783, 90.2932426), 'Federal Republic of Germany': (45.4185139, -75.6817796429953), 'Slovakia': (48.7411522, 19.4528646), 'Israel': (30.8760272, 35.0015196), 'Australia': (-24.7761085, 134.755), 'Columbia': (2.893108, -73.7845142), 'Indonesia': (-4.7993355, 114.5632032), 'Singapore': (1.2904527, 103.852038), 'Afganistan': (33.0000866, 64.9998468), 'Czechoslovakia': (41.2133453, -98.6228518), 'German Democratic Republic': (19.4347018, -99.2052135855648), 'United States of America': (39.7837304, -100.4458824), 'South America': (-21.0002178, -61.0006564), 'Union of Soviet Socialist Republics': '', 'Guatemala': (15.6356088, -89.8988086), 'Germany': (51.0834196, 10.4234469), 'Chile': (-31.7613364, -71.3187696), 'Martinique': (14.6113732, -60.9620776), 'Belgium': (50.6407351, 4.66696), 'China': (35.000074, 104.999927), 'Haiti': (19.1399952, -72.3570971), 'Hong Kong': (22.2793278, 114.1628131), 'Philippines': (12.7503486, 122.7312101), 'Spain': (40.0028028, -4.0031039), 'Ukraine': (49.4871968, 31.2718321), 'Europe': (51.0000003, 9.9999997), 'Netherlands': (52.2379891, 5.53460738161551), 'Holland': (42.7876022, -86.1090827), 'Pakistan': (30.3308401, 71.247499), 'Denmark': (55.670249, 10.3333283), 'Poland': (52.0977181, 19.0258159), 'Finland': (63.2467777, 25.9209164), 'Mexico': (19.4325301, -99.1332101), 'North America': (51.0000002, -109.0000001), 'Sweden': (59.6749712, 14.5208584), 'Korea': (35.3985008, 127.937111), 'Hungary': (47.1817585, 19.5060937), 'Switzerland': (46.7985624, 8.2319736), 'Honduras': (15.0610686, -84.5978533), 'New Zealand': (-41.500083, 172.8344077), 'Cuba': (23.0131338, -80.8328747), 'Physics': (14.4832485, 121.0269845), 'Jamaica': (18.1850507, -77.3947692), 'Romania': (45.9852129, 24.6859225), 'England': (52.7954791, -0.540240236617432), 'Portugal': (40.033265, -7.8896262), 'South Africa': (-28.8166235, 24.991639), 'Egypt': (26.2540493, 29.2675469), 'Unknown.': (24.9993709, -77.4417422759927), 'Liechtenshtein': (31.9345932, 34.8325777), 'India': (22.3511148, 78.6677428), 'United Kingdom': (54.7023545, -3.2765752), 'Malaysia': (2.3923759, 112.8471939), 'Austria': (47.2000338, 13.199959), 'Trinidad and Tabago': '', 'Republic of Korea': (35.3985008, 127.937111), 'Northern Ireland': (54.65688285, -6.5634352096912), 'Sri Lanka': (7.878, 80.7038245875), 'Japan': (36.5748441, 139.2394179), 'Taiwan': (23.9739374, 120.9820179)}

data['GeoLocations'] = [country_locations.get(x) for x in data['ManuCountry']] 
#we use .get() instead of country_locations[x] to prevent errors when keys are not in the dictionary

data['GeoLocations'][80:100]

80                          None
81                          None
82                          None
83    (39.7837304, -100.4458824)
84    (39.7837304, -100.4458824)
85                          None
86                          None
87     (61.0666922, -107.991707)
88     (61.0666922, -107.991707)
89    (39.7837304, -100.4458824)
90    (39.7837304, -100.4458824)
91    (39.7837304, -100.4458824)
92    (39.7837304, -100.4458824)
93    (39.7837304, -100.4458824)
94    (39.7837304, -100.4458824)
95    (39.7837304, -100.4458824)
96    (39.7837304, -100.4458824)
97                          None
98                          None
99    (39.7837304, -100.4458824)
Name: GeoLocations, dtype: object

The above process has yielded tuples of latitudes and longitudes where an artifact's country of origin was available, and null where it hasn't. To average this, we're going to have to do some math, because the earth is round and we can't just add the latitudes and longitudes up:

#We're going to have to convert degrees to radians, so let's define a function to do that:

def to_radians(latlong):
    try:
        return (latlong[0]*3.142/180, latlong[1]*3.142/180)
    except:
        pass

#Next, because the earth is round, we have to convert these to cartesian coordinates:
def to_cartesian(latlongradians):
    try:
        x = math.cos(latlongradians[0]) * math.cos(latlongradians[1])
        y = math.cos(latlongradians[0]) * math.sin(latlongradians[1])
        z = math.sin(latlongradians[0])
        return(x,y,z)
    except:
        pass
    
#Finally, we convert back to our original latitude/longitude:
def to_latlong(cartesian):
    try:
        x = cartesian[0]
        y = cartesian[1]
        z = cartesian[2]
        lon = math.atan2(y, x)
        hyp = math.sqrt(x * x + y * y)
        lat = math.atan2(z, hyp)
        return(lat*180/3.142,lon*180/3.142)
    except:
        pass
    
print data['GeoLocations'][90]    
print to_radians(data['GeoLocations'][90])
print to_cartesian(to_radians(data['GeoLocations'][90]))
print to_latlong(to_cartesian(to_radians(data['GeoLocations'][90])))

(39.7837304, -100.4458824)
(0.6944471162044444, -1.75333868056)
(-0.13948924758883874, -0.7556408246036178, 0.6399606972302785)
(39.7837304, -100.4458824)

Looks like this works! Now, let's map it to the entire series:

data['GeoRadians'] = map(to_radians,data['GeoLocations'])
data['GeoCartesian'] = map(to_cartesian, data['GeoRadians'])

#Now, let's average across all the dimensions by reducing:
geo_no_nan = data.loc[data['GeoCartesian'].isnull() == False]['GeoCartesian']

avgcoordinates = reduce(lambda x,y: (x[0] + y[0], x[1] + y[1], x[2] + y[2]), geo_no_nan)
print avgcoordinates
avgcoordinates = (avgcoordinates[0]/len(geo_no_nan), avgcoordinates[1]/len(geo_no_nan), avgcoordinates[2]/len(geo_no_nan))
print avgcoordinates
print to_latlong(avgcoordinates)

(1612.787448314339, -17961.012440354283, 33139.39424219098)
(0.036587737030724564, -0.40746398458154, 0.7518011397956211)
(61.43857821633736, -84.85795258473567)

This procedure says that the location where the average artifact is manufactured is in the Hudson Bay in Canada:

IFrame('''https://www.google.com/maps/embed?pb=!1m18!1m12!1m3!1d9892352.643336799!2d-98.04822081277437\
!3d59.40187990911089!2m3!1f0!2f0!3f0!3m2!1i1024!2i768!4f13.1!3m3!1m2!1s0x0%3A0x0!2zNjHCsDI2JzE4LjkiTiA\
4NMKwNTEnMjguNiJX!5e0!3m2!1sen!2sus!4v1462926074704''', width = 800, height = 600)

This seems a little strange to me, and I'm definitely going to have to dig deeper in a future post, but I think my math is right, and perhaps I'm just visualizing this in my head wrongly.

%matplotlib inline

The Average Thing #3: Artifacts over time¶

In this post, we're going to explore how the number of artifacts that were received over time.

import pandas as pd
import re
import csv
import itertools
import numpy as np
from matplotlib import pyplot as plt
pd.get_option("display.max_columns")
pd.set_option("display.max_columns", 40)

data = pd.read_csv("cstmc-CSV-en.csv", delimiter = "|", error_bad_lines=False, warn_bad_lines=False)

C:\Users\dfthd\Anaconda3\envs\python2\lib\site-packages\IPython\core\interactiveshell.py:2723: DtypeWarning: Columns (9,10) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

From the data dictionary, the first four digits of the artifact number reflect the year in which the museum acquired them. How many artifacts did the museum acquire each year?

data['YearObtained'] = data['artifactNumber'].str.extract(r'(\d\d\d\d)', expand=False)

data['YearObtained'].unique()

array(['1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973',
       '1974', '1975', '1976', '1977', '1978', '1979', '1980', '1981',
       '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989',
       '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997',
       '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005',
       '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015'], dtype=object)

Number of objects each year¶

This is a dataframe/table of the counts of data each attribute has. Besides telling us how many artifacts the museum received that year (in artifactNumber), it also tells us how many of those cases have available data.

data.groupby(by = 'YearObtained').count()

data.groupby(by = 'YearObtained').count()['artifactNumber'].mean()

2086.86

The mean number of artifacts per year is 2086.6. Let's combine a scatterplot with a fit line to visualize this. I used the examples from here and here.

from scipy.stats import linregress

plt.figure(figsize = (15,10))

meandata = data.groupby(by = 'YearObtained').count()['artifactNumber']
x = pd.to_numeric(meandata.index.values)
y = meandata

print linregress(x,y)

fitline = np.polyfit(x, y, deg=2)

xpoints = np.linspace(min(x),max(x),100);
plt.plot(xpoints, fitline[2] + fitline[1]*xpoints + fitline[0]*xpoints**2 , color = 'red');

plt.scatter(range(min(x),max(x)+1), meandata);
plt.tight_layout()
plt.xticks(size = 20);
plt.yticks(size = 20);

LinregressResult(slope=-17.623769507803122, intercept=37166.973205282113, rvalue=-0.20498455928950232, pvalue=0.15328806044170382, stderr=12.1460638488731)

Some wrong statistics¶

If you look above, the function method lingress from the package scipy.stats only takes one predictor (x). First, to add in the polynomial term (x²), we're going to use the sm method from the statsmodels.api package.

A glance at these results suggest that that a quadratic model fit the data better than a linear model (R² of .25 vs. .02).

It should be said, however, that the inference tests (t-tests) from these results violate the assumption of independent errors. If you look at the "warnings" footnote, under [1], correct standard errors assume that the covariance matrix has been properly specified - it hasn't. One assumption that OLS regression makes is that the errors, or residuals of the model are independent - basically, that the points in the model are not related in some other way than the variables you're using to predict the outcome. In our case, that is untrue, as these are time series data, and years that are temporally close to one another will be related to each other.

In the future, I'm going to learn more about the types of models that are appropriate for these data (e.g., an autoregressive model)

import statsmodels.api as sm

x = pd.to_numeric(meandata.index.values).reshape(50,1)
y = np.array(meandata).reshape(50,1)

xpred = np.column_stack((np.ones(50),np.array(x)))
xpredsquared = np.column_stack((xpred,np.array(x**2)))


print sm.OLS(y,xpred,hasconst=True).fit().summary()
print sm.OLS(y,xpredsquared, hasconst=True).fit().summary()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.042
Model:                            OLS   Adj. R-squared:                  0.022
Method:                 Least Squares   F-statistic:                     2.105
Date:                Sat, 07 May 2016   Prob (F-statistic):              0.153
Time:                        00:51:12   Log-Likelihood:                -426.05
No. Observations:                  50   AIC:                             856.1
Df Residuals:                      48   BIC:                             859.9
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const       3.717e+04   2.42e+04      1.537      0.131     -1.14e+04  8.58e+04
x1           -17.6238     12.146     -1.451      0.153       -42.045     6.798
==============================================================================
Omnibus:                       17.851   Durbin-Watson:                   1.535
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               21.727
Skew:                           1.365   Prob(JB):                     1.91e-05
Kurtosis:                       4.724   Cond. No.                     2.75e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.75e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.282
Model:                            OLS   Adj. R-squared:                  0.252
Method:                 Least Squares   F-statistic:                     9.248
Date:                Sat, 07 May 2016   Prob (F-statistic):           0.000411
Time:                        00:51:12   Log-Likelihood:                -418.82
No. Observations:                  50   AIC:                             843.6
Df Residuals:                      47   BIC:                             849.4
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const      -1.291e+07   3.26e+06     -3.956      0.000     -1.95e+07 -6.34e+06
x1          1.299e+04   3278.612      3.962      0.000      6395.437  1.96e+04
x2            -3.2677      0.824     -3.968      0.000        -4.925    -1.611
==============================================================================
Omnibus:                       16.589   Durbin-Watson:                   2.045
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               19.539
Skew:                           1.287   Prob(JB):                     5.72e-05
Kurtosis:                       4.659   Cond. No.                     8.43e+10
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.43e+10. This might indicate that there are
strong multicollinearity or other numerical problems.

%matplotlib inline

The Average Thing #2¶

This time, we're going to take more in depth look at artifacts that are at the museum and what they are made of. The first thing we're going to do is load the data. Because the code for it is the first post, I'm just going to jump straight to the point where we've successfully processed the dimension data (and converted it to standardized numbers).

A deeper dive into object dimensions¶

In the last post, we found that all of the dimensions appeared to be skewed in that the means were much higher than the medians. Let's plot them to see what those distributions look like.

plt.figure(figsize = (10,6))
plt.hist(data['WeightNUM'].dropna());

plt.figure(figsize = (10,6));
plt.hist(data['LengthNUM'].dropna());

plt.figure(figsize = (10,6));
plt.hist(data['WidthNUM'].dropna(), bins = 100);

plt.figure(figsize = (10,6));
plt.hist(data['HeightNUM'].dropna(), bins = 100);

The graphs above suggest that there is basically a few (perhaps just one) object that is REALLY REALLY big, and that's pulling the mean up. Just for fun, let's find out what that object is.

data.sort_values(['LengthNUM'], ascending = False).iloc[0:3,:]

Apparently there is a paraglider that is 112.8 meters long. Let's pull up a picture of this thing. While we're at that, let's also look at this 5200 cm (52m) car. The last item is a Boeing 720. For that, ~40 meters seems reasonable, and it's what's listed on Wikipedia.

print data['image'].iloc[81085]
print data['image'].iloc[97539]

http://source.techno-science.ca/artifacts-artefacts/images/1999.0265.001.aa.cs.png
http://source.techno-science.ca/artifacts-artefacts/images/2008.0001.001.aa.cs.png

I'll be honest, I'm not sure this thing is that long: Also, this looks like a regular car: These just look like very understandable typos, but they also suggest that we should use the median, which is relatively unaffected by these issues.

Using the median dimensions of all the artifacts, we can calculate the average density of the object, and compare it to this table: http://www.psyclops.com/tools/technotes/materials/density.html:

print 22.8*9.4*12 # volume in cm^3
print 16780/(22.8*9.4*12) # density in grams/cm^3

2571.84
6.52451163369

The density of the average artifact is 6.5 g/cm³ or 6500 kg/m³ and is closest to Zirconium, but in general, consistent with some sort of metal. For reference, Aluminium has a density of 2700 kg/m³, Iron, 7870 kg/m³, and Gold, 19320 kg/m³.

Materials¶

So, what is the average artifact actually made of? There's a materials column that lists a bunch of different materials. Let's find a way extract those strings.

data['material'][50:60]

50                   metal->steel;glass
51                                  NaN
52    metal->steel;glass;synthetic;wood
53    metal->steel;glass;synthetic;wood
54       metal->steel;metal->brass;wood
55         metal->steel;synthetic;glass
56         metal->steel;synthetic;glass
57         metal->steel;synthetic;glass
58         metal->steel;synthetic;glass
59         metal->steel;synthetic;glass
Name: material, dtype: object

The format for these strings is an arrow -> to represent a subtype of material, and a semi-colon to represent different material. As a first pass, let's extract the different materials into a list:

data['material'].str.findall(r'\w+')[50:60]

50                     [metal, steel, glass]
51                                       NaN
52    [metal, steel, glass, synthetic, wood]
53    [metal, steel, glass, synthetic, wood]
54        [metal, steel, metal, brass, wood]
55          [metal, steel, synthetic, glass]
56          [metal, steel, synthetic, glass]
57          [metal, steel, synthetic, glass]
58          [metal, steel, synthetic, glass]
59          [metal, steel, synthetic, glass]
Name: material, dtype: object

There's probably a better way of separating sub-types of materials from the materials themselves, but the regex for that is beyond me at the moment, so let's run with this for now: As a quick and dirty way of guesstimating the main material of the average artifact, we can extract the first material in the list.

data['firstmaterial'] = data['material'].str.extract(r'^(\w+)', expand=False)

These are all the unique materials:

data['firstmaterial'].unique()

array(['paper', 'wood', 'metal', 'glass', nan, 'skin', 'synthetic',
       'fibre', 'resin', 'ceramic', 'stone', 'fluid', 'bone', 'composite',
       'plant', 'animal', 'shell'], dtype=object)

The next thing we want to do is form a dictionary with this list, and then add counts to it.

withmaterials = data.loc[data['firstmaterial'].isnull() == False] #use only rows with data
firstmaterials = dict.fromkeys(withmaterials['firstmaterial'].unique(), 0) #using the dict method .fromkeys(list, default value)

for material in withmaterials['firstmaterial']:
    firstmaterials[material] += 1

firstmaterials

{'animal': 39,
 'bone': 98,
 'ceramic': 799,
 'composite': 28,
 'fibre': 2117,
 'fluid': 9,
 'glass': 6177,
 'metal': 29370,
 'paper': 14106,
 'plant': 20,
 'resin': 513,
 'shell': 6,
 'skin': 774,
 'stone': 165,
 'synthetic': 7440,
 'wood': 6978}

 #plot a bar graph where the axis goes from 0 to number of dictionary entries, and the values are the countes of those entires
plt.figure(figsize = (15,10))
plt.bar(range(0,len(firstmaterials)), firstmaterials.values(), align='center')
plt.xticks(range(0,len(firstmaterials)),firstmaterials.keys(),rotation = 'vertical', size = 20);
plt.xlim([-1,len(firstmaterials)]);

Here they are as percentages:

total = sum(firstmaterials.values())

for key, value in firstmaterials.iteritems():
    print key, str(round((float(value)/total)*100,2)) + "%"

synthetic 10.84%
fibre 3.08%
stone 0.24%
shell 0.01%
composite 0.04%
ceramic 1.16%
metal 42.79%
fluid 0.01%
resin 0.75%
paper 20.55%
glass 9.0%
wood 10.17%
plant 0.03%
animal 0.06%
skin 1.13%
bone 0.14%

The Average Thing¶

The Canada Science and Technology Museums Corporation hosts a repository of all the artifacts in its collection here: (http://techno-science.ca/en/data.php). Let's do something interesting with it.

Reading and cleaning data¶

The data come in either a .csv or .xml format. Let's first try reading the .csv file, because Pandas has native support for that. First, we import all the packages that we might need (e.g., regex), and tell Pandas not to truncate the columns like it usually does.

import pandas as pd
import re
import csv
import itertools
pd.get_option("display.max_columns")
pd.set_option("display.max_columns", 40)

Now, we import the data to a datafram named "data", using Pandas' built-in method .read_csv, specifying the delimiter "|", and telling it not to break when it comes across a bad line, but instead show a bunch of warnings and still continue.

data = pd.read_csv("cstmc-CSV-en.csv", delimiter = "|", error_bad_lines=False, warn_bad_lines=False)

C:\Users\dfthd\Anaconda3\envs\python2\lib\site-packages\IPython\core\interactiveshell.py:2723: DtypeWarning: Columns (9,10) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

If you turn the warnings on, it looks like some of the rows of the csv file have extra columns (i.e., they have 37 columns when we only have 36 column names. Because there are multiple missing values, it's not possible to determine what the extra columns represent. Just by eyeballing things, it looks like about ~500 of the 100,000 cases are problematic.

Our two options are to either ignore these cases, or try and read the xml file, which may or may not pay off. My recommendation is that we just move forward and try and get some initial analysis done before backtracking later if we want to.

Here are the first 3 artifacts in the data:

data.head(n = 3)

And here are some cases which have too many columns:

debug = pd.read_csv("cstmc-CSV-en.csv", header=None, delimiter = "|", error_bad_lines=False, skiprows = 37899, nrows=3)
debug

Recoding dimensions into actual units¶

The first thing we're going to do is to do convert all the dimensions, which have mixed units (e.g., 16cm and 10m or whatever) into one common unit. To do this, we're going to use regular expressions and loop through the dimension columns.

The other thing that we might want to do is decide what artifacts to actually use. It looks like some of the artifacts are things like documents and paper, and others are really big things like engines.

Dealing with missing data¶

As you can see from the example cases above, lots of data is missing or unavailable, which is probably to be expected given that not all columns are necessarily relevant to each artifact. However, this forces us to make a decision when trying to summarize data: How do we decide whether to include or exclude artifacts?

For example, the code below subsets data depending on whether or not any of the measurements of their physical dimensions (weight, height, etc.) are missing:

datasub1 = data.loc[data['Length'].isnull() == False]
datasub2 = datasub1.loc[datasub1['Width'].isnull() == False]
datasub3 = datasub2.loc[datasub2['Height'].isnull() == False]
datasub4 = datasub3.loc[datasub3['Weight'].isnull() == False]

Running the code below will list the problem cases with our regex - some have typos in them, others have European style decimal places (thanks Canada!)

subdata = datasub3
subdata = subdata.reset_index(drop = True)

x = 0
badlines = []
for line in subdata['Length']:
    try:
        re.search(r'(\d*\.\d*) .*', line).group(1)
    except:
        badlines.append(subdata.loc[x, ['Length']][0])
    x +=1

print badlines

['610 cm', '12 cm', '210 cm', '254,9 cm', '12,7 cm', '12,7 cm', '71.0\t cm', '81.0\t cm', '61,5 cm', '61,5 cm', '61,5 cm', '61,5 cm', '42.3\t cm', '59,5 cm', '58,0 cm', '176.O cm', '157,5 cm', '18,5 cm', '2,4 cm', '19,5 cm', '32,0 cm', ' cm', '21,0 cm', '30 cm', '58 cm', '5200 cm', '172,0 cm', '26.,0 cm', '26.,0 cm', '43,7 cm', '200,0 cm', '41,5 cm', '180,0 cm', '150 cm', '41,5 cm', '200 cm', '24 cm', '190 cm', '229 cm', '236 cm']

Let's first sub out the commas for decimal places and remove the '\t' characters.

subdata['Length'] = [re.sub(',', '.', x) for x in subdata['Length']]
subdata['Length'] = [re.sub('\t', '', x) for x in subdata['Length']]

AS IT HAPPENS, you can achieve the above using Pandas' built-in string methods, because I am a moron. Below, we use these methods on the "Length" dimension to replace the commas, extract the digits, extract the units, and then replace the units with the numbers to multiply the digits by (i.e., cm = 1, m = 100).

artifactlengths = data['Length'].str.extract(r'(\d*.\d*)', expand=False)
artifactlengths = artifactlengths.str.replace(',', '.')
artifactlengthmultipliers = data['Length'].str.extract(r'.* ([a-zA-Z]*)', expand=False)

Doing this, we've created a vector (i.e., a Pandas series) of floats, and a separate vector of strings that have the units. For the next step, we want to first replace cm and CM with 1, and m and M with 100, and look for the person who used inches and shoot them.

artifactlengthmultipliers.unique()

array(['cm', nan, 'M', 'in', 'CM', 'm'], dtype=object)

We create a dictionary to use with the .replace method:

unitreplacement = {
    "mm" : .1,
    "MM" : .1,
    "cm" : 1, 
    "CM" : 1,
    "m" : 100,
    "M" : 100,
    "in" : 2.54,
    "kg" : 1,
    "KG" : 1,
    "gm" : .001,
    "tons" : 1000,
    "lbs" : .453
}
artifactlengthmultipliersNUM = artifactlengthmultipliers.replace(unitreplacement)

And then we multiple our numbers by our multipliers, and now we can use the data to find out the average length of all artifacts:

#We have to coerce because our series contains missing values that cannot be typed
artifactlengthincm = pd.to_numeric(artifactlengths, errors = 'coerce') * artifactlengthmultipliersNUM

print artifactlengthincm.mean()
print artifactlengthincm.median()

44.7114966571
22.8

The average length is 44.71 cm, and the median is 22.8 cm.

Functions to extract digits and units¶

Now that we've done this once with length, we can write some functions to do the above easily with the rest of the dimensions. We also have a function that will take a list of dimensions, and "numberize" all of them.

#These are the above as functions to run on our other dimensions
def extract_digits(series):
    lengths = series.str.extract(r'(\d*.\d*)', expand = False)
    lengths = lengths.str.replace(',' , '.')
    return pd.to_numeric(lengths, errors = 'coerce')

def extract_multipliers(series):
    multipliers = series.str.extract(r'.* ([a-zA-Z]*)', expand = False)
    unitreplacement = {
    "cm" : 1, 
    "CM" : 1,
    "m" : 100,
    "M" : 100,
    "in" : 2.54,
    "kg" : 1,
    "KG" : 1,
    "gm" : .001,
    "tons" : 1000,
    "lbs" : .453
    }
    multipliers = multipliers.replace(unitreplacement)
    return pd.to_numeric(multipliers, errors = 'coerce')

def numberize_dimensions(listofdimensions, data):
    for dimension in listofdimensions:
        data[dimension + "INT"] = extract_digits(data[dimension])
        data[dimension + "MULT"] = extract_multipliers(data[dimension])
        data[dimension + "NUM"] = data[dimension + "INT"] + data[dimension + "MULT"]

#This was some hacky debugging when I manually did a binary search-type thing to find out which row numbers were messed up
#pd.to_numeric(artifactlengths[96050:96100])
#artifactlengths[96050:96100]

Rest of the dimensions¶

From the above, the average length of all the artifacts in the collection is 44.7cm. Let's find out what the average is for all the other dimensions:

data['HeightINT'] = extract_digits(data['Height'])
data['HeightMULTI'] = extract_multipliers(data['Height'])
data['HeightNUM'] = data['HeightINT'] * data['HeightMULTI']
print data['HeightNUM'].mean()
print data['HeightNUM'].median()

25.8840089939
9.4

The average height is 25.88 cm and the median is 9.4 cm.

data['WidthINT'] = extract_digits(data['Width'])
data['WidthMULTI'] = extract_multipliers(data['Width'])
data['WidthNUM'] = data['WidthINT'] * data['WidthMULTI']
print data['WidthNUM'].mean()
print data['WidthNUM'].median()

24.1553862744
12.0

The average width is 24.16 cm and the median is 12 cm.

Now, we work on weight. Given that the units for weights range from gm to tons, let's go with kg as the unit we'll use, and assume that tons refer to metric tons. There's also a 'cm' there. Something is wrong.

data['Weight'].str.extract('.* (.*)', expand = False).unique()

array([nan, 'kg', 'cm', 'tons', 'gm', 'lb'], dtype=object)

Below, we extract the indices of cases with "cm" in the weight. There are 13 cases, and some of them look like a misalignment of the columns, but it's impossible to tell which way they are misaligned.

cmtest = data['Weight'].str.extract('(.* cm)', expand=False)
#Here are three
data[cmtest.isnull() == False][0:3]

Let's set those cases to NaN, and just ignore them for now:

data.loc[cmtest.isnull() == False,['Weight']] = None

data['WeightINT'] = extract_digits(data['Weight'])
data['WeightMULT'] = extract_multipliers(data['Weight'])
data['WeightNUM'] = data['WeightINT'] * data['WeightMULT']
print data['WeightNUM'].mean()
print data['WeightNUM'].median()

1058.87995356
16.78

The average weight is 1 ton, which may be a problem if we want to fabricate this. However, the median is much lower, which suggests that the distribution is skewed, and this is also true for the other dimensions, let's take a look at that next.

	artifactNumber	ObjectName	GeneralDescription	model	SerialNumber	Manufacturer	ManuCountry	ManuProvince	ManuCity	BeginDate	EndDate	date_qualifier	patent	NumberOfComponents	ArtifactFinish	ContextCanada	ContextFunction	ContextTechnical	group1	category1	...	Thickness	Weight	Diameter	image	thumbnail	LengthINT	LengthMULT	LengthNUM	HeightINT	HeightMULT	HeightNUM	WidthINT	WidthMULT	WidthNUM	WeightINT	WeightMULT	WeightNUM	ThicknessINT	ThicknessMULT	ThicknessNUM
81085	1999.0265.001	Parasail	.1 NYLON SAIL/ METAL FITTINGS/ PLASTIC & POLYE...	Astra 29	300316	APCO Aviation Ltd.	Israel	NaN	Netanya	1992	NaN	NaN	NaN	2.0	.1 BLUE, YELLOW AND GREEN NYLON SAIL;.2 PURPLE...	PARAGLIDING IS A DEVELOPMENT OF THE SPORT OF H...	A PARAGLIDER IS A CROSS BETWEEN A HANG GLIDER ...	ORIGINALLY DEVELOPED FROM THE SQUARE PARACHUTE...	Aviation	Aircraft	...	NaN	NaN	NaN	http://source.techno-science.ca/artifacts-arte...	http://source.techno-science.ca/artifacts-arte...	112.8	100.0	11280.0	NaN	NaN	NaN	29.8	100.0	2980.0	NaN	NaN	NaN	NaN	NaN	NaN
97539	2008.0001.001	Automobile	metal, glass, fibre, resin, wood and synthetic...	Meteor/Montcalm	5066433 265A 61	Ford Motor Co. of Canada Ltd.	Canada	Ontario	Oakville	1961	NaN	NaN	NaN	12.0	Automobile has turquoise colour body; white ha...	Canadian-built and owned car considered to be ...	Owner driven passenger vehicle for general tra...	Following WWll, the Ford Motor Co. offered pot...	Motorized Ground Transportation	Automotive vehicles	...	NaN	NaN	NaN	http://source.techno-science.ca/artifacts-arte...	http://source.techno-science.ca/artifacts-arte...	5200.0	1.0	5200.0	1405.00	1.0	1405.0	1210.0	1.0	1210.0	NaN	NaN	NaN	NaN	NaN	NaN
102774	2012.0039.001	Airplane	Limited information available at time of catal...	Boeing 720-023B	18024	Boeing Airplane Co.	United States of America	Washington	Seattle	1960	NaN	NaN	NaN	1.0	Red & white colour scheme. Limited information...	, C-FETB proved invaluable in validating elect...	Used in the testing of selected aircraft compo...	, Pratt & Whitney Canada (PWC) bought the Mode...	Aviation	Aircraft	...	NaN	NaN	NaN	NaN	NaN	41.4	100.0	4140.0	12.65	100.0	1265.0	39.9	100.0	3990.0	NaN	NaN	NaN	NaN	NaN	NaN

	artifactNumber	ObjectName	GeneralDescription	model	SerialNumber	Manufacturer	ManuCountry	ManuProvince	ManuCity	BeginDate	EndDate	date_qualifier	patent	NumberOfComponents	ArtifactFinish	ContextCanada	ContextFunction	ContextTechnical	group1	category1	...	category2	subcategory2	group3	category3	subcategory3	material	Length	Width	Height	Thickness	Weight	Diameter	image	thumbnail	HeightINT	HeightMULTI	HeightNUM	WidthINT	WidthMULTI	WidthNUM
1934	1966.0971.001	Photograph	photograph paper; wood frame; glass; paper & c...	NaN	NaN	JARMAN, FRANK LTD.	Canada	Ontario	Ottawa	1941	NaN	NaN	NaN	1.0	b + w photograph; black frame	presented by Frank Learman to Ralph Bell, Dire...	display	NaN	Aviation	Archives	...	Archives	Personal	NaN	NaN	NaN	paper->cardboard;wood->;glass->;metal->steel	49.8 cm	38.8 cm	NaN	NaN	1.9 cm	NaN	http://source.techno-science.ca/artifacts-arte...	http://source.techno-science.ca/artifacts-arte...	NaN	NaN	NaN	38.8	1.0	38.8
1941	1966.0978.001	Photograph	photograph paper; glass; plastic frame; steel ...	NaN	NaN	Royal Canadian Air Force	Canada	Ontario	Ottawa	1927	NaN	circa	NaN	1.0	b + w photograph; caption in white ink; mounte...	RCAF Station Ottawa (later Rockcliffe) was the...	taken as part of aerial photography survey for...	NaN	Aviation	Archives	...	NaN	NaN	NaN	NaN	NaN	paper->;glass->;synthetic->plastic;metal->steel	45.3 cm	40.0 cm	NaN	NaN	2.3 cm	NaN	http://source.techno-science.ca/artifacts-arte...	http://source.techno-science.ca/artifacts-arte...	NaN	NaN	NaN	40.0	1.0	40.0
52400	1989.0488.001	Calculator, air navigation	plastic; metal swivel at centre - possibly alu...	B-24D	NaN	Consolidated Aircraft Corp.	United States of America	California	San diego	1941	NaN	after	NaN	2.0	white with black detail; pale yellow rotating ...	NaN	used to determine gross weight & fore & aft ce...	especially useful on bombers because of varyin...	Aviation	Navigation instruments & equipment	...	NaN	NaN	NaN	NaN	NaN	synthetic->plastic;metal->aluminum - possible	28.0 cm	25.2 cm	NaN	NaN	1.0 cm	NaN	http://source.techno-science.ca/artifacts-arte...	http://source.techno-science.ca/artifacts-arte...	NaN	NaN	NaN	25.2	1.0	25.2

The Average Thing #5: The Average Place pt.2