The Average Thing #5: The Average Place pt.2

In [1]:
%matplotlib inline

The Average Thing #5: The average place pt. 2

This is a quick follow up on the last post: We're going to check our math and look at an alternate way of finding the average location: using the Mercator Projection.

Earlier, we saw that the origin of the average object was somewhere in the Hudson Bay in Canada. This seemed at odds with my intuition. I also had a feeling that the reason this result seemed weird was because my mental representation of the earth was not spherical, but cylindrical (e.g., the Mercator Projection).

To demonstrate this, let's test the simple case where we find the point between the middle of Canada and the UK:

In [6]:
canada = country_locations['Canada']
uk = country_locations['England']

#convert to radians
canada = to_radians(canada)
uk = to_radians(uk)
#convert to cartesian
canada = to_cartesian(canada)
uk = to_cartesian(uk)
#find average
averagepoint = ((canada[0]+uk[0])/2, (canada[1]+uk[1])/2, (canada[2]+uk[2])/2)

print to_latlong(averagepoint)
(68.71255890231265, -45.65693500322893)
In [7]:
from IPython.display import IFrame

IFrame('''https://www.google.com/maps/embed?pb=!1m18!1m12!1m3!1d32299245.83742049!2d-81.10312744481017!3d60\
.38924413554626!2m3!1f0!2f0!3f0!3m2!1i1024!2i768!4f13.1!3m3!1m2!1s0x0%3A0x0!2zNjjCsDQyJzQ1LjIiTiA0NcKwMzknM\
jUuMCJX!5e0!3m2!1sen!2sus!4v1462973875452''', width=800, height=600)
Out[7]:

To visualize how this is really the mid point, we can look at that point on a 3D globe (external link):

The above visualization makes it clear how my intuition differed from reality - the shortest distance between the middle of Canada and the UK is not across the Atlantic, but crosses Greenland.

But let's, for a moment, be flat-earthers, and stick to our flat representation of the earth, like we do whenever we use a map or chart. In this case, we simply average the latitude and longitudes of all the points (see: http://www.geomidpoint.com/calculation.html)

In [10]:
latlon_no_na = data.loc[data['GeoLocations'].isnull() == False]['GeoLocations']
latlon_no_na = latlon_no_na[latlon_no_na != '']
        
avelatlon = reduce(lambda x,y: (x[0] + y[0], x[1] + y[1]), latlon_no_na)

avelatlon = (avelatlon[0]/len(latlon_no_na), avelatlon[1]/len(latlon_no_na))
avelatlon
Out[10]:
(49.76023781406744, -68.28145709782275)

Constructing the map this way gives us a point somewhere in Quebec, which is in line with how we might think of the average point if the world was flat:

In [11]:
IFrame('''https://www.google.com/maps/embed?pb=!1m18!1m12!1m3!1d46671329.280431576!2d-113.3722364450685!\
3d44.44153336080491!2m3!1f0!2f0!3f0!3m2!1i1024!2i768!4f13.1!3m3!1m2!1s0x0%3A0x0!2zNDnCsDQ1JzM2LjkiTiA2OM\
KwMTYnNTMuMiJX!5e0!3m2!1sen!2sus!4v1462977105095''', width = 800, height = 600)
Out[11]:

The Average Thing #4: The Average Place pt.1

In [1]:
%matplotlib inline

The Average Thing #4: The average place pt. 1

Today, we're going to take an initial look at finding where the average artifact comes from.

In [2]:
import pandas as pd
import re
import numpy as np
from matplotlib import pyplot as plt
from geopy.geocoders import Nominatim
pd.get_option("display.max_columns")
pd.set_option("display.max_columns", 40)

data = pd.read_csv("cstmc-CSV-en.csv", delimiter = "|", error_bad_lines=False, warn_bad_lines=False)
C:\Users\dfthd\Anaconda3\envs\python2\lib\site-packages\IPython\core\interactiveshell.py:2723: DtypeWarning: Columns (9,10) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

We're going to use the GeoPy package to see if we can extract the latitude and longitude data from each of the countries. Let's start with a test case:

In [3]:
geolocator = Nominatim() #Using the service Nominatim
examplecase = data['ManuCountry'][7]
print examplecase
location = geolocator.geocode(examplecase) #Use the geocode method, 
print location.latitude
print location.longitude
United States of America
39.7837304
-100.4458824

Now, with that latitude and longitude, let's see where that takes us. If you imput the coordinates on Google Maps, it points to somewhere in the middle of the U.S. That's good enough for now:

In [4]:
from IPython.display import IFrame
from IPython.core.display import display

IFrame('''https://www.google.com/maps/embed?pb=!1m14!1m8!1m3!1d11800310.147808608!\
2d-107.43346408765365!3d38.054495438444!3m2!1i1024!2i768!4f13.1!3m3!1m2!1s0x0%3A0x0!2zMznCsDQ3JzAx\
LjQiTiAxMDDCsDI2JzQ1LjIiVw!5e0!3m2!1sen!2sus!4v1462918136743''', width = 800, height = 600)
Out[4]:

Next, we're going to do the same for all the countries of origin:

In [5]:
#This simple function takes a string and returns a tuple of the latitude and longitude
def find_lat_long(string):
    geolocator = Nominatim() #Using the service Nominatim
    location = geolocator.geocode(string, timeout=None) #Use the geocode method, 
    return(location.latitude, location.longitude)

import math

#create a dictionary of all the countries
country_locations = {country:"" for country in data['ManuCountry'].unique() if (country != 'Unknown')\
                          and (country != 'unknown') and (type(country) is str)}

#country_locations = {k: find_lat_long(k) for k, v in country_locations.iteritems()}
for k, v in country_locations.iteritems():
    try:
        country_locations[k] = find_lat_long(k)
    except:
        pass
In [6]:
print country_locations
{'Canada': (61.0666922, -107.991707), 'Brazil': (-10.3333332, -53.1999999), 'Qatar': (25.3336984, 51.2295295), 'Italy': (42.6384261, 12.674297), 'Peru': (-6.8699696, -75.0458514), 'Antarctica': (-82.1081819, 34.378236), 'Panama': (8.3096067, -81.3066245), 'Scotland': (56.7861112, -4.1140517), 'Cambodia': (13.5066394, 104.869423), 'France': (46.603354, 1.8883335), 'Great Britain': (54.31536155, -1.91802339480124), 'Bermuda': (32.3018217, -64.7603582), 'Ireland': (52.865196, -7.9794598), 'Wales': (52.2928116, -3.7389299), 'Argentina': (-34.9964962, -64.9672816), 'Norway': (64.574131, 11.5270102892405), 'Thailand': (14.8971921, 100.83273), 'Bangladesh': (24.4768783, 90.2932426), 'Federal Republic of Germany': (45.4185139, -75.6817796429953), 'Slovakia': (48.7411522, 19.4528646), 'Israel': (30.8760272, 35.0015196), 'Australia': (-24.7761085, 134.755), 'Columbia': (2.893108, -73.7845142), 'Indonesia': (-4.7993355, 114.5632032), 'Singapore': (1.2904527, 103.852038), 'Afganistan': (33.0000866, 64.9998468), 'Czechoslovakia': (41.2133453, -98.6228518), 'German Democratic Republic': (19.4347018, -99.2052135855648), 'United States of America': (39.7837304, -100.4458824), 'South America': (-21.0002178, -61.0006564), 'Union of Soviet Socialist Republics': '', 'Guatemala': (15.6356088, -89.8988086), 'Germany': (51.0834196, 10.4234469), 'Chile': (-31.7613364, -71.3187696), 'Martinique': (14.6113732, -60.9620776), 'Belgium': (50.6407351, 4.66696), 'China': (35.000074, 104.999927), 'Haiti': (19.1399952, -72.3570971), 'Hong Kong': (22.2793278, 114.1628131), 'Philippines': (12.7503486, 122.7312101), 'Spain': (40.0028028, -4.0031039), 'Ukraine': (49.4871968, 31.2718321), 'Europe': (51.0000003, 9.9999997), 'Netherlands': (52.2379891, 5.53460738161551), 'Holland': (42.7876022, -86.1090827), 'Pakistan': (30.3308401, 71.247499), 'Denmark': (55.670249, 10.3333283), 'Poland': (52.0977181, 19.0258159), 'Finland': (63.2467777, 25.9209164), 'Mexico': (19.4325301, -99.1332101), 'North America': (51.0000002, -109.0000001), 'Sweden': (59.6749712, 14.5208584), 'Korea': (35.3985008, 127.937111), 'Hungary': (47.1817585, 19.5060937), 'Switzerland': (46.7985624, 8.2319736), 'Honduras': (15.0610686, -84.5978533), 'New Zealand': (-41.500083, 172.8344077), 'Cuba': (23.0131338, -80.8328747), 'Physics': (14.4832485, 121.0269845), 'Jamaica': (18.1850507, -77.3947692), 'Romania': (45.9852129, 24.6859225), 'England': (52.7954791, -0.540240236617432), 'Portugal': (40.033265, -7.8896262), 'South Africa': (-28.8166235, 24.991639), 'Egypt': (26.2540493, 29.2675469), 'Unknown.': (24.9993709, -77.4417422759927), 'Liechtenshtein': (31.9345932, 34.8325777), 'India': (22.3511148, 78.6677428), 'United Kingdom': (54.7023545, -3.2765752), 'Malaysia': (2.3923759, 112.8471939), 'Austria': (47.2000338, 13.199959), 'Trinidad and Tabago': '', 'Republic of Korea': (35.3985008, 127.937111), 'Northern Ireland': (54.65688285, -6.5634352096912), 'Sri Lanka': (7.878, 80.7038245875), 'Japan': (36.5748441, 139.2394179), 'Taiwan': (23.9739374, 120.9820179)}
In [7]:
data['GeoLocations'] = [country_locations.get(x) for x in data['ManuCountry']] 
#we use .get() instead of country_locations[x] to prevent errors when keys are not in the dictionary
In [12]:
data['GeoLocations'][80:100]
Out[12]:
80                          None
81                          None
82                          None
83    (39.7837304, -100.4458824)
84    (39.7837304, -100.4458824)
85                          None
86                          None
87     (61.0666922, -107.991707)
88     (61.0666922, -107.991707)
89    (39.7837304, -100.4458824)
90    (39.7837304, -100.4458824)
91    (39.7837304, -100.4458824)
92    (39.7837304, -100.4458824)
93    (39.7837304, -100.4458824)
94    (39.7837304, -100.4458824)
95    (39.7837304, -100.4458824)
96    (39.7837304, -100.4458824)
97                          None
98                          None
99    (39.7837304, -100.4458824)
Name: GeoLocations, dtype: object

The above process has yielded tuples of latitudes and longitudes where an artifact's country of origin was available, and null where it hasn't. To average this, we're going to have to do some math, because the earth is round and we can't just add the latitudes and longitudes up:

In [9]:
#We're going to have to convert degrees to radians, so let's define a function to do that:

def to_radians(latlong):
    try:
        return (latlong[0]*3.142/180, latlong[1]*3.142/180)
    except:
        pass

#Next, because the earth is round, we have to convert these to cartesian coordinates:
def to_cartesian(latlongradians):
    try:
        x = math.cos(latlongradians[0]) * math.cos(latlongradians[1])
        y = math.cos(latlongradians[0]) * math.sin(latlongradians[1])
        z = math.sin(latlongradians[0])
        return(x,y,z)
    except:
        pass
    
#Finally, we convert back to our original latitude/longitude:
def to_latlong(cartesian):
    try:
        x = cartesian[0]
        y = cartesian[1]
        z = cartesian[2]
        lon = math.atan2(y, x)
        hyp = math.sqrt(x * x + y * y)
        lat = math.atan2(z, hyp)
        return(lat*180/3.142,lon*180/3.142)
    except:
        pass
    
print data['GeoLocations'][90]    
print to_radians(data['GeoLocations'][90])
print to_cartesian(to_radians(data['GeoLocations'][90]))
print to_latlong(to_cartesian(to_radians(data['GeoLocations'][90])))
(39.7837304, -100.4458824)
(0.6944471162044444, -1.75333868056)
(-0.13948924758883874, -0.7556408246036178, 0.6399606972302785)
(39.7837304, -100.4458824)

Looks like this works! Now, let's map it to the entire series:

In [10]:
data['GeoRadians'] = map(to_radians,data['GeoLocations'])
data['GeoCartesian'] = map(to_cartesian, data['GeoRadians'])

#Now, let's average across all the dimensions by reducing:
geo_no_nan = data.loc[data['GeoCartesian'].isnull() == False]['GeoCartesian']

avgcoordinates = reduce(lambda x,y: (x[0] + y[0], x[1] + y[1], x[2] + y[2]), geo_no_nan)
print avgcoordinates
avgcoordinates = (avgcoordinates[0]/len(geo_no_nan), avgcoordinates[1]/len(geo_no_nan), avgcoordinates[2]/len(geo_no_nan))
print avgcoordinates
print to_latlong(avgcoordinates)
(1612.787448314339, -17961.012440354283, 33139.39424219098)
(0.036587737030724564, -0.40746398458154, 0.7518011397956211)
(61.43857821633736, -84.85795258473567)

This procedure says that the location where the average artifact is manufactured is in the Hudson Bay in Canada:

In [11]:
IFrame('''https://www.google.com/maps/embed?pb=!1m18!1m12!1m3!1d9892352.643336799!2d-98.04822081277437\
!3d59.40187990911089!2m3!1f0!2f0!3f0!3m2!1i1024!2i768!4f13.1!3m3!1m2!1s0x0%3A0x0!2zNjHCsDI2JzE4LjkiTiA\
4NMKwNTEnMjguNiJX!5e0!3m2!1sen!2sus!4v1462926074704''', width = 800, height = 600)
Out[11]:

This seems a little strange to me, and I'm definitely going to have to dig deeper in a future post, but I think my math is right, and perhaps I'm just visualizing this in my head wrongly.

The Average Thing #3: Artifacts Over Time

In [1]:
%matplotlib inline

The Average Thing #3: Artifacts over time

In this post, we're going to explore how the number of artifacts that were received over time.

In [2]:
import pandas as pd
import re
import csv
import itertools
import numpy as np
from matplotlib import pyplot as plt
pd.get_option("display.max_columns")
pd.set_option("display.max_columns", 40)

data = pd.read_csv("cstmc-CSV-en.csv", delimiter = "|", error_bad_lines=False, warn_bad_lines=False)
C:\Users\dfthd\Anaconda3\envs\python2\lib\site-packages\IPython\core\interactiveshell.py:2723: DtypeWarning: Columns (9,10) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

From the data dictionary, the first four digits of the artifact number reflect the year in which the museum acquired them. How many artifacts did the museum acquire each year?

In [3]:
data['YearObtained'] = data['artifactNumber'].str.extract(r'(\d\d\d\d)', expand=False)
In [4]:
data['YearObtained'].unique()
Out[4]:
array(['1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973',
       '1974', '1975', '1976', '1977', '1978', '1979', '1980', '1981',
       '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989',
       '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997',
       '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005',
       '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015'], dtype=object)

Number of objects each year

This is a dataframe/table of the counts of data each attribute has. Besides telling us how many artifacts the museum received that year (in artifactNumber), it also tells us how many of those cases have available data.

In [5]:
data.groupby(by = 'YearObtained').count()
Out[5]:
artifactNumber ObjectName GeneralDescription model SerialNumber Manufacturer ManuCountry ManuProvince ManuCity BeginDate EndDate date_qualifier patent NumberOfComponents ArtifactFinish ContextCanada ContextFunction ContextTechnical group1 category1 subcategory1 group2 category2 subcategory2 group3 category3 subcategory3 material Length Width Height Thickness Weight Diameter image thumbnail
YearObtained
1966 2110 1719 1540 787 193 1629 1537 588 653 1073 432 925 143 2107 1375 422 1124 483 1662 1662 1276 164 164 115 14 14 9 1550 1188 1221 742 57 20 54 1193 1193
1967 1937 1556 1074 1017 543 1402 1290 521 700 849 257 718 51 1937 735 349 749 456 1498 1498 706 67 67 25 2 2 2 1119 845 999 849 8 39 21 1257 1257
1968 2002 1313 938 469 261 1008 918 342 444 442 151 382 94 2002 478 224 385 215 1239 1239 938 46 46 30 3 3 1 943 694 705 490 1 0 6 1185 1185
1969 2793 2065 1798 950 450 1912 1674 610 753 1094 309 986 189 2793 1098 554 1139 689 2002 2002 1655 201 201 170 2 2 2 1809 1555 1549 1154 0 8 78 1934 1934
1970 2041 1529 827 977 790 1494 1128 370 537 464 139 417 123 2041 662 159 358 157 1503 1503 520 59 59 21 1 1 1 825 778 716 629 2 30 39 1253 1253
1971 1143 809 601 408 241 744 659 329 365 317 111 281 99 1143 383 130 227 166 756 756 506 21 21 13 2 2 0 605 539 550 443 3 4 11 678 678
1972 2127 1540 1528 932 286 1439 1143 561 608 723 257 623 169 2127 1495 197 457 278 1442 1442 793 149 149 122 14 14 0 1533 1258 1335 641 73 1 6 1482 1482
1973 1304 894 778 376 214 796 754 305 376 339 60 256 71 1304 740 206 336 234 848 848 685 44 44 31 0 0 0 780 733 749 645 5 1 17 784 784
1974 1226 894 666 505 324 775 713 431 503 365 131 319 123 1226 500 173 314 198 813 813 617 37 37 15 0 0 0 671 543 552 460 1 1 27 881 881
1975 1683 1128 1047 793 224 1097 1006 570 635 671 258 616 111 1683 560 274 812 361 1088 1088 881 38 38 14 4 4 4 1049 604 972 869 0 2 2 1139 1139
1976 1309 882 950 591 371 871 817 373 433 512 125 425 151 1308 697 208 535 352 812 812 684 33 33 21 3 3 2 953 688 759 607 0 3 4 890 890
1977 1419 943 953 522 264 908 848 420 483 440 112 379 154 1419 754 378 464 397 917 917 651 37 37 20 1 1 1 954 633 884 675 1 2 8 970 970
1978 3567 1271 1172 685 364 1214 1127 498 659 612 208 547 231 3567 933 249 699 354 1246 1246 859 101 101 78 2 2 1 1175 813 1027 832 0 0 18 1143 1143
1979 2322 1339 1137 641 259 1233 1175 500 648 508 122 360 154 2322 857 163 536 237 1303 1301 1114 121 121 101 27 27 21 1137 988 1049 838 0 3 65 1234 1234
1980 1633 1187 1023 654 329 1041 981 427 523 556 249 483 153 1633 725 195 466 306 1033 1033 809 91 91 67 1 1 1 1026 951 891 730 0 34 66 960 960
1981 3163 2168 2028 1343 606 2053 1993 911 1241 1131 442 1071 316 3163 1813 236 1692 438 1979 1979 1867 140 140 81 4 4 3 2031 1744 1807 1554 0 3 28 1928 1928
1982 1796 953 915 544 237 931 882 414 537 505 149 422 148 1796 778 244 572 297 926 926 707 142 142 65 3 3 1 920 807 828 605 10 0 22 896 896
1983 1671 775 811 408 234 747 717 332 439 449 112 419 99 1670 683 191 524 236 754 753 595 70 71 54 0 0 0 810 787 759 667 29 2 45 717 717
1984 2546 1332 1305 661 536 1262 1213 570 765 505 95 477 155 2546 1168 303 995 258 1267 1267 865 87 87 68 1 1 1 1308 1217 1310 1149 15 2 40 1355 1355
1985 2780 1346 1377 583 296 1056 993 473 542 487 114 436 106 2780 1100 396 783 371 1011 1011 790 175 175 86 13 13 13 1377 1199 1096 953 21 1 172 991 991
1986 2113 1046 1107 665 322 1030 974 370 496 588 220 520 129 2113 989 473 846 508 1034 1034 700 130 130 79 7 7 6 1108 1111 1119 893 12 1 33 1080 1080
1987 5122 3544 3727 1923 609 2975 3123 1506 1715 1898 411 1154 259 5122 3674 2476 3077 1386 3461 3461 3261 347 347 267 14 14 10 3724 3454 3364 1710 58 2 97 3157 3157
1988 3277 1736 2047 922 280 1779 1672 558 925 1048 348 941 203 3277 1723 1023 1481 783 1800 1800 1587 459 459 366 50 50 27 2046 1997 2002 1623 47 1 183 1579 1579
1989 1371 696 771 454 161 667 631 292 366 514 194 390 49 1371 731 393 581 337 669 669 574 93 93 67 3 3 1 771 707 645 452 11 1 51 630 630
1990 2280 1380 1320 861 146 1255 1200 305 419 510 232 435 106 2280 1332 626 791 336 916 916 728 379 379 126 23 23 20 1317 1128 1337 1057 34 2 159 941 941
1991 2818 2015 2131 895 234 1741 1708 555 630 1429 319 1106 134 2817 1937 635 1718 524 1791 1791 1603 328 328 214 48 48 46 2126 1871 1828 982 373 3 197 1732 1732
1992 4819 3765 3852 1712 608 3618 3012 1274 1340 2792 583 2549 693 4819 3711 1595 3321 1486 3747 3747 1475 303 303 217 57 57 39 3852 3480 3216 2134 72 9 446 3642 3642
1993 1367 1000 978 587 185 854 817 401 484 658 200 410 62 1367 898 467 785 382 925 925 823 184 184 152 23 23 20 978 861 863 437 29 1 35 974 974
1994 2431 1533 1479 952 153 1421 1383 723 790 1120 341 572 115 2431 1344 559 1390 532 1494 1494 809 227 227 206 26 26 25 1479 1371 1355 626 34 2 18 1478 1478
1995 5929 2256 2249 1374 366 2222 2086 768 993 1244 390 939 220 5929 2234 1543 2183 1218 2297 2297 2035 269 269 188 21 21 6 2249 1915 2052 1485 33 3 59 2242 2242
1996 2496 1470 1361 872 279 1402 1338 550 708 950 261 725 132 2496 1323 766 1239 685 1413 1413 1216 381 381 278 31 31 30 1360 1174 1226 851 31 3 111 1381 1381
1997 2966 1748 1776 1090 504 1706 1634 686 949 1194 237 913 203 2966 1715 1234 1663 1098 1745 1745 1003 311 311 268 31 31 29 1775 1463 1439 1096 100 1 245 1900 1900
1998 2641 1904 1905 1340 140 1926 1659 481 619 1369 368 869 201 2641 1883 1406 1773 941 1872 1871 1661 227 227 207 5 5 3 1906 1330 1517 964 18 2 310 2197 2197
1999 1410 849 892 475 107 866 776 343 372 500 257 431 92 1410 831 675 832 585 900 900 789 150 150 138 46 46 46 871 725 684 476 82 1 78 1041 1041
2000 1059 638 801 404 122 656 703 268 336 408 126 263 119 1059 727 439 577 385 774 774 665 117 117 111 13 13 13 801 686 664 499 31 0 50 783 783
2001 2111 1208 1147 498 163 1129 1062 534 585 637 225 481 126 2111 1094 848 973 671 1132 1132 882 216 216 160 7 7 3 1147 919 950 686 42 0 64 1299 1299
2002 3167 2412 2577 1059 354 2259 2187 774 1109 1970 1073 1626 188 3167 2555 2116 2041 1345 2433 2433 2180 377 377 322 42 42 3 2587 2199 1950 1515 26 1 413 2625 2625
2003 2235 1853 1884 730 76 1859 1724 423 453 1225 609 926 100 2234 1879 1480 1703 531 1774 1764 1436 265 265 225 24 24 7 1886 1501 1438 772 124 2 208 1697 1697
2004 5017 4985 4950 4087 181 4746 4763 828 3563 1836 736 1238 73 5017 4989 4270 4844 4033 4863 4863 1802 2808 2808 657 43 43 29 4989 4838 4627 1064 179 7 186 2896 2896
2005 742 742 738 493 167 658 603 241 274 556 173 395 27 742 734 319 592 258 683 683 583 138 138 116 23 23 23 742 681 648 306 203 3 84 700 700
2006 851 851 851 428 52 840 811 95 282 756 70 510 16 851 851 373 718 554 820 820 767 44 44 32 10 10 0 851 774 452 253 71 3 64 462 462
2007 745 745 743 224 78 706 658 148 200 620 287 481 43 745 738 195 690 178 731 730 608 160 160 143 0 0 0 743 686 658 233 179 0 74 483 483
2008 2005 1996 1998 1670 134 1962 1916 1302 1421 613 172 478 69 2005 2001 1419 1760 1412 1983 1982 646 1323 1323 1260 21 21 10 1998 1925 1909 242 145 0 58 1078 1078
2009 861 861 860 432 117 851 823 258 343 687 209 560 64 847 860 562 644 446 818 805 610 239 230 202 6 6 6 861 761 727 466 31 0 78 668 668
2010 1966 1966 1955 593 83 1923 1913 309 421 1784 1155 1642 205 1966 1957 1792 1811 1581 1957 1957 1745 1192 1188 1091 277 277 272 1954 1618 1500 1129 42 1 269 1139 1139
2011 258 258 255 93 28 240 238 90 101 224 28 138 4 258 253 129 177 79 256 256 163 46 45 43 2 2 2 253 226 203 138 27 0 38 235 235
2012 258 258 257 119 19 256 252 115 114 227 23 181 39 258 258 149 234 122 258 230 182 96 96 83 16 16 5 254 207 205 179 3 0 47 249 249
2013 589 589 587 259 54 582 575 164 187 556 85 426 51 588 588 370 573 359 588 588 539 172 172 168 25 25 20 586 516 490 393 1 0 86 516 516
2014 782 782 782 318 27 743 727 157 220 603 82 488 141 782 781 756 568 567 781 773 562 499 499 446 32 32 0 782 615 458 463 1 1 270 144 144
2015 85 85 75 68 6 59 59 4 40 81 1 64 28 85 85 82 74 82 85 85 34 2 2 0 0 0 0 84 78 67 39 0 0 11 0 0
In [6]:
data.groupby(by = 'YearObtained').count()['artifactNumber'].mean()
Out[6]:
2086.86

The mean number of artifacts per year is 2086.6. Let's combine a scatterplot with a fit line to visualize this. I used the examples from here and here.

In [7]:
from scipy.stats import linregress

plt.figure(figsize = (15,10))

meandata = data.groupby(by = 'YearObtained').count()['artifactNumber']
x = pd.to_numeric(meandata.index.values)
y = meandata

print linregress(x,y)

fitline = np.polyfit(x, y, deg=2)

xpoints = np.linspace(min(x),max(x),100);
plt.plot(xpoints, fitline[2] + fitline[1]*xpoints + fitline[0]*xpoints**2 , color = 'red');

plt.scatter(range(min(x),max(x)+1), meandata);
plt.tight_layout()
plt.xticks(size = 20);
plt.yticks(size = 20);
LinregressResult(slope=-17.623769507803122, intercept=37166.973205282113, rvalue=-0.20498455928950232, pvalue=0.15328806044170382, stderr=12.1460638488731)

Some wrong statistics

If you look above, the function method lingress from the package scipy.stats only takes one predictor (x). First, to add in the polynomial term (x2), we're going to use the sm method from the statsmodels.api package.

A glance at these results suggest that that a quadratic model fit the data better than a linear model (R2 of .25 vs. .02).

It should be said, however, that the inference tests (t-tests) from these results violate the assumption of independent errors. If you look at the "warnings" footnote, under [1], correct standard errors assume that the covariance matrix has been properly specified - it hasn't. One assumption that OLS regression makes is that the errors, or residuals of the model are independent - basically, that the points in the model are not related in some other way than the variables you're using to predict the outcome. In our case, that is untrue, as these are time series data, and years that are temporally close to one another will be related to each other.

In the future, I'm going to learn more about the types of models that are appropriate for these data (e.g., an autoregressive model)

In [9]:
import statsmodels.api as sm

x = pd.to_numeric(meandata.index.values).reshape(50,1)
y = np.array(meandata).reshape(50,1)

xpred = np.column_stack((np.ones(50),np.array(x)))
xpredsquared = np.column_stack((xpred,np.array(x**2)))


print sm.OLS(y,xpred,hasconst=True).fit().summary()
print sm.OLS(y,xpredsquared, hasconst=True).fit().summary()
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.042
Model:                            OLS   Adj. R-squared:                  0.022
Method:                 Least Squares   F-statistic:                     2.105
Date:                Sat, 07 May 2016   Prob (F-statistic):              0.153
Time:                        00:51:12   Log-Likelihood:                -426.05
No. Observations:                  50   AIC:                             856.1
Df Residuals:                      48   BIC:                             859.9
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const       3.717e+04   2.42e+04      1.537      0.131     -1.14e+04  8.58e+04
x1           -17.6238     12.146     -1.451      0.153       -42.045     6.798
==============================================================================
Omnibus:                       17.851   Durbin-Watson:                   1.535
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               21.727
Skew:                           1.365   Prob(JB):                     1.91e-05
Kurtosis:                       4.724   Cond. No.                     2.75e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.75e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.282
Model:                            OLS   Adj. R-squared:                  0.252
Method:                 Least Squares   F-statistic:                     9.248
Date:                Sat, 07 May 2016   Prob (F-statistic):           0.000411
Time:                        00:51:12   Log-Likelihood:                -418.82
No. Observations:                  50   AIC:                             843.6
Df Residuals:                      47   BIC:                             849.4
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const      -1.291e+07   3.26e+06     -3.956      0.000     -1.95e+07 -6.34e+06
x1          1.299e+04   3278.612      3.962      0.000      6395.437  1.96e+04
x2            -3.2677      0.824     -3.968      0.000        -4.925    -1.611
==============================================================================
Omnibus:                       16.589   Durbin-Watson:                   2.045
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               19.539
Skew:                           1.287   Prob(JB):                     5.72e-05
Kurtosis:                       4.659   Cond. No.                     8.43e+10
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.43e+10. This might indicate that there are
strong multicollinearity or other numerical problems.

The Average Thing #2

In [1]:
%matplotlib inline

The Average Thing #2

This time, we're going to take more in depth look at artifacts that are at the museum and what they are made of. The first thing we're going to do is load the data. Because the code for it is the first post, I'm just going to jump straight to the point where we've successfully processed the dimension data (and converted it to standardized numbers).

A deeper dive into object dimensions

In the last post, we found that all of the dimensions appeared to be skewed in that the means were much higher than the medians. Let's plot them to see what those distributions look like.

In [4]:
plt.figure(figsize = (10,6))
plt.hist(data['WeightNUM'].dropna());
In [5]:
plt.figure(figsize = (10,6));
plt.hist(data['LengthNUM'].dropna());
In [6]:
plt.figure(figsize = (10,6));
plt.hist(data['WidthNUM'].dropna(), bins = 100);
In [7]:
plt.figure(figsize = (10,6));
plt.hist(data['HeightNUM'].dropna(), bins = 100);

The graphs above suggest that there is basically a few (perhaps just one) object that is REALLY REALLY big, and that's pulling the mean up. Just for fun, let's find out what that object is.

In [8]:
data.sort_values(['LengthNUM'], ascending = False).iloc[0:3,:]
Out[8]:
artifactNumber ObjectName GeneralDescription model SerialNumber Manufacturer ManuCountry ManuProvince ManuCity BeginDate EndDate date_qualifier patent NumberOfComponents ArtifactFinish ContextCanada ContextFunction ContextTechnical group1 category1 ... Thickness Weight Diameter image thumbnail LengthINT LengthMULT LengthNUM HeightINT HeightMULT HeightNUM WidthINT WidthMULT WidthNUM WeightINT WeightMULT WeightNUM ThicknessINT ThicknessMULT ThicknessNUM
81085 1999.0265.001 Parasail .1 NYLON SAIL/ METAL FITTINGS/ PLASTIC & POLYE... Astra 29 300316 APCO Aviation Ltd. Israel NaN Netanya 1992 NaN NaN NaN 2.0 .1 BLUE, YELLOW AND GREEN NYLON SAIL;.2 PURPLE... PARAGLIDING IS A DEVELOPMENT OF THE SPORT OF H... A PARAGLIDER IS A CROSS BETWEEN A HANG GLIDER ... ORIGINALLY DEVELOPED FROM THE SQUARE PARACHUTE... Aviation Aircraft ... NaN NaN NaN http://source.techno-science.ca/artifacts-arte... http://source.techno-science.ca/artifacts-arte... 112.8 100.0 11280.0 NaN NaN NaN 29.8 100.0 2980.0 NaN NaN NaN NaN NaN NaN
97539 2008.0001.001 Automobile metal, glass, fibre, resin, wood and synthetic... Meteor/Montcalm 5066433 265A 61 Ford Motor Co. of Canada Ltd. Canada Ontario Oakville 1961 NaN NaN NaN 12.0 Automobile has turquoise colour body; white ha... Canadian-built and owned car considered to be ... Owner driven passenger vehicle for general tra... Following WWll, the Ford Motor Co. offered pot... Motorized Ground Transportation Automotive vehicles ... NaN NaN NaN http://source.techno-science.ca/artifacts-arte... http://source.techno-science.ca/artifacts-arte... 5200.0 1.0 5200.0 1405.00 1.0 1405.0 1210.0 1.0 1210.0 NaN NaN NaN NaN NaN NaN
102774 2012.0039.001 Airplane Limited information available at time of catal... Boeing 720-023B 18024 Boeing Airplane Co. United States of America Washington Seattle 1960 NaN NaN NaN 1.0 Red & white colour scheme. Limited information... , C-FETB proved invaluable in validating elect... Used in the testing of selected aircraft compo... , Pratt & Whitney Canada (PWC) bought the Mode... Aviation Aircraft ... NaN NaN NaN NaN NaN 41.4 100.0 4140.0 12.65 100.0 1265.0 39.9 100.0 3990.0 NaN NaN NaN NaN NaN NaN

3 rows × 51 columns

Apparently there is a paraglider that is 112.8 meters long. Let's pull up a picture of this thing. While we're at that, let's also look at this 5200 cm (52m) car. The last item is a Boeing 720. For that, ~40 meters seems reasonable, and it's what's listed on Wikipedia.

In [9]:
print data['image'].iloc[81085]
print data['image'].iloc[97539]
http://source.techno-science.ca/artifacts-artefacts/images/1999.0265.001.aa.cs.png
http://source.techno-science.ca/artifacts-artefacts/images/2008.0001.001.aa.cs.png

I'll be honest, I'm not sure this thing is that long: Also, this looks like a regular car: These just look like very understandable typos, but they also suggest that we should use the median, which is relatively unaffected by these issues.

Using the median dimensions of all the artifacts, we can calculate the average density of the object, and compare it to this table: http://www.psyclops.com/tools/technotes/materials/density.html:

In [10]:
print 22.8*9.4*12 # volume in cm^3
print 16780/(22.8*9.4*12) # density in grams/cm^3
2571.84
6.52451163369

The density of the average artifact is 6.5 g/cm3 or 6500 kg/m3 and is closest to Zirconium, but in general, consistent with some sort of metal. For reference, Aluminium has a density of 2700 kg/m3, Iron, 7870 kg/m3, and Gold, 19320 kg/m3.

Materials

So, what is the average artifact actually made of? There's a materials column that lists a bunch of different materials. Let's find a way extract those strings.

In [11]:
data['material'][50:60]
Out[11]:
50                   metal->steel;glass
51                                  NaN
52    metal->steel;glass;synthetic;wood
53    metal->steel;glass;synthetic;wood
54       metal->steel;metal->brass;wood
55         metal->steel;synthetic;glass
56         metal->steel;synthetic;glass
57         metal->steel;synthetic;glass
58         metal->steel;synthetic;glass
59         metal->steel;synthetic;glass
Name: material, dtype: object

The format for these strings is an arrow -> to represent a subtype of material, and a semi-colon to represent different material. As a first pass, let's extract the different materials into a list:

In [12]:
data['material'].str.findall(r'\w+')[50:60]
Out[12]:
50                     [metal, steel, glass]
51                                       NaN
52    [metal, steel, glass, synthetic, wood]
53    [metal, steel, glass, synthetic, wood]
54        [metal, steel, metal, brass, wood]
55          [metal, steel, synthetic, glass]
56          [metal, steel, synthetic, glass]
57          [metal, steel, synthetic, glass]
58          [metal, steel, synthetic, glass]
59          [metal, steel, synthetic, glass]
Name: material, dtype: object

There's probably a better way of separating sub-types of materials from the materials themselves, but the regex for that is beyond me at the moment, so let's run with this for now: As a quick and dirty way of guesstimating the main material of the average artifact, we can extract the first material in the list.

In [18]:
data['firstmaterial'] = data['material'].str.extract(r'^(\w+)', expand=False)

These are all the unique materials:

In [14]:
data['firstmaterial'].unique()
Out[14]:
array(['paper', 'wood', 'metal', 'glass', nan, 'skin', 'synthetic',
       'fibre', 'resin', 'ceramic', 'stone', 'fluid', 'bone', 'composite',
       'plant', 'animal', 'shell'], dtype=object)

The next thing we want to do is form a dictionary with this list, and then add counts to it.

In [15]:
withmaterials = data.loc[data['firstmaterial'].isnull() == False] #use only rows with data
firstmaterials = dict.fromkeys(withmaterials['firstmaterial'].unique(), 0) #using the dict method .fromkeys(list, default value)

for material in withmaterials['firstmaterial']:
    firstmaterials[material] += 1

firstmaterials
Out[15]:
{'animal': 39,
 'bone': 98,
 'ceramic': 799,
 'composite': 28,
 'fibre': 2117,
 'fluid': 9,
 'glass': 6177,
 'metal': 29370,
 'paper': 14106,
 'plant': 20,
 'resin': 513,
 'shell': 6,
 'skin': 774,
 'stone': 165,
 'synthetic': 7440,
 'wood': 6978}
In [16]:
 #plot a bar graph where the axis goes from 0 to number of dictionary entries, and the values are the countes of those entires
plt.figure(figsize = (15,10))
plt.bar(range(0,len(firstmaterials)), firstmaterials.values(), align='center')
plt.xticks(range(0,len(firstmaterials)),firstmaterials.keys(),rotation = 'vertical', size = 20);
plt.xlim([-1,len(firstmaterials)]);

Here they are as percentages:

In [17]:
total = sum(firstmaterials.values())

for key, value in firstmaterials.iteritems():
    print key, str(round((float(value)/total)*100,2)) + "%"
synthetic 10.84%
fibre 3.08%
stone 0.24%
shell 0.01%
composite 0.04%
ceramic 1.16%
metal 42.79%
fluid 0.01%
resin 0.75%
paper 20.55%
glass 9.0%
wood 10.17%
plant 0.03%
animal 0.06%
skin 1.13%
bone 0.14%

The Average Thing #1

The Average Thing

The Canada Science and Technology Museums Corporation hosts a repository of all the artifacts in its collection here: (http://techno-science.ca/en/data.php). Let's do something interesting with it.

Reading and cleaning data

The data come in either a .csv or .xml format. Let's first try reading the .csv file, because Pandas has native support for that. First, we import all the packages that we might need (e.g., regex), and tell Pandas not to truncate the columns like it usually does.

In [1]:
import pandas as pd
import re
import csv
import itertools
pd.get_option("display.max_columns")
pd.set_option("display.max_columns", 40)

Now, we import the data to a datafram named "data", using Pandas' built-in method .read_csv, specifying the delimiter "|", and telling it not to break when it comes across a bad line, but instead show a bunch of warnings and still continue.

In [2]:
data = pd.read_csv("cstmc-CSV-en.csv", delimiter = "|", error_bad_lines=False, warn_bad_lines=False)
C:\Users\dfthd\Anaconda3\envs\python2\lib\site-packages\IPython\core\interactiveshell.py:2723: DtypeWarning: Columns (9,10) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

If you turn the warnings on, it looks like some of the rows of the csv file have extra columns (i.e., they have 37 columns when we only have 36 column names. Because there are multiple missing values, it's not possible to determine what the extra columns represent. Just by eyeballing things, it looks like about ~500 of the 100,000 cases are problematic.

Our two options are to either ignore these cases, or try and read the xml file, which may or may not pay off. My recommendation is that we just move forward and try and get some initial analysis done before backtracking later if we want to.

Here are the first 3 artifacts in the data:

In [3]:
data.head(n = 3)
Out[3]:
artifactNumber ObjectName GeneralDescription model SerialNumber Manufacturer ManuCountry ManuProvince ManuCity BeginDate EndDate date_qualifier patent NumberOfComponents ArtifactFinish ContextCanada ContextFunction ContextTechnical group1 category1 subcategory1 group2 category2 subcategory2 group3 category3 subcategory3 material Length Width Height Thickness Weight Diameter image thumbnail
0 1966.0001.001 Cover PAPER WESTERN CANADA AIRWAYS LTD. NaN Unknown Unknown NaN NaN 1927 NaN NaN NaN 1.0 NaN AT THE TIME IT WAS THE WORLD'S MOST NORTHERNLY... NaN NaN Aviation Commemorative Stamps & coins NaN NaN NaN NaN NaN NaN paper 4.5 cm 2.6 cm NaN NaN NaN NaN http://source.techno-science.ca/artifacts-arte... http://source.techno-science.ca/artifacts-arte...
1 1966.0002.001 Stamp, postage PAPER WESTERN CANADA AIRWAYS LTD. NaN Unknown Unknown NaN NaN 1927 NaN NaN NaN 1.0 PINK & BLACK ON WHITE NaN NaN NaN Aviation Commemorative Stamps & coins NaN NaN NaN NaN NaN NaN paper 3.8 cm 2.7 cm NaN NaN NaN NaN http://source.techno-science.ca/artifacts-arte... http://source.techno-science.ca/artifacts-arte...
2 1966.0003.001 Stamp, postage PAPER NaN NaN Unknown Unknown NaN NaN 1932 NaN NaN NaN 1.0 DARK & PALE BLUE ON WHITE NaN NaN NaN Aviation Commemorative Stamps & coins NaN NaN NaN NaN NaN NaN paper 12.8 cm 8.4 cm NaN NaN NaN NaN http://source.techno-science.ca/artifacts-arte... http://source.techno-science.ca/artifacts-arte...

And here are some cases which have too many columns:

In [4]:
debug = pd.read_csv("cstmc-CSV-en.csv", header=None, delimiter = "|", error_bad_lines=False, skiprows = 37899, nrows=3)
debug
Out[4]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
0 1984.1214.001 Engine, airplane NaN LTC1K-4A/LYCOMING T53 LE09281X Avco Lycoming Div. United States of America Connecticut Stratford 1969 NaN circa NaN 1 NaN NaN Converts heat into mechanical power NaN Aviation Motive power NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN http://source.techno-science.ca/artifacts-arte...
1 1984.1215.001 Engine, airplane NaN LTC1K-4A/LYCOMING T53 LE09285X Avco Lycoming Div. United States of America Connecticut Stratford 1969 NaN circa NaN 1 NaN NaN Converts heat into mechanical power NaN Aviation Motive power NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN http://source.techno-science.ca/artifacts-arte...
2 1984.1216.001 Propeller Glass - Fibre HAMILTON STANDARD 337-339-330-325 Hamilton Standard Propeller Div., United Aircr... United States of America Connecticut East hartford 1969 NaN circa NaN 1 NaN NaN Aircraft propeller NaN Aviation Aircraft parts NaN NaN NaN NaN NaN NaN NaN glass;fibre NaN NaN NaN NaN NaN NaN http://source.techno-science.ca/artifacts-arte... http://source.techno-science.ca/artifacts-arte... http://source.techno-science.ca/artifacts-arte...

Recoding dimensions into actual units

The first thing we're going to do is to do convert all the dimensions, which have mixed units (e.g., 16cm and 10m or whatever) into one common unit. To do this, we're going to use regular expressions and loop through the dimension columns.

The other thing that we might want to do is decide what artifacts to actually use. It looks like some of the artifacts are things like documents and paper, and others are really big things like engines.

Dealing with missing data

As you can see from the example cases above, lots of data is missing or unavailable, which is probably to be expected given that not all columns are necessarily relevant to each artifact. However, this forces us to make a decision when trying to summarize data: How do we decide whether to include or exclude artifacts?

For example, the code below subsets data depending on whether or not any of the measurements of their physical dimensions (weight, height, etc.) are missing:

In [5]:
datasub1 = data.loc[data['Length'].isnull() == False]
datasub2 = datasub1.loc[datasub1['Width'].isnull() == False]
datasub3 = datasub2.loc[datasub2['Height'].isnull() == False]
datasub4 = datasub3.loc[datasub3['Weight'].isnull() == False]

Running the code below will list the problem cases with our regex - some have typos in them, others have European style decimal places (thanks Canada!)

In [6]:
subdata = datasub3
subdata = subdata.reset_index(drop = True)

x = 0
badlines = []
for line in subdata['Length']:
    try:
        re.search(r'(\d*\.\d*) .*', line).group(1)
    except:
        badlines.append(subdata.loc[x, ['Length']][0])
    x +=1

print badlines
['610 cm', '12 cm', '210 cm', '254,9 cm', '12,7 cm', '12,7 cm', '71.0\t cm', '81.0\t cm', '61,5 cm', '61,5 cm', '61,5 cm', '61,5 cm', '42.3\t cm', '59,5 cm', '58,0 cm', '176.O cm', '157,5 cm', '18,5 cm', '2,4 cm', '19,5 cm', '32,0 cm', ' cm', '21,0 cm', '30 cm', '58 cm', '5200 cm', '172,0 cm', '26.,0 cm', '26.,0 cm', '43,7 cm', '200,0 cm', '41,5 cm', '180,0 cm', '150 cm', '41,5 cm', '200 cm', '24 cm', '190 cm', '229 cm', '236 cm']

Let's first sub out the commas for decimal places and remove the '\t' characters.

In [7]:
subdata['Length'] = [re.sub(',', '.', x) for x in subdata['Length']]
subdata['Length'] = [re.sub('\t', '', x) for x in subdata['Length']]

AS IT HAPPENS, you can achieve the above using Pandas' built-in string methods, because I am a moron. Below, we use these methods on the "Length" dimension to replace the commas, extract the digits, extract the units, and then replace the units with the numbers to multiply the digits by (i.e., cm = 1, m = 100).

In [8]:
artifactlengths = data['Length'].str.extract(r'(\d*.\d*)', expand=False)
artifactlengths = artifactlengths.str.replace(',', '.')
artifactlengthmultipliers = data['Length'].str.extract(r'.* ([a-zA-Z]*)', expand=False)

Doing this, we've created a vector (i.e., a Pandas series) of floats, and a separate vector of strings that have the units. For the next step, we want to first replace cm and CM with 1, and m and M with 100, and look for the person who used inches and shoot them.

In [9]:
artifactlengthmultipliers.unique()
Out[9]:
array(['cm', nan, 'M', 'in', 'CM', 'm'], dtype=object)

We create a dictionary to use with the .replace method:

In [10]:
unitreplacement = {
    "mm" : .1,
    "MM" : .1,
    "cm" : 1, 
    "CM" : 1,
    "m" : 100,
    "M" : 100,
    "in" : 2.54,
    "kg" : 1,
    "KG" : 1,
    "gm" : .001,
    "tons" : 1000,
    "lbs" : .453
}
artifactlengthmultipliersNUM = artifactlengthmultipliers.replace(unitreplacement)

And then we multiple our numbers by our multipliers, and now we can use the data to find out the average length of all artifacts:

In [11]:
#We have to coerce because our series contains missing values that cannot be typed
artifactlengthincm = pd.to_numeric(artifactlengths, errors = 'coerce') * artifactlengthmultipliersNUM
In [12]:
print artifactlengthincm.mean()
print artifactlengthincm.median()
44.7114966571
22.8

The average length is 44.71 cm, and the median is 22.8 cm.

Functions to extract digits and units

Now that we've done this once with length, we can write some functions to do the above easily with the rest of the dimensions. We also have a function that will take a list of dimensions, and "numberize" all of them.

In [13]:
#These are the above as functions to run on our other dimensions
def extract_digits(series):
    lengths = series.str.extract(r'(\d*.\d*)', expand = False)
    lengths = lengths.str.replace(',' , '.')
    return pd.to_numeric(lengths, errors = 'coerce')

def extract_multipliers(series):
    multipliers = series.str.extract(r'.* ([a-zA-Z]*)', expand = False)
    unitreplacement = {
    "cm" : 1, 
    "CM" : 1,
    "m" : 100,
    "M" : 100,
    "in" : 2.54,
    "kg" : 1,
    "KG" : 1,
    "gm" : .001,
    "tons" : 1000,
    "lbs" : .453
    }
    multipliers = multipliers.replace(unitreplacement)
    return pd.to_numeric(multipliers, errors = 'coerce')

def numberize_dimensions(listofdimensions, data):
    for dimension in listofdimensions:
        data[dimension + "INT"] = extract_digits(data[dimension])
        data[dimension + "MULT"] = extract_multipliers(data[dimension])
        data[dimension + "NUM"] = data[dimension + "INT"] + data[dimension + "MULT"]
In [14]:
#This was some hacky debugging when I manually did a binary search-type thing to find out which row numbers were messed up
#pd.to_numeric(artifactlengths[96050:96100])
#artifactlengths[96050:96100]

Rest of the dimensions

From the above, the average length of all the artifacts in the collection is 44.7cm. Let's find out what the average is for all the other dimensions:

In [15]:
data['HeightINT'] = extract_digits(data['Height'])
data['HeightMULTI'] = extract_multipliers(data['Height'])
data['HeightNUM'] = data['HeightINT'] * data['HeightMULTI']
print data['HeightNUM'].mean()
print data['HeightNUM'].median()
25.8840089939
9.4

The average height is 25.88 cm and the median is 9.4 cm.

In [16]:
data['WidthINT'] = extract_digits(data['Width'])
data['WidthMULTI'] = extract_multipliers(data['Width'])
data['WidthNUM'] = data['WidthINT'] * data['WidthMULTI']
print data['WidthNUM'].mean()
print data['WidthNUM'].median()
24.1553862744
12.0

The average width is 24.16 cm and the median is 12 cm.

Now, we work on weight. Given that the units for weights range from gm to tons, let's go with kg as the unit we'll use, and assume that tons refer to metric tons. There's also a 'cm' there. Something is wrong.

In [17]:
data['Weight'].str.extract('.* (.*)', expand = False).unique()
Out[17]:
array([nan, 'kg', 'cm', 'tons', 'gm', 'lb'], dtype=object)

Below, we extract the indices of cases with "cm" in the weight. There are 13 cases, and some of them look like a misalignment of the columns, but it's impossible to tell which way they are misaligned.

In [18]:
cmtest = data['Weight'].str.extract('(.* cm)', expand=False)
#Here are three
data[cmtest.isnull() == False][0:3]
Out[18]:
artifactNumber ObjectName GeneralDescription model SerialNumber Manufacturer ManuCountry ManuProvince ManuCity BeginDate EndDate date_qualifier patent NumberOfComponents ArtifactFinish ContextCanada ContextFunction ContextTechnical group1 category1 ... category2 subcategory2 group3 category3 subcategory3 material Length Width Height Thickness Weight Diameter image thumbnail HeightINT HeightMULTI HeightNUM WidthINT WidthMULTI WidthNUM
1934 1966.0971.001 Photograph photograph paper; wood frame; glass; paper & c... NaN NaN JARMAN, FRANK LTD. Canada Ontario Ottawa 1941 NaN NaN NaN 1.0 b + w photograph; black frame presented by Frank Learman to Ralph Bell, Dire... display NaN Aviation Archives ... Archives Personal NaN NaN NaN paper->cardboard;wood->;glass->;metal->steel 49.8 cm 38.8 cm NaN NaN 1.9 cm NaN http://source.techno-science.ca/artifacts-arte... http://source.techno-science.ca/artifacts-arte... NaN NaN NaN 38.8 1.0 38.8
1941 1966.0978.001 Photograph photograph paper; glass; plastic frame; steel ... NaN NaN Royal Canadian Air Force Canada Ontario Ottawa 1927 NaN circa NaN 1.0 b + w photograph; caption in white ink; mounte... RCAF Station Ottawa (later Rockcliffe) was the... taken as part of aerial photography survey for... NaN Aviation Archives ... NaN NaN NaN NaN NaN paper->;glass->;synthetic->plastic;metal->steel 45.3 cm 40.0 cm NaN NaN 2.3 cm NaN http://source.techno-science.ca/artifacts-arte... http://source.techno-science.ca/artifacts-arte... NaN NaN NaN 40.0 1.0 40.0
52400 1989.0488.001 Calculator, air navigation plastic; metal swivel at centre - possibly alu... B-24D NaN Consolidated Aircraft Corp. United States of America California San diego 1941 NaN after NaN 2.0 white with black detail; pale yellow rotating ... NaN used to determine gross weight & fore & aft ce... especially useful on bombers because of varyin... Aviation Navigation instruments & equipment ... NaN NaN NaN NaN NaN synthetic->plastic;metal->aluminum - possible 28.0 cm 25.2 cm NaN NaN 1.0 cm NaN http://source.techno-science.ca/artifacts-arte... http://source.techno-science.ca/artifacts-arte... NaN NaN NaN 25.2 1.0 25.2

3 rows × 42 columns

Let's set those cases to NaN, and just ignore them for now:

In [19]:
data.loc[cmtest.isnull() == False,['Weight']] = None
In [20]:
data['WeightINT'] = extract_digits(data['Weight'])
data['WeightMULT'] = extract_multipliers(data['Weight'])
data['WeightNUM'] = data['WeightINT'] * data['WeightMULT']
print data['WeightNUM'].mean()
print data['WeightNUM'].median()
1058.87995356
16.78

The average weight is 1 ton, which may be a problem if we want to fabricate this. However, the median is much lower, which suggests that the distribution is skewed, and this is also true for the other dimensions, let's take a look at that next.