The Average Thing #4: The Average Place pt.1

In [1]:
%matplotlib inline

The Average Thing #4: The average place pt. 1

Today, we're going to take an initial look at finding where the average artifact comes from.

In [2]:
import pandas as pd
import re
import numpy as np
from matplotlib import pyplot as plt
from geopy.geocoders import Nominatim
pd.get_option("display.max_columns")
pd.set_option("display.max_columns", 40)

data = pd.read_csv("cstmc-CSV-en.csv", delimiter = "|", error_bad_lines=False, warn_bad_lines=False)
C:\Users\dfthd\Anaconda3\envs\python2\lib\site-packages\IPython\core\interactiveshell.py:2723: DtypeWarning: Columns (9,10) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

We're going to use the GeoPy package to see if we can extract the latitude and longitude data from each of the countries. Let's start with a test case:

In [3]:
geolocator = Nominatim() #Using the service Nominatim
examplecase = data['ManuCountry'][7]
print examplecase
location = geolocator.geocode(examplecase) #Use the geocode method, 
print location.latitude
print location.longitude
United States of America
39.7837304
-100.4458824

Now, with that latitude and longitude, let's see where that takes us. If you imput the coordinates on Google Maps, it points to somewhere in the middle of the U.S. That's good enough for now:

In [4]:
from IPython.display import IFrame
from IPython.core.display import display

IFrame('''https://www.google.com/maps/embed?pb=!1m14!1m8!1m3!1d11800310.147808608!\
2d-107.43346408765365!3d38.054495438444!3m2!1i1024!2i768!4f13.1!3m3!1m2!1s0x0%3A0x0!2zMznCsDQ3JzAx\
LjQiTiAxMDDCsDI2JzQ1LjIiVw!5e0!3m2!1sen!2sus!4v1462918136743''', width = 800, height = 600)
Out[4]:

Next, we're going to do the same for all the countries of origin:

In [5]:
#This simple function takes a string and returns a tuple of the latitude and longitude
def find_lat_long(string):
    geolocator = Nominatim() #Using the service Nominatim
    location = geolocator.geocode(string, timeout=None) #Use the geocode method, 
    return(location.latitude, location.longitude)

import math

#create a dictionary of all the countries
country_locations = {country:"" for country in data['ManuCountry'].unique() if (country != 'Unknown')\
                          and (country != 'unknown') and (type(country) is str)}

#country_locations = {k: find_lat_long(k) for k, v in country_locations.iteritems()}
for k, v in country_locations.iteritems():
    try:
        country_locations[k] = find_lat_long(k)
    except:
        pass
In [6]:
print country_locations
{'Canada': (61.0666922, -107.991707), 'Brazil': (-10.3333332, -53.1999999), 'Qatar': (25.3336984, 51.2295295), 'Italy': (42.6384261, 12.674297), 'Peru': (-6.8699696, -75.0458514), 'Antarctica': (-82.1081819, 34.378236), 'Panama': (8.3096067, -81.3066245), 'Scotland': (56.7861112, -4.1140517), 'Cambodia': (13.5066394, 104.869423), 'France': (46.603354, 1.8883335), 'Great Britain': (54.31536155, -1.91802339480124), 'Bermuda': (32.3018217, -64.7603582), 'Ireland': (52.865196, -7.9794598), 'Wales': (52.2928116, -3.7389299), 'Argentina': (-34.9964962, -64.9672816), 'Norway': (64.574131, 11.5270102892405), 'Thailand': (14.8971921, 100.83273), 'Bangladesh': (24.4768783, 90.2932426), 'Federal Republic of Germany': (45.4185139, -75.6817796429953), 'Slovakia': (48.7411522, 19.4528646), 'Israel': (30.8760272, 35.0015196), 'Australia': (-24.7761085, 134.755), 'Columbia': (2.893108, -73.7845142), 'Indonesia': (-4.7993355, 114.5632032), 'Singapore': (1.2904527, 103.852038), 'Afganistan': (33.0000866, 64.9998468), 'Czechoslovakia': (41.2133453, -98.6228518), 'German Democratic Republic': (19.4347018, -99.2052135855648), 'United States of America': (39.7837304, -100.4458824), 'South America': (-21.0002178, -61.0006564), 'Union of Soviet Socialist Republics': '', 'Guatemala': (15.6356088, -89.8988086), 'Germany': (51.0834196, 10.4234469), 'Chile': (-31.7613364, -71.3187696), 'Martinique': (14.6113732, -60.9620776), 'Belgium': (50.6407351, 4.66696), 'China': (35.000074, 104.999927), 'Haiti': (19.1399952, -72.3570971), 'Hong Kong': (22.2793278, 114.1628131), 'Philippines': (12.7503486, 122.7312101), 'Spain': (40.0028028, -4.0031039), 'Ukraine': (49.4871968, 31.2718321), 'Europe': (51.0000003, 9.9999997), 'Netherlands': (52.2379891, 5.53460738161551), 'Holland': (42.7876022, -86.1090827), 'Pakistan': (30.3308401, 71.247499), 'Denmark': (55.670249, 10.3333283), 'Poland': (52.0977181, 19.0258159), 'Finland': (63.2467777, 25.9209164), 'Mexico': (19.4325301, -99.1332101), 'North America': (51.0000002, -109.0000001), 'Sweden': (59.6749712, 14.5208584), 'Korea': (35.3985008, 127.937111), 'Hungary': (47.1817585, 19.5060937), 'Switzerland': (46.7985624, 8.2319736), 'Honduras': (15.0610686, -84.5978533), 'New Zealand': (-41.500083, 172.8344077), 'Cuba': (23.0131338, -80.8328747), 'Physics': (14.4832485, 121.0269845), 'Jamaica': (18.1850507, -77.3947692), 'Romania': (45.9852129, 24.6859225), 'England': (52.7954791, -0.540240236617432), 'Portugal': (40.033265, -7.8896262), 'South Africa': (-28.8166235, 24.991639), 'Egypt': (26.2540493, 29.2675469), 'Unknown.': (24.9993709, -77.4417422759927), 'Liechtenshtein': (31.9345932, 34.8325777), 'India': (22.3511148, 78.6677428), 'United Kingdom': (54.7023545, -3.2765752), 'Malaysia': (2.3923759, 112.8471939), 'Austria': (47.2000338, 13.199959), 'Trinidad and Tabago': '', 'Republic of Korea': (35.3985008, 127.937111), 'Northern Ireland': (54.65688285, -6.5634352096912), 'Sri Lanka': (7.878, 80.7038245875), 'Japan': (36.5748441, 139.2394179), 'Taiwan': (23.9739374, 120.9820179)}
In [7]:
data['GeoLocations'] = [country_locations.get(x) for x in data['ManuCountry']] 
#we use .get() instead of country_locations[x] to prevent errors when keys are not in the dictionary
In [12]:
data['GeoLocations'][80:100]
Out[12]:
80                          None
81                          None
82                          None
83    (39.7837304, -100.4458824)
84    (39.7837304, -100.4458824)
85                          None
86                          None
87     (61.0666922, -107.991707)
88     (61.0666922, -107.991707)
89    (39.7837304, -100.4458824)
90    (39.7837304, -100.4458824)
91    (39.7837304, -100.4458824)
92    (39.7837304, -100.4458824)
93    (39.7837304, -100.4458824)
94    (39.7837304, -100.4458824)
95    (39.7837304, -100.4458824)
96    (39.7837304, -100.4458824)
97                          None
98                          None
99    (39.7837304, -100.4458824)
Name: GeoLocations, dtype: object

The above process has yielded tuples of latitudes and longitudes where an artifact's country of origin was available, and null where it hasn't. To average this, we're going to have to do some math, because the earth is round and we can't just add the latitudes and longitudes up:

In [9]:
#We're going to have to convert degrees to radians, so let's define a function to do that:

def to_radians(latlong):
    try:
        return (latlong[0]*3.142/180, latlong[1]*3.142/180)
    except:
        pass

#Next, because the earth is round, we have to convert these to cartesian coordinates:
def to_cartesian(latlongradians):
    try:
        x = math.cos(latlongradians[0]) * math.cos(latlongradians[1])
        y = math.cos(latlongradians[0]) * math.sin(latlongradians[1])
        z = math.sin(latlongradians[0])
        return(x,y,z)
    except:
        pass
    
#Finally, we convert back to our original latitude/longitude:
def to_latlong(cartesian):
    try:
        x = cartesian[0]
        y = cartesian[1]
        z = cartesian[2]
        lon = math.atan2(y, x)
        hyp = math.sqrt(x * x + y * y)
        lat = math.atan2(z, hyp)
        return(lat*180/3.142,lon*180/3.142)
    except:
        pass
    
print data['GeoLocations'][90]    
print to_radians(data['GeoLocations'][90])
print to_cartesian(to_radians(data['GeoLocations'][90]))
print to_latlong(to_cartesian(to_radians(data['GeoLocations'][90])))
(39.7837304, -100.4458824)
(0.6944471162044444, -1.75333868056)
(-0.13948924758883874, -0.7556408246036178, 0.6399606972302785)
(39.7837304, -100.4458824)

Looks like this works! Now, let's map it to the entire series:

In [10]:
data['GeoRadians'] = map(to_radians,data['GeoLocations'])
data['GeoCartesian'] = map(to_cartesian, data['GeoRadians'])

#Now, let's average across all the dimensions by reducing:
geo_no_nan = data.loc[data['GeoCartesian'].isnull() == False]['GeoCartesian']

avgcoordinates = reduce(lambda x,y: (x[0] + y[0], x[1] + y[1], x[2] + y[2]), geo_no_nan)
print avgcoordinates
avgcoordinates = (avgcoordinates[0]/len(geo_no_nan), avgcoordinates[1]/len(geo_no_nan), avgcoordinates[2]/len(geo_no_nan))
print avgcoordinates
print to_latlong(avgcoordinates)
(1612.787448314339, -17961.012440354283, 33139.39424219098)
(0.036587737030724564, -0.40746398458154, 0.7518011397956211)
(61.43857821633736, -84.85795258473567)

This procedure says that the location where the average artifact is manufactured is in the Hudson Bay in Canada:

In [11]:
IFrame('''https://www.google.com/maps/embed?pb=!1m18!1m12!1m3!1d9892352.643336799!2d-98.04822081277437\
!3d59.40187990911089!2m3!1f0!2f0!3f0!3m2!1i1024!2i768!4f13.1!3m3!1m2!1s0x0%3A0x0!2zNjHCsDI2JzE4LjkiTiA\
4NMKwNTEnMjguNiJX!5e0!3m2!1sen!2sus!4v1462926074704''', width = 800, height = 600)
Out[11]:

This seems a little strange to me, and I'm definitely going to have to dig deeper in a future post, but I think my math is right, and perhaps I'm just visualizing this in my head wrongly.