The Average Thing #4: The Average Place pt.1
%matplotlib inline
The Average Thing #4: The average place pt. 1¶
Today, we're going to take an initial look at finding where the average artifact comes from.
import pandas as pd
import re
import numpy as np
from matplotlib import pyplot as plt
from geopy.geocoders import Nominatim
pd.get_option("display.max_columns")
pd.set_option("display.max_columns", 40)
data = pd.read_csv("cstmc-CSV-en.csv", delimiter = "|", error_bad_lines=False, warn_bad_lines=False)
We're going to use the GeoPy package to see if we can extract the latitude and longitude data from each of the countries. Let's start with a test case:
geolocator = Nominatim() #Using the service Nominatim
examplecase = data['ManuCountry'][7]
print examplecase
location = geolocator.geocode(examplecase) #Use the geocode method,
print location.latitude
print location.longitude
Now, with that latitude and longitude, let's see where that takes us. If you imput the coordinates on Google Maps, it points to somewhere in the middle of the U.S. That's good enough for now:
from IPython.display import IFrame
from IPython.core.display import display
IFrame('''https://www.google.com/maps/embed?pb=!1m14!1m8!1m3!1d11800310.147808608!\
2d-107.43346408765365!3d38.054495438444!3m2!1i1024!2i768!4f13.1!3m3!1m2!1s0x0%3A0x0!2zMznCsDQ3JzAx\
LjQiTiAxMDDCsDI2JzQ1LjIiVw!5e0!3m2!1sen!2sus!4v1462918136743''', width = 800, height = 600)
Next, we're going to do the same for all the countries of origin:
#This simple function takes a string and returns a tuple of the latitude and longitude
def find_lat_long(string):
geolocator = Nominatim() #Using the service Nominatim
location = geolocator.geocode(string, timeout=None) #Use the geocode method,
return(location.latitude, location.longitude)
import math
#create a dictionary of all the countries
country_locations = {country:"" for country in data['ManuCountry'].unique() if (country != 'Unknown')\
and (country != 'unknown') and (type(country) is str)}
#country_locations = {k: find_lat_long(k) for k, v in country_locations.iteritems()}
for k, v in country_locations.iteritems():
try:
country_locations[k] = find_lat_long(k)
except:
pass
print country_locations
data['GeoLocations'] = [country_locations.get(x) for x in data['ManuCountry']]
#we use .get() instead of country_locations[x] to prevent errors when keys are not in the dictionary
data['GeoLocations'][80:100]
The above process has yielded tuples of latitudes and longitudes where an artifact's country of origin was available, and null where it hasn't. To average this, we're going to have to do some math, because the earth is round and we can't just add the latitudes and longitudes up:
#We're going to have to convert degrees to radians, so let's define a function to do that:
def to_radians(latlong):
try:
return (latlong[0]*3.142/180, latlong[1]*3.142/180)
except:
pass
#Next, because the earth is round, we have to convert these to cartesian coordinates:
def to_cartesian(latlongradians):
try:
x = math.cos(latlongradians[0]) * math.cos(latlongradians[1])
y = math.cos(latlongradians[0]) * math.sin(latlongradians[1])
z = math.sin(latlongradians[0])
return(x,y,z)
except:
pass
#Finally, we convert back to our original latitude/longitude:
def to_latlong(cartesian):
try:
x = cartesian[0]
y = cartesian[1]
z = cartesian[2]
lon = math.atan2(y, x)
hyp = math.sqrt(x * x + y * y)
lat = math.atan2(z, hyp)
return(lat*180/3.142,lon*180/3.142)
except:
pass
print data['GeoLocations'][90]
print to_radians(data['GeoLocations'][90])
print to_cartesian(to_radians(data['GeoLocations'][90]))
print to_latlong(to_cartesian(to_radians(data['GeoLocations'][90])))
Looks like this works! Now, let's map it to the entire series:
data['GeoRadians'] = map(to_radians,data['GeoLocations'])
data['GeoCartesian'] = map(to_cartesian, data['GeoRadians'])
#Now, let's average across all the dimensions by reducing:
geo_no_nan = data.loc[data['GeoCartesian'].isnull() == False]['GeoCartesian']
avgcoordinates = reduce(lambda x,y: (x[0] + y[0], x[1] + y[1], x[2] + y[2]), geo_no_nan)
print avgcoordinates
avgcoordinates = (avgcoordinates[0]/len(geo_no_nan), avgcoordinates[1]/len(geo_no_nan), avgcoordinates[2]/len(geo_no_nan))
print avgcoordinates
print to_latlong(avgcoordinates)
This procedure says that the location where the average artifact is manufactured is in the Hudson Bay in Canada:
IFrame('''https://www.google.com/maps/embed?pb=!1m18!1m12!1m3!1d9892352.643336799!2d-98.04822081277437\
!3d59.40187990911089!2m3!1f0!2f0!3f0!3m2!1i1024!2i768!4f13.1!3m3!1m2!1s0x0%3A0x0!2zNjHCsDI2JzE4LjkiTiA\
4NMKwNTEnMjguNiJX!5e0!3m2!1sen!2sus!4v1462926074704''', width = 800, height = 600)
This seems a little strange to me, and I'm definitely going to have to dig deeper in a future post, but I think my math is right, and perhaps I'm just visualizing this in my head wrongly.