The Average Thing #2
%matplotlib inline
The Average Thing #2¶
This time, we're going to take more in depth look at artifacts that are at the museum and what they are made of. The first thing we're going to do is load the data. Because the code for it is the first post, I'm just going to jump straight to the point where we've successfully processed the dimension data (and converted it to standardized numbers).
A deeper dive into object dimensions¶
In the last post, we found that all of the dimensions appeared to be skewed in that the means were much higher than the medians. Let's plot them to see what those distributions look like.
plt.figure(figsize = (10,6))
plt.hist(data['WeightNUM'].dropna());
plt.figure(figsize = (10,6));
plt.hist(data['LengthNUM'].dropna());
plt.figure(figsize = (10,6));
plt.hist(data['WidthNUM'].dropna(), bins = 100);
plt.figure(figsize = (10,6));
plt.hist(data['HeightNUM'].dropna(), bins = 100);
The graphs above suggest that there is basically a few (perhaps just one) object that is REALLY REALLY big, and that's pulling the mean up. Just for fun, let's find out what that object is.
data.sort_values(['LengthNUM'], ascending = False).iloc[0:3,:]
Apparently there is a paraglider that is 112.8 meters long. Let's pull up a picture of this thing. While we're at that, let's also look at this 5200 cm (52m) car. The last item is a Boeing 720. For that, ~40 meters seems reasonable, and it's what's listed on Wikipedia.
print data['image'].iloc[81085]
print data['image'].iloc[97539]
I'll be honest, I'm not sure this thing is that long: Also, this looks like a regular car: These just look like very understandable typos, but they also suggest that we should use the median, which is relatively unaffected by these issues.
Using the median dimensions of all the artifacts, we can calculate the average density of the object, and compare it to this table: http://www.psyclops.com/tools/technotes/materials/density.html:
print 22.8*9.4*12 # volume in cm^3
print 16780/(22.8*9.4*12) # density in grams/cm^3
The density of the average artifact is 6.5 g/cm3 or 6500 kg/m3 and is closest to Zirconium, but in general, consistent with some sort of metal. For reference, Aluminium has a density of 2700 kg/m3, Iron, 7870 kg/m3, and Gold, 19320 kg/m3.
Materials¶
So, what is the average artifact actually made of? There's a materials column that lists a bunch of different materials. Let's find a way extract those strings.
data['material'][50:60]
The format for these strings is an arrow -> to represent a subtype of material, and a semi-colon to represent different material. As a first pass, let's extract the different materials into a list:
data['material'].str.findall(r'\w+')[50:60]
There's probably a better way of separating sub-types of materials from the materials themselves, but the regex for that is beyond me at the moment, so let's run with this for now: As a quick and dirty way of guesstimating the main material of the average artifact, we can extract the first material in the list.
data['firstmaterial'] = data['material'].str.extract(r'^(\w+)', expand=False)
These are all the unique materials:
data['firstmaterial'].unique()
The next thing we want to do is form a dictionary with this list, and then add counts to it.
withmaterials = data.loc[data['firstmaterial'].isnull() == False] #use only rows with data
firstmaterials = dict.fromkeys(withmaterials['firstmaterial'].unique(), 0) #using the dict method .fromkeys(list, default value)
for material in withmaterials['firstmaterial']:
firstmaterials[material] += 1
firstmaterials
#plot a bar graph where the axis goes from 0 to number of dictionary entries, and the values are the countes of those entires
plt.figure(figsize = (15,10))
plt.bar(range(0,len(firstmaterials)), firstmaterials.values(), align='center')
plt.xticks(range(0,len(firstmaterials)),firstmaterials.keys(),rotation = 'vertical', size = 20);
plt.xlim([-1,len(firstmaterials)]);
Here they are as percentages:
total = sum(firstmaterials.values())
for key, value in firstmaterials.iteritems():
print key, str(round((float(value)/total)*100,2)) + "%"