The Average Thing #3: Artifacts Over Time

In [1]:
%matplotlib inline

The Average Thing #3: Artifacts over time

In this post, we're going to explore how the number of artifacts that were received over time.

In [2]:
import pandas as pd
import re
import csv
import itertools
import numpy as np
from matplotlib import pyplot as plt
pd.get_option("display.max_columns")
pd.set_option("display.max_columns", 40)

data = pd.read_csv("cstmc-CSV-en.csv", delimiter = "|", error_bad_lines=False, warn_bad_lines=False)
C:\Users\dfthd\Anaconda3\envs\python2\lib\site-packages\IPython\core\interactiveshell.py:2723: DtypeWarning: Columns (9,10) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

From the data dictionary, the first four digits of the artifact number reflect the year in which the museum acquired them. How many artifacts did the museum acquire each year?

In [3]:
data['YearObtained'] = data['artifactNumber'].str.extract(r'(\d\d\d\d)', expand=False)
In [4]:
data['YearObtained'].unique()
Out[4]:
array(['1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973',
       '1974', '1975', '1976', '1977', '1978', '1979', '1980', '1981',
       '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989',
       '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997',
       '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005',
       '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015'], dtype=object)

Number of objects each year

This is a dataframe/table of the counts of data each attribute has. Besides telling us how many artifacts the museum received that year (in artifactNumber), it also tells us how many of those cases have available data.

In [5]:
data.groupby(by = 'YearObtained').count()
Out[5]:
artifactNumber ObjectName GeneralDescription model SerialNumber Manufacturer ManuCountry ManuProvince ManuCity BeginDate EndDate date_qualifier patent NumberOfComponents ArtifactFinish ContextCanada ContextFunction ContextTechnical group1 category1 subcategory1 group2 category2 subcategory2 group3 category3 subcategory3 material Length Width Height Thickness Weight Diameter image thumbnail
YearObtained
1966 2110 1719 1540 787 193 1629 1537 588 653 1073 432 925 143 2107 1375 422 1124 483 1662 1662 1276 164 164 115 14 14 9 1550 1188 1221 742 57 20 54 1193 1193
1967 1937 1556 1074 1017 543 1402 1290 521 700 849 257 718 51 1937 735 349 749 456 1498 1498 706 67 67 25 2 2 2 1119 845 999 849 8 39 21 1257 1257
1968 2002 1313 938 469 261 1008 918 342 444 442 151 382 94 2002 478 224 385 215 1239 1239 938 46 46 30 3 3 1 943 694 705 490 1 0 6 1185 1185
1969 2793 2065 1798 950 450 1912 1674 610 753 1094 309 986 189 2793 1098 554 1139 689 2002 2002 1655 201 201 170 2 2 2 1809 1555 1549 1154 0 8 78 1934 1934
1970 2041 1529 827 977 790 1494 1128 370 537 464 139 417 123 2041 662 159 358 157 1503 1503 520 59 59 21 1 1 1 825 778 716 629 2 30 39 1253 1253
1971 1143 809 601 408 241 744 659 329 365 317 111 281 99 1143 383 130 227 166 756 756 506 21 21 13 2 2 0 605 539 550 443 3 4 11 678 678
1972 2127 1540 1528 932 286 1439 1143 561 608 723 257 623 169 2127 1495 197 457 278 1442 1442 793 149 149 122 14 14 0 1533 1258 1335 641 73 1 6 1482 1482
1973 1304 894 778 376 214 796 754 305 376 339 60 256 71 1304 740 206 336 234 848 848 685 44 44 31 0 0 0 780 733 749 645 5 1 17 784 784
1974 1226 894 666 505 324 775 713 431 503 365 131 319 123 1226 500 173 314 198 813 813 617 37 37 15 0 0 0 671 543 552 460 1 1 27 881 881
1975 1683 1128 1047 793 224 1097 1006 570 635 671 258 616 111 1683 560 274 812 361 1088 1088 881 38 38 14 4 4 4 1049 604 972 869 0 2 2 1139 1139
1976 1309 882 950 591 371 871 817 373 433 512 125 425 151 1308 697 208 535 352 812 812 684 33 33 21 3 3 2 953 688 759 607 0 3 4 890 890
1977 1419 943 953 522 264 908 848 420 483 440 112 379 154 1419 754 378 464 397 917 917 651 37 37 20 1 1 1 954 633 884 675 1 2 8 970 970
1978 3567 1271 1172 685 364 1214 1127 498 659 612 208 547 231 3567 933 249 699 354 1246 1246 859 101 101 78 2 2 1 1175 813 1027 832 0 0 18 1143 1143
1979 2322 1339 1137 641 259 1233 1175 500 648 508 122 360 154 2322 857 163 536 237 1303 1301 1114 121 121 101 27 27 21 1137 988 1049 838 0 3 65 1234 1234
1980 1633 1187 1023 654 329 1041 981 427 523 556 249 483 153 1633 725 195 466 306 1033 1033 809 91 91 67 1 1 1 1026 951 891 730 0 34 66 960 960
1981 3163 2168 2028 1343 606 2053 1993 911 1241 1131 442 1071 316 3163 1813 236 1692 438 1979 1979 1867 140 140 81 4 4 3 2031 1744 1807 1554 0 3 28 1928 1928
1982 1796 953 915 544 237 931 882 414 537 505 149 422 148 1796 778 244 572 297 926 926 707 142 142 65 3 3 1 920 807 828 605 10 0 22 896 896
1983 1671 775 811 408 234 747 717 332 439 449 112 419 99 1670 683 191 524 236 754 753 595 70 71 54 0 0 0 810 787 759 667 29 2 45 717 717
1984 2546 1332 1305 661 536 1262 1213 570 765 505 95 477 155 2546 1168 303 995 258 1267 1267 865 87 87 68 1 1 1 1308 1217 1310 1149 15 2 40 1355 1355
1985 2780 1346 1377 583 296 1056 993 473 542 487 114 436 106 2780 1100 396 783 371 1011 1011 790 175 175 86 13 13 13 1377 1199 1096 953 21 1 172 991 991
1986 2113 1046 1107 665 322 1030 974 370 496 588 220 520 129 2113 989 473 846 508 1034 1034 700 130 130 79 7 7 6 1108 1111 1119 893 12 1 33 1080 1080
1987 5122 3544 3727 1923 609 2975 3123 1506 1715 1898 411 1154 259 5122 3674 2476 3077 1386 3461 3461 3261 347 347 267 14 14 10 3724 3454 3364 1710 58 2 97 3157 3157
1988 3277 1736 2047 922 280 1779 1672 558 925 1048 348 941 203 3277 1723 1023 1481 783 1800 1800 1587 459 459 366 50 50 27 2046 1997 2002 1623 47 1 183 1579 1579
1989 1371 696 771 454 161 667 631 292 366 514 194 390 49 1371 731 393 581 337 669 669 574 93 93 67 3 3 1 771 707 645 452 11 1 51 630 630
1990 2280 1380 1320 861 146 1255 1200 305 419 510 232 435 106 2280 1332 626 791 336 916 916 728 379 379 126 23 23 20 1317 1128 1337 1057 34 2 159 941 941
1991 2818 2015 2131 895 234 1741 1708 555 630 1429 319 1106 134 2817 1937 635 1718 524 1791 1791 1603 328 328 214 48 48 46 2126 1871 1828 982 373 3 197 1732 1732
1992 4819 3765 3852 1712 608 3618 3012 1274 1340 2792 583 2549 693 4819 3711 1595 3321 1486 3747 3747 1475 303 303 217 57 57 39 3852 3480 3216 2134 72 9 446 3642 3642
1993 1367 1000 978 587 185 854 817 401 484 658 200 410 62 1367 898 467 785 382 925 925 823 184 184 152 23 23 20 978 861 863 437 29 1 35 974 974
1994 2431 1533 1479 952 153 1421 1383 723 790 1120 341 572 115 2431 1344 559 1390 532 1494 1494 809 227 227 206 26 26 25 1479 1371 1355 626 34 2 18 1478 1478
1995 5929 2256 2249 1374 366 2222 2086 768 993 1244 390 939 220 5929 2234 1543 2183 1218 2297 2297 2035 269 269 188 21 21 6 2249 1915 2052 1485 33 3 59 2242 2242
1996 2496 1470 1361 872 279 1402 1338 550 708 950 261 725 132 2496 1323 766 1239 685 1413 1413 1216 381 381 278 31 31 30 1360 1174 1226 851 31 3 111 1381 1381
1997 2966 1748 1776 1090 504 1706 1634 686 949 1194 237 913 203 2966 1715 1234 1663 1098 1745 1745 1003 311 311 268 31 31 29 1775 1463 1439 1096 100 1 245 1900 1900
1998 2641 1904 1905 1340 140 1926 1659 481 619 1369 368 869 201 2641 1883 1406 1773 941 1872 1871 1661 227 227 207 5 5 3 1906 1330 1517 964 18 2 310 2197 2197
1999 1410 849 892 475 107 866 776 343 372 500 257 431 92 1410 831 675 832 585 900 900 789 150 150 138 46 46 46 871 725 684 476 82 1 78 1041 1041
2000 1059 638 801 404 122 656 703 268 336 408 126 263 119 1059 727 439 577 385 774 774 665 117 117 111 13 13 13 801 686 664 499 31 0 50 783 783
2001 2111 1208 1147 498 163 1129 1062 534 585 637 225 481 126 2111 1094 848 973 671 1132 1132 882 216 216 160 7 7 3 1147 919 950 686 42 0 64 1299 1299
2002 3167 2412 2577 1059 354 2259 2187 774 1109 1970 1073 1626 188 3167 2555 2116 2041 1345 2433 2433 2180 377 377 322 42 42 3 2587 2199 1950 1515 26 1 413 2625 2625
2003 2235 1853 1884 730 76 1859 1724 423 453 1225 609 926 100 2234 1879 1480 1703 531 1774 1764 1436 265 265 225 24 24 7 1886 1501 1438 772 124 2 208 1697 1697
2004 5017 4985 4950 4087 181 4746 4763 828 3563 1836 736 1238 73 5017 4989 4270 4844 4033 4863 4863 1802 2808 2808 657 43 43 29 4989 4838 4627 1064 179 7 186 2896 2896
2005 742 742 738 493 167 658 603 241 274 556 173 395 27 742 734 319 592 258 683 683 583 138 138 116 23 23 23 742 681 648 306 203 3 84 700 700
2006 851 851 851 428 52 840 811 95 282 756 70 510 16 851 851 373 718 554 820 820 767 44 44 32 10 10 0 851 774 452 253 71 3 64 462 462
2007 745 745 743 224 78 706 658 148 200 620 287 481 43 745 738 195 690 178 731 730 608 160 160 143 0 0 0 743 686 658 233 179 0 74 483 483
2008 2005 1996 1998 1670 134 1962 1916 1302 1421 613 172 478 69 2005 2001 1419 1760 1412 1983 1982 646 1323 1323 1260 21 21 10 1998 1925 1909 242 145 0 58 1078 1078
2009 861 861 860 432 117 851 823 258 343 687 209 560 64 847 860 562 644 446 818 805 610 239 230 202 6 6 6 861 761 727 466 31 0 78 668 668
2010 1966 1966 1955 593 83 1923 1913 309 421 1784 1155 1642 205 1966 1957 1792 1811 1581 1957 1957 1745 1192 1188 1091 277 277 272 1954 1618 1500 1129 42 1 269 1139 1139
2011 258 258 255 93 28 240 238 90 101 224 28 138 4 258 253 129 177 79 256 256 163 46 45 43 2 2 2 253 226 203 138 27 0 38 235 235
2012 258 258 257 119 19 256 252 115 114 227 23 181 39 258 258 149 234 122 258 230 182 96 96 83 16 16 5 254 207 205 179 3 0 47 249 249
2013 589 589 587 259 54 582 575 164 187 556 85 426 51 588 588 370 573 359 588 588 539 172 172 168 25 25 20 586 516 490 393 1 0 86 516 516
2014 782 782 782 318 27 743 727 157 220 603 82 488 141 782 781 756 568 567 781 773 562 499 499 446 32 32 0 782 615 458 463 1 1 270 144 144
2015 85 85 75 68 6 59 59 4 40 81 1 64 28 85 85 82 74 82 85 85 34 2 2 0 0 0 0 84 78 67 39 0 0 11 0 0
In [6]:
data.groupby(by = 'YearObtained').count()['artifactNumber'].mean()
Out[6]:
2086.86

The mean number of artifacts per year is 2086.6. Let's combine a scatterplot with a fit line to visualize this. I used the examples from here and here.

In [7]:
from scipy.stats import linregress

plt.figure(figsize = (15,10))

meandata = data.groupby(by = 'YearObtained').count()['artifactNumber']
x = pd.to_numeric(meandata.index.values)
y = meandata

print linregress(x,y)

fitline = np.polyfit(x, y, deg=2)

xpoints = np.linspace(min(x),max(x),100);
plt.plot(xpoints, fitline[2] + fitline[1]*xpoints + fitline[0]*xpoints**2 , color = 'red');

plt.scatter(range(min(x),max(x)+1), meandata);
plt.tight_layout()
plt.xticks(size = 20);
plt.yticks(size = 20);
LinregressResult(slope=-17.623769507803122, intercept=37166.973205282113, rvalue=-0.20498455928950232, pvalue=0.15328806044170382, stderr=12.1460638488731)

Some wrong statistics

If you look above, the function method lingress from the package scipy.stats only takes one predictor (x). First, to add in the polynomial term (x2), we're going to use the sm method from the statsmodels.api package.

A glance at these results suggest that that a quadratic model fit the data better than a linear model (R2 of .25 vs. .02).

It should be said, however, that the inference tests (t-tests) from these results violate the assumption of independent errors. If you look at the "warnings" footnote, under [1], correct standard errors assume that the covariance matrix has been properly specified - it hasn't. One assumption that OLS regression makes is that the errors, or residuals of the model are independent - basically, that the points in the model are not related in some other way than the variables you're using to predict the outcome. In our case, that is untrue, as these are time series data, and years that are temporally close to one another will be related to each other.

In the future, I'm going to learn more about the types of models that are appropriate for these data (e.g., an autoregressive model)

In [9]:
import statsmodels.api as sm

x = pd.to_numeric(meandata.index.values).reshape(50,1)
y = np.array(meandata).reshape(50,1)

xpred = np.column_stack((np.ones(50),np.array(x)))
xpredsquared = np.column_stack((xpred,np.array(x**2)))


print sm.OLS(y,xpred,hasconst=True).fit().summary()
print sm.OLS(y,xpredsquared, hasconst=True).fit().summary()
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.042
Model:                            OLS   Adj. R-squared:                  0.022
Method:                 Least Squares   F-statistic:                     2.105
Date:                Sat, 07 May 2016   Prob (F-statistic):              0.153
Time:                        00:51:12   Log-Likelihood:                -426.05
No. Observations:                  50   AIC:                             856.1
Df Residuals:                      48   BIC:                             859.9
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const       3.717e+04   2.42e+04      1.537      0.131     -1.14e+04  8.58e+04
x1           -17.6238     12.146     -1.451      0.153       -42.045     6.798
==============================================================================
Omnibus:                       17.851   Durbin-Watson:                   1.535
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               21.727
Skew:                           1.365   Prob(JB):                     1.91e-05
Kurtosis:                       4.724   Cond. No.                     2.75e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.75e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.282
Model:                            OLS   Adj. R-squared:                  0.252
Method:                 Least Squares   F-statistic:                     9.248
Date:                Sat, 07 May 2016   Prob (F-statistic):           0.000411
Time:                        00:51:12   Log-Likelihood:                -418.82
No. Observations:                  50   AIC:                             843.6
Df Residuals:                      47   BIC:                             849.4
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const      -1.291e+07   3.26e+06     -3.956      0.000     -1.95e+07 -6.34e+06
x1          1.299e+04   3278.612      3.962      0.000      6395.437  1.96e+04
x2            -3.2677      0.824     -3.968      0.000        -4.925    -1.611
==============================================================================
Omnibus:                       16.589   Durbin-Watson:                   2.045
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               19.539
Skew:                           1.287   Prob(JB):                     5.72e-05
Kurtosis:                       4.659   Cond. No.                     8.43e+10
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.43e+10. This might indicate that there are
strong multicollinearity or other numerical problems.