Misc #3: Soup

In [1]:
%matplotlib inline

Misc #3: Soup

My friend and I once argued about whether one "eats" or "drinks" soup. I believed that you "drink" soup, whereas she thought that you "eat" it.

From our conversation, I realized the verbs people associate with various foods have to do with the state of matter they mentally conjure up when thinking about those foods. For example, I posit that if you came up to me while I was holding a Slurpee (for the sake of balance, other semi-frozen beverages are available), I would be eating it if it just came out of the machine, but drinking it if I'd been standing out in the sweltering New York summer heat for a while.

In [2]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import math

Thus, this simple dataset about soup was born. Using Amazon's Mechanical Turk, I asked 107 people: (a) Whether they thought soup was more similar to stew (closer to solid) or broth (closer to liquid), and (b) whether they were more likely to eat or drink soup.

Participants answered both questions on a 1 to 5 scale with 5 being "most likely to drink" and "closer to stew" (effectively reverse-coding their responses and erring conservatively with regard to my own hypothesis should participants cluster their responses at the ends of the scale). There was also a free-response question that was quite entertaining to read.

In [3]:
soupdata = pd.read_csv('SOUPDATA.csv')
In [4]:
soupdata.head()
Out[4]:
SIMILAR EATDRINK COMMENT
0 5 3 this little survey made me laugh
1 4 3 Sopu is awesome.
2 4 1 NaN
3 2 2 NaN
4 2 3 Soup can only be eaten if meat and other harde...

Scatterplot of responses

This first scatterplot is not a very good one, because both scales consisted of 5 discrete points, so the points on the graph overlap (we'll fix that in a second).

In [5]:
plt.figure(figsize=(12,10))
plt.scatter(soupdata['SIMILAR'],soupdata['EATDRINK']);
plt.xlabel("5 = most similar to stew", size = 24);
plt.ylabel("5 = more likely to drink", size = 24);

We can artifically jitter the points to make the variance of these points more visible. First, let's write a simple function that takes an array and jitters it:

In [6]:
def jitter(array):
    multiplier = 1 + 0.1* np.random.random(np.shape(array))
    return multiplier * array

This allows us to see how responses were clustered. Already, this pattern of data appears to reflect a negative correlation (which is what I expected), but we'll test that later.

In [7]:
plt.figure(figsize=(12,10))
plt.scatter(jitter(soupdata['SIMILAR']), jitter(soupdata['EATDRINK']));
plt.xlabel("5 = most similar to stew", size = 24);
plt.ylabel("5 = more likely to drink", size = 24);

Broth vs. stew

We can then look at the distribution of the two scales to give us further insight into how our sample responded to the two scales.

In [8]:
np.mean(soupdata['SIMILAR'])
Out[8]:
2.9056603773584904

This mean tells us that on average, the sample was split between whether or not soup was broth-like or stew-like.

However, the distribution was such that there weren't two discrete groups of "broth people" vs. "stew people". Rather, most people think it's somewhere in between, or lean slightly to one side. In some ways, this serves as a "sanity check" and also tells us that people know that different types of soup exist.

In [9]:
plt.figure(figsize=(8,6));
plt.hist(soupdata['SIMILAR'], bins = range(1,7));
plt.xlabel('5 = most simlar to stew', size = 24);
plt.ylabel('Frequency', size = 24);
plt.xticks(range(1,6));

Eating vs. drinking soup

On the other hand, the participants' average "eating/drinking" score was slightly below the mean, indictating that they'd be more likely to "eat" soup than "drink" it. This is not surprising considering I restricted the sample to respondents living in the U.S., where my friend and I were having our argument.

In [10]:
np.mean(soupdata['EATDRINK'])
Out[10]:
2.4622641509433962

The distribution of responses is also normal, with a slight positive skew.

In [11]:
plt.figure(figsize=(8,6));
plt.hist(soupdata['EATDRINK'], bins = range(1,7));
plt.xlabel('5 = most simlar to stew', size = 24);
plt.ylabel('Frequency', size = 24);
plt.xticks(range(1,6));

To tell us whether or not the 2.46 average that we obtained (vs. the scale midpoint of 3) might have been due to sampling error, we can also calculate the results of a one-way t-test, comparing 2.46 to the midpoint of the scale. This involves finding the standard error of the sample, and although there are numerous packages that can do this for you, it's easy enough to do by hand:

In [12]:
#Standard deviation
soupsd = np.std(soupdata['EATDRINK'])
print 'Standard deviation = {0}'.format(soupsd)

#Standard error = standard deviation/square root(sample size)
soupse = soupsd/math.sqrt(len(soupdata['EATDRINK']))
print 'Standard error = {0}'.format(soupse)

#t-value = (mean1 - mean2)/standard error
tvalue = (3 - np.mean(soupdata['EATDRINK']))/soupse
print 't({0}) = {1}'.format(len(soupdata['EATDRINK']), tvalue)
Standard deviation = 1.08309939875
Standard error = 0.105199913354
t(106) = 5.11156171059

This gives us a t-value of 5.11 with 106 degrees of freedom, which, if you look up in a t-distribution table, returns a p of < .01 (for reference, a t-value of ~2 with df of ~30 gives you p <.05). We can also calculate confidence intervals:

In [13]:
upperbound = 2.462 + (soupse *1.96)
lowerbound = 2.462 - (soupse *1.96)
print '95% confidence intervals [{0}]'.format((lowerbound,upperbound))
95% confidence intervals [(2.255808169826536, 2.6681918301734644)]

This tells us that if we randomly sampled our population over and over again, 95% of the sample means would fall between 2.26 and 2.67, still below the scale midpoint of 3. Taken together, it's quite likely (and also quite theoretically reasonable), that these participants' tendency to favor "eating" soup was not due to sampling error.

Predicting eating vs. drinking from broth-y vs. stew-y soups

So, what about my hypothesis? To test whether people's mental representation of soup predicts the verb that they use to describe consuming it, we can run a simply OLS regression. I'm not going to manually write out the algorithm for it (although I have before, see here), and instead, borrow the much prettier statsmodel one.

(CAVEAT: These are correlational data; the usual caveats with regard to establishing causality apply)

In [14]:
import statsmodels.api as sm
In [15]:
subdata = sm.add_constant(soupdata)
subdata['SIMILARC'] = subdata['SIMILAR'] - np.mean(subdata['SIMILAR'])
model = sm.OLS(subdata['EATDRINK'],subdata.loc[:,['const','SIMILARC']],hasconstant=1)
results = model.fit()
In [16]:
results.summary()
Out[16]:
OLS Regression Results
Dep. Variable: EATDRINK R-squared: 0.062
Model: OLS Adj. R-squared: 0.053
Method: Least Squares F-statistic: 6.832
Date: Fri, 29 Jul 2016 Prob (F-statistic): 0.0103
Time: 17:58:57 Log-Likelihood: -155.50
No. Observations: 106 AIC: 315.0
Df Residuals: 104 BIC: 320.3
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
const 2.4623 0.103 23.933 0.000 2.258 2.666
SIMILARC -0.2701 0.103 -2.614 0.010 -0.475 -0.065
Omnibus: 3.808 Durbin-Watson: 2.014
Prob(Omnibus): 0.149 Jarque-Bera (JB): 3.819
Skew: 0.450 Prob(JB): 0.148
Kurtosis: 2.766 Cond. No. 1.00

In this analysis, the first coefficient (const), 2.46, tells us that a hypothetical person who though soup could be just as much broth-like as it could be stew-like is still more likely to eat it than drink it (2.46 being significantly lower than the midpoint of 3).

The second coefficient, SIMILARC, is negative, statistically significant and tells us that believing soup is stew-like is associated with saying that one would eat (vs. drink) soup. Specifically, for every 1 point a hypothetical respondent says that soup is similar to stew (vs. broth), he/she is predicted to respond -.27 points lower on the eat vs. drink scale.

Coming back to our original scatter plot, we can draw a fit-line over it:

In [17]:
#Defining a function for our fit line:
def fitline(Xs):
    return [y*(-.2701)+3.47 for y in Xs] #The 3.47 (vs. 2.46) comes from the uncentered version of this analysis
In [18]:
plt.figure(figsize=(12,10))
plt.scatter(jitter(soupdata['SIMILAR']), jitter(soupdata['EATDRINK']));
plt.xlabel("5 = most similar to stew", size = 24);
plt.ylabel("5 = more likely to drink", size = 24);
plt.plot(range(1,6), fitline(range(1,6)));

Conclusion

In summary, I learned that:

  1. Most people obviously know that there are different types of soup, but there is variabilty in what someone's "default" category is.
  2. Not everyone "eats" soup, but more people in the U.S. "eat" it than "drink it".
  3. Finally, people who associate soup with stew are more likely to eat it than drink it.

I'm not sure who won the argument in the end, but this analysis does feel a little like winning the battle but losing the war.

Misc #2: Restructuring data

In [1]:
import pandas as pd

Misc #2: Restructuring data

This post details some of the "dirty work" (specifically, restructuring data), that goes on behind data analysis.

In this case, I'm trying to restructure a person-period/stacked dataset into a person-level/long dataset.

These two formats can be transposed from one to the other, but we often prefer one over the other for readibility/analytic purposes (detailed later).

Person-period vs. person-level data

Person-period datasets are usually found in longitudinal research, where a single person contributes multiple rows of data, reflecting the same measure at multiple time points. For example, a person's blood pressure measured every day over a year might be stored in a table where every row is a new blood pressure reading. Each person contributes 365 rows of data, identified by an ID number.

On the other hand, this blood pressure data can also be stored in a person-level dataset, where each row represents a single person, and multiple variables represent each time point (for example, "BP1", "BP2", "BP3" for days 1, 2 and 3 respectively).

As you might have guessed, in this example, a person-period dataset would be preferred. Firstly, it would take less space to store, as we only have to declare the type of data once for the column and not have to do it for each new row. In contrast, a person-level dataset would have to store information about what type of data is contained in each column, and this redundant information is replicated every time a new reading is taken.

Person-period datasets are also more convenient when missing data is to be expected. Substantial amounts of missing data in a person-level dataset would mean lots of empty cells, whereas missing rows do not have to be stored, unless there are other unique, period-specific data (e.g., if we were missing blood pressure data, but also had to store heart rate data for the day).

On the other hand, person-level datasets are often preferred when there are multiple variables that do not change with each time period. They also tend to be more human-readable. For example, if we were interested in someone's blood pressure at 3 time points, and also 100 other variables about their stable, demographic information a person-period dataset would result in a lot of redundancy, as these 100 variables are replicated for each new row.

Current data

The annonymized data I'm working with here come from an experiment I did a couple of years ago. In this study, participants completed a simple counting task where they had to count backwards, and I measured how long they took for each response. The specific details about the study are irrelevant to this post.

In [2]:
stackeddata = pd.read_csv('stacked dataset.csv')

This is what the person-period version of the data look like:

In [3]:
stackeddata.head()
Out[3]:
ID TIME TIMETRIAL MISTAKE NUMBER MISTAKESTOTAL SCORET RESPONSENUM SPLINE PASSED ... IPOVERTC SPEED100 PASSEDNUM HIvsLEffect DIFFBIN RESPONSENUMC DIFFICULTYC PHIGHSCORE PASSEDBY PASSEDBIN
0 25 16.262476 16.262476 1 2018 1 1981 0 0 0 ... -1.091039 6.149125 55 -1 1 -5 4.56 14.0 41.0 0
1 25 26.887734 10.625258 0 2017 1 1981 1 0 0 ... -1.091039 9.411536 54 -1 1 -4 4.56 14.0 41.0 0
2 25 33.911948 7.024214 0 2014 1 1981 2 0 0 ... -1.091039 14.236468 53 -1 1 -3 4.56 14.0 41.0 0
3 25 39.654125 5.742177 0 2011 1 1981 3 0 0 ... -1.091039 17.414998 52 -1 1 -2 4.56 14.0 41.0 0
4 25 48.295860 8.641734 0 2008 1 1981 4 0 0 ... -1.091039 11.571752 51 -1 1 -1 4.56 14.0 41.0 0

5 rows × 31 columns

Here, you can see ID, the unique identifier for each participant, TIME and TIMETRIAL, the time they took to type in each response both cumulatively and trial-by-trial (so these are all different for each row), as well as some time-invariant data, such as the total number of mistakes they made (MISTAKESTOTAL).

Transposing from person-period to person-level data involves summarizing data in some way. In this case, I'd like to analyze how many responses participants made in total, so that's what I'm going to collapse the data against.

In [4]:
longdata = pd.DataFrame(index=stackeddata[stackeddata.columns.values[0]].unique(), columns = ('TOTRESPONSES','BEFPASS',\
                                                                     'AFTPASS', 'IPOVERT', 'IP', 'T', 'DIFFICULTY', 'HIvsLEffect'))

This creates an empty dataframe with the unique participant IDs as indices, as well as the columns I want to carry over.

In [5]:
longdata.head()
Out[5]:
TOTRESPONSES BEFPASS AFTPASS IPOVERT IP T DIFFICULTY HIvsLEffect
25 NaN NaN NaN NaN NaN NaN NaN NaN
26 NaN NaN NaN NaN NaN NaN NaN NaN
28 NaN NaN NaN NaN NaN NaN NaN NaN
29 NaN NaN NaN NaN NaN NaN NaN NaN
31 NaN NaN NaN NaN NaN NaN NaN NaN

Let's deal with the time-invariant columns first. They're easy (because each row is the same for each participant, so we can just take the first one):

In [6]:
for column in ('IPOVERT', 'IP', 'T', 'DIFFICULTY', 'HIvsLEffect'):
    for participant in longdata.index.values:
        longdata.loc[participant,column] =  stackeddata.loc[stackeddata[stackeddata.columns.values[0]]==participant][column].values[0]
In [7]:
longdata.head()
Out[7]:
TOTRESPONSES BEFPASS AFTPASS IPOVERT IP T DIFFICULTY HIvsLEffect
25 NaN NaN NaN -2 2.33333 4.33333 10 -1
26 NaN NaN NaN -1.33333 2.16667 3.5 8 -1
28 NaN NaN NaN 0.666667 5 4.33333 5 -1
29 NaN NaN NaN -2.33333 3.83333 6.16667 8 -1
31 NaN NaN NaN -4.33333 1.66667 6 6 -1

There are three columns that we have to populate. The first, TOTRESPONSES is simply the number of responses that each participant made.

In [8]:
for participant in longdata.index.values:
        longdata.loc[participant, 'TOTRESPONSES'] = len(stackeddata.loc[(stackeddata[stackeddata.columns.values[0]]==participant) & \
                                                           (stackeddata['MISTAKE'] == 0)])
In [9]:
longdata.head()
Out[9]:
TOTRESPONSES BEFPASS AFTPASS IPOVERT IP T DIFFICULTY HIvsLEffect
25 13 NaN NaN -2 2.33333 4.33333 10 -1
26 34 NaN NaN -1.33333 2.16667 3.5 8 -1
28 20 NaN NaN 0.666667 5 4.33333 5 -1
29 10 NaN NaN -2.33333 3.83333 6.16667 8 -1
31 26 NaN NaN -4.33333 1.66667 6 6 -1

BEFPASS and AFTPASS have to do with whether or not a participant passed a displayed score on the screen (denoted by the column PASSED; 0 for no, 1 for yes).

In [10]:
for participant in longdata.index.values:
        longdata.loc[participant, 'BEFPASS'] = len(stackeddata.loc[(stackeddata[stackeddata.columns.values[0]]==participant) & \
                                                           (stackeddata['MISTAKE'] == 0) & (stackeddata['PASSED'] == 0)])
        longdata.loc[participant, 'AFTPASS'] = len(stackeddata.loc[(stackeddata[stackeddata.columns.values[0]]==participant) & \
                                                           (stackeddata['MISTAKE'] == 0) & (stackeddata['PASSED'] == 1)])

Finally, let's output these data to a csv:

In [11]:
longdata.to_csv('long data.csv')

Misc #1: Experimenting with plots

In [1]:
%matplotlib inline

from matplotlib import pyplot as plt

Misc #1: Experimenting with plots

I'm in the process of constructing figures for my dissertation, and I thought I'd share the different options I'm considering. I only need to plot simple line graphs, so my main considerations are aesthetics and ease of use (as I'll likely be making changes to them). This is the list I'm going to test:

  1. matplotlib
  2. MATLAB
  3. R + ggplot
  4. Excel
  5. gnuplot
  6. plot.ly
  7. Chart.js
  8. Tableau
  9. Bokeh
  10. MS paint (just kidding)

Without further ado, let's go!

matplotlib

In [2]:
intercept = -.2122
b1 = -1.1776 
b2 = -.1192
b3 = .4943

#x1 is a categorical predictor, x2 is a continuous predictor, x3 is their interaction
def y(x1, x2):
    return intercept + b1*x1 + b2*x2 + b3*x1*x2
In [3]:
x = range(1,8)
plt.plot(x, [y(0,z) for z in x], "blue", x, [y(1,z) for z in x], "red")

#pl.fill_between(x, y-error, y+error)

plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.gca().yaxis.set_ticks_position('left')
plt.gca().xaxis.set_ticks_position('bottom')

MATLAB

intercept = -.2122
b1 = -1.1776 
b2 = -.1192
b3 = .4943
x = 1:7
% moderator = 0
y1 = intercept + b2*x
% moderator = 1
y2 = intercept + b1 + b2*x + b3*x

figure
plot(x,y1,x,y2)

R

intercept = -.2122
b1 = -1.1776 
b2 = -.1192
b3 = .4943

y <- function(x1, x2){
  return(intercept + b1*x1 + b2*x2 + b3*x1*x2)
}

x = c(1:7)
y1 = sapply(x,y,x1=0)
y2 = sapply(x,y,x1=1)
xrange <- range(1:7) 
yrange <- range(-1.5:1.5) 
plot(xrange,yrange, type='n')
lines(x,y1)
lines(x,y2)

ggplot2

d1 <- data.frame(x,y=y1,g='0')
d2 <- data.frame(x,y=y2,g='1')
data <- rbind(d1, d2)

cbbPalette <- c("blue", "red")
ggplot(data=data, aes(x=x, y=y, group=g, color = g)) +
  geom_line() +
  scale_color_manual(values=cbbPalette) +
  scale_x_continuous(limits = c(1,7)) + 
  scale_y_continuous(limits = c(-1.5,1.5))

Excel

gnuplot

set title ""
intercept = -.2122
b1 = -1.1776 
b2 = -.1192
b3 = .4943
y1(x) = intercept + b2*x
y2(x) = intercept + b1 + b2*x + b3*x

plot [1:7] y1(x) lt rgb "blue", y2(x) lt rgb "red"

plot.ly

plot.ly is web-based; I took screencaps of the interface:

The plot looks like this:

chart.js

<html>
    <head>
        <script src="Chart.js"></script>
    </head>
<canvas id="myChart" width="640" height="480"></canvas>
<body>
    <script>
        var ctx = document.getElementById("myChart");

        var data = {
            labels: Array.apply(null, Array(8)).map(function (_, i) {return i;}),
            datasets: [
                {
                    label: "y1",
                    fill: false,
                    lineTension: 0.1,
                    backgroundColor: "blue",
                    borderColor: "blue",
                    borderCapStyle: 'butt',
                    borderJoinStyle: 'miter',
                    pointBorderColor: "blue",
                    pointBackgroundColor: "#fff",
                    pointBorderWidth: 1,
                    pointHoverRadius: 5,
                    pointHoverBackgroundColor: "blue",
                    pointHoverBorderColor: "blue",
                    pointHoverBorderWidth: 2,
                    pointRadius: 1,
                    pointHitRadius: 10,
                    data: [-0.3314,-0.4506,-0.5698,-0.689,-0.8082,-0.9274,-1.0466],
                },
                        {
                    label: "y2",
                    fill: false,
                    lineTension: 0.1,
                    backgroundColor: "red",
                    borderColor: "red",
                    borderCapStyle: 'butt',
                    borderJoinStyle: 'miter',
                    pointBorderColor: "red",
                    pointBackgroundColor: "#fff",
                    pointBorderWidth: 1,
                    pointHoverRadius: 5,
                    pointHoverBackgroundColor: "red",
                    pointHoverBorderColor: "red",
                    pointHoverBorderWidth: 2,
                    pointRadius: 1,
                    pointHitRadius: 10,
                    data: [-1.0147,-0.6396,-0.2645,0.1106,0.4857,0.8608,1.2359],
                }
            ]
        };

        var myChart = new Chart(ctx, {
            type: 'line',
            data: data,
            options: {
                responsive:false,
                scales: {
                    yAxes: [{
                        ticks: {
                            beginAtZero:false
                        }
                    }]
                }
            }
        });
</script>
</body>
</html>
In [5]:
from IPython.display import IFrame
IFrame('https://bryansim.github.io/chartjsplot.html', width=700, height=500)
Out[5]:

Bokeh

In [5]:
from bokeh.plotting import figure, output_file, show

intercept = -.2122
b1 = -1.1776 
b2 = -.1192
b3 = .4943

#x1 is a categorical predictor, x2 is a continuous predictor, x3 is their interaction
def y(x1, x2):
    return intercept + b1*x1 + b2*x2 + b3*x1*x2

x = range(1,8)
y1 = [y(0,z) for z in x]
y2 = [y(1,z) for z in x]

# output to static HTML file
output_file("bokehplot.html")

# create a new plot
p = figure(
   tools="pan,box_zoom,reset,save",
   y_range=[-1.5, 1.5],
)

# add some renderers
p.line(x, y1, legend="mod=0", line_color="blue")
p.line(x, y2, legend="mod=1", line_color="red")

# show the results
show(p)
In [4]:
from IPython.display import IFrame
IFrame('https://bryansim.github.io/bokehplot.html', width=700, height=675)
Out[4]:

Summary

I think, from here on out, I'll be using matplotlib for simple plots, Bokeh if I need to generate something interactive, and chart.js if I ever need to implement a dynamic chart.

With regard to the alternatives:

  1. MATLAB is proprietary.
  2. The built-in packages in R generate ugly plots.
  3. ggplot is a little unwieldy, and has defaults that I don't like (e.g., the grey background)
  4. Excel is good to visualize something really quick, but most people who have worked with linear models for a while (including myself) can do this in our heads or with a pen and paper.
  5. gnuplot looks really interesting, and may be something I spend more time on in the future, but it's foreign to me at the moment.
  6. plot.ly and Tableau (which I wanted to test, but their sign-up system never sent me the confirmation email) are both web-based, and for me, needlessly obfuscate what happens in the backend. I'd be comfortable writing my own code to generate the plots I want.

BONUS: MS Paint