%matplotlib inline

Misc #3: Soup¶

My friend and I once argued about whether one "eats" or "drinks" soup. I believed that you "drink" soup, whereas she thought that you "eat" it.

From our conversation, I realized the verbs people associate with various foods have to do with the state of matter they mentally conjure up when thinking about those foods. For example, I posit that if you came up to me while I was holding a Slurpee (for the sake of balance, other semi-frozen beverages are available), I would be eating it if it just came out of the machine, but drinking it if I'd been standing out in the sweltering New York summer heat for a while.

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import math

Thus, this simple dataset about soup was born. Using Amazon's Mechanical Turk, I asked 107 people: (a) Whether they thought soup was more similar to stew (closer to solid) or broth (closer to liquid), and (b) whether they were more likely to eat or drink soup.

Participants answered both questions on a 1 to 5 scale with 5 being "most likely to drink" and "closer to stew" (effectively reverse-coding their responses and erring conservatively with regard to my own hypothesis should participants cluster their responses at the ends of the scale). There was also a free-response question that was quite entertaining to read.

soupdata = pd.read_csv('SOUPDATA.csv')

soupdata.head()

Scatterplot of responses¶

This first scatterplot is not a very good one, because both scales consisted of 5 discrete points, so the points on the graph overlap (we'll fix that in a second).

plt.figure(figsize=(12,10))
plt.scatter(soupdata['SIMILAR'],soupdata['EATDRINK']);
plt.xlabel("5 = most similar to stew", size = 24);
plt.ylabel("5 = more likely to drink", size = 24);

We can artifically jitter the points to make the variance of these points more visible. First, let's write a simple function that takes an array and jitters it:

def jitter(array):
    multiplier = 1 + 0.1* np.random.random(np.shape(array))
    return multiplier * array

This allows us to see how responses were clustered. Already, this pattern of data appears to reflect a negative correlation (which is what I expected), but we'll test that later.

plt.figure(figsize=(12,10))
plt.scatter(jitter(soupdata['SIMILAR']), jitter(soupdata['EATDRINK']));
plt.xlabel("5 = most similar to stew", size = 24);
plt.ylabel("5 = more likely to drink", size = 24);

Broth vs. stew¶

We can then look at the distribution of the two scales to give us further insight into how our sample responded to the two scales.

np.mean(soupdata['SIMILAR'])

2.9056603773584904

This mean tells us that on average, the sample was split between whether or not soup was broth-like or stew-like.

However, the distribution was such that there weren't two discrete groups of "broth people" vs. "stew people". Rather, most people think it's somewhere in between, or lean slightly to one side. In some ways, this serves as a "sanity check" and also tells us that people know that different types of soup exist.

plt.figure(figsize=(8,6));
plt.hist(soupdata['SIMILAR'], bins = range(1,7));
plt.xlabel('5 = most simlar to stew', size = 24);
plt.ylabel('Frequency', size = 24);
plt.xticks(range(1,6));

Eating vs. drinking soup¶

On the other hand, the participants' average "eating/drinking" score was slightly below the mean, indictating that they'd be more likely to "eat" soup than "drink" it. This is not surprising considering I restricted the sample to respondents living in the U.S., where my friend and I were having our argument.

np.mean(soupdata['EATDRINK'])

2.4622641509433962

The distribution of responses is also normal, with a slight positive skew.

plt.figure(figsize=(8,6));
plt.hist(soupdata['EATDRINK'], bins = range(1,7));
plt.xlabel('5 = most simlar to stew', size = 24);
plt.ylabel('Frequency', size = 24);
plt.xticks(range(1,6));

To tell us whether or not the 2.46 average that we obtained (vs. the scale midpoint of 3) might have been due to sampling error, we can also calculate the results of a one-way t-test, comparing 2.46 to the midpoint of the scale. This involves finding the standard error of the sample, and although there are numerous packages that can do this for you, it's easy enough to do by hand:

#Standard deviation
soupsd = np.std(soupdata['EATDRINK'])
print 'Standard deviation = {0}'.format(soupsd)

#Standard error = standard deviation/square root(sample size)
soupse = soupsd/math.sqrt(len(soupdata['EATDRINK']))
print 'Standard error = {0}'.format(soupse)

#t-value = (mean1 - mean2)/standard error
tvalue = (3 - np.mean(soupdata['EATDRINK']))/soupse
print 't({0}) = {1}'.format(len(soupdata['EATDRINK']), tvalue)

Standard deviation = 1.08309939875
Standard error = 0.105199913354
t(106) = 5.11156171059

This gives us a t-value of 5.11 with 106 degrees of freedom, which, if you look up in a t-distribution table, returns a p of < .01 (for reference, a t-value of ~2 with df of ~30 gives you p <.05). We can also calculate confidence intervals:

upperbound = 2.462 + (soupse *1.96)
lowerbound = 2.462 - (soupse *1.96)
print '95% confidence intervals [{0}]'.format((lowerbound,upperbound))

95% confidence intervals [(2.255808169826536, 2.6681918301734644)]

This tells us that if we randomly sampled our population over and over again, 95% of the sample means would fall between 2.26 and 2.67, still below the scale midpoint of 3. Taken together, it's quite likely (and also quite theoretically reasonable), that these participants' tendency to favor "eating" soup was not due to sampling error.

Predicting eating vs. drinking from broth-y vs. stew-y soups¶

So, what about my hypothesis? To test whether people's mental representation of soup predicts the verb that they use to describe consuming it, we can run a simply OLS regression. I'm not going to manually write out the algorithm for it (although I have before, see here), and instead, borrow the much prettier statsmodel one.

(CAVEAT: These are correlational data; the usual caveats with regard to establishing causality apply)

import statsmodels.api as sm

subdata = sm.add_constant(soupdata)
subdata['SIMILARC'] = subdata['SIMILAR'] - np.mean(subdata['SIMILAR'])
model = sm.OLS(subdata['EATDRINK'],subdata.loc[:,['const','SIMILARC']],hasconstant=1)
results = model.fit()

results.summary()

In this analysis, the first coefficient (const), 2.46, tells us that a hypothetical person who though soup could be just as much broth-like as it could be stew-like is still more likely to eat it than drink it (2.46 being significantly lower than the midpoint of 3).

The second coefficient, SIMILARC, is negative, statistically significant and tells us that believing soup is stew-like is associated with saying that one would eat (vs. drink) soup. Specifically, for every 1 point a hypothetical respondent says that soup is similar to stew (vs. broth), he/she is predicted to respond -.27 points lower on the eat vs. drink scale.

Coming back to our original scatter plot, we can draw a fit-line over it:

#Defining a function for our fit line:
def fitline(Xs):
    return [y*(-.2701)+3.47 for y in Xs] #The 3.47 (vs. 2.46) comes from the uncentered version of this analysis

plt.figure(figsize=(12,10))
plt.scatter(jitter(soupdata['SIMILAR']), jitter(soupdata['EATDRINK']));
plt.xlabel("5 = most similar to stew", size = 24);
plt.ylabel("5 = more likely to drink", size = 24);
plt.plot(range(1,6), fitline(range(1,6)));

Conclusion¶

In summary, I learned that:

Most people obviously know that there are different types of soup, but there is variabilty in what someone's "default" category is.
Not everyone "eats" soup, but more people in the U.S. "eat" it than "drink it".
Finally, people who associate soup with stew are more likely to eat it than drink it.

I'm not sure who won the argument in the end, but this analysis does feel a little like winning the battle but losing the war.

import pandas as pd

Misc #2: Restructuring data¶

This post details some of the "dirty work" (specifically, restructuring data), that goes on behind data analysis.

In this case, I'm trying to restructure a person-period/stacked dataset into a person-level/long dataset.

These two formats can be transposed from one to the other, but we often prefer one over the other for readibility/analytic purposes (detailed later).

Person-period vs. person-level data¶

Person-period datasets are usually found in longitudinal research, where a single person contributes multiple rows of data, reflecting the same measure at multiple time points. For example, a person's blood pressure measured every day over a year might be stored in a table where every row is a new blood pressure reading. Each person contributes 365 rows of data, identified by an ID number.

On the other hand, this blood pressure data can also be stored in a person-level dataset, where each row represents a single person, and multiple variables represent each time point (for example, "BP1", "BP2", "BP3" for days 1, 2 and 3 respectively).

As you might have guessed, in this example, a person-period dataset would be preferred. Firstly, it would take less space to store, as we only have to declare the type of data once for the column and not have to do it for each new row. In contrast, a person-level dataset would have to store information about what type of data is contained in each column, and this redundant information is replicated every time a new reading is taken.

Person-period datasets are also more convenient when missing data is to be expected. Substantial amounts of missing data in a person-level dataset would mean lots of empty cells, whereas missing rows do not have to be stored, unless there are other unique, period-specific data (e.g., if we were missing blood pressure data, but also had to store heart rate data for the day).

On the other hand, person-level datasets are often preferred when there are multiple variables that do not change with each time period. They also tend to be more human-readable. For example, if we were interested in someone's blood pressure at 3 time points, and also 100 other variables about their stable, demographic information a person-period dataset would result in a lot of redundancy, as these 100 variables are replicated for each new row.

Current data¶

The annonymized data I'm working with here come from an experiment I did a couple of years ago. In this study, participants completed a simple counting task where they had to count backwards, and I measured how long they took for each response. The specific details about the study are irrelevant to this post.

stackeddata = pd.read_csv('stacked dataset.csv')

This is what the person-period version of the data look like:

stackeddata.head()

Here, you can see ID, the unique identifier for each participant, TIME and TIMETRIAL, the time they took to type in each response both cumulatively and trial-by-trial (so these are all different for each row), as well as some time-invariant data, such as the total number of mistakes they made (MISTAKESTOTAL).

Transposing from person-period to person-level data involves summarizing data in some way. In this case, I'd like to analyze how many responses participants made in total, so that's what I'm going to collapse the data against.

longdata = pd.DataFrame(index=stackeddata[stackeddata.columns.values[0]].unique(), columns = ('TOTRESPONSES','BEFPASS',\
                                                                     'AFTPASS', 'IPOVERT', 'IP', 'T', 'DIFFICULTY', 'HIvsLEffect'))

This creates an empty dataframe with the unique participant IDs as indices, as well as the columns I want to carry over.

longdata.head()

Let's deal with the time-invariant columns first. They're easy (because each row is the same for each participant, so we can just take the first one):

for column in ('IPOVERT', 'IP', 'T', 'DIFFICULTY', 'HIvsLEffect'):
    for participant in longdata.index.values:
        longdata.loc[participant,column] =  stackeddata.loc[stackeddata[stackeddata.columns.values[0]]==participant][column].values[0]

longdata.head()

There are three columns that we have to populate. The first, TOTRESPONSES is simply the number of responses that each participant made.

for participant in longdata.index.values:
        longdata.loc[participant, 'TOTRESPONSES'] = len(stackeddata.loc[(stackeddata[stackeddata.columns.values[0]]==participant) & \
                                                           (stackeddata['MISTAKE'] == 0)])

longdata.head()

BEFPASS and AFTPASS have to do with whether or not a participant passed a displayed score on the screen (denoted by the column PASSED; 0 for no, 1 for yes).

for participant in longdata.index.values:
        longdata.loc[participant, 'BEFPASS'] = len(stackeddata.loc[(stackeddata[stackeddata.columns.values[0]]==participant) & \
                                                           (stackeddata['MISTAKE'] == 0) & (stackeddata['PASSED'] == 0)])
        longdata.loc[participant, 'AFTPASS'] = len(stackeddata.loc[(stackeddata[stackeddata.columns.values[0]]==participant) & \
                                                           (stackeddata['MISTAKE'] == 0) & (stackeddata['PASSED'] == 1)])

Finally, let's output these data to a csv:

longdata.to_csv('long data.csv')

%matplotlib inline

from matplotlib import pyplot as plt

Misc #1: Experimenting with plots¶

I'm in the process of constructing figures for my dissertation, and I thought I'd share the different options I'm considering. I only need to plot simple line graphs, so my main considerations are aesthetics and ease of use (as I'll likely be making changes to them). This is the list I'm going to test:

matplotlib
MATLAB
R + ggplot
Excel
gnuplot
plot.ly
Chart.js
Tableau
Bokeh
MS paint (just kidding)

Without further ado, let's go!

matplotlib¶

intercept = -.2122
b1 = -1.1776 
b2 = -.1192
b3 = .4943

#x1 is a categorical predictor, x2 is a continuous predictor, x3 is their interaction
def y(x1, x2):
    return intercept + b1*x1 + b2*x2 + b3*x1*x2

x = range(1,8)
plt.plot(x, [y(0,z) for z in x], "blue", x, [y(1,z) for z in x], "red")

#pl.fill_between(x, y-error, y+error)

plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.gca().yaxis.set_ticks_position('left')
plt.gca().xaxis.set_ticks_position('bottom')

MATLAB¶

intercept = -.2122
b1 = -1.1776 
b2 = -.1192
b3 = .4943
x = 1:7
% moderator = 0
y1 = intercept + b2*x
% moderator = 1
y2 = intercept + b1 + b2*x + b3*x

figure
plot(x,y1,x,y2)

R¶

intercept = -.2122
b1 = -1.1776 
b2 = -.1192
b3 = .4943

y <- function(x1, x2){
  return(intercept + b1*x1 + b2*x2 + b3*x1*x2)
}

x = c(1:7)
y1 = sapply(x,y,x1=0)
y2 = sapply(x,y,x1=1)
xrange <- range(1:7) 
yrange <- range(-1.5:1.5) 
plot(xrange,yrange, type='n')
lines(x,y1)
lines(x,y2)

ggplot2¶

d1 <- data.frame(x,y=y1,g='0')
d2 <- data.frame(x,y=y2,g='1')
data <- rbind(d1, d2)

cbbPalette <- c("blue", "red")
ggplot(data=data, aes(x=x, y=y, group=g, color = g)) +
  geom_line() +
  scale_color_manual(values=cbbPalette) +
  scale_x_continuous(limits = c(1,7)) + 
  scale_y_continuous(limits = c(-1.5,1.5))

Excel¶

gnuplot¶

set title ""
intercept = -.2122
b1 = -1.1776 
b2 = -.1192
b3 = .4943
y1(x) = intercept + b2*x
y2(x) = intercept + b1 + b2*x + b3*x

plot [1:7] y1(x) lt rgb "blue", y2(x) lt rgb "red"

plot.ly¶

plot.ly is web-based; I took screencaps of the interface:

The plot looks like this:

chart.js¶

<html>
    <head>
        <script src="Chart.js"></script>
    </head>
<canvas id="myChart" width="640" height="480"></canvas>
<body>
    <script>
        var ctx = document.getElementById("myChart");

        var data = {
            labels: Array.apply(null, Array(8)).map(function (_, i) {return i;}),
            datasets: [
                {
                    label: "y1",
                    fill: false,
                    lineTension: 0.1,
                    backgroundColor: "blue",
                    borderColor: "blue",
                    borderCapStyle: 'butt',
                    borderJoinStyle: 'miter',
                    pointBorderColor: "blue",
                    pointBackgroundColor: "#fff",
                    pointBorderWidth: 1,
                    pointHoverRadius: 5,
                    pointHoverBackgroundColor: "blue",
                    pointHoverBorderColor: "blue",
                    pointHoverBorderWidth: 2,
                    pointRadius: 1,
                    pointHitRadius: 10,
                    data: [-0.3314,-0.4506,-0.5698,-0.689,-0.8082,-0.9274,-1.0466],
                },
                        {
                    label: "y2",
                    fill: false,
                    lineTension: 0.1,
                    backgroundColor: "red",
                    borderColor: "red",
                    borderCapStyle: 'butt',
                    borderJoinStyle: 'miter',
                    pointBorderColor: "red",
                    pointBackgroundColor: "#fff",
                    pointBorderWidth: 1,
                    pointHoverRadius: 5,
                    pointHoverBackgroundColor: "red",
                    pointHoverBorderColor: "red",
                    pointHoverBorderWidth: 2,
                    pointRadius: 1,
                    pointHitRadius: 10,
                    data: [-1.0147,-0.6396,-0.2645,0.1106,0.4857,0.8608,1.2359],
                }
            ]
        };

        var myChart = new Chart(ctx, {
            type: 'line',
            data: data,
            options: {
                responsive:false,
                scales: {
                    yAxes: [{
                        ticks: {
                            beginAtZero:false
                        }
                    }]
                }
            }
        });
</script>
</body>
</html>

from IPython.display import IFrame
IFrame('https://bryansim.github.io/chartjsplot.html', width=700, height=500)

Bokeh¶

from bokeh.plotting import figure, output_file, show

intercept = -.2122
b1 = -1.1776 
b2 = -.1192
b3 = .4943

#x1 is a categorical predictor, x2 is a continuous predictor, x3 is their interaction
def y(x1, x2):
    return intercept + b1*x1 + b2*x2 + b3*x1*x2

x = range(1,8)
y1 = [y(0,z) for z in x]
y2 = [y(1,z) for z in x]

# output to static HTML file
output_file("bokehplot.html")

# create a new plot
p = figure(
   tools="pan,box_zoom,reset,save",
   y_range=[-1.5, 1.5],
)

# add some renderers
p.line(x, y1, legend="mod=0", line_color="blue")
p.line(x, y2, legend="mod=1", line_color="red")

# show the results
show(p)

from IPython.display import IFrame
IFrame('https://bryansim.github.io/bokehplot.html', width=700, height=675)

Summary¶

I think, from here on out, I'll be using matplotlib for simple plots, Bokeh if I need to generate something interactive, and chart.js if I ever need to implement a dynamic chart.

With regard to the alternatives:

MATLAB is proprietary.
The built-in packages in R generate ugly plots.
ggplot is a little unwieldy, and has defaults that I don't like (e.g., the grey background)
Excel is good to visualize something really quick, but most people who have worked with linear models for a while (including myself) can do this in our heads or with a pen and paper.
gnuplot looks really interesting, and may be something I spend more time on in the future, but it's foreign to me at the moment.
plot.ly and Tableau (which I wanted to test, but their sign-up system never sent me the confirmation email) are both web-based, and for me, needlessly obfuscate what happens in the backend. I'd be comfortable writing my own code to generate the plots I want.

	SIMILAR	EATDRINK	COMMENT
0	5	3	this little survey made me laugh
1	4	3	Sopu is awesome.
2	4	1	NaN
3	2	2	NaN
4	2	3	Soup can only be eaten if meat and other harde...

Dep. Variable:	EATDRINK	R-squared:	0.062
Model:	OLS	Adj. R-squared:	0.053
Method:	Least Squares	F-statistic:	6.832
Date:	Fri, 29 Jul 2016	Prob (F-statistic):	0.0103
Time:	17:58:57	Log-Likelihood:	-155.50
No. Observations:	106	AIC:	315.0
Df Residuals:	104	BIC:	320.3
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
const	2.4623	0.103	23.933	0.000	2.258 2.666
SIMILARC	-0.2701	0.103	-2.614	0.010	-0.475 -0.065

Omnibus:	3.808	Durbin-Watson:	2.014
Prob(Omnibus):	0.149	Jarque-Bera (JB):	3.819
Skew:	0.450	Prob(JB):	0.148
Kurtosis:	2.766	Cond. No.	1.00

	ID	TIME	TIMETRIAL	MISTAKE	NUMBER	MISTAKESTOTAL	SCORET	RESPONSENUM	...	IPOVERTC	SPEED100	PASSEDNUM	HIvsLEffect	DIFFBIN	RESPONSENUMC	DIFFICULTYC	PHIGHSCORE	PASSEDBY
0	25	16.262476	16.262476	1	2018	1	1981	0	...	-1.091039	6.149125	55	-1	1	-5	4.56	14.0	41.0
1	25	26.887734	10.625258	0	2017	1	1981	1	...	-1.091039	9.411536	54	-1	1	-4	4.56	14.0	41.0
2	25	33.911948	7.024214	0	2014	1	1981	2	...	-1.091039	14.236468	53	-1	1	-3	4.56	14.0	41.0
3	25	39.654125	5.742177	0	2011	1	1981	3	...	-1.091039	17.414998	52	-1	1	-2	4.56	14.0	41.0
4	25	48.295860	8.641734	0	2008	1	1981	4	...	-1.091039	11.571752	51	-1	1	-1	4.56	14.0	41.0

	TOTRESPONSES	BEFPASS	AFTPASS	IPOVERT	IP	T	DIFFICULTY	HIvsLEffect
25	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
26	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
28	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
29	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	TOTRESPONSES	BEFPASS	AFTPASS	IPOVERT	IP	T	DIFFICULTY	HIvsLEffect
25	NaN	NaN	NaN	-2	2.33333	4.33333	10	-1
26	NaN	NaN	NaN	-1.33333	2.16667	3.5	8	-1
28	NaN	NaN	NaN	0.666667	5	4.33333	5	-1
29	NaN	NaN	NaN	-2.33333	3.83333	6.16667	8	-1
31	NaN	NaN	NaN	-4.33333	1.66667	6	6	-1

	TOTRESPONSES	BEFPASS	AFTPASS	IPOVERT	IP	T	DIFFICULTY	HIvsLEffect
25	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
26	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
28	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
29	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN