Misc #2: Restructuring data

In [1]:
import pandas as pd

Misc #2: Restructuring data

This post details some of the "dirty work" (specifically, restructuring data), that goes on behind data analysis.

In this case, I'm trying to restructure a person-period/stacked dataset into a person-level/long dataset.

These two formats can be transposed from one to the other, but we often prefer one over the other for readibility/analytic purposes (detailed later).

Person-period vs. person-level data

Person-period datasets are usually found in longitudinal research, where a single person contributes multiple rows of data, reflecting the same measure at multiple time points. For example, a person's blood pressure measured every day over a year might be stored in a table where every row is a new blood pressure reading. Each person contributes 365 rows of data, identified by an ID number.

On the other hand, this blood pressure data can also be stored in a person-level dataset, where each row represents a single person, and multiple variables represent each time point (for example, "BP1", "BP2", "BP3" for days 1, 2 and 3 respectively).

As you might have guessed, in this example, a person-period dataset would be preferred. Firstly, it would take less space to store, as we only have to declare the type of data once for the column and not have to do it for each new row. In contrast, a person-level dataset would have to store information about what type of data is contained in each column, and this redundant information is replicated every time a new reading is taken.

Person-period datasets are also more convenient when missing data is to be expected. Substantial amounts of missing data in a person-level dataset would mean lots of empty cells, whereas missing rows do not have to be stored, unless there are other unique, period-specific data (e.g., if we were missing blood pressure data, but also had to store heart rate data for the day).

On the other hand, person-level datasets are often preferred when there are multiple variables that do not change with each time period. They also tend to be more human-readable. For example, if we were interested in someone's blood pressure at 3 time points, and also 100 other variables about their stable, demographic information a person-period dataset would result in a lot of redundancy, as these 100 variables are replicated for each new row.

Current data

The annonymized data I'm working with here come from an experiment I did a couple of years ago. In this study, participants completed a simple counting task where they had to count backwards, and I measured how long they took for each response. The specific details about the study are irrelevant to this post.

In [2]:
stackeddata = pd.read_csv('stacked dataset.csv')

This is what the person-period version of the data look like:

In [3]:
stackeddata.head()
Out[3]:
ID TIME TIMETRIAL MISTAKE NUMBER MISTAKESTOTAL SCORET RESPONSENUM SPLINE PASSED ... IPOVERTC SPEED100 PASSEDNUM HIvsLEffect DIFFBIN RESPONSENUMC DIFFICULTYC PHIGHSCORE PASSEDBY PASSEDBIN
0 25 16.262476 16.262476 1 2018 1 1981 0 0 0 ... -1.091039 6.149125 55 -1 1 -5 4.56 14.0 41.0 0
1 25 26.887734 10.625258 0 2017 1 1981 1 0 0 ... -1.091039 9.411536 54 -1 1 -4 4.56 14.0 41.0 0
2 25 33.911948 7.024214 0 2014 1 1981 2 0 0 ... -1.091039 14.236468 53 -1 1 -3 4.56 14.0 41.0 0
3 25 39.654125 5.742177 0 2011 1 1981 3 0 0 ... -1.091039 17.414998 52 -1 1 -2 4.56 14.0 41.0 0
4 25 48.295860 8.641734 0 2008 1 1981 4 0 0 ... -1.091039 11.571752 51 -1 1 -1 4.56 14.0 41.0 0

5 rows × 31 columns

Here, you can see ID, the unique identifier for each participant, TIME and TIMETRIAL, the time they took to type in each response both cumulatively and trial-by-trial (so these are all different for each row), as well as some time-invariant data, such as the total number of mistakes they made (MISTAKESTOTAL).

Transposing from person-period to person-level data involves summarizing data in some way. In this case, I'd like to analyze how many responses participants made in total, so that's what I'm going to collapse the data against.

In [4]:
longdata = pd.DataFrame(index=stackeddata[stackeddata.columns.values[0]].unique(), columns = ('TOTRESPONSES','BEFPASS',\
                                                                     'AFTPASS', 'IPOVERT', 'IP', 'T', 'DIFFICULTY', 'HIvsLEffect'))

This creates an empty dataframe with the unique participant IDs as indices, as well as the columns I want to carry over.

In [5]:
longdata.head()
Out[5]:
TOTRESPONSES BEFPASS AFTPASS IPOVERT IP T DIFFICULTY HIvsLEffect
25 NaN NaN NaN NaN NaN NaN NaN NaN
26 NaN NaN NaN NaN NaN NaN NaN NaN
28 NaN NaN NaN NaN NaN NaN NaN NaN
29 NaN NaN NaN NaN NaN NaN NaN NaN
31 NaN NaN NaN NaN NaN NaN NaN NaN

Let's deal with the time-invariant columns first. They're easy (because each row is the same for each participant, so we can just take the first one):

In [6]:
for column in ('IPOVERT', 'IP', 'T', 'DIFFICULTY', 'HIvsLEffect'):
    for participant in longdata.index.values:
        longdata.loc[participant,column] =  stackeddata.loc[stackeddata[stackeddata.columns.values[0]]==participant][column].values[0]
In [7]:
longdata.head()
Out[7]:
TOTRESPONSES BEFPASS AFTPASS IPOVERT IP T DIFFICULTY HIvsLEffect
25 NaN NaN NaN -2 2.33333 4.33333 10 -1
26 NaN NaN NaN -1.33333 2.16667 3.5 8 -1
28 NaN NaN NaN 0.666667 5 4.33333 5 -1
29 NaN NaN NaN -2.33333 3.83333 6.16667 8 -1
31 NaN NaN NaN -4.33333 1.66667 6 6 -1

There are three columns that we have to populate. The first, TOTRESPONSES is simply the number of responses that each participant made.

In [8]:
for participant in longdata.index.values:
        longdata.loc[participant, 'TOTRESPONSES'] = len(stackeddata.loc[(stackeddata[stackeddata.columns.values[0]]==participant) & \
                                                           (stackeddata['MISTAKE'] == 0)])
In [9]:
longdata.head()
Out[9]:
TOTRESPONSES BEFPASS AFTPASS IPOVERT IP T DIFFICULTY HIvsLEffect
25 13 NaN NaN -2 2.33333 4.33333 10 -1
26 34 NaN NaN -1.33333 2.16667 3.5 8 -1
28 20 NaN NaN 0.666667 5 4.33333 5 -1
29 10 NaN NaN -2.33333 3.83333 6.16667 8 -1
31 26 NaN NaN -4.33333 1.66667 6 6 -1

BEFPASS and AFTPASS have to do with whether or not a participant passed a displayed score on the screen (denoted by the column PASSED; 0 for no, 1 for yes).

In [10]:
for participant in longdata.index.values:
        longdata.loc[participant, 'BEFPASS'] = len(stackeddata.loc[(stackeddata[stackeddata.columns.values[0]]==participant) & \
                                                           (stackeddata['MISTAKE'] == 0) & (stackeddata['PASSED'] == 0)])
        longdata.loc[participant, 'AFTPASS'] = len(stackeddata.loc[(stackeddata[stackeddata.columns.values[0]]==participant) & \
                                                           (stackeddata['MISTAKE'] == 0) & (stackeddata['PASSED'] == 1)])

Finally, let's output these data to a csv:

In [11]:
longdata.to_csv('long data.csv')