import pandas as pd

Misc #2: Restructuring data¶

This post details some of the "dirty work" (specifically, restructuring data), that goes on behind data analysis.

In this case, I'm trying to restructure a person-period/stacked dataset into a person-level/long dataset.

These two formats can be transposed from one to the other, but we often prefer one over the other for readibility/analytic purposes (detailed later).

Person-period vs. person-level data¶

Person-period datasets are usually found in longitudinal research, where a single person contributes multiple rows of data, reflecting the same measure at multiple time points. For example, a person's blood pressure measured every day over a year might be stored in a table where every row is a new blood pressure reading. Each person contributes 365 rows of data, identified by an ID number.

On the other hand, this blood pressure data can also be stored in a person-level dataset, where each row represents a single person, and multiple variables represent each time point (for example, "BP1", "BP2", "BP3" for days 1, 2 and 3 respectively).

As you might have guessed, in this example, a person-period dataset would be preferred. Firstly, it would take less space to store, as we only have to declare the type of data once for the column and not have to do it for each new row. In contrast, a person-level dataset would have to store information about what type of data is contained in each column, and this redundant information is replicated every time a new reading is taken.

Person-period datasets are also more convenient when missing data is to be expected. Substantial amounts of missing data in a person-level dataset would mean lots of empty cells, whereas missing rows do not have to be stored, unless there are other unique, period-specific data (e.g., if we were missing blood pressure data, but also had to store heart rate data for the day).

On the other hand, person-level datasets are often preferred when there are multiple variables that do not change with each time period. They also tend to be more human-readable. For example, if we were interested in someone's blood pressure at 3 time points, and also 100 other variables about their stable, demographic information a person-period dataset would result in a lot of redundancy, as these 100 variables are replicated for each new row.

Current data¶

The annonymized data I'm working with here come from an experiment I did a couple of years ago. In this study, participants completed a simple counting task where they had to count backwards, and I measured how long they took for each response. The specific details about the study are irrelevant to this post.

stackeddata = pd.read_csv('stacked dataset.csv')

This is what the person-period version of the data look like:

stackeddata.head()

Here, you can see ID, the unique identifier for each participant, TIME and TIMETRIAL, the time they took to type in each response both cumulatively and trial-by-trial (so these are all different for each row), as well as some time-invariant data, such as the total number of mistakes they made (MISTAKESTOTAL).

Transposing from person-period to person-level data involves summarizing data in some way. In this case, I'd like to analyze how many responses participants made in total, so that's what I'm going to collapse the data against.

longdata = pd.DataFrame(index=stackeddata[stackeddata.columns.values[0]].unique(), columns = ('TOTRESPONSES','BEFPASS',\
                                                                     'AFTPASS', 'IPOVERT', 'IP', 'T', 'DIFFICULTY', 'HIvsLEffect'))

This creates an empty dataframe with the unique participant IDs as indices, as well as the columns I want to carry over.

longdata.head()

Let's deal with the time-invariant columns first. They're easy (because each row is the same for each participant, so we can just take the first one):

for column in ('IPOVERT', 'IP', 'T', 'DIFFICULTY', 'HIvsLEffect'):
    for participant in longdata.index.values:
        longdata.loc[participant,column] =  stackeddata.loc[stackeddata[stackeddata.columns.values[0]]==participant][column].values[0]

longdata.head()

There are three columns that we have to populate. The first, TOTRESPONSES is simply the number of responses that each participant made.

for participant in longdata.index.values:
        longdata.loc[participant, 'TOTRESPONSES'] = len(stackeddata.loc[(stackeddata[stackeddata.columns.values[0]]==participant) & \
                                                           (stackeddata['MISTAKE'] == 0)])

longdata.head()

BEFPASS and AFTPASS have to do with whether or not a participant passed a displayed score on the screen (denoted by the column PASSED; 0 for no, 1 for yes).

for participant in longdata.index.values:
        longdata.loc[participant, 'BEFPASS'] = len(stackeddata.loc[(stackeddata[stackeddata.columns.values[0]]==participant) & \
                                                           (stackeddata['MISTAKE'] == 0) & (stackeddata['PASSED'] == 0)])
        longdata.loc[participant, 'AFTPASS'] = len(stackeddata.loc[(stackeddata[stackeddata.columns.values[0]]==participant) & \
                                                           (stackeddata['MISTAKE'] == 0) & (stackeddata['PASSED'] == 1)])

Finally, let's output these data to a csv:

longdata.to_csv('long data.csv')

	ID	TIME	TIMETRIAL	MISTAKE	NUMBER	MISTAKESTOTAL	SCORET	RESPONSENUM	...	IPOVERTC	SPEED100	PASSEDNUM	HIvsLEffect	DIFFBIN	RESPONSENUMC	DIFFICULTYC	PHIGHSCORE	PASSEDBY
0	25	16.262476	16.262476	1	2018	1	1981	0	...	-1.091039	6.149125	55	-1	1	-5	4.56	14.0	41.0
1	25	26.887734	10.625258	0	2017	1	1981	1	...	-1.091039	9.411536	54	-1	1	-4	4.56	14.0	41.0
2	25	33.911948	7.024214	0	2014	1	1981	2	...	-1.091039	14.236468	53	-1	1	-3	4.56	14.0	41.0
3	25	39.654125	5.742177	0	2011	1	1981	3	...	-1.091039	17.414998	52	-1	1	-2	4.56	14.0	41.0
4	25	48.295860	8.641734	0	2008	1	1981	4	...	-1.091039	11.571752	51	-1	1	-1	4.56	14.0	41.0

	TOTRESPONSES	BEFPASS	AFTPASS	IPOVERT	IP	T	DIFFICULTY	HIvsLEffect
25	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
26	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
28	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
29	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	TOTRESPONSES	BEFPASS	AFTPASS	IPOVERT	IP	T	DIFFICULTY	HIvsLEffect
25	NaN	NaN	NaN	-2	2.33333	4.33333	10	-1
26	NaN	NaN	NaN	-1.33333	2.16667	3.5	8	-1
28	NaN	NaN	NaN	0.666667	5	4.33333	5	-1
29	NaN	NaN	NaN	-2.33333	3.83333	6.16667	8	-1
31	NaN	NaN	NaN	-4.33333	1.66667	6	6	-1

	TOTRESPONSES	BEFPASS	AFTPASS	IPOVERT	IP	T	DIFFICULTY	HIvsLEffect
25	13	NaN	NaN	-2	2.33333	4.33333	10	-1
26	34	NaN	NaN	-1.33333	2.16667	3.5	8	-1
28	20	NaN	NaN	0.666667	5	4.33333	5	-1
29	10	NaN	NaN	-2.33333	3.83333	6.16667	8	-1
31	26	NaN	NaN	-4.33333	1.66667	6	6	-1

	TOTRESPONSES	BEFPASS	AFTPASS	IPOVERT	IP	T	DIFFICULTY	HIvsLEffect
25	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
26	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
28	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
29	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	TOTRESPONSES	BEFPASS	AFTPASS	IPOVERT	IP	T	DIFFICULTY	HIvsLEffect
25	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
26	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
28	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
29	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	TOTRESPONSES	BEFPASS	AFTPASS	IPOVERT	IP	T	DIFFICULTY	HIvsLEffect
25	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
26	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
28	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
29	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN