Overview¶

Since the time of the last post about these data, my friends and I at the NYU couples lab have collect some actual Mousetracker data from online participants. In this post, I wrote some code to pre-process and clean the dataset.

Current project¶

As a refresher, we hypothesized that at any given time, individuals are concerned with:

their self-interest
their partner's interests
the interest of the group or dyad, or the relationship, or them as a pair

and these motives affect the way individuals choose to distribute resources.

To distinguish between these three motives, we generated three sets of stimuli using poker chips that pit each of these motives against each other:

The first set of stimuli pit participants' self-interest against the interests of their partner.
The second set of stimuli pits a participant's concern for the interest of their partner vs. their own self interest and the group's interest.
The last set of stimuli pit participants' self-interest against that of their partner and the group.

The data¶

The data come in a person-period dataset. This is a "long" format where each participant has multiple rows that represent each trial of the experiment (there were 60 or so trials). However, each row also contains multiple columns each representing a bin of average locations the participant's mouse pointer was during that time span. There are ~100 such bins.

In other words, each participant made 60 choices, and their mouse positions were averaged into ~100 time points per trial.

The first thing we're going to do is actually some cleaning. Each of the .csv files have a redundant first row, so we're just going to delete that:

import os
import re

files = os.listdir('./data')
files

['GBPG SMGL VS. SLGM n=118.csv',
 'GBPG SMGM VS. SLGL n=118.csv',
 'GBPG SO n=118.csv',
 'GBPR SMGL VS. SLGM n=121.csv',
 'GBPR SMGM vs. SLGL n=121.csv',
 'GBPR SO n=121.csv',
 'RBPG SMGL vs. SLGM n=105.csv',
 'RBPG SMGM vs. SLGL n=105.csv',
 'RBPG SO n=105.csv',
 'RBPR SMGL vs. SLGM n=120.csv',
 'RBPR SMGM vs. SLGL n=120.csv',
 'RBPR SO n=120.csv']

#for all our files
for name in files:
    with open('./data/{0}'.format(name)) as f:
        contents = f.readlines()
    out =  open('./data cleaned/{0}.csv'.format(re.match(r'(.*) n=.*', name).group(1)), 'wb')
    
    found = 0
    for line in contents[1:]:
        if line == 'MEAN SUBJECT-BY-SUBJECT DATA\n':
            found = 1
        elif found == 0:
            out.write(line)
        elif found == 1:
            pass
    out.close()

If there wasn't a section at the end I had to delete, I could have just done this:

import pandas as pd
   
testdata = pd.read_csv('./data/%s' % files[0],skiprows = 1)
testdata.head()

Next, I'm going to write a loop that basically does what I did in the first post to all the separate datasets. Again, we're going to be finding the mean of participants' reaction time (RT), maximum deviation (MD), and the area under curve (AUC).

What we want in the end is a csv file that has the overall mean RT, MD, AUC, as well as those metrics when participants' were correct vs. incorrect.

I wrote two functions that basically do what I did by hand in the first post. The first combines two redundant columns, and the second finds the mean of that column, depending on whether the participant made an error or not, or whether we want the grand mean.

data = pd.read_csv('./data cleaned/%s' % os.listdir('./data cleaned')[0])
data.head()

#first, combine the redundant columns and type the relevant columns
def combine_columns(dddd):
    dddd['MD'] = dddd.loc[dddd['MD_1'].isnull() == False, ['MD_1']]
    dddd.loc[dddd['MD'].isnull() == True,['MD']] = dddd.loc[dddd['MD_2'].isnull() == False]['MD_2'] 
    dddd['AUC'] = dddd.loc[dddd['AUC_1'].isnull() == False, ['AUC_1']]
    dddd.loc[dddd['AUC'].isnull() == True, ['AUC']] = dddd.loc[dddd['AUC_2'].isnull() == False]['AUC_2']
    
combine_columns(data)

def find_mean(datasource, participantid, metric, error=None):
    participantsdata = datasource.loc[datasource['subject'] == participantid]
    if error == 1:
        return participantsdata.loc[participantsdata['error']==1][metric].astype('float').mean()
    elif error == 0:
        return participantsdata.loc[participantsdata['error']==0][metric].astype('float').mean()
    else:        
        return participantsdata[metric].astype('float').mean()

Next, we're going to test some code that calculates the mean of the afore-mentioned metrics for every participant in a dataset:

combine_columns(data)

participants = data['subject'].unique()

participantdict = {x:[] for x in participants}

for participant in participantdict:
    for metric in ['AUC', 'MD', 'RT']:
        try:
            participantdict[participant].append(find_mean(data, participant, metric, error = 1))
        except:
            participantdict[participant].append(None)

outdata = pd.DataFrame.from_dict(participantdict,orient='index')
outdata.columns = ['AUC', 'MD', 'RT']

outdata.head()

Let's write this as a function:

def sum_participants(pkeys,metrics,err,datttt):
    adictionary = {x:[] for x in pkeys}
    for participant in adictionary:
        for metric in metrics:
            try:
                adictionary[participant].append(find_mean(datttt, participant, metric ,error = err))
            except:
                adictionary[participant].append(None)
        adictionary[participant].append(err)
    return adictionary

Definitely not production code, but it should work.

Combining datasets¶

Alright, now we have all of that working, let's combine the datasets that we have, and add features to tell us where the data came from:

files = os.listdir('./data cleaned')
files

['GBPG SMGL VS. SLGM.csv',
 'GBPG SMGM VS. SLGL.csv',
 'GBPG SO.csv',
 'GBPR SMGL VS. SLGM.csv',
 'GBPR SMGM vs. SLGL.csv',
 'GBPR SO.csv',
 'RBPG SMGL vs. SLGM.csv',
 'RBPG SMGM vs. SLGL.csv',
 'RBPG SO.csv',
 'RBPR SMGL vs. SLGM.csv',
 'RBPR SMGM vs. SLGL.csv',
 'RBPR SO.csv']

What we're going to do is first create an empty DataFrame with all our columns. Next, we'll load all of our data, do the math for the 3 measures, add a feature that captures where the data came from (i.e., one column for the color codes, another color for the comparison type).

combineddata = pd.DataFrame(columns= ['AUC', 'MD', 'RT', 'ERROR', 'COLORCODE', 'COMPARISON'])

Now, let's put everything together, and loop through all the filenames:

metrics = ['AUC', 'MD', 'RT']

for filename in files:
    tempdata = pd.read_csv('./data cleaned/{0}'.format(filename))
    
    combine_columns(tempdata)
    
    participants = tempdata['subject'].unique()
    
    correctdict = sum_participants(participants,metrics,0,tempdata)
    errordict = sum_participants(participants,metrics,1,tempdata)
    
    correctdata = pd.DataFrame.from_dict(correctdict,orient='index')
    errordata = pd.DataFrame.from_dict(errordict,orient='index')
    
    outdata = pd.concat([correctdata,errordata])
    outdata.columns = ['AUC', 'MD', 'RT', 'ERROR']
    outdata['COLORCODE'] = re.match(r'(....)', filename).group(1)
    outdata['COMPARISON'] = re.match(r'.... (.*).csv',filename).group(1)
    
    print filename
    print len(outdata)
    
    combineddata = pd.concat([combineddata,outdata])

GBPG SMGL VS. SLGM.csv
236
GBPG SMGM VS. SLGL.csv
236
GBPG SO.csv
236
GBPR SMGL VS. SLGM.csv
242
GBPR SMGM vs. SLGL.csv
242
GBPR SO.csv
242
RBPG SMGL vs. SLGM.csv
210
RBPG SMGM vs. SLGL.csv
210
RBPG SO.csv
210
RBPR SMGL vs. SLGM.csv
240
RBPR SMGM vs. SLGL.csv
240
RBPR SO.csv
240

combineddata.head()

This is what the data should look like, so let's write it to a csv:

combineddata.sort_index().to_csv('combineddata.csv')

outdata

	subject	trial	stimfile	condition	code	resp_1	resp_2	response	distractor	error	...	Y_93	Y_94	Y_95	Y_96	Y_97	Y_98	Y_99	Y_100	Y_101	Unnamed: 226
0	75870	2	NaN	1	smglvsslgm4	~S9_O7.jpg	~S8_O12.jpg	1	2	0	...	1.1770	1.1811	1.1826	1.1825	1.1824	1.1874	1.1875	1.1875	1.1875	NaN
1	75870	3	NaN	1	smglvsslgm6	~S10_O15.jpg	~S12_O8.jpg	2	1	0	...	1.1825	1.1825	1.1823	1.1837	1.1865	1.1876	1.1875	1.1875	1.1875	NaN
2	75870	7	NaN	1	smglvsslgm18	~S12_O7.jpg	~S8_O13.jpg	1	2	0	...	1.2003	1.1955	1.1950	1.1935	1.1924	1.1925	1.1925	1.1925	1.1925	NaN
3	75870	9	NaN	1	smglvsslgm19	~S9_O12.jpg	~S12_O8.jpg	2	1	0	...	1.3081	1.3013	1.2999	1.3000	1.3000	1.3000	1.3000	1.3000	1.3000	NaN
4	75870	14	NaN	2	smglvsslgm9	~S9_O7.jpg	~S10_O5.jpg	1	1	1	...	1.2577	1.2575	1.2575	1.2577	1.2565	1.2543	1.2519	1.2500	1.2500	NaN

	subject	trial	stimfile	condition	code	resp_1	resp_2	response	distractor	error	...	Y_93	Y_94	Y_95	Y_96	Y_97	Y_98	Y_99	Y_100	Y_101	Unnamed: 226
0	75870	2	NaN	1	smglvsslgm4	~S9_O7.jpg	~S8_O12.jpg	1	2	0	...	1.1770	1.1811	1.1826	1.1825	1.1824	1.1874	1.1875	1.1875	1.1875	NaN
1	75870	3	NaN	1	smglvsslgm6	~S10_O15.jpg	~S12_O8.jpg	2	1	0	...	1.1825	1.1825	1.1823	1.1837	1.1865	1.1876	1.1875	1.1875	1.1875	NaN
2	75870	7	NaN	1	smglvsslgm18	~S12_O7.jpg	~S8_O13.jpg	1	2	0	...	1.2003	1.1955	1.1950	1.1935	1.1924	1.1925	1.1925	1.1925	1.1925	NaN
3	75870	9	NaN	1	smglvsslgm19	~S9_O12.jpg	~S12_O8.jpg	2	1	0	...	1.3081	1.3013	1.2999	1.3000	1.3000	1.3000	1.3000	1.3000	1.3000	NaN
4	75870	14	NaN	2	smglvsslgm9	~S9_O7.jpg	~S10_O5.jpg	1	1	1	...	1.2577	1.2575	1.2575	1.2577	1.2565	1.2543	1.2519	1.2500	1.2500	NaN

	AUC	MD	RT
9104386	0.222260	0.147950	618.200000
7279624	-0.039400	-0.006537	1208.375000
3452429	0.378033	0.257911	1381.444444
7354384	0.034620	0.038660	1102.700000
479766	-0.036642	-0.048183	867.166667

	AUC	MD	RT	COLORCODE	COMPARISON
9104386	0.198260	0.117420	729.000000	GBPG	SMGL VS. SLGM
7279624	0.147017	0.087525	1235.500000	GBPG	SMGL VS. SLGM
3452429	0.530336	0.300000	1443.727273	GBPG	SMGL VS. SLGM
7354384	0.006440	0.003520	1173.000000	GBPG	SMGL VS. SLGM
479766	0.005037	0.016762	814.250000	GBPG	SMGL VS. SLGM

	AUC	MD	RT	ERROR	COLORCODE	COMPARISON
448004	2.707033	1.111633	906.333333	0	RBPR	SO
2675720	0.352980	0.169195	953.950000	0	RBPR	SO
2947082	0.160825	0.137081	1037.437500	0	RBPR	SO
8594447	0.907000	0.577125	1015.250000	0	RBPR	SO
515090	0.394933	0.282000	833.888889	0	RBPR	SO
1283091	0.474890	0.310680	781.600000	0	RBPR	SO
1591835	2.229329	0.997536	920.357143	0	RBPR	SO
8448030	0.717493	0.313850	748.642857	0	RBPR	SO
8767007	1.628893	0.699000	990.214286	0	RBPR	SO
9444902	0.676900	0.361963	800.750000	0	RBPR	SO
8873015	0.907767	0.487600	660.000000	0	RBPR	SO
8469940	0.164550	0.122125	647.375000	0	RBPR	SO
8325642	1.047320	0.590833	852.800000	0	RBPR	SO
503358	0.058525	0.050817	682.333333	0	RBPR	SO
7386178	1.234072	0.637233	924.611111	0	RBPR	SO
953414	-0.118170	-0.146170	862.500000	0	RBPR	SO
5312076	0.395989	0.157433	548.111111	0	RBPR	SO
2515533	0.960689	0.571544	974.222222	0	RBPR	SO
6012496	1.453787	0.773067	973.066667	0	RBPR	SO
4717240	0.883219	0.595013	1131.687500	0	RBPR	SO
3007065	0.044767	0.053333	739.555556	0	RBPR	SO
3507292	-0.100017	-0.085700	537.000000	0	RBPR	SO
1363052	1.691200	0.790547	985.941176	0	RBPR	SO
1426541	0.000000	0.000000	404.636364	0	RBPR	SO
9603603	0.744925	0.492655	1258.700000	0	RBPR	SO
5480054	-0.025157	-0.026857	649.714286	0	RBPR	SO
1355895	0.970286	0.480421	1062.285714	0	RBPR	SO
7049336	0.971658	0.564100	852.736842	0	RBPR	SO
3354234	0.390192	0.230254	1166.076923	0	RBPR	SO
2585535	1.319237	0.650863	829.250000	0	RBPR	SO
...	...	...	...	...	...	...
84352	0.928446	0.443062	743.384615	1	RBPR	SO
3042691	NaN	NaN	NaN	1	RBPR	SO
2597764	2.076241	0.906424	939.000000	1	RBPR	SO
9324421	NaN	NaN	NaN	1	RBPR	SO
5227406	NaN	NaN	NaN	1	RBPR	SO
7229839	0.029771	0.035086	437.714286	1	RBPR	SO
4169107	NaN	NaN	NaN	1	RBPR	SO
6001558	-0.091046	-0.119262	596.538462	1	RBPR	SO
647592	1.448136	0.694329	1193.571429	1	RBPR	SO
6338986	-0.117800	-0.135150	452.000000	1	RBPR	SO
3329964	0.321055	0.133191	687.454545	1	RBPR	SO
4264365	0.253538	0.165177	686.923077	1	RBPR	SO
3035571	0.064817	0.089733	607.333333	1	RBPR	SO
6431156	0.261650	0.240750	892.500000	1	RBPR	SO
7201720	0.105188	0.100663	598.250000	1	RBPR	SO
5427132	0.064814	0.078571	694.571429	1	RBPR	SO
1544127	0.308789	0.272522	692.666667	1	RBPR	SO
265153	0.999410	0.517540	603.100000	1	RBPR	SO
1399240	0.176720	0.162690	813.000000	1	RBPR	SO
5108681	-0.053300	-0.107300	858.000000	1	RBPR	SO
9541367	0.321488	0.219350	659.250000	1	RBPR	SO
2938324	0.060600	0.097500	781.000000	1	RBPR	SO
374233	0.635582	0.401791	760.272727	1	RBPR	SO
8976353	0.608600	0.358400	1020.000000	1	RBPR	SO
8706045	0.005117	0.013417	734.833333	1	RBPR	SO
9728496	0.165137	0.152912	538.250000	1	RBPR	SO
7825395	NaN	NaN	NaN	1	RBPR	SO
6582776	0.665817	0.497900	1213.333333	1	RBPR	SO
4016125	1.036192	0.571175	828.333333	1	RBPR	SO
7075839	NaN	NaN	NaN	1	RBPR	SO