Mousetracker Data #2
Overview¶
Since the time of the last post about these data, my friends and I at the NYU couples lab have collect some actual Mousetracker data from online participants. In this post, I wrote some code to pre-process and clean the dataset.
Current project¶
As a refresher, we hypothesized that at any given time, individuals are concerned with:
- their self-interest
- their partner's interests
- the interest of the group or dyad, or the relationship, or them as a pair
and these motives affect the way individuals choose to distribute resources.
To distinguish between these three motives, we generated three sets of stimuli using poker chips that pit each of these motives against each other:
- The first set of stimuli pit participants' self-interest against the interests of their partner.
- The second set of stimuli pits a participant's concern for the interest of their partner vs. their own self interest and the group's interest.
- The last set of stimuli pit participants' self-interest against that of their partner and the group.
The data¶
The data come in a person-period dataset. This is a "long" format where each participant has multiple rows that represent each trial of the experiment (there were 60 or so trials). However, each row also contains multiple columns each representing a bin of average locations the participant's mouse pointer was during that time span. There are ~100 such bins.
In other words, each participant made 60 choices, and their mouse positions were averaged into ~100 time points per trial.
The first thing we're going to do is actually some cleaning. Each of the .csv files have a redundant first row, so we're just going to delete that:
import os
import re
files = os.listdir('./data')
files
#for all our files
for name in files:
    with open('./data/{0}'.format(name)) as f:
        contents = f.readlines()
    out =  open('./data cleaned/{0}.csv'.format(re.match(r'(.*) n=.*', name).group(1)), 'wb')
    
    found = 0
    for line in contents[1:]:
        if line == 'MEAN SUBJECT-BY-SUBJECT DATA\n':
            found = 1
        elif found == 0:
            out.write(line)
        elif found == 1:
            pass
    out.close()
If there wasn't a section at the end I had to delete, I could have just done this:
import pandas as pd
   
testdata = pd.read_csv('./data/%s' % files[0],skiprows = 1)
testdata.head()
Next, I'm going to write a loop that basically does what I did in the first post to all the separate datasets. Again, we're going to be finding the mean of participants' reaction time (RT), maximum deviation (MD), and the area under curve (AUC).
What we want in the end is a csv file that has the overall mean RT, MD, AUC, as well as those metrics when participants' were correct vs. incorrect.
I wrote two functions that basically do what I did by hand in the first post. The first combines two redundant columns, and the second finds the mean of that column, depending on whether the participant made an error or not, or whether we want the grand mean.
data = pd.read_csv('./data cleaned/%s' % os.listdir('./data cleaned')[0])
data.head()
#first, combine the redundant columns and type the relevant columns
def combine_columns(dddd):
    dddd['MD'] = dddd.loc[dddd['MD_1'].isnull() == False, ['MD_1']]
    dddd.loc[dddd['MD'].isnull() == True,['MD']] = dddd.loc[dddd['MD_2'].isnull() == False]['MD_2'] 
    dddd['AUC'] = dddd.loc[dddd['AUC_1'].isnull() == False, ['AUC_1']]
    dddd.loc[dddd['AUC'].isnull() == True, ['AUC']] = dddd.loc[dddd['AUC_2'].isnull() == False]['AUC_2']
    
combine_columns(data)
def find_mean(datasource, participantid, metric, error=None):
    participantsdata = datasource.loc[datasource['subject'] == participantid]
    if error == 1:
        return participantsdata.loc[participantsdata['error']==1][metric].astype('float').mean()
    elif error == 0:
        return participantsdata.loc[participantsdata['error']==0][metric].astype('float').mean()
    else:        
        return participantsdata[metric].astype('float').mean()        
Next, we're going to test some code that calculates the mean of the afore-mentioned metrics for every participant in a dataset:
combine_columns(data)
participants = data['subject'].unique()
participantdict = {x:[] for x in participants}
for participant in participantdict:
    for metric in ['AUC', 'MD', 'RT']:
        try:
            participantdict[participant].append(find_mean(data, participant, metric, error = 1))
        except:
            participantdict[participant].append(None)
outdata = pd.DataFrame.from_dict(participantdict,orient='index')
outdata.columns = ['AUC', 'MD', 'RT']
outdata.head()
Let's write this as a function:
def sum_participants(pkeys,metrics,err,datttt):
    adictionary = {x:[] for x in pkeys}
    for participant in adictionary:
        for metric in metrics:
            try:
                adictionary[participant].append(find_mean(datttt, participant, metric ,error = err))
            except:
                adictionary[participant].append(None)
        adictionary[participant].append(err)
    return adictionary
Definitely not production code, but it should work.
Combining datasets¶
Alright, now we have all of that working, let's combine the datasets that we have, and add features to tell us where the data came from:
files = os.listdir('./data cleaned')
files
What we're going to do is first create an empty DataFrame with all our columns. Next, we'll load all of our data, do the math for the 3 measures, add a feature that captures where the data came from (i.e., one column for the color codes, another color for the comparison type).
combineddata = pd.DataFrame(columns= ['AUC', 'MD', 'RT', 'ERROR', 'COLORCODE', 'COMPARISON'])
Now, let's put everything together, and loop through all the filenames:
metrics = ['AUC', 'MD', 'RT']
for filename in files:
    tempdata = pd.read_csv('./data cleaned/{0}'.format(filename))
    
    combine_columns(tempdata)
    
    participants = tempdata['subject'].unique()
    
    correctdict = sum_participants(participants,metrics,0,tempdata)
    errordict = sum_participants(participants,metrics,1,tempdata)
    
    correctdata = pd.DataFrame.from_dict(correctdict,orient='index')
    errordata = pd.DataFrame.from_dict(errordict,orient='index')
    
    outdata = pd.concat([correctdata,errordata])
    outdata.columns = ['AUC', 'MD', 'RT', 'ERROR']
    outdata['COLORCODE'] = re.match(r'(....)', filename).group(1)
    outdata['COMPARISON'] = re.match(r'.... (.*).csv',filename).group(1)
    
    print filename
    print len(outdata)
    
    combineddata = pd.concat([combineddata,outdata])
combineddata.head()
This is what the data should look like, so let's write it to a csv:
combineddata.sort_index().to_csv('combineddata.csv')
outdata