Reality Commons

brought to you by the MIT Human Dynamics Lab

Social Evolution Dataset - Data Collected

Data collection includes proximity, location, and call log. Survey data includes Sociometric survey for relationships (chosen from 'friend', 'acquaintance', or 'don't know'), Political opinions (democratic vs. republican), Recent smoking behavior, Attitudes towards exercise and fitness, Attitudes towards diet, Attitudes towards academic performance, Current confidence and anxiety level, music sharing from 1500 independent music tracks from a wide assortment of genres.

Detailed description of the data is presented below:

individual attributes and relationships captured by phone sensors (temporal resolution - 6 minutes):
- Proximity.csv Bluetooth signal sent from whose mobile phone (user.id) and received by whose mobile phone (remote.user.id) and time, indicating the sender's mobile phone was within 10 meters of the receiver's mobile phone at the time of the record. We also included the probability (prob2) for the two person to be on the same floor in each record, estimated from Wi-Fi RSSI to access points. Proximity is an important media of meme diffusion.
- Calls.csv who (user_id) called which hashed phone numbers (dest_phone_hash, which uniquely identifies a phone number in the caller's phone), when (time_stamp), duration (duration), and user ID of the callee (dest_user_id_if_known) if the callee is also a subject in the experiment. Phone calls and SMS are indicators of friendship and important media of meme diffusion in the data set.
- SMS.csv who (user.id) sent short messages to which hashed phone numbers (dest_phone_hash), when (time), duration (duration), and user ID of the recipient (dest.user.id.if.known) if the recipient is also a subject in the experiment.
- WLAN.csv who (user_id) were within the range of which wireless local area network access points (WLAN AP, identified by wireless_mac), signal strength (strength), and when (time). WLAN AP scanning is an informative indicator of the instantaneous locations of the subjects, and hence the attributes of the subjects. We have encrypted the media access control (MAC) addresses of the access points according to IRB requirement.lab, classroom, student center, gym, restaurant, out of town, and more
individual attributes, relationships and diffusion captured by survey data
- Subjects.csv which year the subjects were in (year_school), and which living sector in the dorm building the subjects' apartments were located in (floor). Living sector in the dormitory and year in school are the two major factors in demining relationships in the data set. Other personal information can be accessed through further IRB approval.
- RelationshipsFromSurveys.csv whether a subject (id.A) indicated he had a relationship (relationship) with another subject (id.B) at the time of survey (survey.date). The surveyed relationships are: close friends (CloseFriend), participated in at least two common activities per week (SocializeTwicePerWeek), discussed politics since the last survey (PoliticalDiscussant), shared all tagged Facebook photos (FacebookAllTaggedPhotos), shared blog/ live journal/ Twitter activities (BlogLivejournalTwitter).
- FluSymptoms.csv daily flu symptoms self-report of the subjects from 01/09/2009 to 04/25/2009. The symptoms include sore throat/ coughing (sore.throat.cough), runny nose/ congestion/ sneezing (runnynose.congestion.sneezing), fever (fever), nausea/vomiting/diarrhea (nausea.vomiting.diarrhea), sad/ depressed (sad.depressed), and stressed(often.stressed). Flu symptoms diffuse through in-person contact in this data set.
- Health.csv monthly self-reports of weight, height, number of salads per week (salads_per_week), number of fruits per day (fruits_per_day), whether the meals were considered healthy (health_diet), number of days per week with at least 20 minutes aerobic excercise (aerobic_per_week), number of sports per week (sports_per_week), whether smoking (current_smoking). Researchers have arguments on whether healthy lifestyle and health are infectious.
- Politics.csv political opinions and participation during 2008 presidential election, and political opinions after the election. Before election: whether interested in politics (interested_in_politics), preferred party (preferred_party), preferred candidate (preferred_candidate). After election: whether made the vote (did_vote_in_election), whether voted preferred candidate (not_voted_preferred_candidate), whether approving the president (approve_obama_president), whether approving the president's economy politics (approve_obama_economy), whether approving the congress (approve_congress).

The following figure shows each socialization network, political discussant network, the Facebook network, the blog/Twitter network (columns 1-5) during December 2008, March 2009, April 2009 and May 2009 (rows 1-4). Each of the 20 images shows whether a given person A (x-axis) reported the stated relationship with a given person B (y-axis) in a specific month. We have reordered the participants so that friends go together. The friendship networks, the socialization networks, and the political discussant networks all take the block diagonal form, and the blocks represent different living sectors. Hence, relationships and living sectors have the most important interactions. The Facebook networks and blog/Twitter networks are the least structured. Some individuals reported relationships with all other residents in the student hall, especially towards the end of the academic year. This may indicate that by the end of the year all 84 residents in the small student dorm know one another, and distinguishing between the different relationships in the survey is a difficult task as a result.


zz = bzfile("RelationshipsFromSurveys.csv.bz2")
relationships = read.csv(zz)
relationships$id.A = factor(relationships$id.A, levels = 1:84)
relationships$id.B = factor(relationships$id.B, levels = 1:84)
relationships$relationship = factor(relationships$relationship,
    levels = c("SocializeTwicePerWeek","CloseFriend",
    "PoliticalDiscussant", "FacebookAllTaggedPhotos", "BlogLivejournalTwitter"))
relationships.aggr = with(
    relationships[relationships$relationship == "SocializeTwicePerWeek",
    ], table(id.A, id.B))
relationships.cube =with(relationships, table(id.A, id.B, relationship, survey.date))
my.cor = function(m) cor(m + matrix(rnorm(prod(dim(m)), sd = sd(c(m)) * 0.01), 
    nrow = nrow(m)))
relationships.hclust = hclust(as.dist((1 - my.cor(relationships.aggr))^0.5), 
    , method = "ward")
layout(matrix(1:30, nrow = 5, byrow = TRUE))
par(mar = c(1, 1, 1, 1))
for (i in 1:dim(relationships.cube)[3]) for (j in 1:dim(relationships.cube)[4]) {
    image(relationships.cube[relationships.hclust$order, relationships.hclust$order, 
        i, j], xaxt = "n", yaxt = "n")
    if (j == dim(relationships.cube)[4]) 
        mtext(dimnames(relationships.cube)[[3]][i], side = 4, cex = 0.7)
    if (i == dim(relationships.cube)[3]) 
        mtext(dimnames(relationships.cube)[[4]][j], side = 1)
}

Students in the dormitory reported clusters of relationships in the surveys.

Friends also have a higher correlation in their on-campus activities, as indicated by the monthly surveys. From September 2008 to May 2009, 15 friend pairs each shared all on-campus activities that we surveyed, and 30% of friend pairs shared over 50% of their on-campus activities. In comparison, non-friends shared less than 10% of on-campus activities. In particular, pairs who participated aerobic exercise about three times weekly were more likely to be friends. While the surveys show significant correlation between activity participation and friendship (hypothesis that the correlation of friends' activity participation has the same probability distribution as the correlation of non-friends' was rejected with \( p<10^{-6} \)
in a Kolmogorov-Smirnov test), activities and interactions collected by sensors are nonetheless necessary for estimating friendship or activities, as the surveys do not offer statistically-significant correlations between friendship and factors that do not require shared space and time, such as shared websites and shared music.

The left panel of the following figure compares activity participation between friend pairs and non-friend pairs with a quantile-quantile plot (QQ plot). A QQ plot compares the probability distributions of two samples. When the two samples (participation correlations among friends and participation correlations among non-friends) have equal sizes, the QQ plot draws in a coordinate system the lowest value in one sample (non-friends) against the lowest value in the other sample, draws the second-lowest value in one sample against the second-lowest value in the other sample, and so on. When the two samples have different sizes, the QQ plot interpolates the values in the two samples. The digits 1-9 in this panel mark the 0.1-0.9 percentiles in the two distributions. For example, 20% of friend pairs and 5% of non-friend pairs (marked by red digit 8) have activity correlations greater than 0.4. If we identify friends as those whose activity correlations are greater than 0.4, we can successfully identify the 20% of friend pairs with the highest activity correlations, but also misidentify the 5% of non-friend pairs with the highest activity correlations as friend pairs.

The right panel of the following figure shows the odds that two individuals will be friends given that they perform certain numbers of aerobic exercises per week. When both individuals participate in around three aerobic activities per week, they have higher odds of being friends. This suggests that friendship determines behavior, because otherwise it is hard to explain why people who do 2.5 aerobic activities per week like those who do 2.5 aerobic activities per week, but people who do 3.5 aerobic activities per week like those who do 3.5 aerobic activities per week. When both individuals have one and fewer aerobic activities per week they are more likely to be non-friends. This seems to suggest that physical exercise is an additional factor other than shared living sector and shared courses that also shape friendship relations.


    
layout(matrix(1:2, nrow = 1, byrow = TRUE))
# par(mar=c(1,1,1,1))
org = read.csv("Activities.csv")
org$user.id = factor(org$user.id, levels = 1:84)
user.org = table(org$user.id, org$campus.organization)
my.cor = function(m) cor(m + matrix(rnorm(prod(dim(m)), sd = sd(c(m)) * 0.01), 
    nrow = nrow(m)))
adj.matrix = relationships.cube[, , "CloseFriend", "2008-12-13"]
x = split(my.cor(t(user.org[as.character(1:84), ]))[upper.tri(adj.matrix)], 
    adj.matrix[upper.tri(adj.matrix)])
qqplot(x[[1]], x[[2]], pch = ".", xlab = "non-friends", ylab = "friends",
    sub = "correlation of on-campus activity", main = "QQ-plot")
lines(-1:1, -1:1, col = "red")
points(quantile(x[[1]], 1:9/10), quantile(x[[2]], 1:9/10), pch = as.character(1:9), 
    col = "red")
#
physical = read.csv("Health.csv")
aerobic = with(physical, tapply(aerobic_per_week, data.frame(user_id, survey.month), 
    mean))
aerobic = aerobic[rowSums(!is.na(aerobic)) > 0, ]
library(MASS)
## Loading required package: MASS
edge.list = which(relationships.cube[dimnames(aerobic)[["user_id"]],
    dimnames(aerobic)[["user_id"]],
    "CloseFriend", "2008-12-13"] > 0, arr.ind = TRUE)
f1 <- kde2d(rowMeans(aerobic, na.rm = TRUE)[edge.list[, 1]], rowMeans(aerobic, 
    na.rm = TRUE)[edge.list[, 2]], n = 8)
edge.list = which(relationships.cube[dimnames(aerobic)[["user_id"]],
    dimnames(aerobic)[["user_id"]],
    "CloseFriend", "2008-12-13"] == 0, arr.ind = TRUE)
f2 <- kde2d(rowMeans(aerobic, na.rm = TRUE)[edge.list[, 1]], rowMeans(aerobic, 
    na.rm = TRUE)[edge.list[, 2]], n = 8)
contour(x = f1$x, y = f1$y, z = exp(asinh(f1$z) - asinh(f2$z)), xlab = "person A", 
    ylab = "person B", main = "odds of being friends", sub = "aerobic per week")
lines(0:7, 0:7, col = "gray")

plot of chunk RH-activity-diffusion

The sensors on subjects' mobile phones captured meaningful information on who the subjects were, whom the subjects interacted with, and how information, opinion, virus and other things diffused among the subjects, just like the surveys did. However, electronic devices could track and help individuals continuously without intervention, while surveys require significant intention from the individuals.

The sensor records of phone calls, short messages, and proximity reveal different types of relationships parellel the self-reported relationships at a much finer time scale and can be collected with much less human intervention. To illustrate this, we show how well these sensor records predict the subjects' self-reported close friend'' relationship andmeeting more than twice per week'' relationship at the middle of the experiment. We simply predict that two persons would report the relationship under investigation if they had more than a threshold amount of phone calls, short messages or proximity records during the whole experiment, and we plot the amount of correct vs wrong prediction at different thresholds and a performance indicator (the following two figures).

Different sensor data have different trade-offs in predicting reported relationships. When two subjects made voice calls to each other, we can predict that they were very likely to report themselves as friends and hence they were very likely to report themselves to get together at least twice per week in some meaningful events (from 0 to 250 correct predictions in black line). However, since subjects often socialize with non-friends who never made any voice calls to one another, a big fraction of self-reported sociolization could not be predicted with voice calls. If two persons were within 10 meters distance (proximity) and on the same floor, they would very likely report themselves to be socializing with each other. However, two persons might be in proximity very often without realizing the presence of each other. As a reference, whether two persons were in the same dormitory floor performs closer to voice call.


users = read.csv("Subjects.csv")
zz = bzfile("Calls.csv.bz2", 
    open = "rt")
calls = read.csv(file = zz)
close(zz)
calls$user_id = ordered(calls$user_id, levels = 1:84)
calls$dest_user_id_if_known = ordered(calls$dest_user_id_if_known, levels = 1:84)
calls.adjmat = table(calls$user_id, calls$dest_user_id_if_known)
zz = bzfile("SMS.csv.bz2", 
    open = "rt")
sms = suppressWarnings(read.csv(file = zz))
close(zz)
sms$user.id = ordered(sms$user.id, levels = 1:84)
sms$dest.user.id.if.known = ordered(sms$dest.user.id.if.known, levels = 1:84)
sms.adjmat = table(sms$user.id, sms$dest.user.id.if.known)
same.floor.adjmat = outer(users$floor, users$floor, "==")
zz = bzfile("Proximity.csv.bz2", 
    open = "rt")
proximity = read.csv(file = zz)
close(zz)
proximity$user.id = ordered(proximity$user.id, levels = 1:84)
proximity$remote.user.id.if.known = ordered(proximity$remote.user.id.if.known, 
    levels = 1:84)
proximity.adjmat = table(proximity$user.id, proximity$remote.user.id.if.known)
proximity.floor.adjmat = tapply(proximity$prob2,
    proximity[, c("user.id", "remote.user.id.if.known")],
    sum, na.rm = TRUE)


make.ordered.dyads = function(adjmat, label.mat) {
    dyads.ordered = data.frame(A = c(row(adjmat), col(adjmat)), B = c(col(adjmat), 
        row(adjmat)), score = rep(c(adjmat), 2)/2, friend = rep(c(label.mat), 
        2))
    dyads.ordered[order(dyads.ordered$score, decreasing = TRUE), ]
}
calls.dyads.ordered = make.ordered.dyads(calls.adjmat,
    relationships.cube[,
    , "SocializeTwicePerWeek", "2008-12-13"])
plot(cumsum(calls.dyads.ordered$friend == 0), cumsum(calls.dyads.ordered$friend > 
    0), type = "l", lty = 1, col = 1, xlab = "number of incorrect predictions", 
    ylab = "number of correct predictions",
    main = "socializing twice per week")
sms.dyads.ordered = make.ordered.dyads(sms.adjmat, relationships.cube[, ,
    "SocializeTwicePerWeek",
    "2008-12-13"])
lines(cumsum(sms.dyads.ordered$friend == 0),
    cumsum(sms.dyads.ordered$friend >
    0), lty = 2, col = 2)
same.floor.dyads.ordered = make.ordered.dyads(same.floor.adjmat, relationships.cube[, 
    , "SocializeTwicePerWeek", "2008-12-13"])
lines(cumsum(same.floor.dyads.ordered$friend == 0),
    cumsum(same.floor.dyads.ordered$friend >
    0), lty = 3, col = 3)
proximity.dyads.ordered = make.ordered.dyads(proximity.adjmat, relationships.cube[, 
    , "SocializeTwicePerWeek", "2008-12-13"])
lines(cumsum(proximity.dyads.ordered$friend == 0),
    cumsum(proximity.dyads.ordered$friend >
    0), lty = 4, col = 4)
proximity.floor.dyads.ordered =
    make.ordered.dyads(proximity.floor.adjmat, relationships.cube[,
    , "SocializeTwicePerWeek", "2008-12-13"])
lines(cumsum(proximity.floor.dyads.ordered$friend == 0),
    cumsum(proximity.floor.dyads.ordered$friend >
    0), lty = 5, col = 5)
legend("bottomright", lty = 1:5, col = 1:5, legend = c("voice call", "SMS", 
    "same dorm floor", "proximity", "proximity+Wi-Fi RSSI"))

plot of chunk RH-relationship-sensor-survey

In the following plot, we compare the reported relationship (SocializeTwicePerWeek) with aggregated dyadic relationships from sensors represented by adjacency matrices. Investigators of this data set can compare the adjacency matrices below with the ROC curves above.

layout(matrix(1:6, nrow = 2, byrow = TRUE))
image(x = 1:84, y = 1:84,
    z = relationships.cube[relationships.hclust$order,
    relationships.hclust$order, "SocializeTwicePerWeek", "2008-12-13"], xaxt = "n", 
    yaxt = "n", xlab = "", ylab = "", main = "reported socialization")
image(x = 1:84, y = 1:84,
    z = sms.adjmat[relationships.hclust$order, relationships.hclust$order] >
    0, xaxt = "n", yaxt = "n", xlab = "", ylab = "", main = "SMS")
image(x = 1:84, y = 1:84,
    z = calls.adjmat[relationships.hclust$order, relationships.hclust$order] >
    0, , xaxt = "n", yaxt = "n", xlab = "", ylab = "", main = "voice call")
image(x = 1:84, y = 1:84,
z=ifelse(is.na(same.floor.adjmat), 0, same.floor.adjmat)[relationships.hclust$order,
    relationships.hclust$order] > 0, xaxt = "n", yaxt = "n", xlab = "", ylab = "", 
    main = "same floor")
image(x = 1:84, y = 1:84,
    z = asinh(10 * sweep(ifelse(is.na(proximity.floor.adjmat),
    0, proximity.floor.adjmat), 1, rowSums(proximity.floor.adjmat, na.rm = TRUE) + 
    1e-06, "/")[relationships.hclust$order, relationships.hclust$order]), xaxt = "n", 
    yaxt = "n", xlab = "", ylab = "", main = "proximity+Wi-Fi")
image(x = 1:84, y = 1:84, z = asinh(10 * sweep(ifelse(is.na(proximity.adjmat), 
    0, proximity.adjmat), 1, rowSums(proximity.adjmat, na.rm = TRUE) + 1e-06, 
    "/")[relationships.hclust$order, relationships.hclust$order]), , xaxt = "n", 
    yaxt = "n", xlab = "", ylab = "", main = "proximity")

plot of chunk RH-adjmat-sensor-survey

The sensor records of Wi-Fi access point within range collected by personal mobile phones parellel the subjects' self-reports on who they were and what they did, at a much finer time scale and with much less human intervention. We make sense of the Wi-Fi data by collecting information about the access points. The following figure is a heat map showing how individuals visited places daily. The x-axis is indexed by Wi-Fi access points, and the y-axis is indexed by time during the week from Monday morning to Saturday at midnight. An entry shows how often the residents accessed a Wi-Fi access point in a specific hour in the week. The Wi-Fi access points on the left side were in the dormitory building, and so had many accesses from midnight to morning. The Wi-Fi access points on the right had high usage during work hours, and correspond to the classrooms and offices. The Wi-Fi access points in the middle show high usage from evening to midnight, and correspond to fitness centers and the student activity center. (We can not publish the latitudes, longitudes, building floors, and MAC addresses of the access points, since otherwise the investigators of this data set can estimate the subjects' instantaneous positions, buildings and room numbers from the RSSI to Wi-Fi access points.)


zz = bzfile("WLAN2.csv.bz2",
    open = "rt")
wlan = read.csv(file = zz)
close(zz)
wlan$time = as.POSIXct("1970-1-1", tz = "America/New_York") + wlan$unix_time
wlan.time = with(wlan[with(wlan,
    which(wireless_mac %in% head(names(sort(table(wlan$wireless_mac),
    decreasing = TRUE)), 512))), c("wireless_mac", "time")], table(wireless_mac, 
    strftime(time, "%w&%H")))
wlan.time[is.na(wlan.time)] = 0
v = hclust(as.dist(sqrt(1 - cor(t(asinh(wlan.time))))), method = "ward")
image(x = 1:nrow(wlan.time), y = 1:ncol(wlan.time), z = asinh(wlan.time[v$order, 
    ]), xlab = "wifi hotspots", ylab = "time in one week", xaxt = "n", yaxt = "n")
abline(h = seq(24, 24 * 7, by = 24))

The student dormitory community cycled among dormitory (left), athletic center (middle) and classroom/office (right) from Sunday (bottom stripe) to Saturday (upper stripe), as indicated by Wi-Fi access-point usage.

Note: There was significant loss of BlueTooth data during the late fall and late spring, due to carriers pushing system updates to the phones. Data during September-October political season and January-February flu season were relatively unaffected. To repair this data we conducted the following post processing steps to interpolate the Bluetooth data between observations (i.e., to interpolate missing data):

For each Bluetooth record (subject A, subject B, time stamp), we add (subject B, subject A, time stamp) into the data set.
We remove duplicated records every 6 minutes
If a subject A didn't have Bluetooth record in a 6 minute interval, we estimate the probability of in proximity of A to other individuals B in this time interval in the following way: ( \(\alpha\) * probability of proximity over all time from step 2 + \(\beta\) * probability of proximity over the same hour of day + \(\gamma\) * probability of proximity over the same hour of day and day of week + \(\delta\) * probability of proximity in 30 minutes window) / (\(\alpha + \beta + \gamma + \delta\)) . Then we insert to the database the estimated records of the top individuals who account for 80% accumulated probability.
To remove the Bluetooth records involving two persons on different floors, we estimated the floor of the individuals with Wi-Fi RSSI, and obtain the ground truth by assuming that the individuals were in their rooms between 3:30am ~ 4:30 am. The detail is here: http://dx.doi.org/10.1109/BSN.2012.16