Reality Commons

brought to you by the MIT Human Dynamics Lab

Social Evolution Dataset - Data Collected

Data collection includes proximity, location, and call log. Survey data includes Sociometric survey for relationships (chosen from 'friend', 'acquaintance', or 'don't know'), Political opinions (democratic vs. republican), Recent smoking behavior, Attitudes towards exercise and fitness, Attitudes towards diet, Attitudes towards academic performance, Current confidence and anxiety level, music sharing from 1500 independent music tracks from a wide assortment of genres.

Detailed description of the data is presented below:

The following figure shows each socialization network, political discussant network, the Facebook network, the blog/Twitter network (columns 1-5) during December 2008, March 2009, April 2009 and May 2009 (rows 1-4). Each of the 20 images shows whether a given person A (x-axis) reported the stated relationship with a given person B (y-axis) in a specific month. We have reordered the participants so that friends go together. The friendship networks, the socialization networks, and the political discussant networks all take the block diagonal form, and the blocks represent different living sectors. Hence, relationships and living sectors have the most important interactions. The Facebook networks and blog/Twitter networks are the least structured. Some individuals reported relationships with all other residents in the student hall, especially towards the end of the academic year. This may indicate that by the end of the year all 84 residents in the small student dorm know one another, and distinguishing between the different relationships in the survey is a difficult task as a result.


zz = bzfile("RelationshipsFromSurveys.csv.bz2")
relationships = read.csv(zz)
relationships$id.A = factor(relationships$id.A, levels = 1:84)
relationships$id.B = factor(relationships$id.B, levels = 1:84)
relationships$relationship = factor(relationships$relationship,
    levels = c("SocializeTwicePerWeek","CloseFriend",
    "PoliticalDiscussant", "FacebookAllTaggedPhotos", "BlogLivejournalTwitter"))
relationships.aggr = with(
    relationships[relationships$relationship == "SocializeTwicePerWeek",
    ], table(id.A, id.B))
relationships.cube =with(relationships, table(id.A, id.B, relationship, survey.date))
my.cor = function(m) cor(m + matrix(rnorm(prod(dim(m)), sd = sd(c(m)) * 0.01), 
    nrow = nrow(m)))
relationships.hclust = hclust(as.dist((1 - my.cor(relationships.aggr))^0.5), 
    , method = "ward")
layout(matrix(1:30, nrow = 5, byrow = TRUE))
par(mar = c(1, 1, 1, 1))
for (i in 1:dim(relationships.cube)[3]) for (j in 1:dim(relationships.cube)[4]) {
    image(relationships.cube[relationships.hclust$order, relationships.hclust$order, 
        i, j], xaxt = "n", yaxt = "n")
    if (j == dim(relationships.cube)[4]) 
        mtext(dimnames(relationships.cube)[[3]][i], side = 4, cex = 0.7)
    if (i == dim(relationships.cube)[3]) 
        mtext(dimnames(relationships.cube)[[4]][j], side = 1)
}

Students in the dormitory reported clusters of relationships in the surveys.

Friends also have a higher correlation in their on-campus activities, as indicated by the monthly surveys. From September 2008 to May 2009, 15 friend pairs each shared all on-campus activities that we surveyed, and 30% of friend pairs shared over 50% of their on-campus activities. In comparison, non-friends shared less than 10% of on-campus activities. In particular, pairs who participated aerobic exercise about three times weekly were more likely to be friends. While the surveys show significant correlation between activity participation and friendship (hypothesis that the correlation of friends' activity participation has the same probability distribution as the correlation of non-friends' was rejected with \( p<10^{-6} \)
in a Kolmogorov-Smirnov test), activities and interactions collected by sensors are nonetheless necessary for estimating friendship or activities, as the surveys do not offer statistically-significant correlations between friendship and factors that do not require shared space and time, such as shared websites and shared music.

The left panel of the following figure compares activity participation between friend pairs and non-friend pairs with a quantile-quantile plot (QQ plot). A QQ plot compares the probability distributions of two samples. When the two samples (participation correlations among friends and participation correlations among non-friends) have equal sizes, the QQ plot draws in a coordinate system the lowest value in one sample (non-friends) against the lowest value in the other sample, draws the second-lowest value in one sample against the second-lowest value in the other sample, and so on. When the two samples have different sizes, the QQ plot interpolates the values in the two samples. The digits 1-9 in this panel mark the 0.1-0.9 percentiles in the two distributions. For example, 20% of friend pairs and 5% of non-friend pairs (marked by red digit 8) have activity correlations greater than 0.4. If we identify friends as those whose activity correlations are greater than 0.4, we can successfully identify the 20% of friend pairs with the highest activity correlations, but also misidentify the 5% of non-friend pairs with the highest activity correlations as friend pairs.

The right panel of the following figure shows the odds that two individuals will be friends given that they perform certain numbers of aerobic exercises per week. When both individuals participate in around three aerobic activities per week, they have higher odds of being friends. This suggests that friendship determines behavior, because otherwise it is hard to explain why people who do 2.5 aerobic activities per week like those who do 2.5 aerobic activities per week, but people who do 3.5 aerobic activities per week like those who do 3.5 aerobic activities per week. When both individuals have one and fewer aerobic activities per week they are more likely to be non-friends. This seems to suggest that physical exercise is an additional factor other than shared living sector and shared courses that also shape friendship relations.


    
layout(matrix(1:2, nrow = 1, byrow = TRUE))
# par(mar=c(1,1,1,1))
org = read.csv("Activities.csv")
org$user.id = factor(org$user.id, levels = 1:84)
user.org = table(org$user.id, org$campus.organization)
my.cor = function(m) cor(m + matrix(rnorm(prod(dim(m)), sd = sd(c(m)) * 0.01), 
    nrow = nrow(m)))
adj.matrix = relationships.cube[, , "CloseFriend", "2008-12-13"]
x = split(my.cor(t(user.org[as.character(1:84), ]))[upper.tri(adj.matrix)], 
    adj.matrix[upper.tri(adj.matrix)])
qqplot(x[[1]], x[[2]], pch = ".", xlab = "non-friends", ylab = "friends",
    sub = "correlation of on-campus activity", main = "QQ-plot")
lines(-1:1, -1:1, col = "red")
points(quantile(x[[1]], 1:9/10), quantile(x[[2]], 1:9/10), pch = as.character(1:9), 
    col = "red")
#
physical = read.csv("Health.csv")
aerobic = with(physical, tapply(aerobic_per_week, data.frame(user_id, survey.month), 
    mean))
aerobic = aerobic[rowSums(!is.na(aerobic)) > 0, ]
library(MASS)
## Loading required package: MASS
edge.list = which(relationships.cube[dimnames(aerobic)[["user_id"]],
    dimnames(aerobic)[["user_id"]],
    "CloseFriend", "2008-12-13"] > 0, arr.ind = TRUE)
f1 <- kde2d(rowMeans(aerobic, na.rm = TRUE)[edge.list[, 1]], rowMeans(aerobic, 
    na.rm = TRUE)[edge.list[, 2]], n = 8)
edge.list = which(relationships.cube[dimnames(aerobic)[["user_id"]],
    dimnames(aerobic)[["user_id"]],
    "CloseFriend", "2008-12-13"] == 0, arr.ind = TRUE)
f2 <- kde2d(rowMeans(aerobic, na.rm = TRUE)[edge.list[, 1]], rowMeans(aerobic, 
    na.rm = TRUE)[edge.list[, 2]], n = 8)
contour(x = f1$x, y = f1$y, z = exp(asinh(f1$z) - asinh(f2$z)), xlab = "person A", 
    ylab = "person B", main = "odds of being friends", sub = "aerobic per week")
lines(0:7, 0:7, col = "gray")

plot of chunk RH-activity-diffusion

The sensors on subjects' mobile phones captured meaningful information on who the subjects were, whom the subjects interacted with, and how information, opinion, virus and other things diffused among the subjects, just like the surveys did. However, electronic devices could track and help individuals continuously without intervention, while surveys require significant intention from the individuals.

The sensor records of phone calls, short messages, and proximity reveal different types of relationships parellel the self-reported relationships at a much finer time scale and can be collected with much less human intervention. To illustrate this, we show how well these sensor records predict the subjects' self-reported close friend'' relationship andmeeting more than twice per week'' relationship at the middle of the experiment. We simply predict that two persons would report the relationship under investigation if they had more than a threshold amount of phone calls, short messages or proximity records during the whole experiment, and we plot the amount of correct vs wrong prediction at different thresholds and a performance indicator (the following two figures).

Different sensor data have different trade-offs in predicting reported relationships. When two subjects made voice calls to each other, we can predict that they were very likely to report themselves as friends and hence they were very likely to report themselves to get together at least twice per week in some meaningful events (from 0 to 250 correct predictions in black line). However, since subjects often socialize with non-friends who never made any voice calls to one another, a big fraction of self-reported sociolization could not be predicted with voice calls. If two persons were within 10 meters distance (proximity) and on the same floor, they would very likely report themselves to be socializing with each other. However, two persons might be in proximity very often without realizing the presence of each other. As a reference, whether two persons were in the same dormitory floor performs closer to voice call.


users = read.csv("Subjects.csv")
zz = bzfile("Calls.csv.bz2", 
    open = "rt")
calls = read.csv(file = zz)
close(zz)
calls$user_id = ordered(calls$user_id, levels = 1:84)
calls$dest_user_id_if_known = ordered(calls$dest_user_id_if_known, levels = 1:84)
calls.adjmat = table(calls$user_id, calls$dest_user_id_if_known)
zz = bzfile("SMS.csv.bz2", 
    open = "rt")
sms = suppressWarnings(read.csv(file = zz))
close(zz)
sms$user.id = ordered(sms$user.id, levels = 1:84)
sms$dest.user.id.if.known = ordered(sms$dest.user.id.if.known, levels = 1:84)
sms.adjmat = table(sms$user.id, sms$dest.user.id.if.known)
same.floor.adjmat = outer(users$floor, users$floor, "==")
zz = bzfile("Proximity.csv.bz2", 
    open = "rt")
proximity = read.csv(file = zz)
close(zz)
proximity$user.id = ordered(proximity$user.id, levels = 1:84)
proximity$remote.user.id.if.known = ordered(proximity$remote.user.id.if.known, 
    levels = 1:84)
proximity.adjmat = table(proximity$user.id, proximity$remote.user.id.if.known)
proximity.floor.adjmat = tapply(proximity$prob2,
    proximity[, c("user.id", "remote.user.id.if.known")],
    sum, na.rm = TRUE)

make.ordered.dyads = function(adjmat, label.mat) {
    dyads.ordered = data.frame(A = c(row(adjmat), col(adjmat)), B = c(col(adjmat), 
        row(adjmat)), score = rep(c(adjmat), 2)/2, friend = rep(c(label.mat), 
        2))
    dyads.ordered[order(dyads.ordered$score, decreasing = TRUE), ]
}
calls.dyads.ordered = make.ordered.dyads(calls.adjmat,
    relationships.cube[,
    , "SocializeTwicePerWeek", "2008-12-13"])
plot(cumsum(calls.dyads.ordered$friend == 0), cumsum(calls.dyads.ordered$friend > 
    0), type = "l", lty = 1, col = 1, xlab = "number of incorrect predictions", 
    ylab = "number of correct predictions",
    main = "socializing twice per week")
sms.dyads.ordered = make.ordered.dyads(sms.adjmat, relationships.cube[, ,
    "SocializeTwicePerWeek",
    "2008-12-13"])
lines(cumsum(sms.dyads.ordered$friend == 0),
    cumsum(sms.dyads.ordered$friend >
    0), lty = 2, col = 2)
same.floor.dyads.ordered = make.ordered.dyads(same.floor.adjmat, relationships.cube[, 
    , "SocializeTwicePerWeek", "2008-12-13"])
lines(cumsum(same.floor.dyads.ordered$friend == 0),
    cumsum(same.floor.dyads.ordered$friend >
    0), lty = 3, col = 3)
proximity.dyads.ordered = make.ordered.dyads(proximity.adjmat, relationships.cube[, 
    , "SocializeTwicePerWeek", "2008-12-13"])
lines(cumsum(proximity.dyads.ordered$friend == 0),
    cumsum(proximity.dyads.ordered$friend >
    0), lty = 4, col = 4)
proximity.floor.dyads.ordered =
    make.ordered.dyads(proximity.floor.adjmat, relationships.cube[,
    , "SocializeTwicePerWeek", "2008-12-13"])
lines(cumsum(proximity.floor.dyads.ordered$friend == 0),
    cumsum(proximity.floor.dyads.ordered$friend >
    0), lty = 5, col = 5)
legend("bottomright", lty = 1:5, col = 1:5, legend = c("voice call", "SMS", 
    "same dorm floor", "proximity", "proximity+Wi-Fi RSSI"))

plot of chunk RH-relationship-sensor-survey

In the following plot, we compare the reported relationship (SocializeTwicePerWeek) with aggregated dyadic relationships from sensors represented by adjacency matrices. Investigators of this data set can compare the adjacency matrices below with the ROC curves above.

layout(matrix(1:6, nrow = 2, byrow = TRUE))
image(x = 1:84, y = 1:84,
    z = relationships.cube[relationships.hclust$order,
    relationships.hclust$order, "SocializeTwicePerWeek", "2008-12-13"], xaxt = "n", 
    yaxt = "n", xlab = "", ylab = "", main = "reported socialization")
image(x = 1:84, y = 1:84,
    z = sms.adjmat[relationships.hclust$order, relationships.hclust$order] >
    0, xaxt = "n", yaxt = "n", xlab = "", ylab = "", main = "SMS")
image(x = 1:84, y = 1:84,
    z = calls.adjmat[relationships.hclust$order, relationships.hclust$order] >
    0, , xaxt = "n", yaxt = "n", xlab = "", ylab = "", main = "voice call")
image(x = 1:84, y = 1:84,
z=ifelse(is.na(same.floor.adjmat), 0, same.floor.adjmat)[relationships.hclust$order,
    relationships.hclust$order] > 0, xaxt = "n", yaxt = "n", xlab = "", ylab = "", 
    main = "same floor")
image(x = 1:84, y = 1:84,
    z = asinh(10 * sweep(ifelse(is.na(proximity.floor.adjmat),
    0, proximity.floor.adjmat), 1, rowSums(proximity.floor.adjmat, na.rm = TRUE) + 
    1e-06, "/")[relationships.hclust$order, relationships.hclust$order]), xaxt = "n", 
    yaxt = "n", xlab = "", ylab = "", main = "proximity+Wi-Fi")
image(x = 1:84, y = 1:84, z = asinh(10 * sweep(ifelse(is.na(proximity.adjmat), 
    0, proximity.adjmat), 1, rowSums(proximity.adjmat, na.rm = TRUE) + 1e-06, 
    "/")[relationships.hclust$order, relationships.hclust$order]), , xaxt = "n", 
    yaxt = "n", xlab = "", ylab = "", main = "proximity")

plot of chunk RH-adjmat-sensor-survey

The sensor records of Wi-Fi access point within range collected by personal mobile phones parellel the subjects' self-reports on who they were and what they did, at a much finer time scale and with much less human intervention. We make sense of the Wi-Fi data by collecting information about the access points. The following figure is a heat map showing how individuals visited places daily. The x-axis is indexed by Wi-Fi access points, and the y-axis is indexed by time during the week from Monday morning to Saturday at midnight. An entry shows how often the residents accessed a Wi-Fi access point in a specific hour in the week. The Wi-Fi access points on the left side were in the dormitory building, and so had many accesses from midnight to morning. The Wi-Fi access points on the right had high usage during work hours, and correspond to the classrooms and offices. The Wi-Fi access points in the middle show high usage from evening to midnight, and correspond to fitness centers and the student activity center. (We can not publish the latitudes, longitudes, building floors, and MAC addresses of the access points, since otherwise the investigators of this data set can estimate the subjects' instantaneous positions, buildings and room numbers from the RSSI to Wi-Fi access points.)


zz = bzfile("WLAN2.csv.bz2",
    open = "rt")
wlan = read.csv(file = zz)
close(zz)
wlan$time = as.POSIXct("1970-1-1", tz = "America/New_York") + wlan$unix_time
wlan.time = with(wlan[with(wlan,
    which(wireless_mac %in% head(names(sort(table(wlan$wireless_mac),
    decreasing = TRUE)), 512))), c("wireless_mac", "time")], table(wireless_mac, 
    strftime(time, "%w&%H")))
wlan.time[is.na(wlan.time)] = 0
v = hclust(as.dist(sqrt(1 - cor(t(asinh(wlan.time))))), method = "ward")
image(x = 1:nrow(wlan.time), y = 1:ncol(wlan.time), z = asinh(wlan.time[v$order, 
    ]), xlab = "wifi hotspots", ylab = "time in one week", xaxt = "n", yaxt = "n")
abline(h = seq(24, 24 * 7, by = 24))

The student dormitory community cycled among dormitory (left), athletic center (middle) and classroom/office (right) from Sunday (bottom stripe) to Saturday (upper stripe), as indicated by Wi-Fi access-point usage.

Note: There was significant loss of BlueTooth data during the late fall and late spring, due to carriers pushing system updates to the phones. Data during September-October political season and January-February flu season were relatively unaffected. To repair this data we conducted the following post processing steps to interpolate the Bluetooth data between observations (i.e., to interpolate missing data):

  1. For each Bluetooth record (subject A, subject B, time stamp), we add (subject B, subject A, time stamp) into the data set.
  2. We remove duplicated records every 6 minutes
  3. If a subject A didn't have Bluetooth record in a 6 minute interval, we estimate the probability of in proximity of A to other individuals B in this time interval in the following way: ( \(\alpha\) * probability of proximity over all time from step 2 + \(\beta\) * probability of proximity over the same hour of day + \(\gamma\) * probability of proximity over the same hour of day and day of week + \(\delta\) * probability of proximity in 30 minutes window) / (\(\alpha + \beta + \gamma + \delta\)) . Then we insert to the database the estimated records of the top individuals who account for 80% accumulated probability.
  4. To remove the Bluetooth records involving two persons on different floors, we estimated the floor of the individuals with Wi-Fi RSSI, and obtain the ground truth by assuming that the individuals were in their rooms between 3:30am ~ 4:30 am. The detail is here: http://dx.doi.org/10.1109/BSN.2012.16