Reality Commons

brought to you by the MIT Human Dynamics Lab

Badge Dataset - Data Collected

The data set contains the following tables, and each table contains the following fields:

badge.assignment = read.csv("BadgeAssignment.csv")
trans = read.csv("Transactions.csv")
trans$ = as.POSIXct(trans$, tz = "America/Chicago")
trans$ = as.POSIXct(trans$, tz = "America/Chicago")
zz = bzfile("LocationTrackingEvery1Minute.csv.bz2", open = "rt")
hdc.xy = read.csv(zz)
hdc.xy$time = as.POSIXct(hdc.xy$time, tz = "America/Chicago")
zz = bzfile("IR.csv.bz2", open = "rt")
IR.aggr = read.csv(zz)
IR.aggr$date.time = as.POSIXct(IR.aggr$date.time, tz = "America/Chicago")
zz = bzfile("Zigbee.csv.bz2", open = "rt")
net.aggr = read.csv(zz)
net.aggr$date.time = as.POSIXct(net.aggr$date.time, tz = "America/Chicago")

The following figure shows how often two employees were located within the distance of one cubicle (that is, co-located) – rows and columns are indexed by employees. The brightness of a table cell is indexed by row and column representing the amount of time employee and employee were co-located; the whiter the color, the more total time they were co-located. The dendrograms to the left and top of the heat map represent how employees were grouped according to their co-location relationship. A leaf of the dendrogram corresponds to the same employee that indexes a row and a column of the heat map, while the colors on the leaves of the dendrogram represent different branches in the firm – red is configuration branch, green coordination branch, and purple pricing branch. The numbers at the right and bottom sides of the heat map show the IDs of the employee tracking badges. We constructed the dendrogram by expressing the amounts of time that an employee was co-located with other employees as an observation vector of real numbers regarding this employee, defining the distance between two employees and to be where is the correlation coefficient between 's times of co-location with other employees and 's times of correlations with other employees. We use Ward's minimum variance method in hierarchical clustering to find compact, spherical clusters in constructing the dendrogram.

Employees are consistently co-located with others whose cubicles are close by, confirming the previous finding that shared time and space is a significant factor in relationship-building [7]. However, employees from different branches have different patterns in co-location, while employees from the same branch pattern similarly – not surprising, since different branches had different types of tasks. Such patterns differentiate the employees into several clusters. About 70% of employees in the cluster from badge ID 278 to badge ID 292 in the heat map were senior configuration staff who did most of the tasks assigned to the configuration branch and had intensive co-location with one another but spent only very little time with other employees. This is because in order to finish the advanced tasks assigned to them, they needed to visit only 100 ~ 200 grid points in the workspace (out of 502 in total), or 7 ~ 14 cubicles (out of 28), and discuss their tasks with only a limited number of people. About 70% employees in the heat map cluster from badge ID 265 to badge ID 56 were novice configuration staff, who in contrast discussed their tasks with few others but pursued only a small fraction of tasks assigned to the configuration branch. The cluster of pricing staff spent less time with one another, but spent more time with the configuration staff, and performed many more basic complex assignments per person compared to senior configuration staff. Note that we used no performance measure in hierarchical clustering, and the splitting of the configuration staff into a cluster including more senior members and another cluster including more junior members is simply because the senior members and the junior members behave differently.

badge.prox = with(net.aggr[net.aggr$ %in% unique(net.aggr$, 
    ], table(,
badge.prox = sweep(badge.prox, 2, tapply(trunc(as.numeric(net.aggr$date.time)/3600), 
    net.aggr$, function(x) length(unique(x))), "/")
badge.prox.hclust = hclust(as.dist((1 - cor(badge.prox))^0.5))
heatmap(badge.prox, Rowv = as.dendrogram(badge.prox.hclust),
    Colv = as.dendrogram(badge.prox.hclust),
    scale = "none",
    RowSideColors = c("yellow", "red", "green",
        "purple", "gray")[badge.assignment$role[match(rownames(badge.prox),
        badge.assignment$BID)]], ColSideColors = c("yellow", "red",
        "purple", "gray")[badge.assignment$role[match(rownames(badge.prox),

plot of chunk prox-symmetry

According to the theory of structure holes [2], more often people talk to those with the same expertise/ roles, and the less often interactions among people with different expertise/ roles can be more important when they happen. This is confirmed by how often people engaged in face-to-face communications in the call center, as indicated by the IR messages logged by the employees' badges (c.f. figure below), and how visiting another employees' cubicles could contribute to higher productivity per unit time, to be discussed later. Employees are more likely to have face-to-face discussions when their cubicles are closer, and this indicats a way of engineering the communication structures within the call center by adjusting the cubicles.

IR.aggr2 = IR.aggr[IR.aggr$ %in% unique(hdc.xy$id) & IR.aggr$ %in%
    unique(hdc.xy$id), ]
ir.prox = table(unique(IR.aggr2)[, c("", "")])
ir.prox = ir.prox[rownames(ir.prox) %in% colnames(ir.prox), colnames(ir.prox) %in% 
ir.prox.hclust = hclust(as.dist(sqrt(1 - cor(asinh(ir.prox)))), method = "ward")
heatmap(asinh(ir.prox * 10), Rowv = as.dendrogram(ir.prox.hclust),
    Colv = as.dendrogram(ir.prox.hclust),
    scale = "none",
    RowSideColors = c("yellow", "red", "green",
        "purple", "gray")[badge.assignment$role[match(rownames(ir.prox),
        ColSideColors = c("yellow", "red", "green",
        "purple", "gray")[
          badge.assignment$role[match(rownames(ir.prox), badge.assignment$BID)]])

plot of chunk ir-symmetry

The following figure shows the positive correlation between the number of tasks assigned and where an employee went while working on a task. The employee with the highest number of assignments (badge ID 293) received 132 tasks during one month. His entropy of going to different places to finish these assignments was 5.75, and he typically went to exp(5.75)=315 grid points in the workspace (out of 502 in total), or 19 cubicles of the 28 non-empty cubicles. The employee with the least number of assignments received only one task. His entropy was 4.19, and he typically went to exp(4.19)=66 grid points, or 6 cubicles.

The following figure also shows that employees in the pricing branch and in the configuration branch received and finished assignments very differently. In terms of overall tasks assigned, a pricing employee received an average of nine times as many assignments when they were basic, and three times as many when they were complex, as a configuration employee was assigned. Pricing employees also finished these assignments in parallel, and went to many people to solve these assignments. Configuration employees, on the other hand, solved advanced assignments exclusively, worked serially, and went to fewer people to solve their assignments.

The entropy of location distribution in solving a complex task is about 10% higher than the entropy of solving a basic task, meaning that solving a complex task requires discussion with 10% more people. However, the entropy of location distribution in solving an advanced task is more centered around the median in comparison to the entropies of basic and complex tasks – advanced tasks require only a certain number of discussions, suggesting that advanced tasks are more self-contained.

Interpreting the log linear relationship between rate of completion and entropy in terms of survival analysis, we write time of completion = \( \exp(-\sum_{(\tilde{x}_{m},\tilde{y}_{n})}p(\tilde{x}_{m},\tilde{y}_{n})\log p(\tilde{x}_{m},\tilde{y}_{n})) \), where \( (\tilde{x}_{m},\tilde{y}_{n}) \) is the set of location grids onto which we map RSSI, \( p(\tilde{x}_{m},\tilde{y}_{n}) \) is the probability that the grid was visited, the exponent is the entropy of the employee's location-visiting behavior when he had a task, and the visit to every location \( (\tilde{x}_{m},\tilde{y}_{n}) \) makes task completion \( \exp(-\sum_{(\tilde{x}_{m},\tilde{y}_{n})}p(\tilde{x}_{m},\tilde{y}_{n})\log p(\tilde{x}_{m},\tilde{y}_{n})) \) times faster. The “survival” time of a task is an exponential function of the rate of task completion, which in turn is the sum of the contributions from all locations that this employee visited weighted by the frequencies with which this employee visited them. The contribution of a specific location per visit \( -\log p(\tilde{x}_{m},\tilde{y}_{n}) \) is more critical when the location is less visited; however, over all visits, the more-frequently-visited locations contributed more to task completion than the less-visited locations, because \( p\log p \) decreases to 0 when \( p \) decreases to 0.

hdc.entropy = sapply(split(hdc.xy, hdc.xy$id), function(x) {
    p = table(paste(x$x, x$y))
    p = p/sum(p)
    sum(p * log(p))
hdc.accomplishment = c(table(as.character(trans$
hdc.accomplishment = hdc.accomplishment[intersect(names(hdc.accomplishment), 
hdc.entropy = hdc.entropy[intersect(names(hdc.accomplishment), names(hdc.entropy))]
plot(-hdc.entropy, hdc.accomplishment, xlab = "entropy", ylab = "# of tasks assigned to")
pointLabel(-hdc.entropy, hdc.accomplishment, names(hdc.entropy),
    col = sapply(as.character(badge.assignment$role[match(names(hdc.entropy),
    badge.assignment$BID)]), function(x) switch(x, Pricing = "purple",
    `Base station` = "orange",
    Coordinator = "green", Configuration = "red", RSSI = "gray")))
legend("topleft", text.col = c("red", "purple"), legend = c("configuration", 

plot of chunk entropy-accomplishment

We show with a quantile-quantile plot (c.f. figure below) that the distance of two persons was closer within 1 minute of a face-to-face discussion, as compared to the distance within 1 hour of the face-to-face discussion, as a sanity testing of the time stamps estimated from “jiffy'' counts of the badges, and the indoor-locations estimated from Zigbee RSSI from employees' badges to anchor nodes: We randomly take 200 records of IR proximity from the data set, randomly take 10 locations within 1 minute of the IR proximity from the sender badge and 10 locations from the receiver badge for each record, sort the 20 thousand pairwise distances (200 records \( \times10\times10 \) pairwise distances per record), and plot them against another 20 thousand sorted distances within 1 hour of IR proximity. We find that with 90% probability two persons were within the distance of 1 cubicle in the 1 minute window of their face-to-face discussion, as compared to 70% probability in the 1 hour window. We would not find this structure if either the estimated time stamps had an error bigger than 1 minute or the estimated indoor locations had an error bigger than the distance of 1 cubicle. We can similarly check that two persons were closer to each other at the time of IR-proximity than Zigbee-proximity, and two persons had more IR-proximity records and Zigbee-proximity records when their cubicles were closer.

IR.aggr2 = IR.aggr[IR.aggr$ %in% unique(hdc.xy$id) & IR.aggr$ %in%
    unique(hdc.xy$id), ]
IR.aggr2$ndx.local = match(paste(IR.aggr2$, strftime(IR.aggr2$date.time, 
    "%Y-%m-%d %H:%M:00")),
    paste(paste(hdc.xy$id, strftime(hdc.xy$time, "%Y-%m-%d %H:%M:00"))))
IR.aggr2$ndx.sender = match(paste(IR.aggr2$, strftime(IR.aggr2$date.time, 
    "%Y-%m-%d %H:%M:00")),
    paste(paste(hdc.xy$id, strftime(hdc.xy$time, "%Y-%m-%d %H:%M:00"))))
IR.dist = unlist(lapply(sample(which(!$ndx.local) & !$ndx.sender)),
    200), function(n) {
    a = hdc.xy[IR.aggr2$ndx.local[n] + 0:9, c("x", "y")]
    b = hdc.xy[IR.aggr2$ndx.sender[n] + 0:9, c("x", "y")]
    round((outer(a$x[1:10], b$x[1:10], function(u, v) (u - v))^2 + outer(a$y[1:10], 
        b$y[1:10], function(u, v) (u - v))^2)^0.5)
w = as.numeric(hdc.xy$time)
IR.dist2 = unlist(lapply(sample(which(!$ndx.local) & !$ndx.sender)),
    200), function(n) {
    ndx = which(IR.aggr2$[n] == hdc.xy$id)
    ndx.local = ndx[abs(w[ndx] - as.numeric(IR.aggr2$date.time[n])) < 60 * 60]
    ndx = which(IR.aggr2$[n] == hdc.xy$id)
    ndx.sender = ndx[abs(w[ndx] - as.numeric(IR.aggr2$date.time[n])) < 60 * 
    a = hdc.xy[sample(ndx.local, 10, replace = TRUE), c("x", "y")]
    b = hdc.xy[sample(ndx.sender, 10, replace = TRUE), c("x", "y")]
    round((outer(a$x[1:10], b$x[1:10], function(u, v) (u - v))^2 + outer(a$y[1:10], 
        b$y[1:10], function(u, v) (u - v))^2)^0.5)
qqplot(IR.dist2, IR.dist, pch = ".",
    xlab = "distance distribution less than 1 hour from IR proximity",
    ylab = "distance distribution less than 1 minute from IR proximity ",
    main = "Q-Q plot")
abline(coef = c(0, 1), col = "red")
pointLabel(quantile(IR.dist2, 1:9/10), quantile(IR.dist, 1:9/10),
    1:9, sep = "."), col = "red")

plot of chunk IR1hour1minDist

Badge Dataset - Dig Deeper into Data

We repackaged the raw sensor data for investigators to inspect the call-center dynamics from more perspectives. The time stamps of the raw sensor data (directly from the badge hardware) were badge CPU clock counts, and started from 0 each time the badges were powered on to collecting data.

We estimate the time of the call center (YYYY-mm-dd HH:MM:SS) corresponding to badge power-on based on the following two facts: (1) The anchor nodes were never rebooted. Hence the CPU clocks of each anchor nodes were non-decreasing over time, and sender.time in Zigbee-raw.csv is non-decreasing in each chunk and consistent across different chunks if is the ID of an anchor node. (2) The time when the data from the badge hardware were downloaded to a computer should be later than the times of the sensor records.

For example, suppose the CPU clock range of one badge is from 0 to 3600* 374400 (CPU clock rate), corresponding to CPU clock range from 3600*374400 to 3600*2*374400 of anchor node A in Zigbee records, corresponding to CPU clock range from 1800*374400 + 3600*374400 to 1800*374400 + 3600*2*374400 of anchor node B, and the data on the badge were dumped at the noon of 2007/03/30. We can infer that the anchor node B started slightly earler than 10:30am on 2007/03/30, and half an hour earlier than anchor node A. After we average over all chunks of sensor data and all anchor nodes, we estimate that our mapping from CPU clock to the time of the call center should have less than 1 second error.

Indoor localization from Zigbee RSSI is based on the fact that employees were at their cubicles more than at elsewhere, and is based on comparing RSSI to anchor nodes per minute per employee to the signature RSSIs to anchor nodes when employees were in their cubicles.