Manipulating NAs in longitudinal data
05 Feb 2016 | web: Sam Clifford | github: samcliffordtags: r, dplyr, tidyr |
This snippet was motivated by a hack that Marcela Cespedes presented to BRAG on 4 February 2016. Sometimes a longitudinal data set, e.g. patient information, doesn’t have all subjects observed at all time points. Here’s a way of generating an index of patient IDs corresponding to non-missing response values which may be useful for analysis with JAGS.
We’ll make use of tidyr
and dplyr
to manipulate some simulated data.
library(tidyr)
library(dplyr)
n.r <- 30 # how many subjects
n.c <- 4 # how many sequential observations
N <- n.r*n.c
X <- matrix(rnorm(N), nrow = n.r, ncol=n.c)
head(X)
## [,1] [,2] [,3] [,4]
## [1,] -0.53875996 -3.0091503 -0.4308401 0.5664380
## [2,] -0.48297471 -0.7620471 0.5479946 0.9791881
## [3,] 0.74274548 -0.1197123 0.4201759 0.2228831
## [4,] -0.70594150 -0.6380808 -0.7879914 -1.4294160
## [5,] -1.07686357 0.3279826 -0.9131584 -1.7212478
## [6,] -0.03247079 0.1207207 0.6485075 -0.3668982
Now we simulate missingness at random
X[sample(1:N, size = N/4)] <- NA
head(X)
## [,1] [,2] [,3] [,4]
## [1,] -0.53875996 -3.0091503 NA 0.5664380
## [2,] -0.48297471 -0.7620471 NA 0.9791881
## [3,] 0.74274548 -0.1197123 0.4201759 0.2228831
## [4,] -0.70594150 -0.6380808 -0.7879914 NA
## [5,] -1.07686357 0.3279826 NA -1.7212478
## [6,] -0.03247079 0.1207207 0.6485075 -0.3668982
Now that we’ve simulated some missing data, we’ll reshape the data from wide to long format, making a new ID column to ensure that we are able to keep track of whose observations these are. We also convert the time information from strings with “X” in them to numbers, sort the remaining data frame by ID
rather than time
and filter out all the rows with missing value
values.
non_missing_data <- data.frame(X, ID=1:n.r) %>%
gather("time", "value", -ID) %>%
mutate(time = extract_numeric(time)) %>%
arrange(ID) %>%
filter(!is.na(value))
If we now wanted a vector of indices that tells us about the IDs for the non-NA values, we could select the ID column
select(non_missing_data, ID) %>% head(10)
## ID
## 1 1
## 2 1
## 3 1
## 4 2
## 5 2
## 6 2
## 7 3
## 8 3
## 9 3
## 10 3
If instead we wanted an object of length n.r
which tells us how many observations correspond to each patient, rather than which observation corresponds to each patient,
patient_counts <- group_by(non_missing_data, ID) %>%
summarise(n = n())
patient_counts %>% select(n)
## Source: local data frame [30 x 1]
##
## n
## (int)
## 1 3
## 2 3
## 3 4
## 4 3
## 5 3
## 6 4
## 7 2
## 8 3
## 9 4
## 10 4
## .. ...
We can then convert this into a data frame with start and endpoints
patient_counts %>%
mutate(end = cumsum(n),
start = 1 + c(0, end[2:length(ID) - 1])) %>%
select(ID:n, start, end)
## Source: local data frame [30 x 4]
##
## ID n start end
## (int) (int) (dbl) (int)
## 1 1 3 1 3
## 2 2 3 4 6
## 3 3 4 7 10
## 4 4 3 11 13
## 5 5 3 14 16
## 6 6 4 17 20
## 7 7 2 21 22
## 8 8 3 23 25
## 9 9 4 26 29
## 10 10 4 30 33
## .. ... ... ... ...
The first solution is probably conceptually the simplest way to store the data for inclusion in a JAGS model, whereas the second index vector will be shorter and potentially simpler to inspect visually when the number of observations per subject is large. The third solution builds on this and gives a column, start
, which can be used as left endpoints of index ranges, and the same for the end
column as the right endpoints.