`MCMCvis` 0.13.1 – HPD intervals, multimodel visualization, and more!

MCMCvis version 0.13.1 is now on CRAN, complete with new features for summarizing and visualizing MCMC output. See HERE and HERE for posts that cover general package features. Highlights since the last CRAN version include:

  • MCMCsummary – Specify quantiles for equal-tailed credible intervals and return highest posterior density (HPD) intervals.
  • MCMCplot – Visualize posteriors from two different models on a single caterpillar plot. Useful when comparing parameter estimates across models and for visualizing shrinkage.
  • MCMCplot – Produce guide lines for caterpillar plots to more easily match parameter names to caterpillars.
  • MCMCplot – Specify colors for caterpillar plots.
  • All functions – Summarize and visualize model output from stanreg objects produced with the rstanarm package and brms objects produced with the brms package (in addition to output produced with the rstan, rjags, R2jags, and jagsUI packages).

Below we demonstrate how one might use some of these features using simulated data and simple models fit in JAGS.

Simulate some data

Let’s simulate data for 12 different groups using a generating process that assumes the means of these groups arise from a normal distribution with a constant mean and variance. In other words, we treat group as a ‘random effect’. We will make the groups differ in size to highlight how a random intercepts model will borrow strength compared to a model where groups are not modeled hierarchically, i.e. groups are treated as fixed effects.

# adapted from Marc Kery's WinBUGS for Ecologists (2010)
set.seed(1)
n_groups <- 12
n_obs <- rpois(n_groups, 8)
n <- sum(n_obs)
mu_group <- 20
sigma_group <- 5 
sigma <- 8
alpha <- rnorm(n_groups, mu_group, sigma_group)
epsilon <- rnorm(n, 0, sigma)
x <- rep(1:n_groups, n_obs)
X <- as.matrix(model.matrix(~as.factor(x) -1))
y <- as.numeric(X %*% as.matrix(alpha) + epsilon)
boxplot(y~x)

Rplot

Fit two models to these data

First, let’s fit a simple means model to these data in JAGS where the means are not modeled hierarchically. We will use diffuse priors and let JAGS pick the initial values for simplicity’s sake.

{
sink("model1.R")
cat("
model {
# priors
for (i in 1:n_groups) { 
alpha[i] ~ dnorm(0, 0.001) 
}
sigma ~ dunif(0, 100)
tau <- 1 / sigma^2

# likelihood
for (i in 1:n) {
y[i] ~ dnorm(alpha[x[i]], tau)
}
}
", fill = TRUE)
sink()
}
data <- list(y = y, x = x, n = n, n_groups = n_groups)
params <- c("alpha", "sigma")
gvals <- c(alpha, sigma)
jm <- rjags::jags.model("model1.R", data = data, n.chains = 3, n.adapt = 5000)
jags_out1 <- rjags::coda.samples(jm, variable.names = params, n.iter = 10000

Next we fit a random intercepts model.

{
sink("model2.R")
cat("
model {
# priors
for (i in 1:n_groups) { 
alpha[i] ~ dnorm(mu_group, tau_group) 
}
mu_group ~ dnorm(0, 0.001)
sigma_group ~ dunif(0, 100)
tau_group <- 1 / sigma_group^2
sigma ~ dunif(0, 100)
tau <- 1 / sigma^2

# likelihood
for (i in 1:n) {
y[i] ~ dnorm(alpha[x[i]], tau)
}
}
", fill = TRUE)
sink()
}
data <- list(y = y, x = x, n = n, n_groups = n_groups)
params <- c("alpha", "mu_group", "sigma", "sigma_group")
gvals <- c(alpha, mu_group, sigma, sigma_group)
jm <- rjags::jags.model("model2.R", data = data, n.chains = 3, n.adapt = 5000)
jags_out2 <- rjags::coda.samples(jm, variable.names = params, n.iter = 10000)

MCMCsummary – custom equal-tailed and HPD intervals

Sample quantiles in MCMCsummary can now be specified directly using the probs argument (removing the need to define custom quantiles with the func argument). The default behavior is to provide 2.5%, 50%, and 97.5% quantiles. These probabilities can be changed by supplying a numeric vector to the probs argument. Here we ask for 90% equal-tailed credible intervals.

MCMCsummary(jags_out1, 
            params = 'alpha', 
            probs = c(0.1, 0.5, 0.9), 
            round = 2)
##            mean   sd   10%   50%   90% Rhat n.eff
## alpha[1]  23.82 3.02 19.97 23.82 27.68    1 30385
## alpha[2]  23.80 2.81 20.25 23.81 27.37    1 30000
## alpha[3]  21.83 2.63 18.49 21.83 25.18    1 29699
## alpha[4]  20.04 2.16 17.30 20.04 22.79    1 30000
## alpha[5]  29.47 3.02 25.58 29.50 33.30    1 29980
## alpha[6]  23.01 2.14 20.28 23.00 25.75    1 30305
## alpha[7]  16.20 2.06 13.60 16.20 18.84    1 30000
## alpha[8]  10.13 2.49  6.95 10.13 13.33    1 28967
## alpha[9]  26.95 2.48 23.79 26.96 30.12    1 30000
## alpha[10] 16.85 3.71 12.05 16.86 21.59    1 30262
## alpha[11] 26.10 3.03 22.24 26.11 29.96    1 30473
## alpha[12] 23.63 3.32 19.40 23.63 27.85    1 29627

Highest posterior density (HPD) intervals can now be displayed instead of equal-tailed intervals by using HPD = TRUE. This uses the HPDinterval function from the coda package to compute intervals based on the probability specified in the hpd_prob argument (this argument is different than probs argument, which is reserved for quantiles). Below we request 90% highest posterior density intervals.

MCMCsummary(jags_out1, 
            params = 'alpha', 
            HPD = TRUE, 
            hpd_prob = 0.9, 
            round = 2)
##            mean   sd 90%_HPDL 90%_HPDU Rhat n.eff
## alpha[1]  23.82 3.02    18.86    28.75    1 30385
## alpha[2]  23.80 2.81    19.20    28.45    1 30000
## alpha[3]  21.83 2.63    17.58    26.17    1 29699
## alpha[4]  20.04 2.16    16.53    23.63    1 30000
## alpha[5]  29.47 3.02    24.45    34.37    1 29980
## alpha[6]  23.01 2.14    19.66    26.63    1 30305
## alpha[7]  16.20 2.06    12.79    19.54    1 30000
## alpha[8]  10.13 2.49     6.12    14.31    1 28967
## alpha[9]  26.95 2.48    22.96    31.10    1 30000
## alpha[10] 16.85 3.71    10.79    22.98    1 30262
## alpha[11] 26.10 3.03    21.29    31.23    1 30473
## alpha[12] 23.63 3.32    18.01    28.94    1 29627

MCMCplot – multimodel visualization

Posterior estimates from two different models can also now be visualized with MCMCplot. Let’s visually compare the alpha parameter posterior samples from the fixed effect and random intercepts models to see how our choice of model affected the posterior estimates. Below we can see that the random intercepts model means (in red) are pulled towards the true (generating value) grand mean (the vertical dashed line) compared to the fixed intercepts model means (in black), as expected due to shrinkage.

MCMCplot(object = jags_out1, 
         object2 = jags_out2, 
         params = 'alpha', 
         offset = 0.2, 
         ref = mu_group)

Rplot05

Guide lines can also be produced (using the guide_lines argument) to help guide to eye from the parameter name to the associated caterpillar, which can be useful when trying to visualize a large number of parameters on a single plot. Colors can also be specified for each parameter (or the entire set of parameters).

MCMCplot(object = jags_out1, 
         params = 'alpha', 
         offset = 0.2, 
         guide_lines = TRUE, 
         ref = NULL, 
         col = topo.colors(12),
         sz_med = 2.5,
         sz_thick = 9,
         sz_thin = 4)

Rplot02

Check out the full package vignette HERE. The MCMCvis source code can be found HERE. More information on authors: Christian Che-Castaldo, Casey Youngflesh.

Posted in R

Precipitation could spell peril for penguins

Lab member Casey Youngflesh has a photo featured as a part of the journal Frontiers in Ecology and the Environment’s new ‘Ecopics’ series. This series features wildlife photos that describe some interesting natural phenomenon. Casey found this Adélie penguin buried by snow near Joinville Island on the Antarctic Peninsula. Increased snow and rain may have negative consequences for penguin populations, as neither eggs nor young chicks can accommodate being wet!

image

Software Carpentry/HPC Workshop at POLAR 2018

Lynch Lab members Casey Youngflesh and Catherine Foley (in addition to some folks at TACC) kicked off this year’s POLAR 2018 conference by teaching a 2-day Software Carpentry/High Performance Computing workshop. The workshop was made possible by an NSF Research Coordination Grant, which endeavors to develop better cyberinfrastructure resources for complex analyses conducted by polar scientists. Thanks to everyone for attending! All workshop materials can be found here.

PolarRCN_workshop_Davos

Congratulations to Casey Youngflesh on the American Ornithological Society Presentation Prize

Lab graduate student Casey Youngflesh was awarded the American Ornithological Society Presentation Prize for his presentation at the World Seabird Twitter Conference 4. The conference challenged participants to communicate their research in four tweets, using anything from hashtags to gifs. Check out all the conference presentations (including Michael Schrimpf’s) on twitter using the #WSTC4 hashtag. Follow Casey on twitter @caseyyoungflesh

`MCMCvis` package peer reviewed and published!

MCMCvis, created by lab PhD candidate Casey Youngflesh is now peer reviewed and published in the Journal of Open Source Software! JOSS is journal that focuses exclusively on scientific software. While I think most would agree that giving software developers credit for their work is important, I think the practice of citing R packages (at least in ecology) is relatively rare. I (Casey) myself am guilty of this, to be sure. Cite the R packages/other software you use – publications and citations of those pubs are the currency of academia!

Check your prior posterior overlap (PPO) – MCMC wrangling in R made easy with `MCMCvis`

When fitting a Bayesian model using MCMC (often via JAGS/BUGS/Stan), a number of checks are typically performed to make sure your model is worth interpreting without further manipulation (remember: all models are wrong, some are useful!):

  • R-hat (AKA Gelman-Rubin statistic) – used to assess convergence of chains in the model
  • Visual assessment of chains – used to assess whether posterior chains mixed well (convergence)
  • Visual assessment of posterior distribution shape – used to determine if the posterior distribution is constrained
  • Posterior predictive check (predicting data using estimated parameters) – used to make sure that the model can generate the data used in the model

PPO

One check, however, is often missing: a robust assessment of the degree to which the prior is informing the posterior distribution. Substantial influence of the prior on the posterior may not be apparent through the use of R-hat and visual checks alone. Version 0.9.2 of MCMCvis (now available on CRAN), makes quantifying and plotting the prior posterior overlap (PPO) simple.

MCMCvis is an R package designed to streamline analysis of Bayesian model results derived from MCMC samplers (e.g., JAGS, BUGS, Stan). It can be used to easily visualize, manipulate, and summarize MCMC output. The newest version is full of new features – a full tutorial can be found here.

An example

To check PPO for a model, we will use the function MCMCtrace. As the function is used to generate trace and density plots, checking for PPO is barely more work than just doing the routine checks that one would ordinarily perform. The function plots trace plots on the left and density plots for both the posterior (black) and prior (red) distributions on the right. The function calculates the percent overlap between the prior and posterior and prints this value on the plot. See ?MCMCtrace in R for details regarding the syntax.

#install package
install.packages('MCMCvis', repos = "http://cran.case.edu")

#load package
require(MCMCvis)

#load example data
data(MCMC_data)

#simulate data from the prior used in your model
#number of iterations should equal the number of draws times the number of chains (although the function will adjust if the correct number of iterations is not specified)
#in JAGS: parameter ~ dnorm(0, 0.001)
PR <- rnorm(15000, 0, 32)

#run the function for just beta parameters
MCMCtrace(MCMC_data, params = 'beta', priors = PR, pdf = FALSE)

unnamed-chunk-1-1

Why check?

Checking the PPO has particular utility when trying to determine if the parameters in your model are identifiable. If substantial PPO exists, the prior may simply be dictating the posterior distribution – the data may have little influence on the results. If a small degree of PPO exists, the data was informative enough to overcome the influence of the prior. In the field of ecology, nonidentifiability is a particular concern in some types of mark-recapture models. Gimenez (2009) developed quantitative guidelines to determine when parameters are robustly identifiable using PPO.

While a large degree of PPO is not always a bad thing (e.g., substantial prior knowledge about the system may result in very informative priors used in the model), it is important to know where data was and was not informative for parameter estimation. The degree of PPO that is acceptable for a particular model will depend on a great number of factors, and may be somewhat subjective (but see Gimenez [2009] for a less subjective case). Like other checks, PPO is just one of many tools to be used for model assessment. Finding substantial PPO when unexpected may suggest that further model manipulation is needed. Happy model building!

Other MCMCvis improvements

Check out the rest of the new package freatures, including the option to calculate the number of effective samples for each parameter, ability to take arguments in the form of a ‘regular expression’ for the params argument, ability to retain the structure of all parameters in model output (e.g., parameters specified as matrices in the model are summarized as matrices).

Follow Casey Youngflesh on Twitter @caseyyoungflesh. The MCMCvis source code can be found on GitHub.

New paper out in Nature Communications

The study, led by Lynch Lab postdoc Chris Che-Castaldo, highlights the need to aggregate abundance estimates over space to produce robust estimates of abundance when substantial stochasticity exists in populations. Adélie penguin population dynamics are inherently noisy, making it difficult to separate signal from noise when using these birds as indicators for environmental change. Nearly the entire global population of Adélie penguins was modeled in this effort, using every piece of publicly available data on Adélie penguin abundance. All code and data (instructions on how to query the database) to run the analyses available in the supplements! Check out the MAPPPD website to interact with the model results and check out penguin population dynamics for yourself.

s41467-017-00890-0