The Lynch Lab celebrated the new academic year, and the addition of Noah Strycker and Michael Wethington to the lab, with the 6th annual Lynch Lab mini golf invitational. With returning champion Casey Youngflesh having graduated, the title went back to (Lynch’s husband) Matt Eisaman.
Congrats to our latest graduates!
Featured
Congratulations to Dr. Casey Youngflesh, Dr. Catherine Foley, and Dr. Maureen Lynch, who graduated with their Ph.D.s from the Department of Ecology & Evolution a few weeks ago. All of them did an outstanding job with their dissertations and are now busy bees in their new jobs. We will miss you guys!
Upping your organizational and reproducibility game for Bayesian analyses with `MCMCvis` 0.15.0
The background
Results from Bayesian analyses are often the result of many hours of study design, data processing, model construction, evaluation, checking, and validation (see here for an excellent overview of a typical Bayesian workflow). Because model building/fitting/checking is an iterative process, we often are left with many versions of scripts, objects, and files associated with an analysis. While this issue isn’t specific to Bayesian analyses, that’s what I focus on here (as that is the focus of the MCMCvis R package, which aims to facilitate the visualization and wrangling of MCMC output).
The problem
This overflow of information can make it difficult to stay organized, keep track of the model development process, and refer back to past results. A number of strategies can be used to help combat the deluge of code and results, one of which being version control, such as git. However, given the large file sizes that model results often generate, version control is often not an effective strategy for tracking changes over time (given that the file size limit for many online version control platforms, such as Github, is several hundred MB). Moreover, wading through one’s git history to find a particular set of model results and the code associated with them can be quite arduous (speaking from personal experience here).
As a user of Bayesian models, I want an archive of my model results, a general summary/list of diagnostics of those results, and the scripts used to fit these models in a single location. I’ve found that creating a new directory (typically with the date the analysis was run) for each model run, and storing all the relevant files associated with that analysis there to be helpful. A typical project directory for me might look like the following:
Archiving results in this way provides a quick and easy way to reference a given model run (how this model fit, what the results were). However, this is somewhat of a pain to produce and code to achieve this can quickly clutter scripts!
A solution
The MCMCdiag
function in MCMCvis 0.15.0, now available on CRAN, aims to provide some structure to do the tedious tasks outlined above with a single function call. Using the default options, MCMCdiag
takes an object output from one of the various packages used to fit Bayesian models in R (including rstan
, nimble
, rjags
, jagsUI
, R2jags
, rstanarm
, and brms
) as an argument, and produces a quick and basic summary of the model results and diagnostics in the form of a .txt file. This .txt file serves as a point of reference for the results from this particular model.
Before getting to the other bells and whistles in this function, let’s first take a look at that file, using results from a model fit using rstan
to interface with Stan
. Code to simulate the data and fit the model can be found at the end of this post.
library(MCMCvis)
#fit is an object produced from rstan::stan
#round = 2 used to round output for readability in this post
MCMCdiag(fit, round = 2)
Based on the information in the .txt file (below), it looks like our model ran quickly, with good numbers for some basic diagnostics. Note that some fields in this file will vary depending on which software was used to fit the model. That is, output from models fit with rstan
will have different fields compared to output from models fit with nimble
(as relevant information such as diagnostics differs slightly between these two programs). More on Stanspecific diagnostics can be found in the Stan documentation. .txt file output:
Information and diagnostics
===========================
Run time (min): 0.01
Total iter: 2000
Warmup: 1000
Thin: 1
Num chains: 4
Adapt delta (specified): 0.95
Max tree depth (specified): 13
Initial step size (specified): 0.01
Mean accept stat: 0.96
Mean tree depth: 2.6
Mean step size: 0.5856
Num divergent transitions: 0
Num max tree depth exceeds: 0
Num chains with BFMI warnings: 0
Max Rhat: 1
Min n.eff: 1530
Model summary
=============
mean sd 2.5% 50% 97.5% Rhat n.eff
alpha 1.92 0.23 1.49 1.92 2.37 1 4020
beta 2.93 0.08 2.78 2.93 3.09 1 3598
sigma 7.26 0.16 6.93 7.26 7.58 1 3006
lp__ 2478.96 1.22 2482.08 2478.65 2477.55 1 1530
However, while this is helpful for summarizing the model output, we still have some work to do to reproduce the organizational structure outlined above. Fortunately, by giving MCMCdiag
several arguments we can do the following (in addition to creating the above .txt summary/diagnostic file):
 create a new directory for the files associated with this analysis (
mkdir
argument)  save the object from
rstan
as a .rds file for later use (save_obj
argument)  save the data used to fit this model and
sessionInfo
as .rds objects (add_obj
argument)  create a copy of the .stan model file and script used to fit this model (
cp_file
argument)
MCMCdiag(fit,
mkdir = 'blogmodel20210325',
file_name = 'blogmodelsummary',
dir = '~/project_directory/results',
save_obj = TRUE,
obj_name = 'blogmodeloutput',
add_obj = list(DATA, sessionInfo()),
add_obj_names = c('blogmodeldata',
'sessioninfo'),
cp_file = c('model.stan', 'blogmodel.R'))
My results/
directory now looks as follows:
If you aren’t familiar with .rds objects, they can be read into R in the following way:
fit < readRDS('blogmodeloutput.rds')
Additional arguments to MCMCdiag
can be used to save additional objects as .rds files, add additional fields to the .txt file (useful for making notes about a particular model run), and control how the summary information is presented (as with other functions in MCMCvis
). See the package vignette for a more complete overview of MCMCvis
functionality.
As a workflow
As I make changes to my analyses (e.g., data processing, model parameterization, priors), I can create an catalog of results by making MCMCdiag
calls after each model run. Importantly, all the information required to reproduce results from a given run (i.e., data to fit the model, code to fit the model, and environment information from sessionInfo
) is contained within each one of these directories, sparing one the pain of combing back through version control history to find out how the data were structured, what model was run, and what the results were at a given point in a project.
When looking back at model results, I generally want to know: 1) what the model was, 2) whether it fit properly, and 3) what the parameter estimates were. This organizational structure facilitated by MCMCdiag
makes this achievable with minimal code.
Code to simulate data and fit model
Simulate data
#set seed
set.seed(1)
#number of data points
N < 1000
#simulated predictor
x_sim < seq(5, 5, length.out = N)
#intercept generating value
alpha_sim < 2
#slope generating value
beta_sim < 3
#generated error (noise)
eps_sim < rnorm(length(x_sim), 0, 7)
#simulate response
y_sim < alpha_sim + beta_sim * x_sim + eps_sim
Stan model (model.stan)
data {
int<lower=0> N;
vector[N] y;
vector[N] x;
}
parameters {
real alpha;
real beta;
real<lower=0> sigma;
}
model {
alpha ~ normal(0, 30);
beta ~ normal(0, 30);
sigma ~ normal(0, 30);
y ~ normal(alpha + beta * x, sigma);
}
Fit model
library(rstan)
DATA < list(N = length(y_sim),
y = y_sim,
x = x_sim)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
fit < stan('model.stan',
data = DATA,
iter = 2000,
chains = 4,
cores = 4,
pars = c('alpha',
'beta',
'sigma'),
control = list(adapt_delta = 0.95,
max_treedepth = 13,
stepsize = 0.01))
More information on the author: Casey Youngflesh
`MCMCvis` 0.13.1 – HPD intervals, multimodel visualization, and more!
MCMCvis
version 0.13.1 is now on CRAN, complete with new features for summarizing and visualizing MCMC output. See HERE and HERE for posts that cover general package features. Highlights since the last CRAN version include:

MCMCsummary
– Specify quantiles for equaltailed credible intervals and return highest posterior density (HPD) intervals. MCMCplot
– Visualize posteriors from two different models on a single caterpillar plot. Useful when comparing parameter estimates across models and for visualizing shrinkage.MCMCplot
– Produce guide lines for caterpillar plots to more easily match parameter names to caterpillars.MCMCplot
– Specify colors for caterpillar plots. All functions – Summarize and visualize model output from
stanreg
objects produced with therstanarm
package andbrms
objects produced with thebrms
package (in addition to output produced with therstan
,rjags
,R2jags
, andjagsUI
packages).
Below we demonstrate how one might use some of these features using simulated data and simple models fit in JAGS.
Simulate some data
Let’s simulate data for 12 different groups using a generating process that assumes the means of these groups arise from a normal distribution with a constant mean and variance. In other words, we treat group as a ‘random effect’. We will make the groups differ in size to highlight how a random intercepts model will borrow strength compared to a model where groups are not modeled hierarchically, i.e. groups are treated as fixed effects.
# adapted from Marc Kery's WinBUGS for Ecologists (2010) set.seed(1) n_groups < 12 n_obs < rpois(n_groups, 8) n < sum(n_obs) mu_group < 20 sigma_group < 5 sigma < 8 alpha < rnorm(n_groups, mu_group, sigma_group) epsilon < rnorm(n, 0, sigma) x < rep(1:n_groups, n_obs) X < as.matrix(model.matrix(~as.factor(x) 1)) y < as.numeric(X %*% as.matrix(alpha) + epsilon) boxplot(y~x)
Fit two models to these data
First, let’s fit a simple means model to these data in JAGS where the means are not modeled hierarchically. We will use diffuse priors and let JAGS pick the initial values for simplicity’s sake.
{ sink("model1.R") cat(" model { # priors for (i in 1:n_groups) { alpha[i] ~ dnorm(0, 0.001) } sigma ~ dunif(0, 100) tau < 1 / sigma^2 # likelihood for (i in 1:n) { y[i] ~ dnorm(alpha[x[i]], tau) } } ", fill = TRUE) sink() }
data < list(y = y, x = x, n = n, n_groups = n_groups) params < c("alpha", "sigma") gvals < c(alpha, sigma) jm < rjags::jags.model("model1.R", data = data, n.chains = 3, n.adapt = 5000) jags_out1 < rjags::coda.samples(jm, variable.names = params, n.iter = 10000
Next we fit a random intercepts model.
{ sink("model2.R") cat(" model { # priors for (i in 1:n_groups) { alpha[i] ~ dnorm(mu_group, tau_group) } mu_group ~ dnorm(0, 0.001) sigma_group ~ dunif(0, 100) tau_group < 1 / sigma_group^2 sigma ~ dunif(0, 100) tau < 1 / sigma^2 # likelihood for (i in 1:n) { y[i] ~ dnorm(alpha[x[i]], tau) } } ", fill = TRUE) sink() }
data < list(y = y, x = x, n = n, n_groups = n_groups) params < c("alpha", "mu_group", "sigma", "sigma_group") gvals < c(alpha, mu_group, sigma, sigma_group) jm < rjags::jags.model("model2.R", data = data, n.chains = 3, n.adapt = 5000) jags_out2 < rjags::coda.samples(jm, variable.names = params, n.iter = 10000)
MCMCsummary
– custom equaltailed and HPD intervals
Sample quantiles in MCMCsummary
can now be specified directly using the probs
argument (removing the need to define custom quantiles with the func
argument). The default behavior is to provide 2.5%, 50%, and 97.5% quantiles. These probabilities can be changed by supplying a numeric vector to the probs
argument. Here we ask for 90% equaltailed credible intervals.
MCMCsummary(jags_out1, params = 'alpha', probs = c(0.1, 0.5, 0.9), round = 2)
## mean sd 10% 50% 90% Rhat n.eff ## alpha[1] 23.82 3.02 19.97 23.82 27.68 1 30385 ## alpha[2] 23.80 2.81 20.25 23.81 27.37 1 30000 ## alpha[3] 21.83 2.63 18.49 21.83 25.18 1 29699 ## alpha[4] 20.04 2.16 17.30 20.04 22.79 1 30000 ## alpha[5] 29.47 3.02 25.58 29.50 33.30 1 29980 ## alpha[6] 23.01 2.14 20.28 23.00 25.75 1 30305 ## alpha[7] 16.20 2.06 13.60 16.20 18.84 1 30000 ## alpha[8] 10.13 2.49 6.95 10.13 13.33 1 28967 ## alpha[9] 26.95 2.48 23.79 26.96 30.12 1 30000 ## alpha[10] 16.85 3.71 12.05 16.86 21.59 1 30262 ## alpha[11] 26.10 3.03 22.24 26.11 29.96 1 30473 ## alpha[12] 23.63 3.32 19.40 23.63 27.85 1 29627
Highest posterior density (HPD) intervals can now be displayed instead of equaltailed intervals by using HPD = TRUE
. This uses the HPDinterval
function from the coda
package to compute intervals based on the probability specified in the hpd_prob
argument (this argument is different than probs
argument, which is reserved for quantiles). Below we request 90% highest posterior density intervals.
MCMCsummary(jags_out1, params = 'alpha', HPD = TRUE, hpd_prob = 0.9, round = 2)
## mean sd 90%_HPDL 90%_HPDU Rhat n.eff ## alpha[1] 23.82 3.02 18.86 28.75 1 30385 ## alpha[2] 23.80 2.81 19.20 28.45 1 30000 ## alpha[3] 21.83 2.63 17.58 26.17 1 29699 ## alpha[4] 20.04 2.16 16.53 23.63 1 30000 ## alpha[5] 29.47 3.02 24.45 34.37 1 29980 ## alpha[6] 23.01 2.14 19.66 26.63 1 30305 ## alpha[7] 16.20 2.06 12.79 19.54 1 30000 ## alpha[8] 10.13 2.49 6.12 14.31 1 28967 ## alpha[9] 26.95 2.48 22.96 31.10 1 30000 ## alpha[10] 16.85 3.71 10.79 22.98 1 30262 ## alpha[11] 26.10 3.03 21.29 31.23 1 30473 ## alpha[12] 23.63 3.32 18.01 28.94 1 29627
MCMCplot
– multimodel visualization
Posterior estimates from two different models can also now be visualized with MCMCplot
. Let’s visually compare the alpha parameter posterior samples from the fixed effect and random intercepts models to see how our choice of model affected the posterior estimates. Below we can see that the random intercepts model means (in red) are pulled towards the true (generating value) grand mean (the vertical dashed line) compared to the fixed intercepts model means (in black), as expected due to shrinkage.
MCMCplot(object = jags_out1, object2 = jags_out2, params = 'alpha', offset = 0.2, ref = mu_group)
Guide lines can also be produced (using the guide_lines
argument) to help guide to eye from the parameter name to the associated caterpillar, which can be useful when trying to visualize a large number of parameters on a single plot. Colors can also be specified for each parameter (or the entire set of parameters).
MCMCplot(object = jags_out1, params = 'alpha', offset = 0.2, guide_lines = TRUE, ref = NULL, col = topo.colors(12), sz_med = 2.5, sz_thick = 9, sz_thin = 4)
Check out the full package vignette HERE. The MCMCvis
source code can be found HERE. More information on authors: Christian CheCastaldo, Casey Youngflesh.
Congrats Reid!
Congratulations to former high school lab member Reid Biondo, who is heading to the University of Virginia this fall.
Happy Holidays from the Lynch Lab!
Image
Lynch Lab at AGU 2018
Heather Lynch and Casey Youngflesh, along with collaborator Mike Polito from LSU, reported on their latest research in a “Penguins! From Space” press conference at AGU.
Precipitation could spell peril for penguins
Lab member Casey Youngflesh has a photo featured as a part of the journal Frontiers in Ecology and the Environment’s new ‘Ecopics’ series. This series features wildlife photos that describe some interesting natural phenomenon. Casey found this Adélie penguin buried by snow near Joinville Island on the Antarctic Peninsula. Increased snow and rain may have negative consequences for penguin populations, as neither eggs nor young chicks can accommodate being wet!
Software Carpentry/HPC Workshop at POLAR 2018
Lynch Lab members Casey Youngflesh and Catherine Foley (in addition to some folks at TACC) kicked off this year’s POLAR 2018 conference by teaching a 2day Software Carpentry/High Performance Computing workshop. The workshop was made possible by an NSF Research Coordination Grant, which endeavors to develop better cyberinfrastructure resources for complex analyses conducted by polar scientists. Thanks to everyone for attending! All workshop materials can be found here.
Congratulations to Casey Youngflesh on the American Ornithological Society Presentation Prize
Lab graduate student Casey Youngflesh was awarded the American Ornithological Society Presentation Prize for his presentation at the World Seabird Twitter Conference 4. The conference challenged participants to communicate their research in four tweets, using anything from hashtags to gifs. Check out all the conference presentations (including Michael Schrimpf’s) on twitter using the #WSTC4 hashtag. Follow Casey on twitter @caseyyoungflesh