Upping your organizational and reproducibility game for Bayesian analyses with `MCMCvis` 0.15.0

The background

Results from Bayesian analyses are often the result of many hours of study design, data processing, model construction, evaluation, checking, and validation (see here for an excellent overview of a typical Bayesian workflow). Because model building/fitting/checking is an iterative process, we often are left with many versions of scripts, objects, and files associated with an analysis. While this issue isn’t specific to Bayesian analyses, that’s what I focus on here (as that is the focus of the MCMCvis R package, which aims to facilitate the visualization and wrangling of MCMC output).

The problem

This overflow of information can make it difficult to stay organized, keep track of the model development process, and refer back to past results. A number of strategies can be used to help combat the deluge of code and results, one of which being version control, such as git. However, given the large file sizes that model results often generate, version control is often not an effective strategy for tracking changes over time (given that the file size limit for many online version control platforms, such as Github, is several hundred MB). Moreover, wading through one’s git history to find a particular set of model results and the code associated with them can be quite arduous (speaking from personal experience here).

As a user of Bayesian models, I want an archive of my model results, a general summary/list of diagnostics of those results, and the scripts used to fit these models in a single location. I’ve found that creating a new directory (typically with the date the analysis was run) for each model run, and storing all the relevant files associated with that analysis there to be helpful. A typical project directory for me might look like the following:

Archiving results in this way provides a quick and easy way to reference a given model run (how this model fit, what the results were). However, this is somewhat of a pain to produce and code to achieve this can quickly clutter scripts!

A solution

The MCMCdiag function in MCMCvis 0.15.0, now available on CRAN, aims to provide some structure to do the tedious tasks outlined above with a single function call. Using the default options, MCMCdiag takes an object output from one of the various packages used to fit Bayesian models in R (including rstan, nimble, rjags, jagsUI, R2jags, rstanarm, and brms) as an argument, and produces a quick and basic summary of the model results and diagnostics in the form of a .txt file. This .txt file serves as a point of reference for the results from this particular model.

Before getting to the other bells and whistles in this function, let’s first take a look at that file, using results from a model fit using rstan to interface with Stan. Code to simulate the data and fit the model can be found at the end of this post.

library(MCMCvis)

#fit is an object produced from rstan::stan
#round = 2 used to round output for readability in this post
MCMCdiag(fit, round = 2)

Based on the information in the .txt file (below), it looks like our model ran quickly, with good numbers for some basic diagnostics. Note that some fields in this file will vary depending on which software was used to fit the model. That is, output from models fit with rstan will have different fields compared to output from models fit with nimble (as relevant information such as diagnostics differs slightly between these two programs). More on Stan-specific diagnostics can be found in the Stan documentation. .txt file output:

Information and diagnostics 
=========================== 
Run time (min):                   0.01 
Total iter:                       2000 
Warmup:                           1000 
Thin:                             1 
Num chains:                       4 
Adapt delta (specified):          0.95 
Max tree depth (specified):       13 
Initial step size (specified):    0.01 
Mean accept stat:                 0.96 
Mean tree depth:                  2.6 
Mean step size:                   0.5856 
Num divergent transitions:        0 
Num max tree depth exceeds:       0 
Num chains with BFMI warnings:    0 
Max Rhat:                         1 
Min n.eff:                        1530 


Model summary 
============= 
          mean   sd     2.5%      50%    97.5% Rhat n.eff
alpha     1.92 0.23     1.49     1.92     2.37    1  4020
beta      2.93 0.08     2.78     2.93     3.09    1  3598
sigma     7.26 0.16     6.93     7.26     7.58    1  3006
lp__  -2478.96 1.22 -2482.08 -2478.65 -2477.55    1  1530

However, while this is helpful for summarizing the model output, we still have some work to do to reproduce the organizational structure outlined above. Fortunately, by giving MCMCdiag several arguments we can do the following (in addition to creating the above .txt summary/diagnostic file):

  1. create a new directory for the files associated with this analysis (mkdir argument)
  2. save the object from rstan as a .rds file for later use (save_obj argument)
  3. save the data used to fit this model and sessionInfo as .rds objects (add_obj argument)
  4. create a copy of the .stan model file and script used to fit this model (cp_file argument)
MCMCdiag(fit, 
         mkdir = 'blog-model-2021-03-25',
         file_name = 'blog-model-summary',
         dir = '~/project_directory/results',
         save_obj = TRUE,
         obj_name = 'blog-model-output',
         add_obj = list(DATA, sessionInfo()),
         add_obj_names = c('blog-model-data', 
                           'session-info'),
         cp_file = c('model.stan', 'blog-model.R'))

My results/ directory now looks as follows:

If you aren’t familiar with .rds objects, they can be read into R in the following way:

fit <- readRDS('blog-model-output.rds')

Additional arguments to MCMCdiag can be used to save additional objects as .rds files, add additional fields to the .txt file (useful for making notes about a particular model run), and control how the summary information is presented (as with other functions in MCMCvis). See the package vignette for a more complete overview of MCMCvis functionality.

As a workflow

As I make changes to my analyses (e.g., data processing, model parameterization, priors), I can create an catalog of results by making MCMCdiag calls after each model run. Importantly, all the information required to reproduce results from a given run (i.e., data to fit the model, code to fit the model, and environment information from sessionInfo) is contained within each one of these directories, sparing one the pain of combing back through version control history to find out how the data were structured, what model was run, and what the results were at a given point in a project.

When looking back at model results, I generally want to know: 1) what the model was, 2) whether it fit properly, and 3) what the parameter estimates were. This organizational structure facilitated by MCMCdiag makes this achievable with minimal code.

Code to simulate data and fit model

Simulate data

#set seed
set.seed(1)

#number of data points
N <- 1000

#simulated predictor
x_sim <- seq(-5, 5, length.out = N)

#intercept generating value
alpha_sim <- 2

#slope generating value
beta_sim <- 3

#generated error (noise)
eps_sim <- rnorm(length(x_sim), 0, 7)

#simulate response
y_sim <- alpha_sim + beta_sim * x_sim + eps_sim

Stan model (model.stan)

data {
int<lower=0> N;
vector[N] y;
vector[N] x;
}

parameters {
real alpha;
real beta;
real<lower=0> sigma;
}

model {
alpha ~ normal(0, 30);
beta ~ normal(0, 30);
sigma ~ normal(0, 30);

y ~ normal(alpha + beta * x, sigma);
}

Fit model

library(rstan)

DATA <- list(N = length(y_sim),
             y = y_sim,
             x = x_sim)

rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

fit <- stan('model.stan',
            data = DATA,
            iter = 2000,
            chains = 4,
            cores = 4,
            pars = c('alpha',
                     'beta',
                     'sigma'), 
            control = list(adapt_delta = 0.95,
                           max_treedepth = 13,
                           stepsize = 0.01))

More information on the author: Casey Youngflesh

Posted in R

`MCMCvis` 0.13.1 – HPD intervals, multimodel visualization, and more!

MCMCvis version 0.13.1 is now on CRAN, complete with new features for summarizing and visualizing MCMC output. See HERE and HERE for posts that cover general package features. Highlights since the last CRAN version include:

  • MCMCsummary – Specify quantiles for equal-tailed credible intervals and return highest posterior density (HPD) intervals.
  • MCMCplot – Visualize posteriors from two different models on a single caterpillar plot. Useful when comparing parameter estimates across models and for visualizing shrinkage.
  • MCMCplot – Produce guide lines for caterpillar plots to more easily match parameter names to caterpillars.
  • MCMCplot – Specify colors for caterpillar plots.
  • All functions – Summarize and visualize model output from stanreg objects produced with the rstanarm package and brms objects produced with the brms package (in addition to output produced with the rstan, rjags, R2jags, and jagsUI packages).

Below we demonstrate how one might use some of these features using simulated data and simple models fit in JAGS.

Simulate some data

Let’s simulate data for 12 different groups using a generating process that assumes the means of these groups arise from a normal distribution with a constant mean and variance. In other words, we treat group as a ‘random effect’. We will make the groups differ in size to highlight how a random intercepts model will borrow strength compared to a model where groups are not modeled hierarchically, i.e. groups are treated as fixed effects.

# adapted from Marc Kery's WinBUGS for Ecologists (2010)
set.seed(1)
n_groups <- 12
n_obs <- rpois(n_groups, 8)
n <- sum(n_obs)
mu_group <- 20
sigma_group <- 5 
sigma <- 8
alpha <- rnorm(n_groups, mu_group, sigma_group)
epsilon <- rnorm(n, 0, sigma)
x <- rep(1:n_groups, n_obs)
X <- as.matrix(model.matrix(~as.factor(x) -1))
y <- as.numeric(X %*% as.matrix(alpha) + epsilon)
boxplot(y~x)

Rplot

Fit two models to these data

First, let’s fit a simple means model to these data in JAGS where the means are not modeled hierarchically. We will use diffuse priors and let JAGS pick the initial values for simplicity’s sake.

{
sink("model1.R")
cat("
model {
# priors
for (i in 1:n_groups) { 
alpha[i] ~ dnorm(0, 0.001) 
}
sigma ~ dunif(0, 100)
tau <- 1 / sigma^2

# likelihood
for (i in 1:n) {
y[i] ~ dnorm(alpha[x[i]], tau)
}
}
", fill = TRUE)
sink()
}
data <- list(y = y, x = x, n = n, n_groups = n_groups)
params <- c("alpha", "sigma")
gvals <- c(alpha, sigma)
jm <- rjags::jags.model("model1.R", data = data, n.chains = 3, n.adapt = 5000)
jags_out1 <- rjags::coda.samples(jm, variable.names = params, n.iter = 10000

Next we fit a random intercepts model.

{
sink("model2.R")
cat("
model {
# priors
for (i in 1:n_groups) { 
alpha[i] ~ dnorm(mu_group, tau_group) 
}
mu_group ~ dnorm(0, 0.001)
sigma_group ~ dunif(0, 100)
tau_group <- 1 / sigma_group^2
sigma ~ dunif(0, 100)
tau <- 1 / sigma^2

# likelihood
for (i in 1:n) {
y[i] ~ dnorm(alpha[x[i]], tau)
}
}
", fill = TRUE)
sink()
}
data <- list(y = y, x = x, n = n, n_groups = n_groups)
params <- c("alpha", "mu_group", "sigma", "sigma_group")
gvals <- c(alpha, mu_group, sigma, sigma_group)
jm <- rjags::jags.model("model2.R", data = data, n.chains = 3, n.adapt = 5000)
jags_out2 <- rjags::coda.samples(jm, variable.names = params, n.iter = 10000)

MCMCsummary – custom equal-tailed and HPD intervals

Sample quantiles in MCMCsummary can now be specified directly using the probs argument (removing the need to define custom quantiles with the func argument). The default behavior is to provide 2.5%, 50%, and 97.5% quantiles. These probabilities can be changed by supplying a numeric vector to the probs argument. Here we ask for 90% equal-tailed credible intervals.

MCMCsummary(jags_out1, 
            params = 'alpha', 
            probs = c(0.1, 0.5, 0.9), 
            round = 2)
##            mean   sd   10%   50%   90% Rhat n.eff
## alpha[1]  23.82 3.02 19.97 23.82 27.68    1 30385
## alpha[2]  23.80 2.81 20.25 23.81 27.37    1 30000
## alpha[3]  21.83 2.63 18.49 21.83 25.18    1 29699
## alpha[4]  20.04 2.16 17.30 20.04 22.79    1 30000
## alpha[5]  29.47 3.02 25.58 29.50 33.30    1 29980
## alpha[6]  23.01 2.14 20.28 23.00 25.75    1 30305
## alpha[7]  16.20 2.06 13.60 16.20 18.84    1 30000
## alpha[8]  10.13 2.49  6.95 10.13 13.33    1 28967
## alpha[9]  26.95 2.48 23.79 26.96 30.12    1 30000
## alpha[10] 16.85 3.71 12.05 16.86 21.59    1 30262
## alpha[11] 26.10 3.03 22.24 26.11 29.96    1 30473
## alpha[12] 23.63 3.32 19.40 23.63 27.85    1 29627

Highest posterior density (HPD) intervals can now be displayed instead of equal-tailed intervals by using HPD = TRUE. This uses the HPDinterval function from the coda package to compute intervals based on the probability specified in the hpd_prob argument (this argument is different than probs argument, which is reserved for quantiles). Below we request 90% highest posterior density intervals.

MCMCsummary(jags_out1, 
            params = 'alpha', 
            HPD = TRUE, 
            hpd_prob = 0.9, 
            round = 2)
##            mean   sd 90%_HPDL 90%_HPDU Rhat n.eff
## alpha[1]  23.82 3.02    18.86    28.75    1 30385
## alpha[2]  23.80 2.81    19.20    28.45    1 30000
## alpha[3]  21.83 2.63    17.58    26.17    1 29699
## alpha[4]  20.04 2.16    16.53    23.63    1 30000
## alpha[5]  29.47 3.02    24.45    34.37    1 29980
## alpha[6]  23.01 2.14    19.66    26.63    1 30305
## alpha[7]  16.20 2.06    12.79    19.54    1 30000
## alpha[8]  10.13 2.49     6.12    14.31    1 28967
## alpha[9]  26.95 2.48    22.96    31.10    1 30000
## alpha[10] 16.85 3.71    10.79    22.98    1 30262
## alpha[11] 26.10 3.03    21.29    31.23    1 30473
## alpha[12] 23.63 3.32    18.01    28.94    1 29627

MCMCplot – multimodel visualization

Posterior estimates from two different models can also now be visualized with MCMCplot. Let’s visually compare the alpha parameter posterior samples from the fixed effect and random intercepts models to see how our choice of model affected the posterior estimates. Below we can see that the random intercepts model means (in red) are pulled towards the true (generating value) grand mean (the vertical dashed line) compared to the fixed intercepts model means (in black), as expected due to shrinkage.

MCMCplot(object = jags_out1, 
         object2 = jags_out2, 
         params = 'alpha', 
         offset = 0.2, 
         ref = mu_group)

Rplot05

Guide lines can also be produced (using the guide_lines argument) to help guide to eye from the parameter name to the associated caterpillar, which can be useful when trying to visualize a large number of parameters on a single plot. Colors can also be specified for each parameter (or the entire set of parameters).

MCMCplot(object = jags_out1, 
         params = 'alpha', 
         offset = 0.2, 
         guide_lines = TRUE, 
         ref = NULL, 
         col = topo.colors(12),
         sz_med = 2.5,
         sz_thick = 9,
         sz_thin = 4)

Rplot02

Check out the full package vignette HERE. The MCMCvis source code can be found HERE. More information on authors: Christian Che-Castaldo, Casey Youngflesh.

Posted in R

Check your prior posterior overlap (PPO) – MCMC wrangling in R made easy with `MCMCvis`

When fitting a Bayesian model using MCMC (often via JAGS/BUGS/Stan), a number of checks are typically performed to make sure your model is worth interpreting without further manipulation (remember: all models are wrong, some are useful!):

  • R-hat (AKA Gelman-Rubin statistic) – used to assess convergence of chains in the model
  • Visual assessment of chains – used to assess whether posterior chains mixed well (convergence)
  • Visual assessment of posterior distribution shape – used to determine if the posterior distribution is constrained
  • Posterior predictive check (predicting data using estimated parameters) – used to make sure that the model can generate the data used in the model

PPO

One check, however, is often missing: a robust assessment of the degree to which the prior is informing the posterior distribution. Substantial influence of the prior on the posterior may not be apparent through the use of R-hat and visual checks alone. Version 0.9.2 of MCMCvis (now available on CRAN), makes quantifying and plotting the prior posterior overlap (PPO) simple.

MCMCvis is an R package designed to streamline analysis of Bayesian model results derived from MCMC samplers (e.g., JAGS, BUGS, Stan). It can be used to easily visualize, manipulate, and summarize MCMC output. The newest version is full of new features – a full tutorial can be found here.

An example

To check PPO for a model, we will use the function MCMCtrace. As the function is used to generate trace and density plots, checking for PPO is barely more work than just doing the routine checks that one would ordinarily perform. The function plots trace plots on the left and density plots for both the posterior (black) and prior (red) distributions on the right. The function calculates the percent overlap between the prior and posterior and prints this value on the plot. See ?MCMCtrace in R for details regarding the syntax.

#install package
install.packages('MCMCvis', repos = "http://cran.case.edu")

#load package
require(MCMCvis)

#load example data
data(MCMC_data)

#simulate data from the prior used in your model
#number of iterations should equal the number of draws times the number of chains (although the function will adjust if the correct number of iterations is not specified)
#in JAGS: parameter ~ dnorm(0, 0.001)
PR <- rnorm(15000, 0, 32)

#run the function for just beta parameters
MCMCtrace(MCMC_data, params = 'beta', priors = PR, pdf = FALSE)

unnamed-chunk-1-1

Why check?

Checking the PPO has particular utility when trying to determine if the parameters in your model are identifiable. If substantial PPO exists, the prior may simply be dictating the posterior distribution – the data may have little influence on the results. If a small degree of PPO exists, the data was informative enough to overcome the influence of the prior. In the field of ecology, nonidentifiability is a particular concern in some types of mark-recapture models. Gimenez (2009) developed quantitative guidelines to determine when parameters are robustly identifiable using PPO.

While a large degree of PPO is not always a bad thing (e.g., substantial prior knowledge about the system may result in very informative priors used in the model), it is important to know where data was and was not informative for parameter estimation. The degree of PPO that is acceptable for a particular model will depend on a great number of factors, and may be somewhat subjective (but see Gimenez [2009] for a less subjective case). Like other checks, PPO is just one of many tools to be used for model assessment. Finding substantial PPO when unexpected may suggest that further model manipulation is needed. Happy model building!

Other MCMCvis improvements

Check out the rest of the new package freatures, including the option to calculate the number of effective samples for each parameter, ability to take arguments in the form of a ‘regular expression’ for the params argument, ability to retain the structure of all parameters in model output (e.g., parameters specified as matrices in the model are summarized as matrices).

Follow Casey Youngflesh on Twitter @caseyyoungflesh. The MCMCvis source code can be found on GitHub.

Visualizing and wrangling MCMC output in R with `MCMCvis`

Model results can be thought of as a reward for the many hours of model design, troubleshooting, re-design, etc. that analyses often require. Following the potentially exhausting mental exercise to acquire these results, I think we’d all like the interpretation to be as straightforward as possible. Analyzing MCMC output from Bayesian analyses, which may include hundreds of parameters and/or derived quantities, however can often require a fair amount of code and (more importantly) time.

The MCMCvis package was designed to alleviate this problem, and streamline the analysis of Bayesian model results. The latest version (0.7.1) is now available on CRAN with bug fixes, speed improvements, and added functionality.

 

Why MCMCvis?

Using MCMCvis provides three principal benefits:

1) MCMC output fit with Stan, JAGS, or other MCMC samplers can be fed into all package functions as an argument with no further manipulation needed. No need to specify the type of object or how it was fit; the package does all of that behind the scenes.

2) Specific parameters or derived quantities of interest can be specified within each function, to avoid additional steps of data processing. This works using a grep like call for optimal efficiency.

3) The package creates ‘publication-ready’ posterior estimate visualizations (below). Parameters can now be plotted vertically or horizontally.

rplot

The package has four functions for basic MCMC output tasks:

MCMCsummary – summarize MCMC output for particular parameters of interest

MCMCtrace – create trace and density plots of MCMC chains for particular parameters of interest

MCMCchains – easily extract posterior chains from MCMC output for particular parameters of interest

MCMCplot – create caterpillar plots from MCMC output for particular parameters of interest

The vignette can be found here.

 

An example workflow may go as follows:

– summarize posterior estimates for just beta parameters

#install package
install.packages('MCMCvis')

#load package
require(MCMCvis)

#load example data
data(MCMC_data)

#run summary function
MCMCsummary(MCMC_data, 
             params = 'beta')
##            mean   2.5%    50% 97.5% Rhat
## beta[1]    0.16   0.06   0.15  0.25    1
## beta[2]   -7.77 -25.82  -7.68  9.78    1
## beta[3]   -5.64 -28.53  -5.76 17.23    1
## beta[4]  -10.39 -25.98 -10.63  5.27    1
## beta[5]    7.52   6.03   7.52  9.05    1
## beta[6]   10.89  10.10  10.89 11.68    1
## beta[7]   -1.91  -4.83  -1.92  1.08    1
## beta[8]    5.38  -6.86   5.45 17.67    1
## beta[9]   13.39   3.28  13.38 23.60    1
## beta[10]  17.63  14.41  17.63 20.86    1

– check posteriors for convergence

MCMCtrace(MCMC_data, 
           params = c('beta[1]', 'beta[2]', 'beta[3]'), 
           ind = TRUE)

rplot01

– extract chains for beta parameters so that they can be manipulated directly

ex <- MCMCchains(MCMC_data, params = 'beta')
#find 22nd quantile for all beta parameters
apply(ex, 2, function(x){round(quantile(x, probs = 0.22), digits = 2)})
##  beta[1]  beta[2]  beta[3]  beta[4]  beta[5]  beta[6]  beta[7]  beta[8]  beta[9] beta[10] 
##    0.12   -14.86   -14.80   -16.48     6.91    10.58    -3.09     0.68     9.29    16.36

– create caterpillar plots for posterior estimates. Shading represents whether 50% CI (gray with open circle), 95% CI (gray with closed circle), or neither (black) overlap 0. This option can be turned off (as below). A variety of options exist, including the ability to plot posteriors vertically rather than horizontally

MCMCplot(MCMC_data, 
          params = 'beta', 
          horiz = FALSE,
          rank = TRUE,
          ref_ovl = FALSE,
          xlab = 'My x-axis label', 
          main = 'MCMCvis plot', 
          labels = c('First param', 'Second param', 'Third param', 
          'Fourth param', 'Fifth param', 'Sixth param', 'Seventh param', 
          'Eighth param', 'Nineth param', 'Tenth param'), 
          labels_sz = 1.5, med_sz = 2, thick_sz = 7, thin_sz = 3, 
          ax_sz = 4, main_text_sz = 2)

 rplot03

Follow Casey Youngflesh on Twitter @caseyyoungflesh. The MCMCvis source code can be found on GitHub.

R package `MCMCvis` now on CRAN

MCMCvis, an R package for visualizing, manipulating, and summarizing MCMC output is now available on CRAN!

It can be installed with: install.packages('MCMCvis')

The vignette (tutorial) can be run with: vignette('MCMCvis')

MCMCvis was designed to perform key functions for MCMC analysis using minimal code, in order to free up time/brainpower for interpretation of analysis results. Functions support simple and straightforward subsetting of model parameters within the calls, and produce presentable and ‘publication-ready’ output. Model output can be from JAGS, Stan, or other MCMC samplers. The package deals with different data types behind the scenes so you don’t have to think about it. You can specify which parameters you want to visualize or extract within the functions.

For example, this plot can be made using just one line of code (after loading package/data, of course)!

library('MCMCvis') #load package

data(MCMC_data) #load example data

MCMCplot(MCMC_data, params = 'beta') #create plot

 

rplot02

More information can be found within the package vignette. The package was created and authored by Lynch Lab PhD candidate, Casey Youngflesh. He can be found here on Github and here on Twitter.