Title: | Combining Subset MCMC Samples to Estimate a Posterior Density |
---|---|
Description: | See Miroshnikov and Conlon (2014) <doi:10.1371/journal.pone.0108425>. Recent Bayesian Markov chain Monto Carlo (MCMC) methods have been developed for big data sets that are too large to be analyzed using traditional statistical methods. These methods partition the data into non-overlapping subsets, and perform parallel independent Bayesian MCMC analyses on the data subsets, creating independent subposterior samples for each data subset. These independent subposterior samples are combined through four functions in this package, including averaging across subset samples, weighted averaging across subsets samples, and kernel smoothing across subset samples. The four functions assume the user has previously run the Bayesian analysis and has produced the independent subposterior samples outside of the package; the functions use as input the array of subposterior samples. The methods have been demonstrated to be useful for Bayesian MCMC models including Bayesian logistic regression, Bayesian Gaussian mixture models and Bayesian hierarchical Poisson-Gamma models. The methods are appropriate for Bayesian hierarchical models with hyperparameters, as long as data values in a single level of the hierarchy are not split into subsets. |
Authors: | Alexey Miroshnikov, Erin Conlon |
Maintainer: | Erin Conlon <[email protected]> |
License: | GPL (>= 2) |
Version: | 2.0 |
Built: | 2025-02-15 05:25:04 UTC |
Source: | https://github.com/cran/parallelMCMCcombine |
Recent Bayesian Markov chain Monto Carlo (MCMC) methods have been developed for big data sets that are too large to be analyzed using traditional statistical methods. These methods partition the data into non-overlapping subsets, and perform parallel independent Bayesian MCMC analyses on the data subsets, creating independent subposterior samples for each data subset. These independent subposterior samples are combined through four functions in this package, including averaging across subset samples, weighted averaging across subsets samples, and kernel smoothing across subset samples. The four functions assume the user has previously run the Bayesian analysis and has produced the independent subposterior samples outside of the package; the functions use as input the array of subposterior samples. The methods have been demonstrated to be useful for Bayesian MCMC models including Bayesian logistic regression, Bayesian Gaussian mixture models and Bayesian hierarchical Poisson-Gamma models. The methods are appropriate for Bayesian hierarchical models with hyperparameters, as long as data values in a single level of the hierarchy are not split into subsets.
Package: | parallelMCMCcombine |
Type: | Package |
Version: | 1.0 |
Date: | 2021-06-18 |
License: | GPL-2 |
The package contains the following functions:
consensusMCcov
Consensus Monte Carlo Algorithm (for correlated parameters)
consensusMCindep
Consensus Monte Carlo Algorithm (for independent parameters)
sampleAvg
Sample Averaging Method
semiparamDPE
Semiparametric Method
Alexey Miroshnikov, Erin Conlon
Maintainer: Erin Conlon [email protected]
Scott, S.L., Blocker, A. W., Bonassi (2013) Bayes and Big Data: The consensus Monte Carlo Algorithm. Bayes 250.
Neiswanger, W., Wang, C., Xing, E. (2014) Asymptotically exact, embarrassingly parallel MCMC. arXiv:1311.4780v2.
Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC. pp. 7-11.
The function uses the Consensus Monte Carlo algorithm introduced by Scott et al. (see References) to combine the independent subset posterior samples subchains into the set of samples that estimate the posterior density given the full data set. The Consensus Monte Carlo algorithm uses a weighted average of the subset posterior samples to produce the combined posterior samples, where the weights are based on the inverse variance-covariance matrix of the subset posterior samples.
consensusMCcov(subchain, shuff = FALSE)
consensusMCcov(subchain, shuff = FALSE)
subchain |
array of subset posterior samples of the dimension c(d,sampT,M). Here d is the dimension of the parameter space, sampT is the number of samples, and M is the number of subposterior datasets. |
shuff |
shuff: logical; if TRUE, each of the M subsets of d dimensional parameters in subchain is shuffled. |
The array subchain must have dimension c(d,sampT,M). Here d is the dimension of the parameter space, sampT is the number of samples, and M is the number of subposterior datasets.
Returns an array of samples of dimension dim=c(d,sampT) representing an estimated (combined) full posterior density.
Scott, S.L., Blocker, A. W., Bonassi (2013) Bayes and Big Data: The consensus Monte Carlo Algorithm. Bayes 250.
d <- 2 # dimension of the parameter space sampT <- 1000 # number of subset posterior samples M <- 3 # total number of subsets ## simulate Gaussian subposterior samples theta <- array(NA,c(d,sampT,M)) norm.mean <- c(1.0, 2.0) norm.sd <- c(0.5, 1.0) for (i in 1:d) for (s in 1:M) theta[i,,s] <- rnorm(sampT, mean=norm.mean[i]+runif(1,-0.01,0.01), sd=norm.sd[i]) ## combine samples: full.theta <- consensusMCcov(subchain=theta, shuff=FALSE)
d <- 2 # dimension of the parameter space sampT <- 1000 # number of subset posterior samples M <- 3 # total number of subsets ## simulate Gaussian subposterior samples theta <- array(NA,c(d,sampT,M)) norm.mean <- c(1.0, 2.0) norm.sd <- c(0.5, 1.0) for (i in 1:d) for (s in 1:M) theta[i,,s] <- rnorm(sampT, mean=norm.mean[i]+runif(1,-0.01,0.01), sd=norm.sd[i]) ## combine samples: full.theta <- consensusMCcov(subchain=theta, shuff=FALSE)
The function uses the Consensus Monte Carlo algorithm introduced by Scott et al. (see References) to combine the independent subset posterior samples subchains into the set of samples that estimate the posterior density given the full data set. The Consensus Monte Carlo algorithm uses a weighted average of the subset posterior samples to produce the combined posterior samples, where the weights are based on the inverse variance-covariance matrix of the subset posterior samples. Here, the model parameters are assumed to be independent, so that the covariance between model parameters is equal to zero.
consensusMCindep(subchain, shuff = FALSE)
consensusMCindep(subchain, shuff = FALSE)
subchain |
array of subset posterior samples of the dimension c(d,sampT,M). Here d is the dimension of the parameter space, sampT is the number of samples, and M is the number of subposterior datasets. |
shuff |
shuff: logical; if TRUE, each of the M subsets of d dimensional parameters in subchain is shuffled. |
The array subchain must have dimension c(d,sampT,M). Here d is the dimension of the parameter space, sampT is the number of samples, and M is the number of subposterior datasets.
Returns an array of samples of dimension dim=c(d,sampT) representing an estimated (combined) full posterior density.
Scott, S.L., Blocker, A. W., Bonassi (2013) Bayes and Big Data: The consensus Monte Carlo Algorithm. Bayes 250 day.
d <- 2 # dimension of the parameter space sampT <- 1000 # number of subset posterior samples M <- 3 # total number of subsets ## simulate Gaussian subposterior samples theta <- array(NA,c(d,sampT,M)) norm.mean <- c(1.0, 2.0) norm.sd <- c(0.5, 1.0) for (i in 1:d) for (s in 1:M) theta[i,,s] <- rnorm(sampT, mean=norm.mean[i]+runif(1,-0.01,0.01), sd=norm.sd[i]) ## combine samples: full.theta <- consensusMCindep(subchain=theta, shuff=FALSE)
d <- 2 # dimension of the parameter space sampT <- 1000 # number of subset posterior samples M <- 3 # total number of subsets ## simulate Gaussian subposterior samples theta <- array(NA,c(d,sampT,M)) norm.mean <- c(1.0, 2.0) norm.sd <- c(0.5, 1.0) for (i in 1:d) for (s in 1:M) theta[i,,s] <- rnorm(sampT, mean=norm.mean[i]+runif(1,-0.01,0.01), sd=norm.sd[i]) ## combine samples: full.theta <- consensusMCindep(subchain=theta, shuff=FALSE)
The function combines the independent subset posterior samples subchains into the set of samples that estimate the posterior density given the full data set, by averaging the samples across subsets. Individual model parameters are assumed to be independent.
sampleAvg(subchain, shuff = FALSE)
sampleAvg(subchain, shuff = FALSE)
subchain |
array of subset posterior samples of the dimension c(d,sampT,M). Here d is the dimension of the parameter space, sampT is the number of samples, and M is the number of subposterior datasets. |
shuff |
shuff: logical; if TRUE, each of the M subsets of d dimensional parameters in subchain is shuffled. |
The array subchain must have dimension c(d,sampT,M). Here d is the dimension of the parameter space, sampT is the number of samples, and M is the number of subposterior datasets.
Returns an array of samples of dimension dim=c(d,sampT) representing an estimated (combined) full posterior density.
d <- 2 # dimension of the parameter space sampT <- 1000 # number of subset posterior samples M <- 3 # total number of subsets ## simulate Gaussian subposterior samples theta <- array(NA,c(d,sampT,M)) norm.mean <- c(1.0, 2.0) norm.sd <- c(0.5, 1.0) for (i in 1:d) for (s in 1:M) theta[i,,s] <- rnorm(sampT, mean=norm.mean[i]+runif(1,-0.01,0.01), sd=norm.sd[i]) ## combine samples: full.theta <- sampleAvg(subchain=theta, shuff=FALSE)
d <- 2 # dimension of the parameter space sampT <- 1000 # number of subset posterior samples M <- 3 # total number of subsets ## simulate Gaussian subposterior samples theta <- array(NA,c(d,sampT,M)) norm.mean <- c(1.0, 2.0) norm.sd <- c(0.5, 1.0) for (i in 1:d) for (s in 1:M) theta[i,,s] <- rnorm(sampT, mean=norm.mean[i]+runif(1,-0.01,0.01), sd=norm.sd[i]) ## combine samples: full.theta <- sampleAvg(subchain=theta, shuff=FALSE)
The function uses the Semiparametric Density Product Estimator method introduced by Neiswanger et al. (see References) to combine the independent subset posterior samples subchains into the set of samples that estimate the posterior density given the full data set. The semiparametric density product estimator method uses kernel smoothing techniques to estimate each subset posterior density; the subposterior densities are then multiplied together to approximate the posterior density based on the full data set.
semiparamDPE(subchain, bandw = rep(1.0, dim(subchain)[1]), anneal = TRUE, shuff = FALSE)
semiparamDPE(subchain, bandw = rep(1.0, dim(subchain)[1]), anneal = TRUE, shuff = FALSE)
subchain |
array of subset posterior samples of the dimension c(d,sampT,M). Here d is the dimension of the parameter space, sampT is the number of samples, and M is the number of subposterior datasets. |
bandw |
bandwidth vector of the length d=dim(subchain)[1]. It is a vector of tuning parameters used in kernel density approximation employed by the semiparametric method. When anneal=TRUE then one of the choices for bandw could be the vector consisting of standard deviations for each of the d parameters. When anneal=FALSE then one of the choices for bandw could be the diagonal of the optimal bandwidth matrix obtained via Silverman's rule of thumb; see Examples. By default bandw=rep(1.0,d). |
anneal |
logical; if TRUE, the bandwidth bandw (instead of being fixed) is annealed as bandw*i^(-1/(4+d)); here i is the index corresponding to a sample; see References. |
shuff |
logical; if TRUE, each of the M subsets of d dimensional parameters in subchain is shuffled. |
Returns an array of samples of dimension dim=c(d,sampT) representing an estimated (combined) full posterior density.
Neiswanger, W., Wang, C., Xing E. (2014) Asymptotically exact, embarrassingly parallel MCMC. arXiv:1311.4780v2.
Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC. pp. 7-11.
d <- 2 # dimension of the parameter space sampT <- 300 # number of subset posterior samples M <- 3 # total number of subsets ## simulate Gaussian subposterior samples theta <- array(NA,c(d,sampT,M)) norm.mean <- c(1.0, 2.0) norm.sd <- c(0.5, 1.0) for (i in 1:d) for (s in 1:M) theta[i,,s] <- rnorm(sampT, mean=norm.mean[i]+runif(1,-0.01,0.01), sd=norm.sd[i]) ## estimate (mean) standard deviations for each parameter across the subsets norm.var.est <- rep(0,d) for(i in 1:d) for(s in 1:M) norm.var.est[i] <- norm.var.est[i] + var(theta[i,,s]) norm.sd.est <- sqrt(norm.var.est/M) ## Compute the diagonal of the optimal bandwidth ## matrix according to Silverman's rule h_opt1 = (4/(d+2))^(1/(4+d)) * (sampT^(-1/(4+d))) * norm.sd.est ## Combine samples. The bandwidth matrix is fixed: full.theta1 <- semiparamDPE( subchain = theta, bandw = h_opt1 * 2, anneal = FALSE) ## Compute the diagonal of the optimal bandwidth ## matrix for the method that uses annealing h_opt2 = (4/(d+2))^(1/(4+d)) * norm.sd.est ## Combine samples. The bandwidth matrix will be annealed: full.theta2 <- semiparamDPE(subchain = theta, bandw = h_opt2 * 2, anneal = TRUE)
d <- 2 # dimension of the parameter space sampT <- 300 # number of subset posterior samples M <- 3 # total number of subsets ## simulate Gaussian subposterior samples theta <- array(NA,c(d,sampT,M)) norm.mean <- c(1.0, 2.0) norm.sd <- c(0.5, 1.0) for (i in 1:d) for (s in 1:M) theta[i,,s] <- rnorm(sampT, mean=norm.mean[i]+runif(1,-0.01,0.01), sd=norm.sd[i]) ## estimate (mean) standard deviations for each parameter across the subsets norm.var.est <- rep(0,d) for(i in 1:d) for(s in 1:M) norm.var.est[i] <- norm.var.est[i] + var(theta[i,,s]) norm.sd.est <- sqrt(norm.var.est/M) ## Compute the diagonal of the optimal bandwidth ## matrix according to Silverman's rule h_opt1 = (4/(d+2))^(1/(4+d)) * (sampT^(-1/(4+d))) * norm.sd.est ## Combine samples. The bandwidth matrix is fixed: full.theta1 <- semiparamDPE( subchain = theta, bandw = h_opt1 * 2, anneal = FALSE) ## Compute the diagonal of the optimal bandwidth ## matrix for the method that uses annealing h_opt2 = (4/(d+2))^(1/(4+d)) * norm.sd.est ## Combine samples. The bandwidth matrix will be annealed: full.theta2 <- semiparamDPE(subchain = theta, bandw = h_opt2 * 2, anneal = TRUE)