Through this project, participants will gain structured insights into the principles and applications of Bayesian inference using R. The curriculum is designed to offer hands-on experience and step-by-step guidance, from the basics of Bayesian theory to complex model building and diagnostics using R and Stan. Each unit is self-contained, focusing on specific aspects of Bayesian analysis and practical implementation using R scripting.
The original prompt:
I’d like to see a detailed real world example of using this is R - Bayesian Analysis
Bayesian Inference: Leverage packages like rstan and Stan.
Bayesian inference is a method of statistical inference in which Bayes' Theorem is used to update the probability for a hypothesis as more evidence or information becomes available. This approach is particularly useful in real-world scenarios where we need to incorporate prior knowledge and continuously update our beliefs.
Bayes’ Theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. The formula for Bayes' Theorem is as follows:
[ P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} ]
Where:
Before we dive into Bayesian analysis using R, make sure you have the necessary packages installed. The rstan
package provides an interface to Stan, a platform for statistical modeling and high-performance statistical computation.
To install the necessary R packages, use the following commands:
install.packages("rstan", repos = "https://cloud.r-project.org/", dependencies = TRUE)
install.packages("StanHeaders", repos = "https://cloud.r-project.org/")
Once installed, you need to load the packages in your R environment:
library(rstan)
library(StanHeaders)
Suppose you want to estimate the mean ((\mu)) and standard deviation ((\sigma)) of a normal distribution using Bayesian inference. Here’s a practical example to get you started.
Create a Stan model file, e.g., normal_model.stan
:
data {
int N; // number of observations
real y[N]; // observed data
}
parameters {
real mu; // mean
real sigma; // standard deviation
}
model {
y ~ normal(mu, sigma); // likelihood
mu ~ normal(0, 10); // prior for mean
sigma ~ cauchy(0, 5); // prior for standard deviation
}
Generate some sample data in R:
set.seed(123)
N <- 100
y <- rnorm(N, mean = 5, sd = 2)
data <- list(N = N, y = y)
Use rstan's stan
function to fit the model:
fit <- stan(
file = 'normal_model.stan', # Stan model file
data = data, # named list of data
iter = 2000, # number of iterations
chains = 4 # number of Markov chains
)
View the summary of the fitted model to inspect the posterior distributions:
print(fit)
plot(fit)
This section provided a practical introduction to Bayesian Theory and demonstrated how to perform Bayesian analysis using R and the rstan
package. By following the provided code examples, you should now have a basic setup and an understanding of how to implement Bayesian inference for a normal distribution.
In this unit, we will provide a guide for setting up your R environment to perform Bayesian inference using the rstan
package. Here are the steps to get started with rstan
and Stan
for Bayesian analysis in R.
Ensure you have the latest versions of R and RStudio installed on your machine.
Launch RStudio and execute the following commands to install the necessary R packages for Bayesian analysis using rstan
:
# Install RStan
install.packages("rstan", dependencies = TRUE)
# Install additional required packages
install.packages("StanHeaders")
install.packages("ggplot2")
install.packages("bayesplot")
# Optional: Install the development version of rstan if necessary
# devtools::install_github("stan-dev/rstan", ref = "develop",
# subdir = "rstan/rstan")
Configure rstan
to avoid recompilation of unchanged models and to set up the number of cores for parallel execution:
# Load rstan and configure
library(rstan)
# Set up options for rstan
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
Run the following R code to verify that rstan
is correctly installed and functioning:
# Example Stan model code (eight schools example)
stan_code <- "
data {
int J; // number of schools
real y[J]; // estimated treatment effects
real sigma[J]; // standard error of effect estimates
}
parameters {
real mu; // overall treatment effect
real tau; // standard deviation in treatment effects
real eta[J]; // school-specific deviations from mu
}
transformed parameters {
real theta[J];
for (j in 1:J)
theta[j] = mu + tau * eta[j];
}
model {
eta ~ normal(0, 1);
y ~ normal(theta, sigma);
}
"
# Prepare data for the example (eight schools data)
schools_data <- list(
J = 8,
y = c(28, 8, -3, 7, -1, 1, 18, 12),
sigma = c(15, 10, 16, 11, 9, 11, 10, 18)
)
# Fit Stan model
fit <- stan(model_code = stan_code, data = schools_data)
# Print fit summary
print(fit)
Executing the above script should provide an output summarizing the posterior distributions of the parameters in the model, confirming that your setup works correctly.
With your environment correctly set up, you can now write your Bayesian inference models using rstan
. Here is an example of how you would encapsulate the steps into a repeatable function:
# Define a function to run a Bayesian inference model
run_bayesian_model <- function(model_code, data) {
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
fit <- stan(model_code = model_code, data = data)
return(fit)
}
# Example usage with eight schools data
stan_code <- "
data {
int J;
real y[J];
real sigma[J];
}
parameters {
real mu;
real tau;
real eta[J];
}
transformed parameters {
real theta[J];
for (j in 1:J)
theta[j] = mu + tau * eta[j];
}
model {
eta ~ normal(0, 1);
y ~ normal(theta, sigma);
}
"
schools_data <- list(
J = 8,
y = c(28, 8, -3, 7, -1, 1, 18, 12),
sigma = c(15, 10, 16, 11, 9, 11, 10, 18)
)
fit <- run_bayesian_model(stan_code, schools_data)
print(fit)
Following these steps ensures that your R environment is ready for Bayesian analysis using rstan
and Stan
. You can now proceed with writing and analyzing your own Bayesian models.
Unit 3: Basics of Bayesian Inference in R
We'll use a simple example to demonstrate Bayesian inference in R. Let's say we want to estimate the probability of success ( \theta ) in a series of Bernoulli trials (e.g., the probability of getting heads in a series of coin flips).
Ensure you have the required R packages:
install.packages("rstan")
library(rstan)
Create a model in the Stan language. Here is a simple Bernoulli-Beta model:
// bernoulli_beta.stan
data {
int N; // Number of trials
int y[N]; // Successes (0 or 1)
}
parameters {
real theta; // Probability of success
}
model {
theta ~ beta(1, 1); // Prior
y ~ bernoulli(theta); // Likelihood
}
Write the R code to prepare data and run the model:
# Simulated data: 10 coin flips with 7 heads
N <- 10
y <- c(1, 1, 1, 1, 1, 1, 1, 0, 0, 0)
data <- list(N = N, y = y)
# Compile the Stan model
model_code <- "
data {
int N;
int y[N];
}
parameters {
real theta;
}
model {
theta ~ beta(1, 1);
y ~ bernoulli(theta);
}
"
stan_model <- stan(model_code = model_code, data = data, iter = 2000, chains = 4)
print(stan_model)
RStan provides tools to visualize and inspect the output:
# Extract samples
extracted <- extract(stan_model)
# Plot posterior distribution for theta
hist(extracted$theta, breaks=30, probability=TRUE,
main="Posterior distribution of theta",
xlab=expression(theta))
# Add a density estimate
lines(density(extracted$theta), col='blue')
This basic example introduces Bayesian inference using R and Stan. The key steps involve defining your model in Stan, preparing your data in R, running the model, and visualizing the results. This forms the foundation on which more complex models and data sets can be tackled in future units of your project.
This section covers practical data preprocessing and cleaning techniques using R. The goal is to prepare your data for Bayesian analysis.
library(dplyr)
library(tidyr)
library(stringr)
library(lubridate)
# Replace 'file_path' with your actual dataset path
data <- read.csv('file_path.csv')
# View the first few rows of the dataset
head(data)
# Get summary statistics of the dataset
summary(data)
# Check the structure of the dataset
str(data)
# Identify missing values in the dataset
missing_values <- sapply(data, function(x) sum(is.na(x)))
print(missing_values)
# Example: Impute missing values with median for numeric columns
data <- data %>%
mutate(across(where(is.numeric), ~ ifelse(is.na(.), median(., na.rm = TRUE), .)))
# Example: Impute missing values with mode for categorical columns
mode_impute <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
data <- data %>%
mutate(across(where(is.factor), ~ ifelse(is.na(.), mode_impute(.), .)))
data <- data %>%
distinct()
# Example using IQR method for numerical columns
outliers <- function(x) {
Q1 <- quantile(x, 0.25)
Q3 <- quantile(x, 0.75)
IQR <- Q3 - Q1
x < (Q1 - 1.5 * IQR) | x > (Q3 + 1.5 * IQR)
}
# Apply to each numeric column
outliers_data <- sapply(data %>% select(where(is.numeric)), outliers)
print(outliers_data)
# Replace outliers with NA and then impute, or simply remove the rows
data <- data %>%
mutate(across(where(is.numeric), ~ ifelse(outliers(.), NA, .))) %>%
mutate(across(where(is.numeric), ~ ifelse(is.na(.), median(., na.rm = TRUE), .)))
# Standardize numerical columns
data <- data %>%
mutate(across(where(is.numeric), scale))
# Normalize numerical columns
normalize <- function(x) {
(x - min(x)) / (max(x) - min(x))
}
data <- data %>%
mutate(across(where(is.numeric), normalize))
# Identify and convert date columns
data <- data %>%
mutate(across(where(is.character), ~ parse_date_time(., orders = c('ymd', 'dmy', 'mdy'))))
data <- data %>%
mutate(across(where(is.factor), as.character)) %>%
pivot_wider(names_from = where(is.character), values_from = last_col(), names_sep = "_")
data <- data %>%
mutate(across(where(is.character), as.factor)) %>%
mutate(across(where(is.factor), as.integer))
data <- data %>%
rename_with(~ str_replace_all(., "[^[:alnum:]]", "_"))
# Inspect the cleaned dataframe
head(data)
summary(data)
str(data)
Now your data is cleaned and preprocessed, ready for Bayesian analysis using R packages such as rstan
and Stan
.
In this section, we will construct a simple Bayesian model using the rstan
package. We will create a Bayesian linear regression model as an example.
Make sure you have the rstan
package installed and loaded in your R environment:
install.packages("rstan")
library(rstan)
For this example, let's assume we have a simple dataset with one predictor variable x
and one response variable y
.
# Simulated Data
set.seed(123)
x <- rnorm(100, mean = 5, sd = 2)
y <- 3 + 2 * x + rnorm(100)
data <- list(
N = length(y),
x = x,
y = y
)
Save the following Stan model to a file named simple_linear_regression.stan
:
data {
int N; // number of observations
vector[N] x; // predictor variable
vector[N] y; // response variable
}
parameters {
real alpha; // intercept
real beta; // slope
real sigma; // standard deviation of errors
}
model {
y ~ normal(alpha + beta * x, sigma); // likelihood
}
model_code <- "
data {
int N; // number of observations
vector[N] x; // predictor variable
vector[N] y; // response variable
}
parameters {
real alpha; // intercept
real beta; // slope
real sigma; // standard deviation of errors
}
model {
y ~ normal(alpha + beta * x, sigma); // likelihood
}
"
stan_model <- stan_model(model_code = model_code)
fit <- sampling(
object = stan_model,
data = data,
iter = 2000,
chains = 4,
seed = 123
)
print(fit)
plot(fit)
posterior_samples <- extract(fit)
alpha_samples <- posterior_samples$alpha
beta_samples <- posterior_samples$beta
sigma_samples <- posterior_samples$sigma
summary(alpha_samples)
summary(beta_samples)
summary(sigma_samples)
hist(alpha_samples, main = "Posterior Distribution of Alpha", xlab = "Alpha")
hist(beta_samples, main = "Posterior Distribution of Beta", xlab = "Beta")
hist(sigma_samples, main = "Posterior Distribution of Sigma", xlab = "Sigma")
You now have a simple Bayesian linear regression model built using R and rstan
. This model can be extended to more complex cases by modifying the Stan code and the R implementation to handle additional variables and parameters.
This section covers the implementation of advanced Bayesian models using rstan
and Stan
in R for real-world applications.
A hierarchical Bayesian model allows us to handle data that has a nested structure. Here is a practical implementation using rstan
and Stan
.
Ensure you have the necessary packages:
install.packages("rstan")
library(rstan)
Create a file named hierarchical_model.stan
and define your model.
data {
int N; // number of data points
int J; // number of groups
int group[N]; // group indicator
vector[N] y; // observations
}
parameters {
real mu; // overall mean
real tau; // between-group standard deviation
vector[J] alpha; // group means offsets
real sigma; // within-group standard deviation
}
model {
tau ~ normal(0, 5);
alpha ~ normal(0, tau);
mu ~ normal(0, 10);
sigma ~ cauchy(0, 2.5);
for (n in 1:N)
y[n] ~ normal(mu + alpha[group[n]], sigma);
}
Suppose you have a dataset data
with columns y
for observations and group
for group indicators.
data <- read.csv("path_to_your_data.csv")
stan_data <- list(
N = nrow(data),
J = length(unique(data$group)),
group = as.integer(data$group),
y = data$y
)
Compile and fit the model using rstan
.
stan_model <- stan_model("hierarchical_model.stan")
fit <- sampling(stan_model, data = stan_data, iter = 2000, chains = 4)
print(fit)
Extract the results and visualize convergences and estimates.
library(ggplot2)
# Extract posterior samples
posterior <- extract(fit)
# Plot group means
alpha_means <- colMeans(posterior$alpha)
data$group_means <- alpha_means[data$group]
ggplot(data, aes(x = factor(group), y = y)) +
geom_point() +
geom_point(aes(y = group_means), color = "red", size = 3) +
labs(title = "Group Observations and Group Means",
x = "Group",
y = "Observations") +
theme_minimal()
By following these steps, you can implement an advanced hierarchical Bayesian model using rstan
and Stan
in R. This example can be adapted and extended based on the specific requirements of your real-world scenarios.
In this section, we will focus on practical techniques for diagnosing and validating Bayesian models implemented using the rstan
package in R. Proper diagnostics and validation are critical to ensuring that our models fit the data well and that the inferences we draw are reliable.
# Load necessary libraries
library(rstan)
library(bayesplot)
library(loo)
library(posterior)
Assuming that you have already prepared your data and defined your Stan model, you would typically fit your model as follows:
fit <- stan(file = 'model.stan', data = data, iter = 2000, warmup = 500, chains = 4, seed = 123)
Trace plots help you to visualize the Markov chains' behavior. Ideally, the chains should mix well and converge, appearing as a "hairy caterpillar."
traceplot(fit)
The potential scale reduction factor (R-hat) indicates how well the chains have converged. R-hat values close to 1 indicate convergence.
print(fit, pars = c("variable_name"), probs = c(0.025, 0.5, 0.975))
ESS measures the number of independent draws from the posterior distribution.
summary(fit)$summary[,"n_eff"]
Posterior predictive checks allow us to assess how well our model replicates the observed data.
# Extract posterior samples
posterior_samples <- extract(fit)
# Generate posterior predictive samples
y_rep <- posterior_predict(fit)
# Plot posterior predictive checks
ppc_dens_overlay(data$y, y_rep)
We can use the loo
package to perform approximate leave-one-out cross-validation to compare models.
# Compute LOO
loo_fit <- loo(fit)
# Print LOO results
print(loo_fit)
# Assuming we have another model fit
loo_fit2 <- loo(fit2)
# Compare models
loo_compare(loo_fit, loo_fit2)
Finally, summarizing the posterior distributions of the parameters gives insight into the estimated values and their uncertainties.
print(fit, probs = c(0.025, 0.5, 0.975))
# Visualize posterior distributions
mcmc_areas(as.array(fit), pars = c("parameter1", "parameter2", "parameter3"), prob = 0.8)
These steps provide a practical implementation for diagnosing and validating Bayesian models using rstan
in R. Following these practices ensures that your model is robust, reliable, and provides meaningful inferences from your data.
Once you've run your Bayesian models using rstan
or Stan, the next step is to interpret the results effectively. This includes understanding the posterior distributions, credible intervals, and key statistics. Below is a practical guide to interpreting and reporting Bayesian results:
First, assuming you have fitted a model using rstan
, you can extract the results using the summary
function.
library(rstan)
# Example: Extracting summary from the fitted Stan model
fit <- # (your stanfit object here)
fit_summary <- summary(fit, probs = c(0.025, 0.5, 0.975))$summary
print(fit_summary)
You should always check the convergence diagnostics like R-hat values and effective sample size (ESS).
print(fit_summary[,"Rhat"])
print(fit_summary[,"n_eff"])
Visualizing the posterior distribution helps in understanding the parameter estimates better.
# Plot posterior distribution of a parameter
library(bayesplot)
params <- extract(fit)
mcmc_areas(params$your_parameter, prob = 0.95)
This plots the posterior density with a 95% credible interval.
Create a summary table including key statistics like mean, median, and credible intervals.
results <- data.frame(
Parameter = rownames(fit_summary),
Mean = fit_summary[,"mean"],
SD = fit_summary[,"sd"],
`2.5%` = fit_summary[,"2.5%"],
`97.5%` = fit_summary[,"97.5%"]
)
print(results)
Here's how you might interpret and report these results in text:
"We estimated the posterior distribution for the parameter extremely_important_parameter
. The posterior mean was approximately x
with a 95% credible interval ranging from y
to z
. The convergence diagnostics (R-hat < 1.01) indicate that the model has adequately converged."
Include visualizations like trace plots, posterior density plots, and more to support your textual interpretations.
# Trace plot for checking convergence
mcmc_trace(params$your_parameter)
# Density plot
mcmc_dens(params$your_parameter)
library(shinystan)
# Launch ShinyStan for interactive exploration
launch_shinystan(fit)
ShinyStan provides a convenient way to interactively explore model diagnostics and results.
By following these steps, you can effectively interpret and report the results from your Bayesian inference models in R using rstan
or Stan. This ensures that your analysis is both accurate and comprehensible to a broader audience.
Remember to include key statistics, diagnostic checks, and visual aids to communicate your findings effectively.
In this section, we will demonstrate how to apply Bayesian inference to a real-world problem using R. We will use the rstan
package to perform Bayesian linear regression. Specifically, we'll predict house prices based on certain features like square footage and the number of bedrooms.
# Load required libraries
library(rstan)
library(ggplot2)
library(dplyr)
For this example, we'll use a sample dataset housing_data
which contains house prices and features.
# Sample dataset
housing_data <- data.frame(
square_footage = c(1400, 1600, 1700, 1875, 1100, 1550, 2350, 2450, 1425, 1700),
bedrooms = c(3, 3, 3, 4, 2, 3, 4, 4, 3, 3),
price = c(245000, 312000, 279000, 308000, 199000, 219000, 405000, 324000, 319000, 255000)
)
# Standardize the data
housing_data <- housing_data %>%
mutate(
scale_sqrt_ft = scale(square_footage),
scale_bed = scale(bedrooms),
scale_price = scale(price)
)
Now, we'll specify the Stan model for Bayesian linear regression.
stan_model_code <- "
data {
int N;
vector[N] scale_sqrt_ft;
vector[N] scale_bed;
vector[N] scale_price;
}
parameters {
real alpha;
real beta_sqrt_ft;
real beta_bed;
real sigma;
}
model {
scale_price ~ normal(alpha + beta_sqrt_ft * scale_sqrt_ft + beta_bed * scale_bed, sigma);
}
"
# Compile the Stan model
stan_model <- stan_model(stan_model_code, data = housing_dat)
rstan
# Prepare data for Stan model
stan_data <- list(
N = nrow(housing_data),
scale_sqrt_ft = housing_data$scale_sqrt_ft,
scale_bed = housing_data$scale_bed,
scale_price = housing_data$scale_price
)
fit <- stan(
model_code = stan_model_code,
data = stan_data,
iter = 2000,
chains = 4
)
# Print the results
print(fit)
We'll conduct posterior predictive checks to ensure the quality of our model.
# Extract posterior samples
posterior_samples <- extract(fit)
# Posterior Prediction
posterior_predict <- function(x) {
alpha <- posterior_samples$alpha
beta_sqrt_ft <- posterior_samples$beta_sqrt_ft
beta_bed <- posterior_samples$beta_bed
sigma <- posterior_samples$sigma
alpha + beta_sqrt_ft * x[1] + beta_bed * x[2] + rnorm(length(alpha), 0, sigma)
}
# Example prediction for a new data point
new_data <- c(scale(housing_data$square_footage[1]),
scale(housing_data$bedrooms[1]))
predictions <- posterior_predict(new_data)
# Plot predictions
ggplot(data = housing_data, aes(x = scale_sqrt_ft, y = scale_price)) +
geom_point(color = "blue") +
geom_abline(intercept = median(posterior_samples$alpha),
slope = median(posterior_samples$beta_sqrt_ft),
color = "red")
This implementation provides a comprehensive, practical introduction to Bayesian inference for real-world scenarios using R. By utilizing the rstan
package, learners can fit Bayesian models to their data and make informed predictions and analyses.
The objective of this section is to apply the concepts learned in previous units. You will implement Bayesian inference on real-world datasets using R, focusing on practical implementation with code and case studies.
This section consists of two main parts:
Let's consider a case where we want to predict the sales of a retail store based on advertising budget. We'll use a simple linear regression model with Bayesian inference.
For this example, we'll use a hypothetical dataset. You can replace this with a real dataset in practice.
# Load required libraries
library(rstan)
library(ggplot2)
# Hypothetical dataset
data <- data.frame(
Advertising = c(50, 100, 150, 200, 250, 300, 350, 400, 450, 500),
Sales = c(20, 25, 30, 34, 40, 41, 42, 44, 45, 48)
)
Stan requires the data to be in a list format.
# Data preparation
stan_data <- list(
N = nrow(data),
x = data$Advertising,
y = data$Sales
)
We'll write a simple linear regression model in Stan.
# Stan model code
stan_code <- "
data {
int N;
vector[N] x;
vector[N] y;
}
parameters {
real alpha;
real beta;
real sigma;
}
model {
y ~ normal(alpha + beta * x, sigma);
}
"
Compile and fit the model using the stan
function.
# Compile and fit the model
fit <- stan(
model_code = stan_code,
data = stan_data,
iter = 2000,
chains = 4
)
Extract the results and visualize them.
# Extract results
print(fit)
# Plot the results
posterior <- extract(fit)
alpha_post <- posterior$alpha
beta_post <- posterior$beta
# Predictive plot
predicted_sales <- alpha_post + beta_post * data$Advertising
data$Predicted <- rowMeans(predicted_sales)
ggplot(data, aes(x = Advertising, y = Sales)) +
geom_point() +
geom_line(aes(y = Predicted), color = 'blue') +
labs(title = "Advertising Budget vs Sales",
x = "Advertising Budget (in 1000s)",
y = "Sales")
Use Bayesian inference to predict housing prices based on features like the number of bedrooms, square footage, etc.
Implement a Bayesian A/B test to determine the effectiveness of two different website designs on conversion rates.
Apply Bayesian techniques to forecast demand in a supply chain context, taking into account previous demand data and external factors.
By now, you should be able to implement Bayesian inference in R for various real-world problems. These case studies provide a practical approach to using Bayesian analysis. Feel free to expand on them or create your own case studies to gain deeper insights.