Introduction to Monte Carlo Simulation: The Math of Uncertainty
What is Monte Carlo Simulation?
Named after the famous casino in Monaco, a Monte Carlo Simulation is a mathematical technique that allows us to account for risk in quantitative analysis and decision-making. Instead of predicting a single outcome (e.g., "This project will cost $10M"), it runs thousands of "what-if" scenarios to produce a range of possible outcomes and the probability of each occurring.
The Foundation: From Simple to Complex
In this series, we begin with the fundamental building blocks of probability using two simple experiments:
The Coin Toss (Binomial Distribution): Exploring 50/50 outcomes across 1 million events.
The Dice Roll (Normal Distribution): Observing how the sum of many independent random variables naturally forms a "Bell Curve."
By mastering these, we establish the Law of Large Numbers and the Central Limit Theorem—the same engines that power risk models in the world's largest industries.
Real-World Applications of the Monte Carlo Method
While coin flips and dice rolls provide a controlled environment for understanding probability, the true power of Monte Carlo simulations lies in their ability to price uncertainty in complex, high-stakes industries. By shifting from deterministic (single-point) estimates to stochastic (probabilistic) modeling, professionals can make data-driven decisions.
Industry Use Cases
| Sector | Practical Application | Key Metric |
|---|---|---|
| Project Management | Schedule Risk Analysis (SRA): Simulating task durations to determine the "P80" or "P90" completion dates for large-scale infrastructure and PPP projects. | Completion Probability |
| Finance & Investing | Portfolio Optimization: Simulating thousands of price paths for stocks and options to calculate potential returns and Value at Risk (VaR). | Risk-Adjusted Return |
| Supply Chain | Lead Time Uncertainty: Modeling the ripple effect of shipping delays or raw material shortages on total production timelines and inventory levels. | Service Level % |
| Banking | Credit Risk Exposure: Stress-testing loan portfolios against various economic scenarios to ensure capital adequacy and solvency. | Expected Loss (EL) |
| Insurance | Actuarial Modeling: Predicting the frequency and severity of catastrophic events (e.g., floods or fires) to accurately set premiums and reserves. | Loss Ratio |
Moving from Theory to Practice
In a deterministic model, one might say: "This bridge will cost $500M and take 3 years to build." In a Monte Carlo-driven model, the conversation changes to:
- "There is a 15% probability of finishing within the $500M budget."
- *"To reach a 90% confidence level ($P_{90}$), we require a contingency buffer of $45M."*
By using the same mathematical engines we used to simulate 1,000,000 coin tosses, we can quantify these risks with mathematical rigour rather than intuition.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats# Constants
TRIALS = 10000
TOSSES_PER_TRIAL = 1000000
print(f"Starting simulation: {TRIALS} trials of {TOSSES_PER_TRIAL} tosses each...")
# Using binomial distribution for efficiency
# np.random.binomial(n, p, size) gives the number of successes (Heads)
# n = number of tosses, p = probability of heads (0.5), size = number of trials
heads_count = np.random.binomial(TOSSES_PER_TRIAL, 0.5, TRIALS)
# Create the results table
simulation_cointoss_results = pd.DataFrame({
'trial_id': np.arange(1, TRIALS + 1),
'No_of_heads': heads_count
})
print("Simulation Complete.")
print(simulation_cointoss_results.head())Starting simulation: 10000 trials of 1000000 tosses each... Simulation Complete. trial_id No_of_heads 0 1 500152 1 2 499216 2 3 499965 3 4 500025 4 5 500329
stats_coin_toss = simulation_cointoss_results['No_of_heads'].describe()
mean_heads = stats_coin_toss['mean']
std_dev = stats_coin_toss['std']
min_heads = stats_coin_toss['min']
max_heads = stats_coin_toss ['max']
print("\n--- Statistical Analysis ---")
print(f"Mean Number of Heads: {mean_heads}")
print(f"Standard Deviation: {std_dev}")
print(f"Expected Mean (Theoretical): {TOSSES_PER_TRIAL * 0.5}")
print(f"Min Number of Heads: {min_heads}")
print(f"Max Number of Heads: {max_heads}")--- Statistical Analysis --- Mean Number of Heads: 499999.3159 Standard Deviation: 502.95435878009005 Expected Mean (Theoretical): 500000.0 Min Number of Heads: 497984.0 Max Number of Heads: 501758.0
plt.figure(figsize=(10, 6))
sns.histplot(simulation_cointoss_results['No_of_heads'], kde=True, color='skyblue')
plt.title(f'Monte Carlo Simulation: {TRIALS} Trials of 1M Coin Tosses')
plt.xlabel('Number of Heads')
plt.ylabel('Frequency')
plt.axvline(mean_heads, color='red', linestyle='--', label=f'Mean: {mean_heads:.2f}')
plt.legend()
plt.show()# 1. Import ONLY the 'norm' function to avoid conflict with your 'stats' variable
# from scipy.stats import stats - imported in libraries
# 2. Define the range for the PDF (X-axis)
# Using your existing min_heads and max_heads variables
x_coin_toss = np.linspace(min_heads, max_heads, 100)
# 3. Calculate the PDF values
# We call 'norm.pdf' directly instead of 'stats.norm.pdf'
pdf_coin_toss = stats.norm.pdf(x_coin_toss, mean_heads, std_dev)
# 4. Plotting the results
plt.figure(figsize=(10, 6))
# The smooth theoretical curve
plt.plot(x_coin_toss, pdf_coin_toss, color='red', lw=3, label='Normal PDF Curve')
# The actual simulation data
# Note: I'm using the column directly here to ensure 'data' is defined
sns.histplot(simulation_cointoss_results['No_of_heads'],
stat="density", bins=30, alpha=0.3, color='green')
plt.title('Probability Density Function of Coinflips')
plt.xlabel('Number of Heads in 1M Tosses')
plt.ylabel('Probability Density')
plt.legend()
plt.show()lower_bound_coin_toss = 499500 #(500000 - 500)
upper_bound_coin_toss = 500500 #(500000 + 500)
# stats.norm.cdf gives the area to the left of a point
prob_below_upper_toss = stats.norm.cdf(upper_bound_coin_toss, mean_heads, std_dev)
prob_below_lower_toss = stats.norm.cdf(lower_bound_coin_toss, mean_heads, std_dev)
total_probability_coin_toss = (prob_below_upper_toss - prob_below_lower_toss) * 100
print(f"There is a {total_probability_coin_toss:.2f}% chance the sum falls between {lower_bound_coin_toss} and {upper_bound_coin_toss}.")There is a 67.98% chance the sum falls between 499500 and 500500.
Theory Vs Practice
- In this simulation, a range of $\pm$ 500 heads from the mean defines 1 Standard Deviation ($\sigma$). While the theoretical probability for a normal distribution within $1\sigma$ is 68.26%, our empirical simulation yielded 67.98%. This 0.28% variance represents the Sampling Error, which is expected in stochastic modeling. This gap indicates how closely our finite sample approximates the theoretical population.
Roll a dice
Progression from a coin flip to a dice roll :
Moving from coin tosses (binary: 0 or 1) to dice rolls (discrete: 1 through 6) shifts the simulation from a Binomial Distribution to a Multinomial Distribution. The Central Limit Theorem still applies to the dice rolls : the sum of these independent rolls will form a beautiful normal distribution (bell curve) centered around the theoretical average.
Theoretical BackgroundFor a fair 6-sided die:
- Average of one roll: 3.5 (calculated as (1+2+3+4+5+6) / 6) - Expected total for 500,000 rolls: $500,000 * 3.5 = 1,750,000
# 1. Define Adjusted Parameters
TRIALS_DICE = 5000
ROLLS_PER_TRIAL = 500000
print(f"Simulating {TRIALS_DICE} trials of {ROLLS_PER_TRIAL} dice rolls...")
# 2. Run the Simulation
TRIAL_DICE_sums = []
for i in range(TRIALS_DICE):
current_trial = np.random.randint(1, 7, size=ROLLS_PER_TRIAL)
TRIAL_DICE_sums.append(current_trial.sum())
# 3. Create the Results Table
dice_simulation_results = pd.DataFrame({
'TRIAL_ID': np.arange(1, TRIALS_DICE + 1),
'TOTAL_SUM': TRIAL_DICE_sums
})
print("Simulation Complete.")
print(dice_simulation_results.head())Simulating 5000 trials of 500000 dice rolls...
Simulation Complete. TRIAL_ID TOTAL_SUM 0 1 1748580 1 2 1749843 2 3 1750806 3 4 1749339 4 5 1750849
# Statistical Summary
actual_mean = dice_simulation_results['TOTAL_SUM'].mean()
actual_min = dice_simulation_results['TOTAL_SUM'].min()
actual_max = dice_simulation_results['TOTAL_SUM'].max()
theoretical_mean = ROLLS_PER_TRIAL * 3.5
print(f"Actual Mean Sum: {actual_mean:,.2f}")
print(f"Actual Min Sum: {actual_min:,.2f}")
print(f"Actual Max Sum: {actual_max:,.2f}")
print(f"Theoretical Mean: {theoretical_mean:,.2f}")
print(f"Difference from theoritical mean: {abs(actual_mean - theoretical_mean):,.2f}")
# Plotting
plt.figure(figsize=(12, 6))
sns.histplot(dice_simulation_results['TOTAL_SUM'], kde=True, color='forestgreen')
plt.axvline(theoretical_mean, color='orange', linestyle='--', label='Theoretical Mean')
plt.title(f'Monte Carlo: {TRIALS} Trials of {ROLLS_PER_TRIAL} Dice Rolls')
plt.xlabel('Sum of Dice Faces')
plt.ylabel('Frequency')
plt.legend()
plt.show()Actual Mean Sum: 1,750,000.78 Actual Min Sum: 1,745,862.00 Actual Max Sum: 1,753,814.00 Theoretical Mean: 1,750,000.00 Difference from theoritical mean: 0.78
Statistical Transformation: From Frequency to PDF
In Monte Carlo simulations, we move from raw observations to predictive models by transforming our data from a Frequency Histogram to a Probability Density Function (PDF).
1. Frequency Histogram (The Observed Data)
The Frequency Histogram is a simple tally of how many times an outcome occurred within our $N$ trials.
- Y-Axis: Represents the Count (integers).
- Total Area: Summing the heights of the bars equals the total number of trials ($N$).
- Use Case: Historical reporting of simulation results.
2. Probability Mass Function (The Normalized Data)
By dividing each frequency count by the total number of trials, we "normalize" the data into a PMF.
- Formula: $P(x) = \frac{\text{Count}}{\text{Total Trials}}$
- Y-Axis: Represents Probability (decimals between 0 and 1).
- Total Area: The sum of all probabilities equals exactly 1.0.
3. Probability Density Function (The Continuous Model)
The PDF is used when we treat our data as a continuous range (like the sum of a million dice rolls). It accounts for the width of the histogram bins.
- Formula: $\text{Density} = \frac{\text{Count}}{\text{Total Trials} \times \text{Bin Width}}$
- Y-Axis: Represents Density (relative likelihood).
- Area as Probability: The probability of a result falling between a specific range $[a, b]$ is the integral (area) under the curve: $$P(a \le X \le b) = \int_{a}^{b} f(x) dx$$
Summary Comparison Table
| Feature | Frequency Histogram | Normalized (PMF) | PDF (Density) |
|---|---|---|---|
| Logic | What happened? | How often? | What is the likelihood? |
| Y-Axis | Integer Counts | Percentage / Probability | Probability Density |
| Bar Height | $f_i$ | $f_i / N$ | $f_i / (N \times \text{width})$ |
| Total Sum | Equals $N$ (Trials) | Equals 1.0 (100%) | Area equals 1.0 |
Key Takeaway: Normalization allows us to compare simulations of different sizes (e.g., 5,000 trials vs 1,000,000 trials) on the same scale, as the PDF will always converge toward the same "ideal" bell curve.
# import scipy.stats as stats - imported in the first cell
# 1. Get stats from your existing dataframe
data = dice_simulation_results['TOTAL_SUM']
mu = data.mean()
sigma = data.std()
# 2. Define the range for the PDF (X-axis)
# We create 100 points between the min and max sum found in the simulation
x = np.linspace(data.min(), data.max(), 100)
# 3. Calculate the PDF values
pdf = stats.norm.pdf(x, mu, sigma)
# 4. Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(x, pdf, color='red', lw=3, label='Normal PDF Curve')
sns.histplot(data, stat="density", bins=30, alpha=0.3, color='green') # Density=True matches the PDF scale
plt.title('Probability Density Function of Dice Sums')
plt.xlabel('Sum of 500k Dice Rolls')
plt.ylabel('Probability Density')
plt.legend()
plt.show()Mathematical Calculation (Theoretical)
- For a sum of $n$ independent trials, the formula for the Standard Deviation of the total sum is:
$$\sigma_{Total} = \sqrt{n} \times \sigma_{Single}$$
For Coin Tosses (Binomial):
- A single coin toss has a variance of
$p(1-p)$. - Variance of 1 toss: $0.5 \times 0.5 = 0.25$
- Standard Deviation of 1 toss ($\sigma_{Single}$): $\sqrt{0.25} = 0.5$
- For 1,000,000 tosses: $$\sigma = \sqrt{1,000,000} \times 0.5 = 1,000 \times 0.5 = 500$$
- Result: 1 STD for your coin toss simulation is 500 heads.
- A single coin toss has a variance of
For Dice Rolls:
- A single 6-sided die has a specific variance:
- Variance of 1 roll: $\approx 2.917$
- Standard Deviation of 1 roll ($\sigma_{Single}$): $\sqrt{2.917} \approx 1.708$
- For 500,000 rolls: $$\sigma = \sqrt{500,000} \times 1.708 \approx 707.1 \times 1.708 \approx 1,208
- $$Result: 1 STD for your dice sum simulation is $\approx$ 1,208.
Python Calculation (Empirical)
Standard Deviation (1 STD) Calculation
The Standard Deviation ($\sigma$) represents the "spread" of our trial results. In this simulation, we calculate it using the following formula:
$$\sigma = \sqrt{\frac{\sum (x_i - \mu)^2}{N}}$$
Where:
- $x_i$ = The result of an individual trial (e.g., TOTAL_SUM)
- $\mu$ = The mean of all trials
- $N$ = The number of trials (e.g., 5,000)
Calculated Values from Simulation:
- Mean ($\mu$):
df['TOTAL_SUM'].mean() - 1 STD ($\sigma$):
df['TOTAL_SUM'].std() - 68% Confidence Interval: $[\mu - \sigma, \mu + \sigma]$
lower_bound = 1749000
upper_bound = 1751000
# stats.norm.cdf gives the area to the left of a point
prob_below_upper = stats.norm.cdf(upper_bound, mu, sigma)
prob_below_lower = stats.norm.cdf(lower_bound, mu, sigma)
total_probability = (prob_below_upper - prob_below_lower) * 100
print(f"There is a {total_probability:.2f}% chance the sum falls between {lower_bound} and {upper_bound}.")There is a 59.65% chance the sum falls between 1749000 and 1751000.
lower_bound = 1748792
upper_bound = 1751208
# stats.norm.cdf gives the area to the left of a point
prob_below_upper = stats.norm.cdf(upper_bound, mu, sigma)
prob_below_lower = stats.norm.cdf(lower_bound, mu, sigma)
total_probability = (prob_below_upper - prob_below_lower) * 100
print(f"There is a {total_probability:.2f}% chance the sum falls between {lower_bound} and {upper_bound}.")There is a 68.71% chance the sum falls between 1748792 and 1751208.
Simulation Stability Analysis
Between two subsequent runs, the observed probability within 1 Standard Deviation shifted from 69.04% to 68.71%.
- Observed Variance: 0.33%
- Nature of the Gap: This difference is known as Sampling Noise (or Stochastic Variation).
- Convergence: According to the Law of Large Numbers, as we increase the number of TRIALS ($N$), this noise will decrease, and the results of subsequent runs will converge toward the theoretical value of 68.26%.
How to "kill" the noise
To make $0.33%$ gap to disappear (or become much smaller), we have two choices:
- Increase TRIALS: As we move from 5,000 trials to 50,000 trials, the gap between two runs will likely drop from $0.33%$ to something much smaller, like $0.05%$.
Such variations in the results (the gap) can be used as a basis for defining "Risk Buffers" .
Summary
- Noise: The gap between the results in two subsequent simulations (69.04% and 68.71%) is called noise (casual term)
- Stochastic Variation: It refers to the inherent randomness in a simulation.
- Sampling Variance: This describes how much a result (like your 1 SD percentage) changes from one sample (Simulation A) to the next (Simulation B).
- The "Gap" ($69.04% - 68.71% = 0.33%$): This specific difference is the Realization Variance. It shows that your simulation hasn't fully "settled" yet.
- Sampling Error : The gap between the result (69.04% and 68.71% in subsequent simulation) and the theoretical prediction (68.26%) is primarily known as Sampling Error (or Simulation Error).
- Law of Large Numbers : As we increase the number of Trials, the Sampling Error will decrease, and the result will get closer and closer to 68.26%.
- Convergence: Empirical vs. Theoretical: The $\frac{1}{\sqrt{N}}$ RuleThere is a rule in Monte Carlo simulations that the error decreases at a rate of $1/\sqrt{N}$.
