Simulations

simulations.Sim.MAR(ymis, xdep, miss, tail)

Introduces Missing At Random (MAR) missingness into a numeric array, dependent on a covariate.

The missingness in ymis is introduced based on the values of xdep. The probability of missingness is determined using a logistic model, where higher or lower values of xdep (depending on the tail) lead to increased missingness.

Parameters:

ymis – Array-like numeric input (e.g., outcome variable Y).
xdep – Covariate (X) that missingness depends on.
miss – Target overall missingness probability (float between 0 and 1).
tail – Direction of dependency; “left” (more missing for low X) or “right” (more missing for high X).

Returns:

A NumPy array with MAR-induced missing values based on xdep.

Return type:

np.ndarray

simulations.Sim.MCAR(ymis, miss)

Introduces Missing Completely at Random (MCAR) missingness into a numeric array.

Randomly sets a proportion of values in ymis to NaN based on the specified missingness probability miss.

Parameters:

ymis – Array-like numeric input Y.
miss – Proportion of values to be set as missing (float between 0 and 1).

Returns:

A NumPy array with MCAR-induced missing values.

Return type:

np.ndarray

simulations.Sim.Simulate(dist, n, mp, miss, m, k, hmi, pilot, method, tail=None, pmass=None)

One simulation run.

Parameters:

dist – Distribution of Y.
n – Number of observations.
mp – Missingness mechanism; “MCAR” or “MAR”.
miss – Missingness probability.
m – Number of multiple imputation iterations.
k – Donor size for PMM.
hmi – Whether to use HowManyImputations (True/False).
pilot – Number of pilot imputations for HMI.
method – Imputation method: “pmm” or “midas”.
tail – Tail direction for MAR; “left” or “right”.
pmass – Point mass probability.

Returns:

Tuple of (cBias, Width, MSE, Coverage, sdBias, it, actual_missing, sdY)

simulations.Sim.corr_mean(n, sim)

Tests whether the target correlation is achieved across multiple simulated datasets.

Simulates n datasets, each with 5000 observations, using the specified distribution. The distributions correspond to those used in the thesis: semicontinuous, normal, or Poisson. The function returns the average empirical correlation across all simulations.

Parameters:

n – Number of datasets to generate.
sim – Distribution type; must be one of “semi”, “norm”, or “pois”.

Returns:

Average empirical correlation across the n simulated datasets.

Return type:

float

simulations.Sim.data_norm(n, locY, scaleY, rho)

Simulates n normally distributed observations for response variable Y and a covariate X.

The covariate X is generated with a specified correlation rho to Y using Cholesky decomposition.

Parameters:

n – Number of observations.
locY – Mean (location) of Y.
scaleY – Standard deviation (scale) of Y.
rho – Correlation coefficient between Y and X (between -1 and 1).

Returns:

A tuple containing: - Y (np.ndarray): Simulated response variable. - X (np.ndarray): Simulated covariate with correlation rho to Y. - corr (float): Empirical correlation between Y and X.

simulations.Sim.data_pois(n, lambda_poisson, rho, max_iter=30, tol=0.01)

Simulates a Poisson-distributed response variable Y and a continuous covariate X with a specified correlation using a Gaussian copula approach.

Y is generated from a Poisson distribution with mean lambda_poisson.
X is a standard normal variable correlated with Y using a Gaussian copula.
Iteratively adjusts the latent correlation to match the target rho within a tolerance tol.

Parameters:

n – Number of observations to simulate.
lambda_poisson – Mean (λ) of the Poisson distribution.
rho – Desired Pearson correlation between Y and X.
max_iter – Maximum number of iterations to adjust latent correlation.
tol – Tolerance threshold for the difference between target and achieved correlation.

Returns:

A tuple containing: - Y (np.ndarray): Poisson-distributed response variable. - X (np.ndarray): Covariate with approximately the specified correlation to Y. - corr (float): Achieved empirical correlation between X and Y.

simulations.Sim.data_semi(n, locY, scaleY, rho, pmass)

Simulates a semi-continuous response variable Y and a correlated covariate X.

Y is drawn from a normal distribution with mean locY and standard deviation scaleY.
Y is transformed via Y^4 / max(Y^3) to induce right skewness.
A point mass at zero is introduced by randomly setting values to zero with probability pmass.
Covariate X is generated to have correlation rho with Y using Cholesky decomposition.

Parameters:

n – Number of observations to simulate.
locY – Mean (location) of the normally distributed base Y.
scaleY – Standard deviation (scale) of Y.
rho – Desired Pearson correlation coefficient between Y and X (range -1 to 1).
pmass – Probability of setting a value in Y to zero (point mass at zero).

Returns:

A tuple containing: - Y (np.ndarray): Semi-continuous response variable with right skewness and point mass. - X (np.ndarray): Covariate with specified correlation to Y. - corr (float): Empirical correlation between X and Y.

simulations.Sim.expit(x)

simulations.Sim.logit(p)

simulations.Sim.miss_mean(Y, miss, n, md, X=None, tail=None)

Estimates the average proportion of missing values introduced under a specified missingness mechanism.

Repeats the missingness process n times on the variable Y using either Missing Completely At Random (MCAR) or Missing At Random (MAR) mechanism. For MAR, a covariate X and a direction tail must be provided.

Parameters:

Y – Array-like outcome variable.
miss – Target missingness probability (float between 0 and 1).
n – Number of repetitions to simulate missingness.
md – Missingness mechanism; must be “MCAR” or “MAR”.
X – Covariate X (required if md is “MAR”).
tail – Direction of missingness for MAR; “left” or “right”.

Returns:

Average proportion of missing values across n repetitions.

Return type:

float

simulations.Sim.plot_density(Y, X)

Plots the histogram-based density of two variables, Y and X.

Overlays histograms of Y and X using 100 bins each for visual comparison.

Parameters:

Y – Array-like variable (e.g., response variable).
X – Array-like variable (e.g., covariate).

Returns:

None. Displays the plot.

Return type:

None

simulations.Sim.plot_missingness_tails(Y, X)

Visualizes the distribution of the covariate X, conditional on whether the corresponding Y values are observed or missing.

This plot is useful for illustrating Missing At Random (MAR) mechanisms where missingness in Y depends on X.

Parameters:

Y – Array-like response variable containing missing values.
X – Array-like covariate variable used to model missingness.

Returns:

None. Displays the plot.

Return type:

None

simulations.Sim.repeat_sim(dist, n, mp, miss, m, k, hmi, pilot, method, tail=None, pmass=None)

Repeats the simulation experiment 500 times and stores aggregated performance metrics.

For each run, the function calls Simulate() with the given parameters and collects evaluation metrics. Results are aggregated and written to a CSV file named according to the simulation parameters.

Parameters:

dist – Distribution type of Y (“norm”, “semi”, or “pois”).
n – Number of observations in each simulation.
mp – Missingness mechanism (“MCAR” or “MAR”).
miss – Probability of missingness (float between 0 and 1).
m – Number of multiple imputations.
k – Donor pool size for PMM.
hmi – Boolean; whether to use How-Many-Imputations (HMI) approach.
pilot – Number of pilot imputations for HMI.
method – Imputation method (“pmm” or “midas”).
tail – Direction of missingness for MAR (“left” or “right”). Required if mp == “MAR”.
pmass – Probability of inducing a point mass at zero (for semi-continuous Y).

Returns:

None. Saves results to a CSV file in the project directory.

Return type:

None

Example

>>> repeat_sim(
...     dist="norm",
...     n=500,
...     mp="MCAR",
...     miss=0.6,
...     m=5,
...     k=5,
...     hmi=False,
...     pilot=5,
...     method="pmm",
...     tail="left",
...     pmass=0.2
... )