Imputation
- class imputation.mice(data=None, m=5, maxit=5, predictorMatrix=None, initial='meanobs')
- HMI(pilot, alpha, cv, fml)
Performs How-Many-Imputations (HMI) procedure to determine the required number of imputations based on fraction of missing information (FMI) for stable point estimates and standard errors.
This method first runs a pilot set of imputations to estimate the FMI, then calculates the number of additional imputations needed to achieve a target coefficient of variation (cv) for the standard error estimates, following Hippel (2020) [https://arxiv.org/pdf/1608.05406].
- Parameters:
pilot – int Number of pilot imputations to run for initial FMI estimation.
alpha – float Significance level for confidence interval calculation (e.g., 0.05 for 95% CI).
cv – float Desired coefficient of variation for the standard error estimates.
fml – str Analysis model formula (in patsy format) to fit during imputation.
- Returns:
None
- complete()
Performs a single-step imputation to produce a completed dataset.
Runs the initial imputation and then one iteration of the imputation analysis step to fill in missing values.
- Parameters:
None
- Returns:
iterdata – A completed dataset with imputed values after one imputation step.
- Return type:
pandas.DataFrame
- convergence_plot(fml, x='mean')
Generates convergence plots of parameter estimates over iterations for multiple imputations.
Runs multiple imputations and fits the analysis model for each iteration, then plots either the mean or standard error of model parameters across cycles.
- Parameters:
fml (str) – Analysis model formula in patsy syntax.
x (str, optional) – Metric to plot: “mean” for parameter estimates or “sd” for standard errors (default is “mean”).
- Returns:
Displays matplotlib plots of convergence diagnostics.
- Return type:
None
- fit(fml, history=True, HMI=False, alpha=0.05, cv=0.05, pilot=5, **kwargs)
Fits the imputation model and performs analysis using the specified formula.
- Parameters:
fml (str) – Analysis model formula in Patsy syntax. Supports variable transformations but does not allow dots in variable names (which may cause Patsy errors).
history (bool, optional (default=True)) – If True, saves all iterations of the imputation in a dictionary. If False, only final metrics are kept.
HMI (bool, optional (default=False)) – Whether to use HowManyImputations (Hippel, 2020) for pooling results. If True, alpha and cv parameters are passed to the HMI method.
alpha (float, optional (default=0.05)) – Significance level used in HMI pooling.
cv (float, optional (default=0.05)) – Coefficient of variation threshold for HMI pooling.
pilot (int, optional (default=5)) – Number of pilot imputations for HMI.
kwargs – Additional keyword arguments (currently unused).
- Returns:
Results of the imputation and analysis.
- Return type:
depends on self.results
- iterate()
Performs the iterative imputation procedure.
Starts with an initial imputation, then iteratively updates the imputations for a number of cycles (self.maxit) by applying the imputation model for each variable. After the final iteration, fits the analysis model to the imputed dataset and stores the fitted model results.
- Parameters:
None
- Returns:
iterdata – The imputed dataset after the final iteration.
- Return type:
pandas.DataFrame
- pool(summ=False)
Pools parameter estimates and covariance matrices from multiple imputations to produce overall inference estimates following Rubin’s rules.
Aggregates results across multiple imputed datasets by combining within-imputation variance and between-imputation variance to estimate overall parameter uncertainty.
- Parameters:
summ – bool, optional If True, returns a summary of the pooled fit (default is False).
- Returns:
None Stores pooled results in self.results as a MICEResults object.
- set_methods(d)
Assigns imputation methods to columns in the dataset.
For each column, the method specified in the dictionary d is assigned. If a column is not specified in d, a default method is assigned based on the variable type: “pmm” for categorical or numeric columns.
- Parameters:
d (dict) – Dictionary mapping column names to imputation methods.
- Returns:
None
- Raises:
ValueError – If any method in d is not supported (checked by _check_d).
- imputation.midas(y, ry, x, ridge=1e-05, midas_kappa=None, outout=True)
MIDAS Imputation: Multiple Imputation with Distant Average Substitution.
This function implements the MIDAS imputation algorithm for continuous variables, as introduced by Gaffert et al. (2018).
It operates by weighting observed donors based on the similarity between predicted values, with optional leave-one-out model estimation for increased fidelity.
- Parameters:
y (array-like of shape (n_samples,)) – The target variable with missing values to be imputed. Must be numeric.
ry (array-like of bool of shape (n_samples,)) – Logical array indicating observed values in y. True where y is observed, False where missing.
x (array-like of shape (n_samples, n_features)) – Design matrix of predictor variables. Must be fully observed.
ridge (float, default=1e-5) – Ridge penalty used in regularized regression to stabilize the solution in the presence of multicollinearity. - Set lower (e.g. 1e-6) to reduce bias in noisy data. - Set higher (e.g. 1e-4) if collinearity is suspected.
midas_kappa (float or None, default=None) – Controls the sharpness of donor weighting. If None, the optimal value is estimated based on R² as described by Siddique and Belin (2008). A common fallback is 3.
outout (bool, default=True) – If True, uses leave-one-out regression for each donor (slow but MI-proper). If False, a single model is estimated for all donors and recipients. WARNING: Setting outout=False may produce biased estimates and is not fully supported.
- Returns:
y_imp – The input array y with imputed values replacing the missing entries.
- Return type:
np.ndarray
Notes
Based on: Gaffert, P., Meinfelder, F., & van den Bosch, V. (2018). “Towards an MI-proper Predictive Mean Matching.”
Related: Siddique, J. & Belin, T. R. (2008). “Multiple Imputation Using an Iterative Hot-Deck with Distance-Based Donor Selection.”
Examples
>>> y = np.array([7, np.nan, 9, 10, 11]) >>> ry = ~np.isnan(y) >>> x = np.array([[1, 2], [3, 4], [5, 6], [7, 13], [11, 10]]) >>> midas(y, ry, x) array([7. , 9.0, 9. , 10., 11.])
- imputation.pmm(y, ry, x, wy=None, donors=5, matchtype=1, quantify=True, ridge=1e-05, matcher='NN', **kwargs)
Predictive Mean Matching (PMM) imputation.
This function imputes missing values in a variable y using predictive mean matching. The method is based on Rubin’s (1987) Bayesian linear regression and mimics the behavior of the R mice package’s PMM imputation method.
- Parameters:
y (array-like (1D), shape (n_samples,)) – Target variable to be imputed. Can be numeric or categorical.
ry (array-like of bool, shape (n_samples,)) – Logical array indicating which elements of y are observed (True) or missing (False).
x (array-like (2D), shape (n_samples, n_features)) – Numeric design matrix of predictors. Must have no missing values.
wy (array-like of bool, shape (n_samples,), optional) – Logical array indicating which values should be imputed. If None, wy is set to the complement of ry.
donors (int, default=5) – Number of donors to draw from the observed cases when imputing missing values.
matchtype (int, default=1) – Type of matching: - 0: Predicted value of y_obs vs predicted value of y_mis - 1: Predicted value of y_obs vs drawn value of y_mis (default) - 2: Drawn value of y_obs vs drawn value of y_mis
quantify (bool, default=True) – If True and y is categorical, factor levels are replaced by the first canonical variate (via CCA). If False, categorical values are replaced by integer codes (less accurate).
ridge (float, default=1e-5) – Ridge regularization parameter used in norm_draw() to stabilize estimation. Increase for multicollinear data, decrease to reduce bias.
matcher (str, default="NN") – Matching method. Currently only “NN” (nearest neighbor) is supported.
**kwargs (dict) – Additional arguments passed to norm_draw(), such as ls_meth.
- Returns:
y_imp – Imputed version of y with missing values filled via PMM. Returns object array if y was categorical, else float array.
- Return type:
np.ndarray
Notes
Based on: - Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. - Van Buuren, S. & Groothuis-Oudshoorn, K. (2011). mice R package.
Examples
>>> y = np.array([7, np.nan, 9, 10, 11]) >>> ry = ~np.isnan(y) >>> x = np.array([[1, 2], [3, 4], [5, 7], [7, 8], [9, 10]]) >>> pmm(y=y, ry=ry, x=x, donors=3)