1 Introduction to Parallel Imputatation

misspi is designed to significantly improve the imputation speed for high dimensional data. It is a stable framework that built upon the missForest (Stekhoven and Bühlmann 2012) and MICE (Azur et al. 2011), but additionally, is capable of parallel on features of a single dataset, making full use of computation cores, while ensuring the imputation results consistent due to its embarassingly parallel nature. It leverages the capabilities of LightGBM (Ke et al. 2017) as default, and seamlessly integrates with various other machine learning algorithms even if they are not intrinsically parallel.

There are mainly 3 steps for parallel imputation (misspi):
Step 1. Fill all of the missing values with initial predictions, e.g., tree prediction, mean, median.
Step 2. Iterate through all variables, while each value’s missing part is predicted based on values of all other variables from the previous iteration. This procedure is inherently embarrasingly parallel.
Step 3. Iterations continues until convergence criteria is satisfied.

Below is the workflow of how it works:

where the missing part of \(k\)-th variable in iteration \(j\): \(x_k^{(j)}\) is imputed with all other variables excluding \(k\) from \(j-1\)-th iteration: \(x_{-k}^{(j-1)}\). This distinguishes itself from the missForest procedure, where imputation performed sequentially in place.

2 Quick Start

2.1 Installation and loading

# Install if haven't yet
if (!require("misspi")) {
  install.packages("misspi")
}

## Loading required package: misspi

# Load Package 
library(misspi)

2.2 Prepare missing data

We will use iris and a high dimensional data set toxicity (Gul et al. 2021) for the purpose of demonstration. We use missar(x, miss.rate, miss.var) to generate missing values, where miss.var select a random percentage of variables to have missing values, and miss.rate is the missing rate for the data that have missing values. Throughout the tutorial, we will use iris for general purpose illustration, and use toxicity to exhibit the superior speed for high dimensional data.

# Load a small data  
data(iris)
# Keep numerical columns  
num.col <- which(sapply(iris, is.numeric))
iris.numeric <- as.matrix(iris[, num.col])

# Load a high dimensional data
data(toxicity, package="misspi")

# Generate some missing value in both data
# missar is a function that generates 
set.seed(0)
iris.miss <- missar(iris.numeric, 0.3, 1)
set.seed(0)
toxicity.miss <- missar(toxicity, 0.4, 0.2)

2.3 Take off

We illustrate using two datasets. A simple function misspi() function would be sufficient to impute the data in parallel using random forest (Breiman 2001) as default. We highly recommend using the default progress = T to monitor the progress using fun shapes. But here we turned it off to accommodate for markdown syntax.

# Imputation 

# For iris data
# recommend 
# iris.impute <- misspi(iris.miss)
iris.impute <- misspi(iris.miss, progress = F)

## Set number of cores used to 4 necessary ... 
## Parallel computing using 4 cores ...
## Imputing a matrix with  150  rows and  4  columns ... 
## Highest missing rate is 0.326666666666667 ... 
## Initializing ... 
## 
## Iteration 1 ... 
## 
## Relative squared difference is 0.0114309991586331 ... 
## 
## Iteration 2 ... 
## 
## Relative squared difference is 0.0116804139169135 ... 
## Early stopping invoked ...

# For toxicity data
# recommend 
# toxicity.impute <- misspi(toxicity.miss)
toxicity.impute <- misspi(toxicity.miss, progress = F)

## We highly recommend activate viselect since your data is in high dimension. This may highly improve the speed and imputation accuracy ...

## Parallel computing using 10 cores ...
## Imputing a matrix with  171  rows and  1203  columns ... 
## Highest missing rate is 0.491228070175439 ... 
## Initializing ... 
## 
## Iteration 1 ... 
## 
## Relative squared difference is 0.00289110202106119 ... 
## 
## Algorithm converged at iter 1 ...

2.4 Check results

We take toxicity imputation result as an example. In our data with dimension 171 * 1203, when 240 variables have missing values, with missing rate 0.4, we completed our imputation in 20 seconds just using a macbook Pro with only 10 cores. The speed gain will be dramatic if using a HPC cluster for large dimensions.

# The imputed result 
# The imputed result is stored in the `x.imputed` matrix.
toxicity.impute$x.imputed[1:5, 1:5]

##       MATS3v nHBint10  MATS3s  MATS3p nHBDon_Lipinski
## [1,]  0.0908        0  0.0075  0.0173               0
## [2,]  0.0213        0  0.1144 -0.0410               0
## [3,]  0.0018        0 -0.0156 -0.0765               2
## [4,] -0.0251        0 -0.0064 -0.0894               3
## [5,]  0.0135        0  0.0424 -0.0353               0

# Check imputation time
toxicity.impute$time.elapsed

## Time difference of 15.80384 secs

# Check resources
detectCores()

## [1] 10

2.5 Evaluate results

In a simulation study when we have the true values for the missing entries, we could evaluate the result by comparing the imputed value and the true value. misspi provides function evaliq to calculate RMSE (Root Mean Squared Error), MAE (Mean Absolute Error) and NRMSE (Normalized Root Mean Squared Error). When it’s not in interactive mode by default, it will automatically fit a regression line and provide the \(R^2\). The function also provide interactive visualization if interactive = TRUE. Users can select certain regions to zoom in for inspection, as well as hover the mouse on the data of interest to compare the real value with imputed value.

# Get the NA indexes for toxicity
na.idx.iris <- which(is.na(iris.miss))

# Interactive plot
# evaliq(iris.numeric[na.idx.iris], iris.impute$x.imputed[na.idx.iris], interactive=T)

# Non Interactive plot
iris.error <- evaliq(iris.numeric[na.idx.iris], iris.impute$x.imputed[na.idx.iris])

## `geom_smooth()` using formula = 'y ~ x'

iris.error

## $rmse
## [1] 0.643342
## 
## $mae
## [1] 0.4155556
## 
## $nrmse
## [1] 0.3300936

# Get the NA indexes for toxicity
na.idx.toxicity <- which(is.na(toxicity.miss))

# Non interactive plot
toxicity.error <- evaliq(toxicity[na.idx.toxicity], toxicity.impute$x.imputed[na.idx.toxicity])

## `geom_smooth()` using formula = 'y ~ x'

toxicity.error

## $rmse
## [1] 2744.697
## 
## $mae
## [1] 65.5142
## 
## $nrmse
## [1] 0.5477324

3 Parameters to customize for the package

We have explored the basic functionalities of the package which are readily available to use in the above paragraphs. In addition, misspi offers a variety of parameters to fine-tuning the imputation process.

3.1 Number of cores for parallel computing

The number of cores misspi requests is by default all cores detected in the current host. It could be changed through the parameter ncore. misspi will automatically reduce the number of cores to necessary if the number of cores is larger than the number of variables of data to be imputed and also provide protection against excessive request:

iris.impute.change.core <- misspi(iris.miss, ncore=10000000000, progress = F)

## Set number of cores used to 4 necessary ... 
## Parallel computing using 4 cores ...
## Imputing a matrix with  150  rows and  4  columns ... 
## Highest missing rate is 0.326666666666667 ... 
## Initializing ... 
## 
## Iteration 1 ... 
## 
## Relative squared difference is 0.0114309991586331 ... 
## 
## Iteration 2 ... 
## 
## Relative squared difference is 0.0116804139169135 ... 
## Early stopping invoked ...

3.2 Initialization and maximum iterations

In the iterative imputation procedure, initialization fills in the missing value in the first step. misspi currently supports several different options for initialization through the argument init.method. We offer rf for random forest initialization, mean for mean initialization, median for median initialization. In addition, users could specify the maximum number of iterations through maxiter. Here we provide two special examples for demonstration:

Only use mean imputation

iris.impute.mean <- misspi(iris.miss, init.method = "mean", maxiter = 0, progress = F)

## Set number of cores used to 4 necessary ... 
## Parallel computing using 4 cores ...
## Imputing a matrix with  150  rows and  4  columns ... 
## Highest missing rate is 0.326666666666667 ... 
## Initializing ... 
## 
##  No iterations, imputation completed ...

Random forest imputation without iteration

iris.impute.noiter <- misspi(iris.miss, maxiter = 1, progress = F)

## Set number of cores used to 4 necessary ... 
## Parallel computing using 4 cores ...
## Imputing a matrix with  150  rows and  4  columns ... 
## Highest missing rate is 0.326666666666667 ... 
## Initializing ... 
## 
## Iteration 1 ... 
## 
## Relative squared difference is 0.0114309991586331 ...

3.3 Early stopping

misspi introduce early stopping mechanism by default to avoid unnecessary iterations, when the algorithm performance stabilizes but is oscillating. We will monitor the relative error for each iteration, and the default earlystopping = T will automatically terminate further iterations when the relative error increases from the previous iteration. Set earlystopping = F will unrestrict the algorithm itereations untill it converages, or meet the maxiter which is by default 10, whichever comes first.
We can see that the below example takes more iterations than previous without early stopping, but actually, the performance are very similar.

# example of truning off early stopping 
iris.impute.earlystopping <- misspi(iris.miss, earlystopping = F, progress = F)

## Set number of cores used to 4 necessary ... 
## Parallel computing using 4 cores ...
## Imputing a matrix with  150  rows and  4  columns ... 
## Highest missing rate is 0.326666666666667 ... 
## Initializing ... 
## 
## Iteration 1 ... 
## 
## Relative squared difference is 0.0114309991586331 ... 
## 
## Iteration 2 ... 
## 
## Relative squared difference is 0.0116804139169135 ... 
## 
## Iteration 3 ... 
## 
## Relative squared difference is 0.00883859701393717 ... 
## 
## Algorithm converged at iter 3 ...

3.4 Built in methods and how to adjust parameters

misspi built in method includes rf for random forest as default, gbm for gradient boosted machine, they are both implemented through the fast lightGBM algorithm. We also includes LASSO (Tibshirani 1996) as an method (method = 'lasso'), which could be particularly beneficial in a linear scenario.

3.4.1 Adjust for tree methods

Users could pass parameters for lightGBM methods to lgb.params (when method = 'rf' or method = 'gbm'). The parameters for the random forest initialization could also be adjusted through lgb.params0. For quick change of number of trees for tree based inference from 100 as default, you could simply change ntree, init.ntree for imputation and initialization, with respectively.

# Change parameter follow lightGBM mannual
params <- list(
        objective = "regression",
        boosting_type = "rf",        # Use Random Forest boosting type
        learning_rate = 0.05,        # learning rate
        max_depth = 5,               # maximum depth
        bagging_freq = 5,            # Frequency for bagging 
        bagging_fraction = 0.6,      # Fraction of data to be used for bagging
        feature_fraction = 0.33      # Fraction of features to be used for building trees
      )

# Use new parameter for random forest
iris.rf.new.params <- misspi(iris.miss, lgb.params = params, lgb.params0 = params, progress = F)

## Set number of cores used to 4 necessary ... 
## Parallel computing using 4 cores ...
## Imputing a matrix with  150  rows and  4  columns ... 
## Highest missing rate is 0.326666666666667 ... 
## Initializing ... 
## 
## Iteration 1 ... 
## 
## Relative squared difference is 0.0214084890259387 ... 
## 
## Iteration 2 ... 
## 
## Relative squared difference is 0.0147656150106396 ... 
## 
## Iteration 3 ... 
## 
## Relative squared difference is 0.0075542325521034 ... 
## 
## Algorithm converged at iter 3 ...

# GBM as method
# You similary change the parameter fr gbm using the same way above
iris.gbm <- misspi(iris.miss, method = "gbm", progress = F)

## Set number of cores used to 4 necessary ... 
## Parallel computing using 4 cores ...
## Imputing a matrix with  150  rows and  4  columns ... 
## Highest missing rate is 0.326666666666667 ... 
## Initializing ... 
## 
## Iteration 1 ... 
## 
## Relative squared difference is 0.0123299464073526 ... 
## 
## Iteration 2 ... 
## 
## Relative squared difference is 0.00900182123212257 ... 
## 
## Algorithm converged at iter 2 ...

# Just reduce the number of trees 
iris.less.tree <- misspi(iris.miss, ntree = 20, init.ntree = 20, progress = F)

## Set number of cores used to 4 necessary ... 
## Parallel computing using 4 cores ...
## Imputing a matrix with  150  rows and  4  columns ... 
## Highest missing rate is 0.326666666666667 ... 
## Initializing ... 
## 
## Iteration 1 ... 
## 
## Relative squared difference is 0.0177419977067842 ... 
## 
## Iteration 2 ... 
## 
## Relative squared difference is 0.019877686167096 ... 
## Early stopping invoked ...

3.4.2 Adjust for LASSO

For imputation using LASSO, you could use nlassofold to change the number of folds for cross validation and set isis = T to screen using iterative sure independence screening (ISIS) (Fan and Lv 2008) to screen variables before LASSO.

iris.lasso <- misspi(iris.miss, method="lasso", progress = F)

## Set number of cores used to 4 necessary ... 
## Parallel computing using 4 cores ...
## Imputing a matrix with  150  rows and  4  columns ... 
## Highest missing rate is 0.326666666666667 ... 
## Initializing ... 
## 
## Iteration 1 ... 
## 
## Relative squared difference is 0.0155097380422926 ... 
## 
## Iteration 2 ... 
## 
## Relative squared difference is 0.00888575773578426 ... 
## 
## Algorithm converged at iter 2 ...

iris.sis.lasso <- misspi(iris.miss, method="lasso", isis = T, progress = F)

## Set number of cores used to 4 necessary ... 
## Parallel computing using 4 cores ...
## Imputing a matrix with  150  rows and  4  columns ... 
## Highest missing rate is 0.326666666666667 ... 
## Initializing ... 
## 
## Iteration 1 ... 
## 
## Relative squared difference is 0.0154415536261172 ... 
## 
## Iteration 2 ... 
## 
## Relative squared difference is 0.00803501950853286 ... 
## 
## Algorithm converged at iter 2 ...

3.5 Predictive mean matching (PMM)

Instead of directly use prediction for imputation, misspi by default uses predictive mean matching (PMM) that averages observed values with similar predictions to improve accuracy. The number of neighbors from which it borrows information could be controlled by parameter nn. The PMM mode could be turned off by setting pmm = F.

# Use 10 neighbors
iris.pmm10 <- misspi(iris.miss, nn=10, progress=F)

## Set number of cores used to 4 necessary ... 
## Parallel computing using 4 cores ...
## Imputing a matrix with  150  rows and  4  columns ... 
## Highest missing rate is 0.326666666666667 ... 
## Initializing ... 
## 
## Iteration 1 ... 
## 
## Relative squared difference is 0.00694693784737903 ... 
## 
## Algorithm converged at iter 1 ...

# Turn off pmm but use predicted value
iris.no.pmm <- misspi(iris.miss, pmm=F, progress=F)

## Set number of cores used to 4 necessary ... 
## Parallel computing using 4 cores ...
## Imputing a matrix with  150  rows and  4  columns ... 
## Highest missing rate is 0.326666666666667 ... 
## Initializing ... 
## 
## Iteration 1 ... 
## 
## Relative squared difference is 0.00504984172014011 ... 
## 
## Algorithm converged at iter 1 ...

# Evaluate errors 
iris.pmm10.error <- evaliq(iris.numeric[na.idx.iris], iris.pmm10$x.imputed[na.idx.iris], plot = F)
iris.pmm10.error

## $rmse
## [1] 0.5701369
## 
## $mae
## [1] 0.3841667
## 
## $nrmse
## [1] 0.2925327

iris.no.pmm.error <- evaliq(iris.numeric[na.idx.iris], iris.no.pmm$x.imputed[na.idx.iris], plot = F)
iris.no.pmm.error

## $rmse
## [1] 0.606254
## 
## $mae
## [1] 0.4293527
## 
## $nrmse
## [1] 0.3110641

3.6 Have some fun

You could modify the progress bar shape (when progress = T) from default to other shapes using unicode, e.g. chess king; through char option.

# iris.impute.king <- misspi(iris.miss, char=" \u2654")

4 Advanced Usage of the Package

After going through the basic parameters, we dive into some special features that misspi offers: variable selection to further speed up imputation for high dimensional data, and customize your own machine learning algorithm, even if not intrinsically parallel, for parallel imputation.

4.1 Even faster imputation for high dimensional data

In addition to the parallel nature, misspi provides variable selection using variable importance obtained during random forest initialization (init.method = 'rf'). In all of the following iterations, the variables would be imputed only use its “important” regressors with high variable importance selected in the initialization. The number could be customized by viselect. This procedure would significantly boost the speed for high dimensional data with multiple iterations.
Following example demonstrates the speed gain for iterations and would be more intuitive when progress = T.

toxicity.impute.vi <- misspi(toxicity.miss, viselect=128, progress = F)

## Parallel computing using 10 cores ...
## Imputing a matrix with  171  rows and  1203  columns ... 
## Highest missing rate is 0.491228070175439 ... 
## Initializing ... 
## 
## Iteration 1 ... 
## 
## Relative squared difference is 0.0145663584042424 ... 
## 
## Iteration 2 ... 
## 
## Relative squared difference is 4.45331305499396e-05 ... 
## 
## Algorithm converged at iter 2 ...

toxicity.impute.vi$time.elapsed

## Time difference of 30.13338 secs

4.2 Parallel imputation using customized machine learning algorithms

misspi seamlessly integrates with a wide range of algorithms and is fully equipped to accommodate machine learning models via the model.train option, requiring only the function names.
The input model should be able to take the formula y ~ x for fitting process where y, and x are matrices, and should also work with predict for model prediction. We will use iris data for illustration:

4.2.1 Linear Model

We could simply provide the built in model lm for linear models.

# Simply set model.train
iris.impute.lm <- misspi(iris.miss, model.train = lm, progress = F)

## Since your model.train is not null, we have set method to customize ...

## Set number of cores used to 4 necessary ... 
## Parallel computing using 4 cores ...
## Imputing a matrix with  150  rows and  4  columns ... 
## Highest missing rate is 0.326666666666667 ... 
## Initializing ... 
## 
## Iteration 1 ... 
## 
## Relative squared difference is 0.0151947375063306 ... 
## 
## Iteration 2 ... 
## 
## Relative squared difference is 0.0071021751073966 ... 
## 
## Algorithm converged at iter 2 ...

4.2.2 Support Vector Machine

Below is an example for using SVM from external package (Meyer et al. 2023). Note that you could pass all the additional arguments through ... of function misspi() in misspi.

# Load package
if (!require("e1071")) {
  install.packages("e1071")
}

## Loading required package: e1071

library(e1071)

# Simply set model.train
iris.impute.svm.radial <- misspi(iris.miss, model.train = svm, progress = F)

## Since your model.train is not null, we have set method to customize ...

## Set number of cores used to 4 necessary ... 
## Parallel computing using 4 cores ...
## Imputing a matrix with  150  rows and  4  columns ... 
## Highest missing rate is 0.326666666666667 ... 
## Initializing ... 
## 
## Iteration 1 ... 
## 
## Relative squared difference is 0.0106045631340212 ... 
## 
## Iteration 2 ... 
## 
## Relative squared difference is 0.00469874434957308 ... 
## 
## Algorithm converged at iter 2 ...

# Change kernel
iris.impute.svm.radial <- misspi(iris.miss, model.train = svm, progress = F, kernel = "polynomial")

## Since your model.train is not null, we have set method to customize ...

## Set number of cores used to 4 necessary ... 
## Parallel computing using 4 cores ...
## Imputing a matrix with  150  rows and  4  columns ... 
## Highest missing rate is 0.326666666666667 ... 
## Initializing ... 
## 
## Iteration 1 ... 
## 
## Relative squared difference is 0.015098512782166 ... 
## 
## Iteration 2 ... 
## 
## Relative squared difference is 0.0104062134259554 ... 
## 
## Iteration 3 ... 
## 
## Relative squared difference is 0.00671787380197917 ... 
## 
## Algorithm converged at iter 3 ...

4.2.3 Neural Network

We could even try neural network from (Fritsch, Guenther, and Wright 2019) as method, but note that it may require heavy tuning in practice.

# Load package
if (!require("neuralnet")) {
  install.packages("neuralnet")
}

## Loading required package: neuralnet

library(neuralnet)

# Simply set model.train
iris.impute.nn <- misspi(iris.miss, model.train = neuralnet, progress = F)

## Since your model.train is not null, we have set method to customize ...

## Set number of cores used to 4 necessary ... 
## Parallel computing using 4 cores ...
## Imputing a matrix with  150  rows and  4  columns ... 
## Highest missing rate is 0.326666666666667 ... 
## Initializing ... 
## 
## Iteration 1 ... 
## 
## Relative squared difference is 0.0136444643359174 ... 
## 
## Iteration 2 ... 
## 
## Relative squared difference is 0.00746389372107591 ... 
## 
## Algorithm converged at iter 2 ...

Bibliography

Azur, Melissa J, Elizabeth A Stuart, Constantine Frangakis, and Philip J Leaf. 2011. “Multiple Imputation by Chained Equations: What Is It and How Does It Work?” International Journal of Methods in Psychiatric Research 20 (1): 40–49.

Breiman, Leo. 2001. “Random Forests.” Machine Learning 45: 5–32.

Fan, Jianqing, and Jinchi Lv. 2008. “Sure Independence Screening for Ultrahigh Dimensional Feature Space.” Journal of the Royal Statistical Society Series B: Statistical Methodology 70 (5): 849–911.

Fritsch, Stefan, Frauke Guenther, and Marvin N. Wright. 2019. Neuralnet: Training of Neural Networks. https://CRAN.R-project.org/package=neuralnet.

Gul, Seref, Fatih Rahim, Safak Isin, Fatma Yilmaz, Nuri Ozturk, Metin Turkay, and Ibrahim Halil Kavakli. 2021. “Structure-Based Design and Classifications of Small Molecules Regulating the Circadian Rhythm Period.” Scientific Reports 11 (1): 18510.

Ke, Guolin, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. “Lightgbm: A Highly Efficient Gradient Boosting Decision Tree.” Advances in Neural Information Processing Systems 30.

Meyer, David, Evgenia Dimitriadou, Kurt Hornik, Andreas Weingessel, and Friedrich Leisch. 2023. E1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. https://CRAN.R-project.org/package=e1071.

Stekhoven, Daniel J, and Peter Bühlmann. 2012. “MissForest—Non-Parametric Missing Value Imputation for Mixed-Type Data.” Bioinformatics 28 (1): 112–18.

Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society Series B: Statistical Methodology 58 (1): 267–88.

misspi: Fast Missing Value Imputation in Parallel

Zhongli Jiang

2023-10-16