misspi is designed to significantly improve the imputation speed for high dimensional data. It is a stable framework that built upon the missForest (Stekhoven and Bühlmann 2012) and MICE (Azur et al. 2011), but additionally, is capable of parallel on features of a single dataset, making full use of computation cores, while ensuring the imputation results consistent due to its embarassingly parallel nature. It leverages the capabilities of LightGBM (Ke et al. 2017) as default, and seamlessly integrates with various other machine learning algorithms even if they are not intrinsically parallel.
There are mainly 3 steps for parallel imputation (misspi):
Step 1. Fill all of the missing values with initial
predictions, e.g., tree prediction, mean, median.
Step 2. Iterate through all variables, while each
value’s missing part is predicted based on values of all other variables
from the previous iteration. This procedure is
inherently embarrasingly parallel.
Step 3. Iterations continues until convergence criteria
is satisfied.
Below is the workflow of how it works:
where the missing part of \(k\)-th variable in iteration \(j\): \(x_k^{(j)}\) is imputed with all other variables excluding \(k\) from \(j-1\)-th iteration: \(x_{-k}^{(j-1)}\). This distinguishes itself from the missForest procedure, where imputation performed sequentially in place.
# Install if haven't yet
if (!require("misspi")) {
install.packages("misspi")
}
## Loading required package: misspi
# Load Package
library(misspi)
We will use iris and a high dimensional data set
toxicity (Gul et al.
2021) for the purpose of demonstration. We use
missar(x, miss.rate, miss.var)
to generate missing values,
where miss.var
select a random percentage of variables to
have missing values, and miss.rate
is the missing rate for
the data that have missing values. Throughout the tutorial, we will use
iris for general purpose illustration, and use toxicity to exhibit the
superior speed for high dimensional data.
# Load a small data
data(iris)
# Keep numerical columns
num.col <- which(sapply(iris, is.numeric))
iris.numeric <- as.matrix(iris[, num.col])
# Load a high dimensional data
data(toxicity, package="misspi")
# Generate some missing value in both data
# missar is a function that generates
set.seed(0)
iris.miss <- missar(iris.numeric, 0.3, 1)
set.seed(0)
toxicity.miss <- missar(toxicity, 0.4, 0.2)
We illustrate using two datasets. A simple function
misspi()
function would be sufficient to impute the data in
parallel using random forest (Breiman
2001) as default. We highly recommend using the default
progress = T
to monitor the progress using fun shapes. But
here we turned it off to accommodate for markdown syntax.
# Imputation
# For iris data
# recommend
# iris.impute <- misspi(iris.miss)
iris.impute <- misspi(iris.miss, progress = F)
## Set number of cores used to 4 necessary ...
## Parallel computing using 4 cores ...
## Imputing a matrix with 150 rows and 4 columns ...
## Highest missing rate is 0.326666666666667 ...
## Initializing ...
##
## Iteration 1 ...
##
## Relative squared difference is 0.0114309991586331 ...
##
## Iteration 2 ...
##
## Relative squared difference is 0.0116804139169135 ...
## Early stopping invoked ...
# For toxicity data
# recommend
# toxicity.impute <- misspi(toxicity.miss)
toxicity.impute <- misspi(toxicity.miss, progress = F)
## We highly recommend activate viselect since your data is in high dimension. This may highly improve the speed and imputation accuracy ...
## Parallel computing using 10 cores ...
## Imputing a matrix with 171 rows and 1203 columns ...
## Highest missing rate is 0.491228070175439 ...
## Initializing ...
##
## Iteration 1 ...
##
## Relative squared difference is 0.00289110202106119 ...
##
## Algorithm converged at iter 1 ...
We take toxicity
imputation result as an example. In our
data with dimension 171 * 1203, when
240 variables have missing values, with missing rate
0.4, we completed our imputation in 20 seconds just
using a macbook Pro with only 10 cores. The speed gain
will be dramatic if using a HPC cluster for large dimensions.
# The imputed result
# The imputed result is stored in the `x.imputed` matrix.
toxicity.impute$x.imputed[1:5, 1:5]
## MATS3v nHBint10 MATS3s MATS3p nHBDon_Lipinski
## [1,] 0.0908 0 0.0075 0.0173 0
## [2,] 0.0213 0 0.1144 -0.0410 0
## [3,] 0.0018 0 -0.0156 -0.0765 2
## [4,] -0.0251 0 -0.0064 -0.0894 3
## [5,] 0.0135 0 0.0424 -0.0353 0
# Check imputation time
toxicity.impute$time.elapsed
## Time difference of 15.80384 secs
# Check resources
detectCores()
## [1] 10
In a simulation study when we have the true values for the missing
entries, we could evaluate the result by comparing the imputed value and
the true value. misspi provides function evaliq
to
calculate RMSE (Root Mean Squared Error), MAE (Mean Absolute Error) and
NRMSE (Normalized Root Mean Squared Error). When it’s not in interactive
mode by default, it will automatically fit a regression line and provide
the \(R^2\). The function also provide
interactive visualization if interactive = TRUE
. Users can
select certain regions to zoom in for inspection, as well as hover the
mouse on the data of interest to compare the real value with imputed
value.
# Get the NA indexes for toxicity
na.idx.iris <- which(is.na(iris.miss))
# Interactive plot
# evaliq(iris.numeric[na.idx.iris], iris.impute$x.imputed[na.idx.iris], interactive=T)
# Non Interactive plot
iris.error <- evaliq(iris.numeric[na.idx.iris], iris.impute$x.imputed[na.idx.iris])
## `geom_smooth()` using formula = 'y ~ x'
iris.error
## $rmse
## [1] 0.643342
##
## $mae
## [1] 0.4155556
##
## $nrmse
## [1] 0.3300936
# Get the NA indexes for toxicity
na.idx.toxicity <- which(is.na(toxicity.miss))
# Non interactive plot
toxicity.error <- evaliq(toxicity[na.idx.toxicity], toxicity.impute$x.imputed[na.idx.toxicity])
## `geom_smooth()` using formula = 'y ~ x'
toxicity.error
## $rmse
## [1] 2744.697
##
## $mae
## [1] 65.5142
##
## $nrmse
## [1] 0.5477324
We have explored the basic functionalities of the package which are readily available to use in the above paragraphs. In addition, misspi offers a variety of parameters to fine-tuning the imputation process.
The number of cores misspi requests is by default all cores detected
in the current host. It could be changed through the parameter
ncore
. misspi will automatically reduce the number of cores
to necessary if the number of cores is larger than the number of
variables of data to be imputed and also provide protection against
excessive request:
iris.impute.change.core <- misspi(iris.miss, ncore=10000000000, progress = F)
## Set number of cores used to 4 necessary ...
## Parallel computing using 4 cores ...
## Imputing a matrix with 150 rows and 4 columns ...
## Highest missing rate is 0.326666666666667 ...
## Initializing ...
##
## Iteration 1 ...
##
## Relative squared difference is 0.0114309991586331 ...
##
## Iteration 2 ...
##
## Relative squared difference is 0.0116804139169135 ...
## Early stopping invoked ...
In the iterative imputation procedure, initialization fills in the
missing value in the first step. misspi currently supports several
different options for initialization through the argument
init.method
. We offer rf
for random forest
initialization, mean
for mean initialization,
median
for median initialization. In addition, users could
specify the maximum number of iterations through maxiter
.
Here we provide two special examples for demonstration:
Only use mean imputation
iris.impute.mean <- misspi(iris.miss, init.method = "mean", maxiter = 0, progress = F)
## Set number of cores used to 4 necessary ...
## Parallel computing using 4 cores ...
## Imputing a matrix with 150 rows and 4 columns ...
## Highest missing rate is 0.326666666666667 ...
## Initializing ...
##
## No iterations, imputation completed ...
Random forest imputation without iteration
iris.impute.noiter <- misspi(iris.miss, maxiter = 1, progress = F)
## Set number of cores used to 4 necessary ...
## Parallel computing using 4 cores ...
## Imputing a matrix with 150 rows and 4 columns ...
## Highest missing rate is 0.326666666666667 ...
## Initializing ...
##
## Iteration 1 ...
##
## Relative squared difference is 0.0114309991586331 ...
misspi introduce early stopping mechanism by default to avoid
unnecessary iterations, when the algorithm performance stabilizes but is
oscillating. We will monitor the relative error for each iteration, and
the default earlystopping = T
will automatically terminate
further iterations when the relative error increases from the previous
iteration. Set earlystopping = F
will unrestrict the
algorithm itereations untill it converages, or meet the
maxiter
which is by default 10, whichever comes
first.
We can see that the below example takes more iterations than previous
without early stopping, but actually, the performance are very
similar.
# example of truning off early stopping
iris.impute.earlystopping <- misspi(iris.miss, earlystopping = F, progress = F)
## Set number of cores used to 4 necessary ...
## Parallel computing using 4 cores ...
## Imputing a matrix with 150 rows and 4 columns ...
## Highest missing rate is 0.326666666666667 ...
## Initializing ...
##
## Iteration 1 ...
##
## Relative squared difference is 0.0114309991586331 ...
##
## Iteration 2 ...
##
## Relative squared difference is 0.0116804139169135 ...
##
## Iteration 3 ...
##
## Relative squared difference is 0.00883859701393717 ...
##
## Algorithm converged at iter 3 ...
misspi built in method
includes rf
for
random forest as default, gbm
for gradient boosted machine,
they are both implemented through the fast lightGBM
algorithm. We also includes LASSO (Tibshirani
1996) as an method (method = 'lasso'
), which
could be particularly beneficial in a linear scenario.
Users could pass parameters for lightGBM methods to
lgb.params
(when method = 'rf'
or
method = 'gbm'
). The parameters for the random forest
initialization could also be adjusted through lgb.params0
.
For quick change of number of trees for tree based inference from 100 as
default, you could simply change ntree
,
init.ntree
for imputation and initialization, with
respectively.
# Change parameter follow lightGBM mannual
params <- list(
objective = "regression",
boosting_type = "rf", # Use Random Forest boosting type
learning_rate = 0.05, # learning rate
max_depth = 5, # maximum depth
bagging_freq = 5, # Frequency for bagging
bagging_fraction = 0.6, # Fraction of data to be used for bagging
feature_fraction = 0.33 # Fraction of features to be used for building trees
)
# Use new parameter for random forest
iris.rf.new.params <- misspi(iris.miss, lgb.params = params, lgb.params0 = params, progress = F)
## Set number of cores used to 4 necessary ...
## Parallel computing using 4 cores ...
## Imputing a matrix with 150 rows and 4 columns ...
## Highest missing rate is 0.326666666666667 ...
## Initializing ...
##
## Iteration 1 ...
##
## Relative squared difference is 0.0214084890259387 ...
##
## Iteration 2 ...
##
## Relative squared difference is 0.0147656150106396 ...
##
## Iteration 3 ...
##
## Relative squared difference is 0.0075542325521034 ...
##
## Algorithm converged at iter 3 ...
# GBM as method
# You similary change the parameter fr gbm using the same way above
iris.gbm <- misspi(iris.miss, method = "gbm", progress = F)
## Set number of cores used to 4 necessary ...
## Parallel computing using 4 cores ...
## Imputing a matrix with 150 rows and 4 columns ...
## Highest missing rate is 0.326666666666667 ...
## Initializing ...
##
## Iteration 1 ...
##
## Relative squared difference is 0.0123299464073526 ...
##
## Iteration 2 ...
##
## Relative squared difference is 0.00900182123212257 ...
##
## Algorithm converged at iter 2 ...
# Just reduce the number of trees
iris.less.tree <- misspi(iris.miss, ntree = 20, init.ntree = 20, progress = F)
## Set number of cores used to 4 necessary ...
## Parallel computing using 4 cores ...
## Imputing a matrix with 150 rows and 4 columns ...
## Highest missing rate is 0.326666666666667 ...
## Initializing ...
##
## Iteration 1 ...
##
## Relative squared difference is 0.0177419977067842 ...
##
## Iteration 2 ...
##
## Relative squared difference is 0.019877686167096 ...
## Early stopping invoked ...
For imputation using LASSO, you could use nlassofold
to
change the number of folds for cross validation and set
isis = T
to screen using iterative sure independence
screening (ISIS) (Fan and Lv 2008) to screen variables
before LASSO.
iris.lasso <- misspi(iris.miss, method="lasso", progress = F)
## Set number of cores used to 4 necessary ...
## Parallel computing using 4 cores ...
## Imputing a matrix with 150 rows and 4 columns ...
## Highest missing rate is 0.326666666666667 ...
## Initializing ...
##
## Iteration 1 ...
##
## Relative squared difference is 0.0155097380422926 ...
##
## Iteration 2 ...
##
## Relative squared difference is 0.00888575773578426 ...
##
## Algorithm converged at iter 2 ...
iris.sis.lasso <- misspi(iris.miss, method="lasso", isis = T, progress = F)
## Set number of cores used to 4 necessary ...
## Parallel computing using 4 cores ...
## Imputing a matrix with 150 rows and 4 columns ...
## Highest missing rate is 0.326666666666667 ...
## Initializing ...
##
## Iteration 1 ...
##
## Relative squared difference is 0.0154415536261172 ...
##
## Iteration 2 ...
##
## Relative squared difference is 0.00803501950853286 ...
##
## Algorithm converged at iter 2 ...
Instead of directly use prediction for imputation, misspi by default
uses predictive mean matching (PMM) that averages observed values with
similar predictions to improve accuracy. The number of neighbors from
which it borrows information could be controlled by parameter
nn
. The PMM mode could be turned off by setting
pmm = F
.
# Use 10 neighbors
iris.pmm10 <- misspi(iris.miss, nn=10, progress=F)
## Set number of cores used to 4 necessary ...
## Parallel computing using 4 cores ...
## Imputing a matrix with 150 rows and 4 columns ...
## Highest missing rate is 0.326666666666667 ...
## Initializing ...
##
## Iteration 1 ...
##
## Relative squared difference is 0.00694693784737903 ...
##
## Algorithm converged at iter 1 ...
# Turn off pmm but use predicted value
iris.no.pmm <- misspi(iris.miss, pmm=F, progress=F)
## Set number of cores used to 4 necessary ...
## Parallel computing using 4 cores ...
## Imputing a matrix with 150 rows and 4 columns ...
## Highest missing rate is 0.326666666666667 ...
## Initializing ...
##
## Iteration 1 ...
##
## Relative squared difference is 0.00504984172014011 ...
##
## Algorithm converged at iter 1 ...
# Evaluate errors
iris.pmm10.error <- evaliq(iris.numeric[na.idx.iris], iris.pmm10$x.imputed[na.idx.iris], plot = F)
iris.pmm10.error
## $rmse
## [1] 0.5701369
##
## $mae
## [1] 0.3841667
##
## $nrmse
## [1] 0.2925327
iris.no.pmm.error <- evaliq(iris.numeric[na.idx.iris], iris.no.pmm$x.imputed[na.idx.iris], plot = F)
iris.no.pmm.error
## $rmse
## [1] 0.606254
##
## $mae
## [1] 0.4293527
##
## $nrmse
## [1] 0.3110641
You could modify the progress bar shape (when
progress = T
) from default to other shapes using unicode,
e.g. chess king; through char
option.
# iris.impute.king <- misspi(iris.miss, char=" \u2654")
After going through the basic parameters, we dive into some special features that misspi offers: variable selection to further speed up imputation for high dimensional data, and customize your own machine learning algorithm, even if not intrinsically parallel, for parallel imputation.
In addition to the parallel nature, misspi provides variable
selection using variable importance obtained during random forest
initialization (init.method = 'rf'
). In all of the
following iterations, the variables would be imputed only use its
“important” regressors with high variable importance selected in the
initialization. The number could be customized by viselect
.
This procedure would significantly boost the speed for high dimensional
data with multiple iterations.
Following example demonstrates the speed gain for iterations and would
be more intuitive when progress = T
.
toxicity.impute.vi <- misspi(toxicity.miss, viselect=128, progress = F)
## Parallel computing using 10 cores ...
## Imputing a matrix with 171 rows and 1203 columns ...
## Highest missing rate is 0.491228070175439 ...
## Initializing ...
##
## Iteration 1 ...
##
## Relative squared difference is 0.0145663584042424 ...
##
## Iteration 2 ...
##
## Relative squared difference is 4.45331305499396e-05 ...
##
## Algorithm converged at iter 2 ...
toxicity.impute.vi$time.elapsed
## Time difference of 30.13338 secs
misspi seamlessly integrates with a wide range of algorithms and is
fully equipped to accommodate machine learning models via the
model.train
option, requiring only the function
names.
The input model should be able to take the formula y ~ x
for fitting process where y, and x are matrices, and should also work
with predict
for model prediction. We will use iris data
for illustration:
We could simply provide the built in model lm
for linear
models.
# Simply set model.train
iris.impute.lm <- misspi(iris.miss, model.train = lm, progress = F)
## Since your model.train is not null, we have set method to customize ...
## Set number of cores used to 4 necessary ...
## Parallel computing using 4 cores ...
## Imputing a matrix with 150 rows and 4 columns ...
## Highest missing rate is 0.326666666666667 ...
## Initializing ...
##
## Iteration 1 ...
##
## Relative squared difference is 0.0151947375063306 ...
##
## Iteration 2 ...
##
## Relative squared difference is 0.0071021751073966 ...
##
## Algorithm converged at iter 2 ...
Below is an example for using SVM from external package (Meyer et al.
2023). Note that you could pass all the additional arguments
through ...
of function misspi()
in
misspi.
# Load package
if (!require("e1071")) {
install.packages("e1071")
}
## Loading required package: e1071
library(e1071)
# Simply set model.train
iris.impute.svm.radial <- misspi(iris.miss, model.train = svm, progress = F)
## Since your model.train is not null, we have set method to customize ...
## Set number of cores used to 4 necessary ...
## Parallel computing using 4 cores ...
## Imputing a matrix with 150 rows and 4 columns ...
## Highest missing rate is 0.326666666666667 ...
## Initializing ...
##
## Iteration 1 ...
##
## Relative squared difference is 0.0106045631340212 ...
##
## Iteration 2 ...
##
## Relative squared difference is 0.00469874434957308 ...
##
## Algorithm converged at iter 2 ...
# Change kernel
iris.impute.svm.radial <- misspi(iris.miss, model.train = svm, progress = F, kernel = "polynomial")
## Since your model.train is not null, we have set method to customize ...
## Set number of cores used to 4 necessary ...
## Parallel computing using 4 cores ...
## Imputing a matrix with 150 rows and 4 columns ...
## Highest missing rate is 0.326666666666667 ...
## Initializing ...
##
## Iteration 1 ...
##
## Relative squared difference is 0.015098512782166 ...
##
## Iteration 2 ...
##
## Relative squared difference is 0.0104062134259554 ...
##
## Iteration 3 ...
##
## Relative squared difference is 0.00671787380197917 ...
##
## Algorithm converged at iter 3 ...
We could even try neural network from (Fritsch, Guenther, and Wright 2019) as method, but note that it may require heavy tuning in practice.
# Load package
if (!require("neuralnet")) {
install.packages("neuralnet")
}
## Loading required package: neuralnet
library(neuralnet)
# Simply set model.train
iris.impute.nn <- misspi(iris.miss, model.train = neuralnet, progress = F)
## Since your model.train is not null, we have set method to customize ...
## Set number of cores used to 4 necessary ...
## Parallel computing using 4 cores ...
## Imputing a matrix with 150 rows and 4 columns ...
## Highest missing rate is 0.326666666666667 ...
## Initializing ...
##
## Iteration 1 ...
##
## Relative squared difference is 0.0136444643359174 ...
##
## Iteration 2 ...
##
## Relative squared difference is 0.00746389372107591 ...
##
## Algorithm converged at iter 2 ...