Jeffrey C. Wong: Sparse Model Matrix Optimizations in C++

Jeffrey C. Wong

In this post we will add optimizations to the Matrix::sparse.model.matrix function to increase throughput and decrease memory usage. The optimizations can allow users to create sparse model matrices with interactions 5x faster.

The sparse.model.matrix function creates a model matrix that is sparse. This is especially important when there are categorical variables in the data, which will be one-hot encoded and are, by definition, sparse. In addition, interactions with categorical variables are also inherently sparse. Since one-hot encoding generates multiple columns in the feature matrix, we want to make sure storage costs are minimized.

Data and Model Matrices

Let’s start with a data.frame that has 3 categorical variables, each has 10 unique values.

n = 1e5
data = data.frame(
  x1 = sample(LETTERS[1:10], n, replace = TRUE),
  x2 = sample(LETTERS[1:10], n, replace = TRUE),
  x3 = sample(LETTERS[1:10], n, replace = TRUE)
)
data %>% head()

  x1 x2 x3
1  H  D  B
2  C  H  H
3  J  G  B
4  F  I  E
5  D  E  F
6  A  F  D

A model matrix with all interactions has 1000 columns! However, approximately 8 of those columns are nonzero. The storage cost for such a matrix should be small. Yet, from the RStudio profiler, the memory cost for calling sparse.model.matrix is roughly 700 MB. That is quite large. The bulk of the allocations come from the function Matrix:::sparse2int. Below, we will make a copy of the function and add some optimizations.

X1 = sparse.model.matrix(~ x1*x2*x3, data)
dim(X1)

[1] 100000   1000

Matrix sparse2int

The role of the Matrix:::sparse2int function is to compute interactions. It receives two blocks of features, \(X\) and \(Y\). These features are transpose of the original features in the dataframe, so columns in \(X\) and \(Y\) represent observations and rows represent features. Say \(x_1\) is a feature from the \(X\) block, and \(y_1\) is feature from the \(Y\) block. The interaction is simply the product \(x_1 \cdot y_1\). So, for every observation, we want to compute all pairwise products of the features in \(X\) and \(Y\).

This feels like a kronecker product: get feature \(y_1\) for every observation and multiply it with everything in \(X\); repeat for all features in \(Y\). In fact, this operation can be expressed as a kronecker product.

r_naive_interaction_kronecker = function(X, Y) {
  kronecker(rep(1, nrow(Y)), X) * kronecker(Y, rep(1, nrow(X)))
}

The kronecker product kronecker(Y, rep(1, nrow(X))) effectively stretches rows in Y, so that they have the same shape as the X matrix, and can be multipled elementwise. However, it’s not necessary to physically stretch the matrices, we should keep the features compact and recycle instead of stretching them.

Below is an example for how to do this using RcppEigen.

Eigen::SparseMatrix<double> sparse_sparse_interaction_kronecker(
    Eigen::SparseMatrix<double> X,
    Eigen::SparseMatrix<double> Y
  ) {

  SparseMatrix<double, RowMajor> output(X.rows() * Y.rows(), X.cols());
  SparseMatrix<double> Y_transpose = Y.transpose();
  for (int k=0; k<Y_transpose.outerSize(); ++k) {
    SparseMatrix<double> X_scaled = X * VectorXd(Y_transpose.col(k)).asDiagonal();
    X_scaled.prune(0.0);
    output.middleRows(k * X.rows(), X.rows()) = X_scaled;
  }
  return output;
}

Overwriting Functions in Matrix Package

RcppSparse2int

We’ll introduce a new function, RcppSparse2int, which implements the interaction kronecker from Rcpp. This is largely a copy of the original Matrix::sparse2int. Next, we will fork a new function: RcppSparse.model.matrix that will invoke RcppSparse2int.

#' Faster version of sparse2int.r
#' 
#' For most cases, we want to compute
#' kronecker(rep(1, ny), X) * kronecker(Y, rep(1, nx)).
#' However, we should do this without materializing the kronecker.
#' This represents a major speedup and reduction in memory
RcppSparse2int <- function(X, Y, do.names = TRUE, forceSparse = FALSE, verbose = FALSE) {
  if (do.names) {
    dnx <- dimnames(X)
    dny <- dimnames(Y)
  }
  dimnames(Y) <- dimnames(X) <- list(NULL, NULL)
  nx <- nrow(X)
  ny <- nrow(Y)
  r <-
    if ((nX <- is.numeric(X)) | (nY <- is.numeric(Y))) {
      # At least one was dense
      if (nX) {
        # X is dense
        if (nY) {
          # Y is also dense
          dense_dense_interaction_kronecker(X, Y)
        } else if (nx > 1) {
          # X has more than 1 row and is dense.
          # Y is sparse.
          dense_sparse_interaction_kronecker(X, Y)
        }
        # if (nY || nx > 1) { # both numeric, or X >=2 "columns"
        #   F <- if (forceSparse) function(m) .Call(Matrix:::dense_to_Csparse, m) else identity
        #   # F((if(ny == 1) X else X[rep.int(seq_len(nx),  ny)  , ]) *
        #   #   (if(nx == 1) Y else Y[rep     (seq_len(ny),each=nx), ]))
        #   
        # }
        else { ## numeric X (1 "column"),  sparseMatrix Y
          r <- Y
          dp <- Y@p[-1] - Y@p[-(Y@Dim[2] + 1L)]
          ## stopifnot(all(dp %in% 0:1)) # just for now
          ## if(nx == 1)
          ## FIXME: similar trick would be applicable for nx > 2
          r@x <- X[dp == 1L] * Y@x
          r
        }
      }
      else { ## sparseMatrix X, dense Y
        if (ny == 1) {
          ## FIXME: similar trick would be applicable for ny > 2
          r <- X
          dp <- X@p[-1] - X@p[-(X@Dim[2] + 1L)]
          ## stopifnot(all(dp %in% 0:1)) # just for now - drop! - FIXME
          r@x <- Y[dp == 1L] * X@x
          r
        }
        else { ## ny > 1 -- *larger* matrix
          # X is sparse, Y is dense
          sparse_dense_interaction_kronecker(X, Y)
        }
      }
    }
  else { ## X & Y are both sparseMatrix
    sparse_sparse_interaction_kronecker(X, Y)
  }
  if (verbose) {
    cat(sprintf(
      " sp..2int(%s[%d],%s[%d]) ",
      if (nX) "<N>" else "<sparse>", nx,
      if (nY) "<N>" else "<sparse>", ny
    ))
  }
  
  if (do.names) {
    ## FIXME: This names business needs a good solution..
    ##        but maybe "up in the caller"
    if (!is.null(dim(r)) &&
        !is.null(nX <- dnx[[1]]) &&
        !is.null(nY <- dny[[1]])) {
      rownames(r) <- outer(nX, nY, paste, sep = ":")
    }
  }
  r
}

#' Faster version of sparseInt.r
#' 
#' This version invokes RcppSparse2int instead of sparse2int,
#' which in tern invokes Rcpp code to do fast matrix operations.
RcppSparseInt.r <- function(rList, do.names = TRUE, forceSparse = FALSE, verbose = FALSE) {
  nl <- length(rList)
  if (forceSparse) {
    F <- function(m) if (is.matrix(m)) .Call(Matrix:::dense_to_Csparse, m) else m
  }
  if (verbose) {
    cat("RcppSparseInt.r(<list>[1:", nl, "], f.Sp=", forceSparse, "): is.mat()= (",
        paste(symnum(vapply(rList, is.matrix, NA)), collapse = ""),
        ")\n",
        sep = ""
    )
  }
  if (nl == 1) {
    if (forceSparse) F(rList[[1]]) else rList[[1]]
  } else {
    ## 'recursion' free:
    r <- rList[[1]]
    for (j in 2:nl)
      r <- RcppSparse2int(r, rList[[j]],
                          do.names = do.names, verbose = verbose
      )
    if (forceSparse) F(r) else r
  }
}

#' A faster version of model.spmatrix
#' 
#' This version calls RcppSparseInt.r instead of sparseint.r.
#' It also uses RcppSparseMatrixRbindList to bind multiple matrices vertically.
RcppModel.spmatrix <- function(trms, mf, transpose=FALSE,
                               drop.unused.levels = FALSE, row.names=TRUE, sep="", verbose=FALSE) {
  ## Author: Martin Maechler, Date:  7 Jul 2009
  
  ## mf is a model frame or a "simple" data.frame [after reorder !]
  stopifnot(is.data.frame(mf))
  n <- nrow(mf)
  if(row.names)
    rnames <- row.names(mf)
  ## mf:  make into list, dropping all attributes (but the names)
  ### FIXME: for poly(., 5)  mf has a 5-column matrix as "one column" => looses names here
  fnames <- names(mf <- unclass(mf))
  attributes(mf) <- list(names = fnames)
  
  if(length(factorPattern <- attr(trms, "factors"))) {
    d <- dim(factorPattern)
    nVar <- d[1]
    nTrm <- d[2]
    n.fP <- dimnames(factorPattern)
    fnames <- n.fP[[1]] # == names of variables {incl. "F(var)"} in the model
    Names  <- n.fP[[2]] # == colnames == names of terms:  "a", "b:c", ...
  } else { ## degenerate, e.g.  'Y ~ 1'
    nVar <- nTrm <- 0L
    fnames <- Names <- character(0)
  }
  ## all the "variables in the model" are also in "mf", including "sin(x)";
  ## actually, ..../src/main/model.c even assumes
  stopifnot((m <- length(mf)) >= nVar)
  if(verbose)
    cat(sprintf("model.spm..(): (n=%d, nVar=%d (m=%d), nTrm=%d)\n",
                n, nVar,m, nTrm))
  if(m > nVar) mf <- mf[seq_len(nVar)]
  stopifnot(fnames == names(mf), allow.logical0 = TRUE)
  noVar <- nVar == 0
  ##>> this seems wrong; we use  1:nVar for indexing mf[] below ..
  ##>> if(noVar) nVar <- 1L # (as in ~/R/D/r-devel/R/src/main/model.c)
  ## Note: "character" variables have been changed to factor in the caller;
  ##     hence: both factor and *logical*  should be dealt as factor :
  is.f <- if(noVar) logical(0) else vapply(mf, function(.)
    is.factor(.) | is.logical(.), NA)
  indF <- which(is.f)
  if(verbose) { cat(" --> indF =\n"); print(indF) }
  hasInt <- attr(trms, "intercept") == 1
  ## the degree of interaction:
  ## intOrder <- attr(trms, "order")
  ##
  if(!hasInt && length(indF)) {
    ## change the '1' of the first factor into a '2' :
    if(any(i1 <- factorPattern[indF, ] == 1))
      ## replace at the first '1' location:
      factorPattern[indF,][which.max(i1)] <- 2L
    else {}
    ## nothing to do
  }
  ## Convert "factors" to "Rowwise- sparseMatrix ("dummy"-matrix) -----------
  ## Result: a list of sparse model matrices for the "factor"s :
  f.matr <- structure(vector("list", length = length(indF)),
                      names = fnames[indF])
  i.f <- 0
  ## ---- For each variable in the model -------------------
  for(i in seq_len(nVar)) {
    nam <- fnames[i]
    f <- mf[[i]]
    if(is.f[i]) {
      fp <- factorPattern[i,] ## == factorPattern[nam,]
      contr <- attr(f, "contrasts")
      f.matr[[(i.f <- i.f + 1)]] <- # a list of 2
        lapply(fac2Sparse(f, to = "d",
                          drop.unused.levels=drop.unused.levels,
                          factorPatt12 = 1:2 %in% fp,
                          contrasts.arg = contr),
               function(s) {
                 if(is.null(s)) return(s)
                 ## else
                 rownames(s) <- ## for some contr.*(), have lost rownames; hmm..
                   paste(nam, rownames(s) %||% seq_len(nrow(s)), sep=sep)
                 s
               })
    } else { ## continuous variable --> "matrix" - for all of them
      if(any(iA <- (cl <- class(f)) == "AsIs")) # drop "AsIs" class
        class(f) <- if(length(cl) > 1L) cl[!iA]
      nr <- if(is.matrix(f)) nrow(f <- t(f)) else (dim(f) <- c(1L, length(f)))[1]
      if(is.null(rownames(f)))
        rownames(f) <- if(nr == 1) nam else paste(nam, seq_len(nr), sep=sep)
      mf[[i]] <- f
    }
  }
  if(verbose) {
    cat(" ---> f.matr list :\n")
    str(f.matr, max = as.integer(verbose))
    fNms <- format(dQuote(Names))
    dim.string <- gsub('5', as.character(floor(1+log10(n))),
                       " -- concatenating (r, rj): dim = (%5d,%5d) | (%5d,%5d)\n")
  }
  
  ## FIXME: do all this in C --
  
  getR <- function(N)      # using 'nm'
    if(!is.null(r <- f.matr[[N]])) r[[factorPattern[N, nm]]] else mf[[N]]
  vNms <- "(Intercept)"[hasInt]
  counts <- integer(nTrm)
  r <-
    if(hasInt) ## column of 1's - as sparse
      new("dgCMatrix", i = 0:(n-1L), p = c(0L, n),
          Dim = c(n, 1L), x = rep.int(1, n))
  else new("dgCMatrix", Dim = c(n, 0L))
  if(transpose) r <- t(r)
  iTrm <- seq_len(nTrm)
  r.list <- list(r)
  for(j in iTrm) { ## j-th term
    nm <- Names[j]
    if(verbose) cat(sprintf("term[%2d] %s .. ", j, fNms[j]))
    nmSplits <- strsplit(nm, ":", fixed=TRUE)[[1]]
    ## NOTA BENE: This can be very slow when many terms are involved
    ## FIXME ??? why does it use *much* memory in those cases ??
    # rj <- sparseInt.r(lapply(nmSplits, getR), do.names=TRUE,
    #                   forceSparse = TRUE, verbose=verbose)# or just (verbose >= 2))
    rj <- RcppSparseInt.r(lapply(nmSplits, getR),
                          do.names = TRUE,
                          forceSparse = TRUE, verbose = verbose
    )
    r.list[[j + 1]] <- rj
    if(verbose) cat(sprintf(dim.string, nrow(r), ncol(r), nrow(rj),ncol(rj)))
    ## fast version of cbind2() / rbind2(), w/o checks, dimnames, etc
    # r <- if(transpose) .Call(Csparse_vertcat, r, rj)
    # else     .Call(Csparse_horzcat, r, t(rj))
    ## if(verbose) cat(" [Ok]\n")
    vNms <- c(vNms, dimnames(rj)[[1]])
    counts[j] <- nrow(rj)
  }
  r <- RcppSparseMatrixRbindList(r.list) %>% t()
  rns <- if(row.names) rnames
  dimnames(r) <- if (transpose) list(rns, vNms) else list(vNms, rns)
  attr(r, "assign") <- c(if(hasInt) 0L, rep(iTrm, counts))
  r
}

#' Sparse Model Matrix
#'
#' Rcpp replacement for sparse.model.matrix.
#' This function replaces the call to model.spmatrix with RcppModel.spmatrix
RcppSparse.model.matrix <-
  function(object, data = environment(object), contrasts.arg = NULL,
           xlev = NULL, transpose = TRUE,
           drop.unused.levels = FALSE, row.names = TRUE
           , sep = ""
           , verbose = FALSE, ...)
  {
    t <- if(missing(data)) terms(object) else terms(object, data=data)
    if (is.null(attr(data, "terms")))
      data <- model.frame(object, data, xlev=xlev)
    else {
      reorder <- match(sapply(attr(t,"variables"),deparse,
                              width.cutoff=500)[-1L],
                       names(data))
      if (anyNA(reorder))
        stop("model frame and formula mismatch in model.matrix()")
      if(!isSeq(reorder, ncol(data), Ostart=FALSE))
        data <- data[,reorder, drop=FALSE]
    }
    int <- attr(t, "response")
    if(length(data)) {      # otherwise no rhs terms, so skip all this
      contr.funs <- as.character(getOption("contrasts"))
      namD <- names(data)
      ## turn any character columns into factors
      for(i in namD)
        if(is.character(data[[i]]))
          data[[i]] <- factor(data[[i]])
      isF <- vapply(data, function(x) is.factor(x) || is.logical(x), NA)
      isF[int] <- FALSE
      isOF <- vapply(data, is.ordered, NA)
      for(nn in namD[isF])            # drop response
        if(is.null(attr(data[[nn]], "contrasts")))
          contrasts(data[[nn]]) <- contr.funs[1 + isOF[nn]]
      ## it might be safer to have numerical contrasts:
      ##    get(contr.funs[1 + isOF[nn]])(nlevels(data[[nn]]))
      if (!is.null(contrasts.arg) && is.list(contrasts.arg)) {
        if (is.null(namC <- names(contrasts.arg)))
          stop("invalid 'contrasts.arg' argument")
        for (nn in namC) {
          if (is.na(ni <- match(nn, namD)))
            warning(gettextf("variable '%s' is absent, its contrast will be ignored", nn),
                    domain = NA)
          else {
            ca <- contrasts.arg[[nn]]
            ## FIXME: work for *sparse* ca
            if(is.matrix(ca)) contrasts(data[[ni]], ncol(ca)) <- ca
            else contrasts(data[[ni]]) <- contrasts.arg[[nn]]
          }
        }
      }
    } else {               # internal model.matrix needs some variable
      isF <-  FALSE
      data <- cbind(data, x = 0)
    }
    ## <Sparse> src/library/stats/R/models.R has
    ##    ans <- .Internal(model.matrix(t, data))
    if(verbose) {
      cat("RcppModel.spmatrix(t, data, ..)  with t =\n"); str(t,give.attr=FALSE) }
    ans <- RcppModel.spmatrix(t, data, transpose=transpose,
                          ##     ==============
                          drop.unused.levels=drop.unused.levels,
                          row.names=row.names, sep=sep, verbose=verbose)
    ## </Sparse>
    attr(ans, "contrasts") <-
      lapply(data[isF], function(x) attr(x, "contrasts"))
    ans
  } ## {sparse.model.matrix}

Benchmark

Below we benchmark the results for accuracy, and for performance. The new Rcpp interaction kronecker produces the correct answer, and needs only 20% of the time.

X1 = sparse.model.matrix(~ x1*x2*x3, data)
X2 = RcppSparse.model.matrix(~ x1*x2*x3, data)
identical(X1, X2)

[1] TRUE

system.time(sparse.model.matrix(~ x1*x2*x3, data, row.names = FALSE))

   user  system elapsed 
  2.034   0.384   2.423

system.time(RcppSparse.model.matrix(~ x1*x2*x3, data, row.names = FALSE))

   user  system elapsed 
  0.451   0.040   0.492

Sparse Model Matrix Optimizations in C++

Data and Model Matrices

Matrix sparse2int

Overwriting Functions in Matrix Package

RcppSparse2int

Benchmark

Citation