CS 6241: Numerics for Data Science

Latent factor models

Author

David Bindel

Published

February 11, 2025

Introduction

Last week, we considered least squares and related regression models. In these models, we want to predict some dependent variable \(Y\) (also called an outcome variable) as a function of some independent variables \(X\) (also called features, regressors, or explanatory variables). In the simplest case, we look for a linear prediction: \[Y \approx Xw\] where \(X\) is a row vector of features and \(w\) is a column vector of weights. We can make the model a little more expressive with a linear prediction in a higher-dimensional space: \[ Y \approx \psi(X) w \] where \(\psi\) is a (nonlinear) feature map from the original vector of dependent variables to a new, expanded set of variables1. In either case, we choose \(w\) by minimizing a sum of loss functions over examples drawn from a population, maybe together with a regularization term. For the least squares loss function and some standard regularizers, we can solve the resulting optimization problem using factorization methods from numerical linear algebra.

In some cases, though, the distinction between explanatory and output variables is not clear. Sometimes we may know different subsets of the variables for each experiment, and we want to fill in the rest. Or we may know all the variables in our experiment, and want to look for relationships between them rather than trying to make predictions. Or we might want to use the attributes associated with different objects not for prediction, but to cluster the objects, or to find outliers. Remarkably, we can view these tasks as well through the lens of matrix factorization.

Matrix factorizations and latent variables

We can use matrices to encode many types of data: images, graphs, user preferences, and distributions of jointly distributed random variables are but a few. Often, as in our linear regression examples, the rows of the matrix represent objects or experiments and the columns represent associated attributes. In other cases, as when encoding a graph, both rows and columns represent objects, and the entries of the matrix represent pairwise interactions. Factoring these data matrices can help us compress and denoise data, separate effects of different factors, find similarities between objects, or fill in missing data.

We generally seek to approximately factor a data matrix \(A \in \mathbb{R}^{m \times n}\) as \[A \approx LMR, \quad L \in \mathbb{R}^{m \times r}, M \in \mathbb{R}^{r \times r}, R \in \mathbb{R}^{r \times n}.\] In this view, we can think of \(R\) (or \(MR\)) as a map from the original attributes to a smaller set of latent factors. Different factorization methods differ both in the structural constraints on \(L\), \(M\), and \(R\) and on how the approximation is chosen.

It is worth being a little careful in how we think of these factorizations! In most of linear algebra, we are interested in a matrix as a representation of a linear map or a quadratic form with respect to some particular basis — but we have a rich set of bases we can choose. For data analysis, though, we may want to restrict the bases we consider for the sake of interpretability. For example, if we want to choose as our latent factors a subset of the original factors, or if we want to enforce non-negativity in the factors, we cannot consider arbitrary changes of basis. Because of this, some of the methods that we like best for interpretability are also the most difficult to compute, as they involve an optimization problem that is combinatorial rather than continuous in nature. We will see this issue repeatedly this week.

Some basic factorization tools

Next time, we will start our discussion of factorizations used in data analysis with the symmetric eigenvalue problem. This is a useful building block on its own, particularly for low rank approximation of symmetric matrices. It is also useful as a prelude to another discussion of the singular value decomposition (SVD), that Swiss Army knife of matrix factorizations. But both the symmetric eigenvalue decomposition and the singular value decomposition involve a very flexible choice of bases; as we have mentioned, this is not always ideal when we want an interpretable model. For interpretability, it is helpful to talk again about pivoted QR and the closely-related interpolative decomposition (ID), in which we choose a subset of the data matrix columns as a basis for the column space. We will also mention the closely-related CUR decomposition, in which both the left and right factors in our approximation are drawn from the columns and rows of the data matrix.

Footnotes

  1. We did not talk about feature maps yet in this class. But many of you will have seen feature maps and “the kernel trick” in a machine learning class, and you are implicitly using feature maps in the polynomial fitting homework problem from last week. We will discuss these ideas in more detail next week when we discuss function approximation.↩︎

  2. Very common words (stop words) and very rare words may be removed from the vocabulary before taking counts.↩︎