Using Conjugate Gradient to Solve Classical Matrix Equations

In linear algebra, the conjugate gradient method is an algorithm for numerically approximating the solution of a system of linear equations. This post focuses on the conjugate gradient method and its applications to solving matrix equations. The whole story will cover the following contents:

Introduction
Preliminaries
- Vectorization operator
- Kronecker product
Classical matrix equations
- Lyapunov equation
- Sylvester equation
- Generalized Sylvester equation
Conjugate gradient (CG) method
- Algorithm
- solving linear equation
- solving matrix equation
- solving matrix factorization on incomplete data
CG’s potential for large-scale and sparse problem
Conclusion

Preliminaries

We introduce vectorization operator and Kronecker product, the basic building blocks, for the following content.

Vectorization Operator

Vectorization is a basic operator for converting a matrix (or tensor) into a vector, which follows the rule that the operator stacks the columns of a matrix into a “long” vector. For instance, the vectorization of $\boldsymbol{X}=\begin{bmatrix} 1 & 0 \\ 2 & 0 \\ 0 & 3 \\ 1 & 1 \\ \end{bmatrix}$ is

$\text{vec}(\boldsymbol{X})=(1,2,0,1,0,0,3,1)^\top$

Kronecker Product

Kronecker product is important and widely used in matrix computations, which is named after a famous mathematician Leopold Kronecker (1823–1891). By definition, the Kronecker product between two matrices $\boldsymbol{X}\in\mathbb{R}^{m\times n}$ and $\boldsymbol{Y}\in\mathbb{R}^{p\times q}$ is

$\boldsymbol{X} \otimes \boldsymbol{Y}=\begin{bmatrix} x_{11} \boldsymbol{Y} & x_{12} \boldsymbol{Y} & \cdots & x_{1 n} \boldsymbol{Y} \\ x_{21} \boldsymbol{Y} & x_{22} \boldsymbol{Y} & \cdots & x_{2 n} \boldsymbol{Y} \\ \vdots & \vdots & \ddots & \vdots \\ x_{m 1} \boldsymbol{Y} & x_{m 2} \boldsymbol{Y} & \cdots & x_{m n} \boldsymbol{Y} \end{bmatrix} \in \mathbb{R}^{(m p) \times(n q)}$

where the symbol $\otimes$ denotes the Kronecker product.

For example, the Kronecker product between two matrices $\boldsymbol{X}=\begin{bmatrix} 1 & 2 \\ 3 & 4 \\ \end{bmatrix}$ and $\boldsymbol{X}=\begin{bmatrix} 5 & 6 & 7 \\ 8 & 9 & 10 \\ \end{bmatrix}$ is

$\begin{aligned}\boldsymbol{X}\otimes\boldsymbol{Y}&=\left[\begin{array}{ccc} 1 \times\left[\begin{array}{ccc} 5 & 6 & 7 \\ 8 & 9 & 10 \end{array}\right] & 2 \times\left[\begin{array}{ccc} 5 & 6 & 7 \\ 8 & 9 & 10 \end{array}\right] \\ 3 \times\left[\begin{array}{ccc} 5 & 6 & 7 \\ 8 & 9 & 10 \end{array}\right] & 4 \times\left[\begin{array}{ccc} 5 & 6 & 7 \\ 8 & 9 & 10 \end{array}\right] \end{array}\right] \\ &=\left[\begin{array}{cccccc} 5 & 6 & 7 & 10 & 12 & 14 \\ 8 & 9 & 10 & 16 & 18 & 20 \\ 15 & 18 & 21 & 20 & 24 & 28 \\ 24 & 27 & 30 & 32 & 36 & 40 \end{array}\right] \end{aligned}$

Putting the vectorization operator and the Kronecker product together, there exists an important property for the linear system $\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}$ , which is given by

$\text{vec}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B})=(\boldsymbol{B}^\top\otimes\boldsymbol{A})\text{vec}(\boldsymbol{X})$

This property allows one to express a matrix equation like $\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}=\boldsymbol{C}$ in the form of standard linear equation: $\text{vec}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B})=(\boldsymbol{B}^\top\otimes\boldsymbol{A})\text{vec}(\boldsymbol{X})=\text{vec}(\boldsymbol{C})$ .

Classical Matrix Equations

Sylvester Equation

Sylvester equation is a linear matrix equation, and it usually arises in applications in control theory. This kind of matrix equation is named after a famous mathematician James Joseph Sylvester (1814–1897), who contributed the homogeneous version of the equation $\boldsymbol{A}\boldsymbol{X}-\boldsymbol{X}\boldsymbol{B}=\boldsymbol{0}$ in 1884.

Given matrices $\boldsymbol{A}\in\mathbb{R}^{m\times m}$ , $\boldsymbol{B}\in\mathbb{R}^{n\times n}$ , and $\boldsymbol{C}\in\mathbb{R}^{m\times n}$ , a Sylvester equation is a matrix equation of the form:

$\boldsymbol{A}\boldsymbol{X}+\boldsymbol{X}\boldsymbol{B}=\boldsymbol{C}$

where the problem is seeking the possible solution of $\boldsymbol{X}\in\mathbb{R}^{m\times n}$ .

Since the Sylvester equation indeed a linear equation, it is possible to write the formula in the standard linear equation $\boldsymbol{A}\boldsymbol{x}=\boldsymbol{b}$ . This can be done by using both vectorization operator and Kronecker product, which yields

$(\boldsymbol{I}_{n}\otimes\boldsymbol{A}+\boldsymbol{B}^\top\otimes\boldsymbol{I}_{m})\text{vec}(\boldsymbol{X})=\text{vec}(\boldsymbol{C})$

This formula allows one to obtain the closed-form solution, however, it would cost $\mathcal{O}((mn)^{3})$ . Can we do better? One famous method with an acceptable complexity is Bartels-Stewart algorithm, which is developed by Bartels and Stewart in 1972.

For instance, given $\boldsymbol{A}=\left[\begin{array}{llll} 1 & 0 & 2 & 3 \\ 4 & 1 & 0 & 2 \\ 0 & 5 & 5 & 6 \\ 1 & 7 & 9 & 0 \end{array}\right]$ , $\boldsymbol{B}=\left[\begin{array}{cc} 0 & -1 \\ 1 & 0 \end{array}\right]$ , and $\boldsymbol{C}=\left[\begin{array}{ll} 1 & 0 \\ 2 & 0 \\ 0 & 3 \\ 1 & 1 \end{array}\right]$ , what is the solution of $\boldsymbol{A}\boldsymbol{X}+\boldsymbol{X}\boldsymbol{B}=\boldsymbol{C}$ ?

import numpy as np
from scipy import linalg

A = np.array([[1, 0, 2, 3], [4, 1, 0, 2], [0, 5, 5, 6], [1, 7, 9, 0]])
B = np.array([[0, -1], [1, 0]])
C = np.array([[1, 0], [2, 0], [0, 3], [1, 1]])

X = linalg.solve_sylvester(A, B, C)
print(X)

The output would be

[[ 0.47316381 -0.36642543]
 [-0.40056724  0.35311491]
 [ 0.33053301 -0.11422005]
 [ 0.07739853  0.35600978]]

Generalized Sylvester Equation

Sylvester equation has many variants and special cases. It can be generalized to multiple terms and to have coefficient matrices on both sides of $\boldsymbol{X}$ , yielding

$\sum_{k=1}^{d}\boldsymbol{A}_k\boldsymbol{X}\boldsymbol{B}_k=\boldsymbol{C}$

Conjugate Gradient (CG) Method

Algorithm

For solving linear equation $\boldsymbol{A}\boldsymbol{x}=\boldsymbol{b}$ in which $\boldsymbol{A}$ is real, symmetric, and positive-definite matrix. The input vector $\boldsymbol{x}_0$ is the initial value.

$\boldsymbol{r}_{0}:=\boldsymbol{b}-\boldsymbol{A}\boldsymbol{x}_0$
if $\boldsymbol{r}_{0}$ is sufficiently small, then return $\boldsymbol{x}_{0}$ as the result
$\boldsymbol{q}_{0}:=\boldsymbol{r}_0$
$\ell:=\boldsymbol{0}$
repeat
- $\alpha_{\ell}:=\frac{\boldsymbol{r}_{\ell}^\top\boldsymbol{r}_{\ell}}{\boldsymbol{q}_{\ell}^\top\boldsymbol{A}\boldsymbol{q}_{\ell}}$
- $\boldsymbol{x}_{\ell+1}:=\boldsymbol{x}_{\ell}+\alpha_{\ell}\boldsymbol{q}_{\ell}$
- $\boldsymbol{r}_{\ell+1}:=\boldsymbol{r}_{\ell}-\alpha_{\ell}\boldsymbol{A}\boldsymbol{q}_{\ell}$
- if $\boldsymbol{r}_{\ell+1}$ is sufficiently small, then exit loop
- $\beta_{\ell}:=\frac{\boldsymbol{r}_{\ell+1}^\top\boldsymbol{r}_{\ell+1}}{\boldsymbol{r}_{\ell}^\top\boldsymbol{r}_{\ell}}$
- $\boldsymbol{q}_{\ell+1}:=\boldsymbol{r}_{\ell+1}+\beta_{\ell}\boldsymbol{q}_{\ell}$
- $\ell:=\ell+1$
end repeat
return $\boldsymbol{x}_{\ell+1}$ as the result

Solving Matrix Equation

import numpy as np

def compute_Ax(A, B, X):
    return np.reshape(A.T @ A @ X + A.T @ X @ B + A @ X @ B.T + X @ B @ B.T, -1, order = 'F')

def solve_sylvester(A, B, C):
    dim1 = A.shape[1]
    dim2 = B.shape[0]
    X = np.random.randn(dim1, dim2)
    x = np.reshape(X, -1, order = 'F')
    b = np.reshape(A.T @ C + C @ B.T, -1, order = 'F')
    Ax = compute_Ax(A, B, X)
    r = b - Ax
    p = r.copy()
    rold = np.inner(r, r)
    for it in range(5):
        Ap = compute_Ax(A, B, np.reshape(p, (dim1, dim2), order = 'F'))
        alpha = rold / np.inner(p, Ap)
        x += alpha * p
        r -= alpha * Ap
        rnew = np.inner(r, r)
        if np.sqrt(rnew) < 1e-10:
            break
        p = r + (rnew / rold) * p
        rold = rnew.copy()
    return np.reshape(x, (dim1, dim2), order = 'F')

A = np.array([[1, 0, 2, 3], [4, 1, 0, 2], [0, 5, 5, 6], [1, 7, 9, 0]])
B = np.array([[0, -1], [1, 0]])
C = np.array([[1, 0], [2, 0], [0, 3], [1, 1]])
print(solve_sylvester(A, B, C))

Solving Matrix Factorization on Incomplete Data

Matrix factorization is an important tool for many real-world applications, e.g., recommender systems. The idea behind matrix factorization is to represent partially observed matrix in a lower dimensional latent space. For any partially observed data matrix $\boldsymbol{Y}\in\mathbb{R}^{N\times T}$ with the observed index set $\Omega$ , one can approximately factorize the data matrix into latent factor matrices $\boldsymbol{W}\in\mathbb{R}^{R\times N}$ and $\boldsymbol{X}\in\mathbb{R}^{R\times T}$ of rank $R$ . Element-wise, any $(i,t)$ th entry $y_{it}$ in $\boldsymbol{Y}$ can be reconstructed by the combination of latent factors:

$y_{it}\approx\boldsymbol{w}_{i}^\top\boldsymbol{x}_{t}$

where $\boldsymbol{w}_{i}\in\mathbb{R}^{R}$ and $\boldsymbol{x}_{t}\in\mathbb{R}^{R}$ are the $i$ th and $t$ th columns of $\boldsymbol{W}$ and $\boldsymbol{X}$ , respectively. They are referred to as latent factors.

If $\mathcal{P}_{\Omega}(\cdot)$ denotes the orthogonal projection supported on the observed index set $\Omega$ , then the above matrix factorization can be rewritten as follows,

$\mathcal{P}_{\Omega}(\boldsymbol{Y})\approx\mathcal{P}_{\Omega}(\boldsymbol{W}^\top\boldsymbol{X})$

To achieve such approximation, one fundamental problem is that how to determine the latent matrices, so that the partially specified data matrix $\boldsymbol{Y}$ matches $\boldsymbol{W}^\top\boldsymbol{X}$ as closely as possible? One can formulate an optimization problem with respect to the factor matrices, and it follows that

$\min_{\boldsymbol{W},\boldsymbol{V}}~\frac{1}{2}\left\|\mathcal{P}_{\Omega}(\boldsymbol{Y}-\boldsymbol{W}^\top\boldsymbol{X})\right\|_{F}^{2}$

Let $f=\frac{1}{2}\left\|\mathcal{P}_{\Omega}(\boldsymbol{Y}-\boldsymbol{W}^\top\boldsymbol{X})\right\|_{F}^{2}$ , then the first-order partial derivative with respect to $\boldsymbol{W}$ as

$\frac{\partial f}{\partial\boldsymbol{W}}=-\boldsymbol{X}\mathcal{P}_{\Omega}^\top(\boldsymbol{Y}-\boldsymbol{W}^\top\boldsymbol{X})$

With respect to $\boldsymbol{X}$ , we have

$\frac{\partial f}{\partial\boldsymbol{X}}=-\boldsymbol{W}\mathcal{P}_{\Omega}(\boldsymbol{Y}-\boldsymbol{W}^\top\boldsymbol{X})$

Conclusion

In recent years, many machine learning problems arise in large-scale and sparse data are possibly associated with solving linear equations. There remain some open problems for modeling matrix equations with certain structures (e.g., learning matrix factorization from partially observed data) and deriving efficient numerical methods.

Using Conjugate Gradient to Solve Classical Matrix Equations

Preliminaries

Vectorization Operator

Kronecker Product

Classical Matrix Equations

Sylvester Equation

Generalized Sylvester Equation

Conjugate Gradient (CG) Method

Algorithm

Solving Matrix Equation

Solving Matrix Factorization on Incomplete Data

Conclusion

References

transdim

Error

Preliminaries

Vectorization Operator

Kronecker Product

Classical Matrix Equations

Sylvester Equation

Generalized Sylvester Equation

Conjugate Gradient (CG) Method

Algorithm

Solving Matrix Equation

Solving Matrix Factorization on Incomplete Data

Conclusion

References

Templates (for web app):

Error