How to Take the Derivative of a Matrix: Key Rules

Taking the derivative of a matrix expression works by the same logic as ordinary calculus: you find how a small change in one quantity affects another. The difference is that your inputs and outputs can be vectors or matrices instead of single numbers, so the result is often a matrix of partial derivatives rather than a single value. Once you learn a handful of core identities and two key rules, you can differentiate most matrix expressions you’ll encounter in practice.

What “Matrix Derivative” Actually Means

In single-variable calculus, the derivative of f(x) is one number: the rate of change. When your function takes a vector or matrix as input, there are suddenly many variables to differentiate with respect to, and the derivative becomes a structured collection of partial derivatives.

For a scalar function of a vector (like a loss function that takes a vector of weights and outputs a single number), the derivative is the gradient: a vector of partial derivatives, one for each input element. If your function maps a vector to another vector, the derivative is the Jacobian matrix, where each row is the gradient of one output component. So for a vector-valued function with m outputs and n inputs, the Jacobian is an m × n matrix.

For a scalar function of an entire matrix X (say, the determinant or the trace), the derivative is a matrix the same shape as X, where each entry tells you how the output changes when you nudge that one element of X.

Layout Conventions: Numerator vs. Denominator

Before you look up any identity, you need to know which convention the source is using. There are two, and they produce results that are transposes of each other. In numerator layout, the dimensions of the result match the numerator first, then the denominator. So if y is a scalar and x is a column vector with N entries, the derivative ∂y/∂x comes out as a 1 × N row vector. In denominator layout, the same derivative would be an N × 1 column vector.

Neither convention is “correct.” Textbooks, Wikipedia, and different course notes switch between them freely, which is the single biggest source of confusion in matrix calculus. When a formula you found online gives you a result that’s the transpose of what you expected, check which layout the source uses. Throughout this article, we’ll use numerator layout, which is common in machine learning contexts.

The Product Rule and Chain Rule Still Apply

The familiar rules from single-variable calculus carry over, with one important caveat: matrix multiplication isn’t commutative, so you have to keep terms in the right order.

The chain rule for matrices is clean. If you have a composition of two differentiable functions, the Jacobian of the composition is the product of the individual Jacobians:

J(F ∘ G) = J(F) · J(G)

This is exactly what you’d expect from single-variable calculus, just with matrix multiplication replacing scalar multiplication. For a concrete example: if z = f(x, y), x = g(s, t), and y = h(s, t), then ∂z/∂s = (∂f/∂x)(∂x/∂s) + (∂f/∂y)(∂y/∂s), and similarly for ∂z/∂t. The Jacobian packages all of these into one matrix product.

The product rule also generalizes, but you must preserve the order of multiplication. For two matrix-valued functions U(t) and V(t), the derivative of their product is: d(UV)/dt = (dU/dt)V + U(dV/dt). Swapping the order would give a wrong answer.

Essential Identities for Vectors

Most practical matrix calculus boils down to a short list of identities. Here are the ones you’ll use constantly.

Linear Forms

If y = Wx, where W is a constant matrix and x is a column vector, then dy/dx = W. This holds regardless of whether you write the product as Wx (with x as a column vector) or xW (with x as a row vector). The Jacobian is simply the weight matrix W. This identity is the backbone of computing gradients in linear models and neural networks.

Quadratic Forms

A quadratic form looks like f(x) = xᵀAx, where A is a square matrix and x is a column vector. The gradient is:

∇f(x) = (Aᵀ + A)x

When A is symmetric (meaning Aᵀ = A, which is common in practice), this simplifies to ∇f(x) = 2Ax. You’ll run into this constantly in least-squares problems and regularization terms.

Essential Identities for Matrices

The Trace

The trace of a matrix (the sum of its diagonal entries) shows up frequently because it converts matrix expressions into scalars, making differentiation easier. The key identity is: ∂ tr(AX) / ∂X = Aᵀ. This is useful because many scalar-valued matrix functions can be rewritten using the trace.

The Determinant

The derivative of a determinant with respect to the matrix itself involves the cofactor matrix. For a square matrix A:

∂ det(A) / ∂A_{ij} = C_{ij}

where C_{ij} is the cofactor of element (i, j). Written more compactly, the gradient of the determinant is the entire cofactor matrix. For an invertible matrix, this equals det(A) times the transpose of the inverse: det(A) · A⁻ᵀ.

The Inverse

The differential of a matrix inverse follows a clean pattern. If X is an invertible matrix and you perturb it slightly:

∂(X⁻¹) = −X⁻¹ (∂X) X⁻¹

This means if X depends on some scalar variable t, then dX⁻¹/dt = −X⁻¹ (dX/dt) X⁻¹. Notice the inverse appears on both sides of the perturbation, which is a consequence of matrix multiplication not being commutative.

A Practical Approach: The Differential Method

When you encounter a matrix expression that doesn’t match a known identity, the differential method gives you a systematic way to find the derivative. The idea: take the differential of both sides (treating d as an operator, like in single-variable calculus), then rearrange the result into a standard form that lets you read off the derivative.

For example, suppose you want the derivative of f(X) = tr(X²). First, take the differential: d(tr(X²)) = tr(d(X²)) = tr((dX)X + X(dX)) = tr(XdX + XdX) = tr(2XdX). Since the result has the form tr(B · dX), the derivative is ∂f/∂X = Bᵀ = 2Xᵀ. For symmetric X, this simplifies to 2X.

This technique works because the trace has a useful property: tr(AB) = tr(BA). You can cycle matrices inside a trace until the dX factor lands at the end, then read off the answer.

How This Applies to Machine Learning

The reason most people encounter matrix derivatives today is backpropagation in neural networks. Training a network means minimizing a scalar loss function with respect to weight matrices, and every step of that process is matrix calculus.

Consider a simple two-layer network. In the forward pass, you compute z₁ = XW₁ (input times first weight matrix), apply an activation function to get h₁, then compute the output ŷ = h₁W₂. The loss might be L = sum of squared outputs.

In the backward pass, you work through the chain rule in reverse. The gradient of the loss with respect to the second weight matrix is dW₂ = h₁ᵀ · dŷ, which is a direct application of the identity for d(Wx)/dW. The gradient flows backward through the activation function, and the gradient with respect to the first weight matrix is dW₁ = Xᵀ · dz₁. Each step uses the same core identities: the derivative of a linear product and the chain rule for Jacobians.

Deep learning frameworks compute these derivatives automatically, but understanding the underlying matrix calculus helps you debug gradient issues, design custom layers, and reason about why certain architectures train well or poorly.

Quick Reference Table

∂(Ax)/∂x = A
∂(xᵀA)/∂x = Aᵀ (denominator layout) or A (numerator layout, row vector x)
∂(xᵀAx)/∂x = (Aᵀ + A)x, or 2Ax when A is symmetric
∂ tr(AX)/∂X = Aᵀ
∂ det(X)/∂X = cofactor matrix of X
∂(X⁻¹)/∂t = −X⁻¹ (∂X/∂t) X⁻¹
Chain rule: J(F ∘ G) = J(F) · J(G)

The single most useful reference for looking up additional identities is “The Matrix Cookbook,” a free PDF that catalogs hundreds of matrix derivative results. It’s widely used in machine learning research and engineering, and it follows denominator layout, so transpose any result if you need numerator layout.