Part 1 of 2

Complete Reference · From Scratch · GNSS/IMU Fusion

Factor Graph Optimization
& Sensor Fusion

Everything you need to understand factor graph optimization from first principles: probabilistic estimation theory, graphical model foundations, nonlinear least squares, numerical linear algebra, and the Lie group mathematics required for 3D rotation handling. Part 2 applies all of this to a real GNSS/IMU fusion pipeline.

MAP Estimation Factor Graphs Nonlinear Least Squares Gauss-Newton / LM iSAM2 SO(3) / SE(3) GTSAM

📅 March 30, 2026 ⏱ 75 min — Part 1 ✍ Alireza Olama

01 The State Estimation Problem

Before any equations, let us understand what problem we are actually solving. A robot, autonomous vehicle, or any mobile platform needs to know where it is, how fast it is moving, and how it is oriented in the world. Together these constitute the state — a vector that fully describes the relevant aspects of the system at a point in time.

The state is never directly observable. Sensors give us noisy, partial, and sometimes inconsistent glimpses of reality. A GPS receiver outputs a position with meter-level uncertainty. An IMU measures acceleration and angular velocity but accumulates error over time. A camera sees the world but cannot directly tell you your latitude. The goal of state estimation is to combine all available sensor data — with their known noise characteristics — to compute the best possible estimate of the state.

What makes this hard?

Three fundamental difficulties arise:

1
Nonlinearity. The relationship between state and sensor measurement is rarely linear. IMU integration involves multiplying velocity by time and rotating acceleration vectors. GNSS pseudorange depends on the Euclidean distance between satellite and receiver. These nonlinearities prevent closed-form solutions in general.
2
Noise on a manifold. Orientation — which is part of the state — does not live in Euclidean space. Rotations belong to the special orthogonal group SO(3), a curved manifold. Standard addition does not apply; you cannot write R_new = R_old + δ. This requires special mathematical machinery.
3
Scale. A practical GNSS/IMU system runs for minutes to hours. At 200 Hz IMU and 10 Hz GNSS, that is tens of thousands of state variables coupled by hundreds of thousands of measurements. Naïve matrix operations are computationally impossible; we need structure-exploiting algorithms.

Two estimation paradigms: filtering vs. smoothing

Historically, two philosophies have emerged for handling temporal state estimation:

Aspect	Filtering (EKF, UKF)	Smoothing (Factor Graphs)
Data used	Only current + past (online)	All data, past and future
Estimate	Marginal posterior p(x_k \| z_1:k)	Full joint posterior p(x_1:k \| z_1:k)
Memory	O(n²) — state covariance only	O(n) sparse — full history
Loop closures	Expensive, approximate	Natural, exact
Accuracy	Good for linear/near-linear	Optimal under correct model
Latency	Zero (online)	Fixed-lag or batch
Use case	Real-time control, resource-constrained	Offline, post-processing, SLAM

The Extended Kalman Filter (EKF) has dominated robotics since the 1990s due to its simplicity and low memory footprint. It works by maintaining only the current state estimate and its covariance, updating them as new measurements arrive. The cost is that it discards all past information except through its summarized covariance — an approximation that loses accuracy over time, especially in nonlinear systems.

Factor graph smoothing keeps every state variable and every measurement in a graph simultaneously, then finds the global maximum a posteriori (MAP) estimate over the full trajectory. This is more accurate because: (a) future measurements can retroactively improve past estimates, (b) loop closures in SLAM are handled exactly, and (c) the global optimization is not prone to linearization errors that accumulate in the EKF. Modern implementations (iSAM2, GTSAM) make this computationally feasible by exploiting the sparse structure of the graph.

Why GNSS + IMU?

GNSS alone

1–10 Hz

Absolute position, drifts with multipath and signal loss. Cannot provide orientation or velocity directly. Fails indoors, under bridges, in urban canyons.

IMU alone

100–1000 Hz

High-rate relative motion. Gravity reference gives roll/pitch. But integration error grows unboundedly — position error is cubic in time without correction.

Together, GNSS and IMU are complementary: GNSS gives absolute position at low rate; IMU gives high-rate motion with bounded but growing error. A factor graph can fuse them optimally — the GNSS factors anchor the trajectory globally while the IMU preintegration factors provide tight high-frequency constraints between keyframes.

The full pipeline — preview

GNSS/IMU Fusion Pipeline Overview

Part 1 of this guide builds all the mathematical machinery needed to understand the optimization block (Chapters 2–7). Part 2 fills in the sensor models and assembles the full pipeline with working code.

02 Probabilistic Foundations

Factor graph optimization is, at its core, Bayesian inference expressed as a least squares problem. To understand why optimization is the right tool, you need to understand the probabilistic reasoning underneath it.

The Bayesian framework

We model the unknown state as a random variable x and sensor measurements as random variables z. We have two pieces of information:

1
Prior belief p(x): what we know about the state before seeing any measurements. For the first pose of a trajectory, this is often a delta function (known starting point) or a broad Gaussian (uncertain starting position).
2
Likelihood p(z|x): the sensor model — given state x, how likely is measurement z? This encodes the noise model of the sensor. A GNSS receiver measuring position with 1 m standard deviation contributes a Gaussian likelihood.

Bayes' theorem combines these into the posterior:

(2.1) p(x | z) = p(z | x) · p(x) / p(z) The denominator p(z) is a constant — it does not depend on x and only serves as a normalizer.

The posterior encodes everything we know about x after seeing z. For multiple independent measurements z₁, …, z_n, the joint likelihood factorizes:

p(x | z₁, …, z_n) ∝ p(x) · ∏_i p(z_i | x)

This is the fundamental equation that factor graphs compute. The graph structure makes the product over factors explicit and computationally exploitable.

Maximum A Posteriori (MAP) estimation

For continuous state spaces, we cannot enumerate all possibilities. Instead, we find the single most probable state — the mode of the posterior. This is the MAP estimate:

(2.2) x* = argmax_x p(x | z₁, …, z_n) = argmax_x p(x) · ∏_i p(z_i | x)

Taking the logarithm (which is monotone, so does not change the argmax) and negating (turning maximization into minimization):

(2.3) x* = argmin_x [ −log p(x) − ∑_i log p(z_i | x) ]

This is a scalar function we can minimize with standard optimization tools. The log transformation is crucial: it turns products into sums, exponentials into quadratics, and multiplicative uncertainty into additive cost.

Gaussian distributions and their log-likelihood

In almost all sensor fusion problems, we assume Gaussian noise. This is motivated by the Central Limit Theorem (many small independent errors sum to Gaussian), mathematical tractability, and practical effectiveness. For a multivariate Gaussian with mean μ and covariance Σ:

(2.4) p(x) = (2π)^−n/2 |Σ|^−1/2 · exp(−½ (x−μ)^T Σ⁻¹ (x−μ))

Taking the negative log:

(2.5) −log p(x) = ½ (x−μ)^T Σ⁻¹ (x−μ) + const The constant (n/2)log(2π) + ½log|Σ| does not depend on x and drops out of the argmin.

The expression ‖e‖²_Σ = e^T Σ⁻¹ e is called the Mahalanobis distance squared. It generalizes Euclidean distance by scaling each dimension by its uncertainty — a 1 cm error in a 1 mm-precision sensor costs 100× more than in a 10 cm-precision sensor.

Why Σ⁻¹, not Σ?

The matrix Σ⁻¹ is the information matrix (also called the precision matrix), denoted Ω = Σ⁻¹. High confidence (small Σ) means high Ω — the factor contributes strongly to the objective. This is the natural weighting: we trust accurate sensors more. In factor graphs, each factor contributes its information matrix Ω_i to the Hessian of the cost.

The least squares connection

Now substitute Gaussian noise models into Equation (2.3). Each measurement factor p(z_i | x) = 𝒩(h_i(x), Σ_i) where h_i is the measurement function (how the state maps to expected measurement). The negative log-likelihood of this factor is:

−log p(z_i | x) = ½ ‖z_i − h_i(x)‖²_{Σ_i} + const

Summing over all factors, the MAP problem becomes:

(2.6) x* = argmin_x ∑_i ½ ‖ r_i(x) ‖²_{Ω_i} where r_i(x) = z_i − h_i(x) is the residual of factor i, and Ω_i = Σ_i⁻¹ is its information matrix.

This is nonlinear weighted least squares. Every factor graph problem with Gaussian noise reduces to this form. The prior factor contributes ½ ‖x − μ₀‖²_Ω₀ and each measurement factor contributes ½ ‖z_i − h_i(x)‖²_{Ω_i}.

Whitening: reducing to standard least squares

Any weighted problem can be converted to standard (unweighted) least squares via the whitening transformation. If Ω = L^TL (upper Cholesky), then:

½ ‖r‖²_Ω = ½ r^T L^T L r = ½ ‖L r‖²

Defining the whitened residual r̃ = L r, the problem becomes standard least squares ½ ‖r̃‖². Whitening is not just a mathematical trick — it has practical importance: the square root factor L (not Ω itself) is what iSAM2 maintains and updates incrementally. Working with L instead of Ω improves numerical stability and enables efficient rank-one updates.

🧠 Intuition: What does minimizing cost mean physically?

Think of each factor as a spring connecting variables to their "target" values or to each other. The spring stiffness is Ω_i — a tight spring (high Ω) strongly enforces the constraint. The MAP estimate is the configuration where all springs are in equilibrium: no spring can be relaxed without tightening another. Optimization finds this equilibrium configuration. IMU factors are stiff (very precise relative motion); GNSS factors are looser (meter-level position noise).

Maximum Likelihood vs MAP

When there is no prior (p(x) = const, or equivalently Ω₀ → 0), MAP reduces to Maximum Likelihood Estimation (MLE): just minimize the measurement residuals. In practice for sensor fusion, we always include at least a prior on the first state (to fix the reference frame) and priors on IMU biases (from calibration). Pure MLE on a GNSS/IMU system would produce an under-constrained problem in the orientation space.

03 Graphical Models

We established that the MAP problem involves a product of factors. How we represent and organize those factors determines how efficiently we can solve the problem. Graphical models provide the representation; graph structure exposes the sparsity that makes optimization tractable.

Why graphs?

The joint distribution over all state variables and measurements is typically intractable to reason about directly. A vehicle navigating for 10 minutes at 10 Hz has 6000 pose states. The full covariance matrix would be 36,000 × 36,000 — over a billion entries, impossible to store or invert. But in reality, state x₅ only directly constrains states x₄ and x₆ (through the motion model). The GNSS measurement at time 5 constrains only x₅. The joint distribution has sparse conditional independence structure, and graphs make this structure explicit.

Bayesian Networks (Directed Graphical Models)

A Bayesian Network (BN) is a directed acyclic graph (DAG) where each node represents a random variable and each directed edge encodes a conditional dependency. The joint distribution factorizes as:

p(x₁, …, x_n) = ∏_i p(x_i | parents(x_i))

For a robot motion model with states x_0:T and IMU controls u_1:T:

p(x_0:T | u_1:T) = p(x₀) · ∏_k=1^T p(x_k | x_k−1, u_k)

This is the standard state-space model underlying the Kalman filter. Bayesian networks are powerful for representing causal, temporal structure.

Markov Random Fields (Undirected)

Some constraints are symmetric — a distance between two poses is the same whether you measure it from pose A to pose B or B to A. Undirected graphical models (Markov Random Fields / MRFs) handle these naturally. The joint distribution is a product of potential functions over cliques:

p(x₁, …, x_n) = (1/Z) · ∏_{c ∈ cliques} ψ_c(x_c)

where Z is the partition function (normalizing constant). MRFs are natural for loop-closure constraints in SLAM, where closing a loop creates a symmetric constraint between two poses.

Factor Graphs — the best of both worlds

Factor graphs are bipartite graphs with two types of nodes:

○
Variable nodes (circles): represent the unknowns — poses, velocities, biases. These are what we optimize over.
■
Factor nodes (squares): represent constraints between variables — measurement factors, motion factors, prior factors. Each factor encodes a probability distribution involving its connected variables.

Every BN and MRF can be converted to a factor graph. The power of factor graphs is that they make the factorization structure of the joint distribution visually and computationally explicit. The joint probability is:

(3.1) p(x₁, …, x_n) = (1/Z) · ∏_i f_i(X_i) where f_i is factor i and X_i ⊆ {x₁,…,xₙ} is its set of connected variables.

Factor Graph Structure — Variable Nodes (circles) and Factor Nodes (squares)

Conditional independence from graph structure

The most powerful property of factor graphs is that the graph structure directly encodes conditional independence. Two variables x_i and x_j are conditionally independent given a set of variables S if all paths between them in the factor graph pass through S. This is the graphical model analog of the Markov property, and it is what allows local computations (variable elimination, belief propagation) to produce globally correct results.

In a GNSS/IMU fusion graph, state x₅ is conditionally independent of state x₂ given x₄ — because the only path from x₂ to x₅ passes through x₃ and x₄. This means when we update x₅ based on a new GNSS measurement, we only need to process the local neighborhood of the graph, not reprocess everything from the beginning. This is the key insight that makes incremental smoothing (iSAM2) efficient.

04 The Factor Graph Formalism

With the probabilistic motivation and graphical model foundation in place, we can now define factor graphs precisely and show how they encode the GNSS/IMU state estimation problem.

Formal definition

A factor graph G = (V, F, E) consists of:

V
Variable nodes V = {x₁, …, x_n}: the unknowns we want to estimate. In GNSS/IMU fusion, these are navigation states — pose, velocity, IMU biases at each keyframe.
F
Factor nodes F = {f₁, …, f_m}: the probabilistic constraints. Each factor f_i is a function of a subset of variables X_i ⊆ V. It encodes the probability distribution of a measurement or a prior.
E
Edges E ⊆ V × F: undirected edges connecting each factor to its variable nodes. A factor f_i is connected to exactly those variables X_i it depends on. There are no edges between variable nodes or between factor nodes (bipartite).

The graph represents the factorized joint distribution of Equation (3.1). Finding the MAP estimate means finding the variable assignment that maximizes this joint, or equivalently minimizes the sum of negative log-factors.

Factor types in sensor fusion

Factor Type	Variables	Measurement / Role	Noise Model
Prior factor	x₀	Initial pose from GNSS fix or known start	Gaussian, user-specified Σ
IMU preintegration	x_k, x_k+1, b_k	Integrated accel + gyro over Δt	Gaussian, from IMU specs
GNSS position	x_k	Absolute lat/lon/alt from receiver	Gaussian, dilution of precision
Bias random walk	b_k, b_k+1	IMU bias evolves slowly	Gaussian, σ from IMU datasheet
Zero-velocity update	x_k	Detected stationary interval	Tight Gaussian on velocity = 0

The objective function

For Gaussian factors, the MAP problem (Equation 2.6) with factor graph structure becomes:

(4.1) F(x) = ½ ‖r_prior(x₀)‖²_Ω₀ + ∑_k ½ ‖r_IMU(x_k, x_k+1, b_k)‖²_{Ω_IMU} + ∑_{k ∈ GNSS} ½ ‖r_GNSS(x_k)‖²_{Ω_GPS} + ∑_k ½ ‖r_bias(b_k, b_k+1)‖²_{Ω_b}

Each term is a factor's contribution to the total cost. Minimizing F over all state variables simultaneously yields the global MAP estimate — the trajectory that best explains all measurements given the noise models.

The information matrix and sparsity

When we linearize F around a current estimate (Chapter 5), the resulting Hessian H has exactly the same sparsity pattern as the factor graph connectivity. If two variables x_i and x_j are not connected to any common factor, the corresponding H_ij block is zero. This is why factor graph optimization is tractable at scale: for a chain-structured GNSS/IMU graph (no loop closures), H is block-tridiagonal — only O(n) nonzero blocks — and can be solved in O(n) time via forward/backward substitution.

Sparsity Pattern of Hessian H for a 4-State GNSS/IMU Graph

A minimal working example in Python (GTSAM)

Before diving into the optimization algorithms, here is a complete minimal factor graph: 1D robot position estimated from two noisy measurements. This makes the abstract formalism concrete.

import gtsam
import numpy as np

# --- Define variable keys ---
# GTSAM uses integer keys to identify variables
X0 = gtsam.symbol('x', 0)   # position at time 0
X1 = gtsam.symbol('x', 1)   # position at time 1

# --- Create an empty nonlinear factor graph ---
graph = gtsam.NonlinearFactorGraph()

# --- Prior factor: x0 ~ N(0.0, sigma=0.5) ---
prior_noise = gtsam.noiseModel.Isotropic.Sigma(1, 0.5)
graph.add(gtsam.PriorFactorDouble(X0, 0.0, prior_noise))

# --- Between factor: x1 - x0 ~ N(1.0, sigma=0.1) (odometry) ---
odom_noise = gtsam.noiseModel.Isotropic.Sigma(1, 0.1)
graph.add(gtsam.BetweenFactorDouble(X0, X1, 1.0, odom_noise))

# --- Unary measurement: x1 ~ N(1.2, sigma=0.3) (GPS) ---
gps_noise = gtsam.noiseModel.Isotropic.Sigma(1, 0.3)
graph.add(gtsam.PriorFactorDouble(X1, 1.2, gps_noise))

# --- Initial values (linearization point) ---
init = gtsam.Values()
init.insert(X0, 0.0)
init.insert(X1, 1.0)

# --- Optimize with Gauss-Newton ---
params = gtsam.GaussNewtonParams()
params.setRelativeErrorTol(1e-6)
optimizer = gtsam.GaussNewtonOptimizer(graph, init, params)
result = optimizer.optimize()

# --- Extract result ---
print(f"x0 = {result.atDouble(X0):.4f}")   # ≈ 0.000
print(f"x1 = {result.atDouble(X1):.4f}")   # ≈ 1.057 (weighted avg of 1.0+0.0 and 1.2)

# --- Marginal covariances ---
marginals = gtsam.Marginals(graph, result)
print(f"cov(x1) = {marginals.marginalCovariance(X1)[0,0]:.4f}")  # ≈ 0.0082
Python · GTSAM

💡 Why is x1 ≈ 1.057, not 1.0 or 1.2?

The result is the information-weighted average of two constraints on x₁: the odometry says x₁ = x₀ + 1.0 = 1.0 with σ=0.1 (Ω=100), and the GPS says x₁ = 1.2 with σ=0.3 (Ω=11.1). The MAP estimate weights by information: (100·1.0 + 11.1·1.2) / (100 + 11.1) ≈ 1.021. The prior on x₀ is relatively weak (σ=0.5), pulling x₁ slightly toward a lower value. This is exactly what the factor graph optimizer computes.

05 Nonlinear Least Squares

The objective function F(x) in Equation (4.1) is nonlinear because the measurement functions h_i(x) are nonlinear (rotation matrices, trigonometric functions, etc.). We cannot solve for the minimum analytically. We iteratively linearize and solve linear systems — this is the foundation of all factor graph solvers.

Residuals and the cost function

For each factor i, define the residual:

r_i(x) = z_i − h_i(X_i) ∈ ℝ^m_i z_i: actual measurement · h_i: predicted measurement from current state · m_i: measurement dimension

The total cost is:

(5.1) F(x) = ½ ∑_i r_i(x)^T Ω_i r_i(x) = ½ ‖r(x)‖²_Ω where r(x) stacks all residuals and Ω is the block-diagonal information matrix.

For a Gaussian measurement factor, r_i = 0 when the measurement is perfectly predicted. For a GNSS factor, r_GNSS = p_measured − R_body→world · p_state measures how much the current trajectory state disagrees with the GNSS fix.

Jacobians and linearization

To linearize, expand r_i around the current estimate x₀ using a first-order Taylor expansion:

(5.2) r_i(x₀ + δ) ≈ r_i(x₀) + J_i δ J_i = ∂r_i/∂x |_x₀ ∈ ℝ^{m_i × n} is the Jacobian of factor i evaluated at x₀.

Substituting into (5.1) and minimizing over δ:

(5.3) F(x₀ + δ) ≈ ½ ∑_i (r_i + J_iδ)^T Ω_i (r_i + J_iδ) = ½ δ^T H δ + b^T δ + const

where:

H = ∑_i J_i^T Ω_i J_i (approximate Hessian, positive semi-definite)
b = ∑_i J_i^T Ω_i r_i (gradient of cost at x₀)

The Gauss-Newton method

Setting the gradient of the quadratic approximation to zero gives the normal equations:

(5.4) H · δ = −b → δ* = −H⁻¹ b The Gauss-Newton step — solve this linear system, then update x₁ = x₀ ⊕ δ*.

The "⊕" notation is intentional: for variables living on manifolds (like rotations), the update is not simple addition but a retraction (Chapter 7). For Euclidean variables, ⊕ is just addition.

The full Gauss-Newton iteration:

def gauss_newton(factors, x0, max_iter=50, tol=1e-8):
    x = x0.copy()
    for iteration in range(max_iter):
        # Step 1: Evaluate all residuals and Jacobians at current x
        H = np.zeros((n, n))          # Hessian accumulator
        b = np.zeros(n)               # Gradient accumulator
        cost = 0.0

        for factor in factors:
            r, J = factor.evaluate(x)  # residual and Jacobian
            Omega = factor.information_matrix()
            H += J.T @ Omega @ J       # accumulate Hessian
            b += J.T @ Omega @ r       # accumulate gradient
            cost += 0.5 * r @ Omega @ r

        # Step 2: Solve normal equations H·δ = -b
        delta = np.linalg.solve(H, -b)

        # Step 3: Update state (retract on manifold)
        x = retract(x, delta)

        # Step 4: Check convergence
        if np.linalg.norm(delta) < tol:
            break

    return x, cost
Python · Pseudocode

⚠️ Gauss-Newton Pitfall: Initialization sensitivity

Gauss-Newton is a local method. It finds a local minimum near the starting point. For sensor fusion, a poor initialization (e.g., starting at zero velocity when the vehicle is actually moving at 20 m/s) will produce a wrong local minimum or divergence. Always initialize from a reasonable first guess: dead reckoning with IMU integration, or the previous optimizer solution. The IMU preintegration makes this natural — the propagated state is an excellent initial guess.

Levenberg-Marquardt (LM)

Gauss-Newton can diverge when the linear approximation is poor — this happens far from the minimum (early iterations) or in regions of high curvature. The Levenberg-Marquardt algorithm adds a damping term that interpolates between Gauss-Newton and gradient descent:

(5.5) (H + λ D) · δ = −b D is usually diag(H) (diagonal scaling), λ ≥ 0 is the damping parameter.

The damping parameter λ is adapted automatically:

↑
If the step increased cost (step was bad): increase λ → move toward gradient descent (smaller, safer step).
↓
If the step decreased cost (step was good): decrease λ → move toward Gauss-Newton (larger, faster step).

LM has a beautiful trust-region interpretation: λ implicitly controls the radius of a ball around the current point within which the quadratic model is trusted. When things go wrong, shrink the ball; when they go right, expand it. This makes LM robust to poor initialization and widely used in robotics (Ceres Solver, GTSAM both use LM variants).

Convergence criteria

Three standard checks determine convergence. All three should pass before declaring success:

‖δ‖ < ε₁ step size is negligibly small
‖b‖ = ‖J^TΩr‖ < ε₂ gradient is negligibly small (at minimum)
|ΔF / F| < ε₃ relative cost change is negligible

Typical values: ε₁ = 10⁻⁸, ε₂ = 10⁻⁵, ε₃ = 10⁻⁶. For sensor fusion with well-conditioned problems, convergence in 5–15 Gauss-Newton iterations is typical. The cost of one iteration is dominated by solving the linear system, which we address next.

Analytic vs automatic differentiation

Computing Jacobians J_i = ∂r_i/∂x has two practical approaches:

Method	How	Speed	Accuracy	Effort
Analytic	Derive ∂r/∂x by hand or with a CAS, implement in code	Fastest	Exact (up to floating point)	High — error-prone derivation
Automatic diff	Dual numbers or adjoint (Ceres, GTSAM with Expression)	Fast (1–3× slower)	Machine precision	Low — write residual once
Numerical	Finite differences: (r(x+ε) − r(x)) / ε	Slow (~2n evaluations)	Limited by step size ε	Zero — use for testing only

For performance-critical sensor fusion (real-time SLAM), analytic Jacobians are standard. For prototyping and research, GTSAM's Expression and Ceres's automatic differentiation are excellent. Numerical Jacobians are invaluable for checking your analytic implementation: if the analytic and numerical Jacobians disagree by more than ~10⁻⁵, your analytic derivation has a bug.

06 Solving the Linear System

Each Gauss-Newton iteration reduces to solving the linear system H·δ = −b, where H is a large sparse symmetric positive definite (SPD) matrix. The efficiency of this solve — exploiting the sparsity from the factor graph structure — is what makes large-scale sensor fusion tractable.

The sparsity of H

H = ∑_i J_i^T Ω_i J_i inherits the sparsity of the Jacobians. Block H_jk is nonzero only when variables x_j and x_k appear together in at least one factor. For a sequential GNSS/IMU trajectory with no loop closures, H is block-tridiagonal (each state only directly coupled to its neighbors). This sparsity is the key property that separates factor graph backends from brute-force matrix inversion.

Cholesky factorization

For an SPD matrix H, the standard solver is Cholesky factorization: H = L L^T where L is lower triangular. Then:

H δ = −b
L L^T δ = −b
L y = −b (forward substitution, O(nnz))
L^T δ = y (backward substitution, O(nnz))

where nnz is the number of nonzero entries in L. For a tridiagonal H of size n×n, nnz = O(n) and the entire solve is O(n). Compare this to O(n³) for dense Cholesky — the difference for a 10,000-state trajectory is roughly 10¹² vs 10⁴ operations.

Variable ordering and fill-in

When Cholesky-factoring a sparse H, the triangular factor L can be denser than H — this is called fill-in. The amount of fill-in depends critically on the variable ordering (the order in which variables appear in the vector x, which determines the row/column order in H).

Consider a factor graph with a loop closure: state x₀ and state x₁₀₀ are connected by a factor. This creates H_0,100 ≠ 0, and Cholesky propagates this coupling to fill all blocks H_i,j for 0 ≤ i,j ≤ 100. The optimal ordering (nested dissection, COLAMD, AMD) minimizes fill-in by clever reordering. In practice, GTSAM and g2o use COLAMD or AMD automatically — the programmer rarely needs to think about this.

Elimination ordering = tree structure

There is a deep connection between variable elimination order and graph structure. Eliminating variables one by one from the factor graph produces a Bayes net (the directed graphical model equivalent) — this is the Cholesky factorization viewed as a graphical operation. The Cholesky factor L is the information matrix of this Bayes net. Good elimination orderings correspond to trees or near-tree structures in the graph, which minimize fill-in.

QR factorization and the square root information form

An alternative to Cholesky is to factor the Jacobian matrix J directly (before forming H = J^TΩJ):

J̃ = Q R (thin QR factorization, J̃ = L^T_Ω J is the whitened Jacobian)
H = J̃^T J̃ = R^T Q^T Q R = R^T R

The upper triangular R is the square root information matrix (also called the R-factor or ℛ). Solving the normal equations becomes R^T R δ = −b, equivalent to Cholesky but numerically more stable (H's condition number is the square of J's condition number — QR avoids this squaring). GTSAM internally maintains R and updates it incrementally.

iSAM: Incremental Smoothing and Mapping

In online sensor fusion, new states and factors are added continuously. Rerunning Cholesky from scratch every time is wasteful — most of the existing factorization is still valid. iSAM (Kaess et al., 2008) exploits this by maintaining the Cholesky factor and updating it incrementally using rank-one (or rank-k) updates when new factors arrive.

The cost of adding a new factor that involves m variables is O(m·nnz_affected) rather than O(n·nnz_total). For a chain-structured problem, this is O(1) amortized per new factor — the same cost as the EKF, but with globally consistent estimates.

iSAM2 and the Bayes tree

iSAM had a limitation: when a non-leaf variable is affected by a new measurement (e.g., an old pose receives a loop closure), the entire subtree below it in the elimination tree must be refactored. iSAM2 (Kaess et al., 2012) solves this with the Bayes tree — a data structure that organizes the factorized Bayes net into cliques, enabling truly local updates.

Bayes Tree Structure — Cliques and Separator Sets

When a new measurement arrives, iSAM2:

1
Identifies the affected cliques — those in the subtree rooted at the clique containing the new variables.
2
Removes the affected cliques from the Bayes tree (converting them back to factors).
3
Adds the new factor and re-eliminates only the affected variables (a local Cholesky on a small subproblem).
4
Updates the Bayes tree with the new cliques and propagates back-substitution (fluid relinearization) only through affected variables.

For a chain graph (no loop closures), only one clique is ever affected per new measurement — O(1) work. For loop closures, multiple cliques may need re-elimination, but the tree structure keeps this local. In practice, iSAM2 achieves near-constant time per factor addition for most SLAM/fusion problems.

Library comparison

Library	Backend	Incremental	Languages	Best for
GTSAM	iSAM2, Bayes tree, QR/Cholesky	Yes (iSAM2)	C++, Python	Sensor fusion, SLAM, smoothing
g2o	Sparse Cholesky (CHOLMOD, csparse)	No (batch)	C++	Pose graph SLAM, fast batch
Ceres	Sparse/dense Schur, LM	No (batch)	C++	General NLS, bundle adjustment
manif	Thin wrapper, user provides solver	No	C++, Python	Manifold operations only
SE-Sync	Riemannian optimization	No	MATLAB, C++	Rotation averaging, global minimum

For GNSS/IMU fusion, GTSAM is the standard choice: it has prebuilt IMU preintegration factors, a Python API, active development, and iSAM2 for real-time use. The examples in Part 2 use GTSAM.

07 Lie Groups & Manifold Optimization

Everything in Chapters 2–6 assumed variables live in Euclidean space ℝⁿ, where adding a correction δ to a state x is well-defined. Rotations break this assumption. This chapter explains why, and provides the mathematical tools to fix it.

The problem with representing rotations

A 3D rotation has exactly 3 degrees of freedom (pitch, roll, yaw). Yet every common representation has problems:

Representation	DoF	Parameters	Problem
Euler angles (ZYX)	3	3	Gimbal lock: singularity at pitch = ±90°
Rotation matrix R ∈ SO(3)	3	9	6 nonlinear constraints (R^TR=I, det=1) — over-parameterized
Quaternion q ∈ ℝ⁴	3	4	Unit norm constraint ‖q‖=1, double cover (q and −q represent the same rotation)
Rotation vector φ ∈ ℝ³	3	3	Singular at ‖φ‖ = 2π, wrapping ambiguity

For optimization, the rotation matrix is actually the best internal representation — it is closed under composition (R₁R₂ is always a rotation matrix) and easy to differentiate. The challenge is that we cannot perform unconstrained optimization over the 9 entries of R — we must ensure the result stays in SO(3). The solution is optimization on manifolds.

Groups and Lie groups

A group (G, ∘) is a set G with a binary operation ∘ satisfying: (1) closure (a∘b ∈ G), (2) associativity, (3) identity element e, (4) inverse for every element. The set of 3D rotations with composition forms the group SO(3).

A Lie group is a group that is simultaneously a smooth manifold — you can do calculus on it. The group operation and inverse are smooth (differentiable) maps. This is what enables the connection to optimization: we can define derivatives, tangent spaces, and exponential maps.

The special orthogonal group SO(3)

SO(3) = { R ∈ ℝ³ˣ³ : R^TR = I, det(R) = +1 }

SO(3) is a 3-dimensional manifold embedded in ℝ⁹. It looks locally like ℝ³ near any point, but globally it is curved — think of the surface of a sphere, which locally looks flat but globally is not.

The Lie algebra so(3)

The Lie algebra g of a Lie group G is the tangent space at the identity element. For SO(3), the identity is I (the 3×3 identity matrix), and the tangent space is the set of skew-symmetric matrices:

so(3) = { Φ ∈ ℝ³ˣ³ : Φ + Φ^T = 0 }

Every element of so(3) has the form:

[φ]_× = ⎡ 0 −φ₃ φ₂ ⎤
⎢ φ₃ 0 −φ₁ ⎥
⎣ −φ₂ φ₁ 0 ⎦ where φ = (φ₁, φ₂, φ₃)^T ∈ ℝ³ is the rotation vector (axis × angle).

The hat operator (∧): φ ∈ ℝ³ → [φ]_× ∈ so(3) converts a 3D vector to a skew-symmetric matrix. The vee operator (∨) is its inverse: [φ]_×^∨ = φ.

The exponential map: Rodrigues' rotation formula

The exponential map takes a Lie algebra element (tangent vector at identity) to a Lie group element (a rotation):

(7.1) Exp(φ) = exp([φ]_×) = I + ^{sin θ}/_θ [φ]_× + ^{(1 − cos θ)}/_θ² [φ]_×² where θ = ‖φ‖ is the rotation angle. This is Rodrigues' rotation formula.

The inverse, the logarithmic map, maps a rotation back to its rotation vector:

(7.2) Log(R) = ^θ/_{2 sin θ} (R − R^T)^∨ where θ = arccos(^{trace(R) − 1}/₂)

These formulas are used constantly in factor graph optimization: the IMU preintegration integrates angular velocity by repeatedly applying Exp, and the optimizer computes linearization Jacobians using derivatives of Exp and Log.

import numpy as np

def hat(phi):
    """R³ → so(3): skew-symmetric matrix from rotation vector."""
    return np.array([[       0, -phi[2],  phi[1]],
                     [ phi[2],       0, -phi[0]],
                     [-phi[1],  phi[0],       0]])

def Exp(phi):
    """so(3) → SO(3): Rodrigues' rotation formula."""
    theta = np.linalg.norm(phi)
    if theta < 1e-8:
        return np.eye(3) + hat(phi)   # first-order approx for small angles
    K = hat(phi / theta)
    return np.eye(3) + np.sin(theta) * K + (1 - np.cos(theta)) * K @ K

def Log(R):
    """SO(3) → so(3): matrix logarithm."""
    theta = np.arccos(np.clip((np.trace(R) - 1) / 2, -1, 1))
    if theta < 1e-8:
        return (R - R.T) / 2           # first-order approx
    return (theta / (2 * np.sin(theta))) * (R - R.T)

def vee(Phi):
    """so(3) → R³: extract rotation vector from skew-symmetric matrix."""
    return np.array([Phi[2,1], Phi[0,2], Phi[1,0]])

# Example: 90-degree rotation about z-axis
phi = np.array([0, 0, np.pi/2])
R = Exp(phi)
phi_recovered = vee(Log(R))
print(phi_recovered)  # [0, 0, 1.5708] ✓
Python · SO(3)

The special Euclidean group SE(3)

A full rigid body pose (position + orientation) is an element of SE(3):

SE(3) = SO(3) ⋉ ℝ³ = { T = [R t; 0 1] ∈ ℝ⁴ˣ⁴ : R ∈ SO(3), t ∈ ℝ³ }

The 4×4 matrix T is the homogeneous transformation matrix. It acts on homogeneous points p = [x;y;z;1] as T·p = [Rp+t; 1]. The Lie algebra is se(3) — 4×4 matrices of the form [[φ]_×, ρ; 0, 0] where ρ ∈ ℝ³. SE(3) has 6 degrees of freedom (3 rotation + 3 translation), corresponding to the 6 DoF state of a rigid body.

Perturbation model: how to update rotations in optimization

In Gauss-Newton, we compute a correction δ ∈ ℝ⁶ (for SE(3)) and need to apply it to the current estimate T. The right perturbation model is:

(7.3) T_new = T_old · Exp(δ) Apply the perturbation on the right (in the body frame). Left perturbation (Exp(δ)·T) uses the world frame.

This is the retraction — the operation that takes a current point on the manifold and a tangent vector, and produces a new point on the manifold. It is what the optimizer uses instead of naive addition. The choice of left vs right perturbation affects the form of the Jacobians but not the final result; GTSAM uses right perturbation for its Pose3 type.

Jacobians on manifolds

When a factor residual r_i depends on a rotation R, the Jacobian ∂r_i/∂R is not well-defined in the usual sense (R is constrained). Instead, we compute the perturbation Jacobian:

(7.4) J_i,R = lim_δ→0 [r_i(R · Exp(δ)) − r_i(R)] / δ ∈ ℝ^m×3 This is the derivative of the residual with respect to a small right-perturbation of R in so(3).

The 3-column Jacobian J_i,R is what enters the normal equations. It accounts for the curved geometry of SO(3) and produces numerically stable updates that always stay on the rotation manifold. GTSAM computes these Jacobians automatically for its built-in types (Rot3, Pose3) and provides a framework for user-defined types.

🧠 Why does this matter for GNSS/IMU fusion?

The IMU state includes orientation R_k ∈ SO(3) at every keyframe. When the optimizer computes a Gauss-Newton step, it produces a 3D correction δφ_k ∈ ℝ³ for each orientation. Applying R_k,new = R_k,old · Exp(δφ_k) guarantees R_k,new is exactly a rotation matrix — there is no need for normalization or Gram-Schmidt re-orthogonalization. The manifold structure is maintained automatically by the optimizer.

The full state manifold for GNSS/IMU

The navigation state at keyframe k is:

X_k = (R_k, p_k, v_k, b^a_k, b^g_k) ∈ SO(3) × ℝ³ × ℝ³ × ℝ³ × ℝ³

where R_k is orientation, p_k is position, v_k is velocity, b^a_k is accelerometer bias, and b^g_k is gyroscope bias. The total dimension per keyframe is 3+3+3+3+3 = 15 parameters (but only 3+3+3+3+3 = 15 degrees of freedom since SO(3) has 3 DoF). The full trajectory over K keyframes has 15K variables. A typical 1-minute drive at 10 Hz keyframes has 600×15 = 9,000 optimization variables — well within the reach of modern sparse solvers.

← Back

Blog Index

Next: Part 2 →

IMU Preintegration, GNSS Model & Full Pipeline

Factor Graph Optimization& Sensor Fusion