Factor Graph Optimization
& Sensor Fusion
Everything you need to understand factor graph optimization from first principles: probabilistic estimation theory, graphical model foundations, nonlinear least squares, numerical linear algebra, and the Lie group mathematics required for 3D rotation handling. Part 2 applies all of this to a real GNSS/IMU fusion pipeline.
01 The State Estimation Problem
Before any equations, let us understand what problem we are actually solving. A robot, autonomous vehicle, or any mobile platform needs to know where it is, how fast it is moving, and how it is oriented in the world. Together these constitute the state — a vector that fully describes the relevant aspects of the system at a point in time.
The state is never directly observable. Sensors give us noisy, partial, and sometimes inconsistent glimpses of reality. A GPS receiver outputs a position with meter-level uncertainty. An IMU measures acceleration and angular velocity but accumulates error over time. A camera sees the world but cannot directly tell you your latitude. The goal of state estimation is to combine all available sensor data — with their known noise characteristics — to compute the best possible estimate of the state.
What makes this hard?
Three fundamental difficulties arise:
- 1Nonlinearity. The relationship between state and sensor measurement is rarely linear. IMU integration involves multiplying velocity by time and rotating acceleration vectors. GNSS pseudorange depends on the Euclidean distance between satellite and receiver. These nonlinearities prevent closed-form solutions in general.
- 2Noise on a manifold. Orientation — which is part of the state — does not live in Euclidean space. Rotations belong to the special orthogonal group SO(3), a curved manifold. Standard addition does not apply; you cannot write
R_new = R_old + δ. This requires special mathematical machinery. - 3Scale. A practical GNSS/IMU system runs for minutes to hours. At 200 Hz IMU and 10 Hz GNSS, that is tens of thousands of state variables coupled by hundreds of thousands of measurements. Naïve matrix operations are computationally impossible; we need structure-exploiting algorithms.
Two estimation paradigms: filtering vs. smoothing
Historically, two philosophies have emerged for handling temporal state estimation:
| Aspect | Filtering (EKF, UKF) | Smoothing (Factor Graphs) |
|---|---|---|
| Data used | Only current + past (online) | All data, past and future |
| Estimate | Marginal posterior p(xk | z1:k) | Full joint posterior p(x1:k | z1:k) |
| Memory | O(n²) — state covariance only | O(n) sparse — full history |
| Loop closures | Expensive, approximate | Natural, exact |
| Accuracy | Good for linear/near-linear | Optimal under correct model |
| Latency | Zero (online) | Fixed-lag or batch |
| Use case | Real-time control, resource-constrained | Offline, post-processing, SLAM |
The Extended Kalman Filter (EKF) has dominated robotics since the 1990s due to its simplicity and low memory footprint. It works by maintaining only the current state estimate and its covariance, updating them as new measurements arrive. The cost is that it discards all past information except through its summarized covariance — an approximation that loses accuracy over time, especially in nonlinear systems.
Factor graph smoothing keeps every state variable and every measurement in a graph simultaneously, then finds the global maximum a posteriori (MAP) estimate over the full trajectory. This is more accurate because: (a) future measurements can retroactively improve past estimates, (b) loop closures in SLAM are handled exactly, and (c) the global optimization is not prone to linearization errors that accumulate in the EKF. Modern implementations (iSAM2, GTSAM) make this computationally feasible by exploiting the sparse structure of the graph.
Why GNSS + IMU?
Together, GNSS and IMU are complementary: GNSS gives absolute position at low rate; IMU gives high-rate motion with bounded but growing error. A factor graph can fuse them optimally — the GNSS factors anchor the trajectory globally while the IMU preintegration factors provide tight high-frequency constraints between keyframes.
The full pipeline — preview
Part 1 of this guide builds all the mathematical machinery needed to understand the optimization block (Chapters 2–7). Part 2 fills in the sensor models and assembles the full pipeline with working code.
02 Probabilistic Foundations
Factor graph optimization is, at its core, Bayesian inference expressed as a least squares problem. To understand why optimization is the right tool, you need to understand the probabilistic reasoning underneath it.
The Bayesian framework
We model the unknown state as a random variable x and sensor measurements as random variables z. We have two pieces of information:
- 1Prior belief p(x): what we know about the state before seeing any measurements. For the first pose of a trajectory, this is often a delta function (known starting point) or a broad Gaussian (uncertain starting position).
- 2Likelihood p(z|x): the sensor model — given state x, how likely is measurement z? This encodes the noise model of the sensor. A GNSS receiver measuring position with 1 m standard deviation contributes a Gaussian likelihood.
Bayes' theorem combines these into the posterior:
The posterior encodes everything we know about x after seeing z. For multiple independent measurements z1, …, zn, the joint likelihood factorizes:
This is the fundamental equation that factor graphs compute. The graph structure makes the product over factors explicit and computationally exploitable.
Maximum A Posteriori (MAP) estimation
For continuous state spaces, we cannot enumerate all possibilities. Instead, we find the single most probable state — the mode of the posterior. This is the MAP estimate:
Taking the logarithm (which is monotone, so does not change the argmax) and negating (turning maximization into minimization):
This is a scalar function we can minimize with standard optimization tools. The log transformation is crucial: it turns products into sums, exponentials into quadratics, and multiplicative uncertainty into additive cost.
Gaussian distributions and their log-likelihood
In almost all sensor fusion problems, we assume Gaussian noise. This is motivated by the Central Limit Theorem (many small independent errors sum to Gaussian), mathematical tractability, and practical effectiveness. For a multivariate Gaussian with mean μ and covariance Σ:
Taking the negative log:
The expression ‖e‖²Σ = eT Σ−1 e is called the Mahalanobis distance squared. It generalizes Euclidean distance by scaling each dimension by its uncertainty — a 1 cm error in a 1 mm-precision sensor costs 100× more than in a 10 cm-precision sensor.
The matrix Σ−1 is the information matrix (also called the precision matrix), denoted Ω = Σ−1. High confidence (small Σ) means high Ω — the factor contributes strongly to the objective. This is the natural weighting: we trust accurate sensors more. In factor graphs, each factor contributes its information matrix Ωi to the Hessian of the cost.
The least squares connection
Now substitute Gaussian noise models into Equation (2.3). Each measurement factor p(zi | x) = 𝒩(hi(x), Σi) where hi is the measurement function (how the state maps to expected measurement). The negative log-likelihood of this factor is:
Summing over all factors, the MAP problem becomes:
This is nonlinear weighted least squares. Every factor graph problem with Gaussian noise reduces to this form. The prior factor contributes ½ ‖x − μ₀‖²Ω₀ and each measurement factor contributes ½ ‖zi − hi(x)‖²Ωi.
Whitening: reducing to standard least squares
Any weighted problem can be converted to standard (unweighted) least squares via the whitening transformation. If Ω = LTL (upper Cholesky), then:
Defining the whitened residual r̃ = L r, the problem becomes standard least squares ½ ‖r̃‖². Whitening is not just a mathematical trick — it has practical importance: the square root factor L (not Ω itself) is what iSAM2 maintains and updates incrementally. Working with L instead of Ω improves numerical stability and enables efficient rank-one updates.
Think of each factor as a spring connecting variables to their "target" values or to each other. The spring stiffness is Ωi — a tight spring (high Ω) strongly enforces the constraint. The MAP estimate is the configuration where all springs are in equilibrium: no spring can be relaxed without tightening another. Optimization finds this equilibrium configuration. IMU factors are stiff (very precise relative motion); GNSS factors are looser (meter-level position noise).
Maximum Likelihood vs MAP
When there is no prior (p(x) = const, or equivalently Ω0 → 0), MAP reduces to Maximum Likelihood Estimation (MLE): just minimize the measurement residuals. In practice for sensor fusion, we always include at least a prior on the first state (to fix the reference frame) and priors on IMU biases (from calibration). Pure MLE on a GNSS/IMU system would produce an under-constrained problem in the orientation space.
03 Graphical Models
We established that the MAP problem involves a product of factors. How we represent and organize those factors determines how efficiently we can solve the problem. Graphical models provide the representation; graph structure exposes the sparsity that makes optimization tractable.
Why graphs?
The joint distribution over all state variables and measurements is typically intractable to reason about directly. A vehicle navigating for 10 minutes at 10 Hz has 6000 pose states. The full covariance matrix would be 36,000 × 36,000 — over a billion entries, impossible to store or invert. But in reality, state x5 only directly constrains states x4 and x6 (through the motion model). The GNSS measurement at time 5 constrains only x5. The joint distribution has sparse conditional independence structure, and graphs make this structure explicit.
Bayesian Networks (Directed Graphical Models)
A Bayesian Network (BN) is a directed acyclic graph (DAG) where each node represents a random variable and each directed edge encodes a conditional dependency. The joint distribution factorizes as:
For a robot motion model with states x0:T and IMU controls u1:T:
This is the standard state-space model underlying the Kalman filter. Bayesian networks are powerful for representing causal, temporal structure.
Markov Random Fields (Undirected)
Some constraints are symmetric — a distance between two poses is the same whether you measure it from pose A to pose B or B to A. Undirected graphical models (Markov Random Fields / MRFs) handle these naturally. The joint distribution is a product of potential functions over cliques:
where Z is the partition function (normalizing constant). MRFs are natural for loop-closure constraints in SLAM, where closing a loop creates a symmetric constraint between two poses.
Factor Graphs — the best of both worlds
Factor graphs are bipartite graphs with two types of nodes:
- ○Variable nodes (circles): represent the unknowns — poses, velocities, biases. These are what we optimize over.
- ■Factor nodes (squares): represent constraints between variables — measurement factors, motion factors, prior factors. Each factor encodes a probability distribution involving its connected variables.
Every BN and MRF can be converted to a factor graph. The power of factor graphs is that they make the factorization structure of the joint distribution visually and computationally explicit. The joint probability is:
Conditional independence from graph structure
The most powerful property of factor graphs is that the graph structure directly encodes conditional independence. Two variables xi and xj are conditionally independent given a set of variables S if all paths between them in the factor graph pass through S. This is the graphical model analog of the Markov property, and it is what allows local computations (variable elimination, belief propagation) to produce globally correct results.
In a GNSS/IMU fusion graph, state x5 is conditionally independent of state x2 given x4 — because the only path from x2 to x5 passes through x3 and x4. This means when we update x5 based on a new GNSS measurement, we only need to process the local neighborhood of the graph, not reprocess everything from the beginning. This is the key insight that makes incremental smoothing (iSAM2) efficient.
04 The Factor Graph Formalism
With the probabilistic motivation and graphical model foundation in place, we can now define factor graphs precisely and show how they encode the GNSS/IMU state estimation problem.
Formal definition
A factor graph G = (V, F, E) consists of:
- VVariable nodes V = {x1, …, xn}: the unknowns we want to estimate. In GNSS/IMU fusion, these are navigation states — pose, velocity, IMU biases at each keyframe.
- FFactor nodes F = {f1, …, fm}: the probabilistic constraints. Each factor fi is a function of a subset of variables Xi ⊆ V. It encodes the probability distribution of a measurement or a prior.
- EEdges E ⊆ V × F: undirected edges connecting each factor to its variable nodes. A factor fi is connected to exactly those variables Xi it depends on. There are no edges between variable nodes or between factor nodes (bipartite).
The graph represents the factorized joint distribution of Equation (3.1). Finding the MAP estimate means finding the variable assignment that maximizes this joint, or equivalently minimizes the sum of negative log-factors.
Factor types in sensor fusion
| Factor Type | Variables | Measurement / Role | Noise Model |
|---|---|---|---|
| Prior factor | x0 | Initial pose from GNSS fix or known start | Gaussian, user-specified Σ |
| IMU preintegration | xk, xk+1, bk | Integrated accel + gyro over Δt | Gaussian, from IMU specs |
| GNSS position | xk | Absolute lat/lon/alt from receiver | Gaussian, dilution of precision |
| Bias random walk | bk, bk+1 | IMU bias evolves slowly | Gaussian, σ from IMU datasheet |
| Zero-velocity update | xk | Detected stationary interval | Tight Gaussian on velocity = 0 |
The objective function
For Gaussian factors, the MAP problem (Equation 2.6) with factor graph structure becomes:
Each term is a factor's contribution to the total cost. Minimizing F over all state variables simultaneously yields the global MAP estimate — the trajectory that best explains all measurements given the noise models.
The information matrix and sparsity
When we linearize F around a current estimate (Chapter 5), the resulting Hessian H has exactly the same sparsity pattern as the factor graph connectivity. If two variables xi and xj are not connected to any common factor, the corresponding Hij block is zero. This is why factor graph optimization is tractable at scale: for a chain-structured GNSS/IMU graph (no loop closures), H is block-tridiagonal — only O(n) nonzero blocks — and can be solved in O(n) time via forward/backward substitution.
A minimal working example in Python (GTSAM)
Before diving into the optimization algorithms, here is a complete minimal factor graph: 1D robot position estimated from two noisy measurements. This makes the abstract formalism concrete.
import gtsam import numpy as np # --- Define variable keys --- # GTSAM uses integer keys to identify variables X0 = gtsam.symbol('x', 0) # position at time 0 X1 = gtsam.symbol('x', 1) # position at time 1 # --- Create an empty nonlinear factor graph --- graph = gtsam.NonlinearFactorGraph() # --- Prior factor: x0 ~ N(0.0, sigma=0.5) --- prior_noise = gtsam.noiseModel.Isotropic.Sigma(1, 0.5) graph.add(gtsam.PriorFactorDouble(X0, 0.0, prior_noise)) # --- Between factor: x1 - x0 ~ N(1.0, sigma=0.1) (odometry) --- odom_noise = gtsam.noiseModel.Isotropic.Sigma(1, 0.1) graph.add(gtsam.BetweenFactorDouble(X0, X1, 1.0, odom_noise)) # --- Unary measurement: x1 ~ N(1.2, sigma=0.3) (GPS) --- gps_noise = gtsam.noiseModel.Isotropic.Sigma(1, 0.3) graph.add(gtsam.PriorFactorDouble(X1, 1.2, gps_noise)) # --- Initial values (linearization point) --- init = gtsam.Values() init.insert(X0, 0.0) init.insert(X1, 1.0) # --- Optimize with Gauss-Newton --- params = gtsam.GaussNewtonParams() params.setRelativeErrorTol(1e-6) optimizer = gtsam.GaussNewtonOptimizer(graph, init, params) result = optimizer.optimize() # --- Extract result --- print(f"x0 = {result.atDouble(X0):.4f}") # ≈ 0.000 print(f"x1 = {result.atDouble(X1):.4f}") # ≈ 1.057 (weighted avg of 1.0+0.0 and 1.2) # --- Marginal covariances --- marginals = gtsam.Marginals(graph, result) print(f"cov(x1) = {marginals.marginalCovariance(X1)[0,0]:.4f}") # ≈ 0.0082 Python · GTSAM
The result is the information-weighted average of two constraints on x1: the odometry says x1 = x0 + 1.0 = 1.0 with σ=0.1 (Ω=100), and the GPS says x1 = 1.2 with σ=0.3 (Ω=11.1). The MAP estimate weights by information: (100·1.0 + 11.1·1.2) / (100 + 11.1) ≈ 1.021. The prior on x0 is relatively weak (σ=0.5), pulling x1 slightly toward a lower value. This is exactly what the factor graph optimizer computes.
05 Nonlinear Least Squares
The objective function F(x) in Equation (4.1) is nonlinear because the measurement functions hi(x) are nonlinear (rotation matrices, trigonometric functions, etc.). We cannot solve for the minimum analytically. We iteratively linearize and solve linear systems — this is the foundation of all factor graph solvers.
Residuals and the cost function
For each factor i, define the residual:
The total cost is:
For a Gaussian measurement factor, ri = 0 when the measurement is perfectly predicted. For a GNSS factor, rGNSS = pmeasured − Rbody→world · pstate measures how much the current trajectory state disagrees with the GNSS fix.
Jacobians and linearization
To linearize, expand ri around the current estimate x0 using a first-order Taylor expansion:
Substituting into (5.1) and minimizing over δ:
where:
b = ∑i JiT Ωi ri (gradient of cost at x₀)
The Gauss-Newton method
Setting the gradient of the quadratic approximation to zero gives the normal equations:
The "⊕" notation is intentional: for variables living on manifolds (like rotations), the update is not simple addition but a retraction (Chapter 7). For Euclidean variables, ⊕ is just addition.
The full Gauss-Newton iteration:
def gauss_newton(factors, x0, max_iter=50, tol=1e-8): x = x0.copy() for iteration in range(max_iter): # Step 1: Evaluate all residuals and Jacobians at current x H = np.zeros((n, n)) # Hessian accumulator b = np.zeros(n) # Gradient accumulator cost = 0.0 for factor in factors: r, J = factor.evaluate(x) # residual and Jacobian Omega = factor.information_matrix() H += J.T @ Omega @ J # accumulate Hessian b += J.T @ Omega @ r # accumulate gradient cost += 0.5 * r @ Omega @ r # Step 2: Solve normal equations H·δ = -b delta = np.linalg.solve(H, -b) # Step 3: Update state (retract on manifold) x = retract(x, delta) # Step 4: Check convergence if np.linalg.norm(delta) < tol: break return x, cost Python · Pseudocode
Gauss-Newton is a local method. It finds a local minimum near the starting point. For sensor fusion, a poor initialization (e.g., starting at zero velocity when the vehicle is actually moving at 20 m/s) will produce a wrong local minimum or divergence. Always initialize from a reasonable first guess: dead reckoning with IMU integration, or the previous optimizer solution. The IMU preintegration makes this natural — the propagated state is an excellent initial guess.
Levenberg-Marquardt (LM)
Gauss-Newton can diverge when the linear approximation is poor — this happens far from the minimum (early iterations) or in regions of high curvature. The Levenberg-Marquardt algorithm adds a damping term that interpolates between Gauss-Newton and gradient descent:
The damping parameter λ is adapted automatically:
- ↑If the step increased cost (step was bad): increase λ → move toward gradient descent (smaller, safer step).
- ↓If the step decreased cost (step was good): decrease λ → move toward Gauss-Newton (larger, faster step).
LM has a beautiful trust-region interpretation: λ implicitly controls the radius of a ball around the current point within which the quadratic model is trusted. When things go wrong, shrink the ball; when they go right, expand it. This makes LM robust to poor initialization and widely used in robotics (Ceres Solver, GTSAM both use LM variants).
Convergence criteria
Three standard checks determine convergence. All three should pass before declaring success:
‖b‖ = ‖JTΩr‖ < ε₂ gradient is negligibly small (at minimum)
|ΔF / F| < ε₃ relative cost change is negligible
Typical values: ε₁ = 10−8, ε₂ = 10−5, ε₃ = 10−6. For sensor fusion with well-conditioned problems, convergence in 5–15 Gauss-Newton iterations is typical. The cost of one iteration is dominated by solving the linear system, which we address next.
Analytic vs automatic differentiation
Computing Jacobians Ji = ∂ri/∂x has two practical approaches:
| Method | How | Speed | Accuracy | Effort |
|---|---|---|---|---|
| Analytic | Derive ∂r/∂x by hand or with a CAS, implement in code | Fastest | Exact (up to floating point) | High — error-prone derivation |
| Automatic diff | Dual numbers or adjoint (Ceres, GTSAM with Expression) | Fast (1–3× slower) | Machine precision | Low — write residual once |
| Numerical | Finite differences: (r(x+ε) − r(x)) / ε | Slow (~2n evaluations) | Limited by step size ε | Zero — use for testing only |
For performance-critical sensor fusion (real-time SLAM), analytic Jacobians are standard. For prototyping and research, GTSAM's Expression and Ceres's automatic differentiation are excellent. Numerical Jacobians are invaluable for checking your analytic implementation: if the analytic and numerical Jacobians disagree by more than ~10−5, your analytic derivation has a bug.
06 Solving the Linear System
Each Gauss-Newton iteration reduces to solving the linear system H·δ = −b, where H is a large sparse symmetric positive definite (SPD) matrix. The efficiency of this solve — exploiting the sparsity from the factor graph structure — is what makes large-scale sensor fusion tractable.
The sparsity of H
H = ∑i JiT Ωi Ji inherits the sparsity of the Jacobians. Block Hjk is nonzero only when variables xj and xk appear together in at least one factor. For a sequential GNSS/IMU trajectory with no loop closures, H is block-tridiagonal (each state only directly coupled to its neighbors). This sparsity is the key property that separates factor graph backends from brute-force matrix inversion.
Cholesky factorization
For an SPD matrix H, the standard solver is Cholesky factorization: H = L LT where L is lower triangular. Then:
L LT δ = −b
L y = −b (forward substitution, O(nnz))
LT δ = y (backward substitution, O(nnz))
where nnz is the number of nonzero entries in L. For a tridiagonal H of size n×n, nnz = O(n) and the entire solve is O(n). Compare this to O(n3) for dense Cholesky — the difference for a 10,000-state trajectory is roughly 1012 vs 104 operations.
Variable ordering and fill-in
When Cholesky-factoring a sparse H, the triangular factor L can be denser than H — this is called fill-in. The amount of fill-in depends critically on the variable ordering (the order in which variables appear in the vector x, which determines the row/column order in H).
Consider a factor graph with a loop closure: state x0 and state x100 are connected by a factor. This creates H0,100 ≠ 0, and Cholesky propagates this coupling to fill all blocks Hi,j for 0 ≤ i,j ≤ 100. The optimal ordering (nested dissection, COLAMD, AMD) minimizes fill-in by clever reordering. In practice, GTSAM and g2o use COLAMD or AMD automatically — the programmer rarely needs to think about this.
There is a deep connection between variable elimination order and graph structure. Eliminating variables one by one from the factor graph produces a Bayes net (the directed graphical model equivalent) — this is the Cholesky factorization viewed as a graphical operation. The Cholesky factor L is the information matrix of this Bayes net. Good elimination orderings correspond to trees or near-tree structures in the graph, which minimize fill-in.
QR factorization and the square root information form
An alternative to Cholesky is to factor the Jacobian matrix J directly (before forming H = JTΩJ):
H = J̃T J̃ = RT QT Q R = RT R
The upper triangular R is the square root information matrix (also called the R-factor or ℛ). Solving the normal equations becomes RT R δ = −b, equivalent to Cholesky but numerically more stable (H's condition number is the square of J's condition number — QR avoids this squaring). GTSAM internally maintains R and updates it incrementally.
iSAM: Incremental Smoothing and Mapping
In online sensor fusion, new states and factors are added continuously. Rerunning Cholesky from scratch every time is wasteful — most of the existing factorization is still valid. iSAM (Kaess et al., 2008) exploits this by maintaining the Cholesky factor and updating it incrementally using rank-one (or rank-k) updates when new factors arrive.
The cost of adding a new factor that involves m variables is O(m·nnzaffected) rather than O(n·nnztotal). For a chain-structured problem, this is O(1) amortized per new factor — the same cost as the EKF, but with globally consistent estimates.
iSAM2 and the Bayes tree
iSAM had a limitation: when a non-leaf variable is affected by a new measurement (e.g., an old pose receives a loop closure), the entire subtree below it in the elimination tree must be refactored. iSAM2 (Kaess et al., 2012) solves this with the Bayes tree — a data structure that organizes the factorized Bayes net into cliques, enabling truly local updates.
When a new measurement arrives, iSAM2:
- 1Identifies the affected cliques — those in the subtree rooted at the clique containing the new variables.
- 2Removes the affected cliques from the Bayes tree (converting them back to factors).
- 3Adds the new factor and re-eliminates only the affected variables (a local Cholesky on a small subproblem).
- 4Updates the Bayes tree with the new cliques and propagates back-substitution (fluid relinearization) only through affected variables.
For a chain graph (no loop closures), only one clique is ever affected per new measurement — O(1) work. For loop closures, multiple cliques may need re-elimination, but the tree structure keeps this local. In practice, iSAM2 achieves near-constant time per factor addition for most SLAM/fusion problems.
Library comparison
| Library | Backend | Incremental | Languages | Best for |
|---|---|---|---|---|
| GTSAM | iSAM2, Bayes tree, QR/Cholesky | Yes (iSAM2) | C++, Python | Sensor fusion, SLAM, smoothing |
| g2o | Sparse Cholesky (CHOLMOD, csparse) | No (batch) | C++ | Pose graph SLAM, fast batch |
| Ceres | Sparse/dense Schur, LM | No (batch) | C++ | General NLS, bundle adjustment |
| manif | Thin wrapper, user provides solver | No | C++, Python | Manifold operations only |
| SE-Sync | Riemannian optimization | No | MATLAB, C++ | Rotation averaging, global minimum |
For GNSS/IMU fusion, GTSAM is the standard choice: it has prebuilt IMU preintegration factors, a Python API, active development, and iSAM2 for real-time use. The examples in Part 2 use GTSAM.
07 Lie Groups & Manifold Optimization
Everything in Chapters 2–6 assumed variables live in Euclidean space ℝn, where adding a correction δ to a state x is well-defined. Rotations break this assumption. This chapter explains why, and provides the mathematical tools to fix it.
The problem with representing rotations
A 3D rotation has exactly 3 degrees of freedom (pitch, roll, yaw). Yet every common representation has problems:
| Representation | DoF | Parameters | Problem |
|---|---|---|---|
| Euler angles (ZYX) | 3 | 3 | Gimbal lock: singularity at pitch = ±90° |
| Rotation matrix R ∈ SO(3) | 3 | 9 | 6 nonlinear constraints (RTR=I, det=1) — over-parameterized |
| Quaternion q ∈ ℝ⁴ | 3 | 4 | Unit norm constraint ‖q‖=1, double cover (q and −q represent the same rotation) |
| Rotation vector φ ∈ ℝ³ | 3 | 3 | Singular at ‖φ‖ = 2π, wrapping ambiguity |
For optimization, the rotation matrix is actually the best internal representation — it is closed under composition (R1R2 is always a rotation matrix) and easy to differentiate. The challenge is that we cannot perform unconstrained optimization over the 9 entries of R — we must ensure the result stays in SO(3). The solution is optimization on manifolds.
Groups and Lie groups
A group (G, ∘) is a set G with a binary operation ∘ satisfying: (1) closure (a∘b ∈ G), (2) associativity, (3) identity element e, (4) inverse for every element. The set of 3D rotations with composition forms the group SO(3).
A Lie group is a group that is simultaneously a smooth manifold — you can do calculus on it. The group operation and inverse are smooth (differentiable) maps. This is what enables the connection to optimization: we can define derivatives, tangent spaces, and exponential maps.
The special orthogonal group SO(3)
SO(3) is a 3-dimensional manifold embedded in ℝ⁹. It looks locally like ℝ³ near any point, but globally it is curved — think of the surface of a sphere, which locally looks flat but globally is not.
The Lie algebra so(3)
The Lie algebra g of a Lie group G is the tangent space at the identity element. For SO(3), the identity is I (the 3×3 identity matrix), and the tangent space is the set of skew-symmetric matrices:
Every element of so(3) has the form:
⎢ φ₃ 0 −φ₁ ⎥
⎣ −φ₂ φ₁ 0 ⎦ where φ = (φ₁, φ₂, φ₃)T ∈ ℝ³ is the rotation vector (axis × angle).
The hat operator (∧): φ ∈ ℝ³ → [φ]× ∈ so(3) converts a 3D vector to a skew-symmetric matrix. The vee operator (∨) is its inverse: [φ]×∨ = φ.
The exponential map: Rodrigues' rotation formula
The exponential map takes a Lie algebra element (tangent vector at identity) to a Lie group element (a rotation):
The inverse, the logarithmic map, maps a rotation back to its rotation vector:
These formulas are used constantly in factor graph optimization: the IMU preintegration integrates angular velocity by repeatedly applying Exp, and the optimizer computes linearization Jacobians using derivatives of Exp and Log.
import numpy as np def hat(phi): """R³ → so(3): skew-symmetric matrix from rotation vector.""" return np.array([[ 0, -phi[2], phi[1]], [ phi[2], 0, -phi[0]], [-phi[1], phi[0], 0]]) def Exp(phi): """so(3) → SO(3): Rodrigues' rotation formula.""" theta = np.linalg.norm(phi) if theta < 1e-8: return np.eye(3) + hat(phi) # first-order approx for small angles K = hat(phi / theta) return np.eye(3) + np.sin(theta) * K + (1 - np.cos(theta)) * K @ K def Log(R): """SO(3) → so(3): matrix logarithm.""" theta = np.arccos(np.clip((np.trace(R) - 1) / 2, -1, 1)) if theta < 1e-8: return (R - R.T) / 2 # first-order approx return (theta / (2 * np.sin(theta))) * (R - R.T) def vee(Phi): """so(3) → R³: extract rotation vector from skew-symmetric matrix.""" return np.array([Phi[2,1], Phi[0,2], Phi[1,0]]) # Example: 90-degree rotation about z-axis phi = np.array([0, 0, np.pi/2]) R = Exp(phi) phi_recovered = vee(Log(R)) print(phi_recovered) # [0, 0, 1.5708] ✓ Python · SO(3)
The special Euclidean group SE(3)
A full rigid body pose (position + orientation) is an element of SE(3):
The 4×4 matrix T is the homogeneous transformation matrix. It acts on homogeneous points p = [x;y;z;1] as T·p = [Rp+t; 1]. The Lie algebra is se(3) — 4×4 matrices of the form [[φ]×, ρ; 0, 0] where ρ ∈ ℝ³. SE(3) has 6 degrees of freedom (3 rotation + 3 translation), corresponding to the 6 DoF state of a rigid body.
Perturbation model: how to update rotations in optimization
In Gauss-Newton, we compute a correction δ ∈ ℝ⁶ (for SE(3)) and need to apply it to the current estimate T. The right perturbation model is:
This is the retraction — the operation that takes a current point on the manifold and a tangent vector, and produces a new point on the manifold. It is what the optimizer uses instead of naive addition. The choice of left vs right perturbation affects the form of the Jacobians but not the final result; GTSAM uses right perturbation for its Pose3 type.
Jacobians on manifolds
When a factor residual ri depends on a rotation R, the Jacobian ∂ri/∂R is not well-defined in the usual sense (R is constrained). Instead, we compute the perturbation Jacobian:
The 3-column Jacobian Ji,R is what enters the normal equations. It accounts for the curved geometry of SO(3) and produces numerically stable updates that always stay on the rotation manifold. GTSAM computes these Jacobians automatically for its built-in types (Rot3, Pose3) and provides a framework for user-defined types.
The IMU state includes orientation Rk ∈ SO(3) at every keyframe. When the optimizer computes a Gauss-Newton step, it produces a 3D correction δφk ∈ ℝ³ for each orientation. Applying Rk,new = Rk,old · Exp(δφk) guarantees Rk,new is exactly a rotation matrix — there is no need for normalization or Gram-Schmidt re-orthogonalization. The manifold structure is maintained automatically by the optimizer.
The full state manifold for GNSS/IMU
The navigation state at keyframe k is:
where Rk is orientation, pk is position, vk is velocity, bak is accelerometer bias, and bgk is gyroscope bias. The total dimension per keyframe is 3+3+3+3+3 = 15 parameters (but only 3+3+3+3+3 = 15 degrees of freedom since SO(3) has 3 DoF). The full trajectory over K keyframes has 15K variables. A typical 1-minute drive at 10 Hz keyframes has 600×15 = 9,000 optimization variables — well within the reach of modern sparse solvers.