Photogrammetry for the Categorically Minded

A visual primer — no graphics background required

The Premise

You have played video games. You know that a 3D engine takes geometry and renders a flat image from a virtual camera. This is a projection \(\pi:\mathbb{R}^3\dashrightarrow\mathbb{R}^2\). It is surjective but not injective: every pixel corresponds to a ray, and the depth along that ray is destroyed.

Photogrammetry is the inverse problem: given a collection of flat images taken from different viewpoints, recover the 3D geometry. If rendering is the pullback \(p^*\), photogrammetry is descent.

One photograph is not enough — the depth fiber is killed. But multiple photographs from different positions each kill a different fiber direction. The collection of cameras forms a cover, and the problem becomes: given coherent local data on the cover, reconstruct the global object on the base.

The Capture Rig

Professional studios use static multi-camera rigs — 30 to 200+ cameras arranged in rings and domes around a central volume. All cameras fire simultaneously, freezing the subject. The rig geometry is fixed and pre-calibrated; only the subject changes between captures.

Below: cameras in three rings (low, mid, high) surround a bust. Drag to orbit. Click any camera to see its 2D projection — the pullback \(p^*E\).

Interactive · Static Capture Rig
Drag to orbit · Click a camera to see its view · Scroll to zoom
Low ring
Mid ring
High ring
Selected view

The Reconstruction Pipeline

Once the photographs are captured, reconstruction proceeds through stages — each with a categorical analog. Step through them below.

Interactive · Reconstruction Pipeline

Triangulation: Recovering the Fiber

Two cameras, two rays, one intersection. Each matched feature determines a ray from each camera center through the image plane. These rays generically meet in a unique 3D point — effective descent.

Drag the scene point. Watch the rays converge.

Interactive · Triangulation (Effective Descent)
Drag the amber dot to move the scene point
Cameras
Scene point P
Reconstructed P̂
Rays

When cameras are too close, rays become nearly parallel and the intersection degenerates. This is the analog of a cover that is faithful but not "flat enough."

The Cocycle Condition

With three or more views, correspondences must be transitive:

\(\varphi_{ik}=\varphi_{jk}\circ\varphi_{ij}\)

Failure is measured by reprojection error: project each 3D point back into every camera and measure the residual. Bundle adjustment minimizes this globally — enforcing the cocycle condition in a least-squares sense. Crank the noise slider:

Interactive · Cocycle Condition & Bundle Adjustment
Cameras
True points
Reconstructed
Error

The Dictionary

The correspondence is not an analogy — it is a functor.

Descent TheoryPhotogrammetryGloss
Base \(X\)3D sceneThe unknown geometry
Cover \(Y\to X\)Camera rigViewpoints whose union covers the scene
Faithful flatness≥ 60% overlapEvery point visible to ≥ 2 cameras
Pullback \(p^*E\)PhotographDepth fiber killed
\(Y\!\times_X\! Y\)Overlap regionPoints visible to cameras \(i\) and \(j\)
Descent datum \(\varphi_{ij}\)Feature correspondenceMatched keypoints across views
Cocycle on \(Y^3_X\)Trifocal tensorThree-view consistency
Cocycle conditionBundle adjustmentTransitivity of correspondences
Effective descent3D reconstructionPoint cloud recovered
\(H^1\) obstructionReprojection errorThe red residuals
\(\mathrm{SE}(3)\)Camera pose groupRigid motions
Gauge ambiguityScale ambiguityUp to global similarity
SectionGround control pointFixes the gauge
Sheaf, no descentNeRF\(T\mapsto\mathrm{view}(T)\) directly

The last row is the punchline. A NeRF learns \((\mathrm{pos},\mathrm{dir})\mapsto(\mathrm{color},\mathrm{density})\) — the functor of views — without ever constructing a mesh. It is a sheaf that refuses to descend.