Photogrammetry for the Categorically Minded
The Premise
You have played video games. You know that a 3D engine takes geometry and renders a flat image from a virtual camera. This is a projection \(\pi:\mathbb{R}^3\dashrightarrow\mathbb{R}^2\). It is surjective but not injective: every pixel corresponds to a ray, and the depth along that ray is destroyed.
Photogrammetry is the inverse problem: given a collection of flat images taken from different viewpoints, recover the 3D geometry. If rendering is the pullback \(p^*\), photogrammetry is descent.
One photograph is not enough — the depth fiber is killed. But multiple photographs from different positions each kill a different fiber direction. The collection of cameras forms a cover, and the problem becomes: given coherent local data on the cover, reconstruct the global object on the base.
The Capture Rig
Professional studios use static multi-camera rigs — 30 to 200+ cameras arranged in rings and domes around a central volume. All cameras fire simultaneously, freezing the subject. The rig geometry is fixed and pre-calibrated; only the subject changes between captures.
Below: cameras in three rings (low, mid, high) surround a bust. Drag to orbit. Click any camera to see its 2D projection — the pullback \(p^*E\).
The Reconstruction Pipeline
Once the photographs are captured, reconstruction proceeds through stages — each with a categorical analog. Step through them below.
Triangulation: Recovering the Fiber
Two cameras, two rays, one intersection. Each matched feature determines a ray from each camera center through the image plane. These rays generically meet in a unique 3D point — effective descent.
Drag the scene point. Watch the rays converge.
When cameras are too close, rays become nearly parallel and the intersection degenerates. This is the analog of a cover that is faithful but not "flat enough."
The Cocycle Condition
With three or more views, correspondences must be transitive:
\(\varphi_{ik}=\varphi_{jk}\circ\varphi_{ij}\)
Failure is measured by reprojection error: project each 3D point back into every camera and measure the residual. Bundle adjustment minimizes this globally — enforcing the cocycle condition in a least-squares sense. Crank the noise slider:
The Dictionary
The correspondence is not an analogy — it is a functor.
| Descent Theory | Photogrammetry | Gloss |
|---|---|---|
| Base \(X\) | 3D scene | The unknown geometry |
| Cover \(Y\to X\) | Camera rig | Viewpoints whose union covers the scene |
| Faithful flatness | ≥ 60% overlap | Every point visible to ≥ 2 cameras |
| Pullback \(p^*E\) | Photograph | Depth fiber killed |
| \(Y\!\times_X\! Y\) | Overlap region | Points visible to cameras \(i\) and \(j\) |
| Descent datum \(\varphi_{ij}\) | Feature correspondence | Matched keypoints across views |
| Cocycle on \(Y^3_X\) | Trifocal tensor | Three-view consistency |
| Cocycle condition | Bundle adjustment | Transitivity of correspondences |
| Effective descent | 3D reconstruction | Point cloud recovered |
| \(H^1\) obstruction | Reprojection error | The red residuals |
| \(\mathrm{SE}(3)\) | Camera pose group | Rigid motions |
| Gauge ambiguity | Scale ambiguity | Up to global similarity |
| Section | Ground control point | Fixes the gauge |
| Sheaf, no descent | NeRF | \(T\mapsto\mathrm{view}(T)\) directly |
The last row is the punchline. A NeRF learns \((\mathrm{pos},\mathrm{dir})\mapsto(\mathrm{color},\mathrm{density})\) — the functor of views — without ever constructing a mesh. It is a sheaf that refuses to descend.