Part I of II — Descent via Photogrammetry

The Cover

Projection, Fibers, and the Data Before Descent
Taking photographs is easy. Any tourist does it. But reconstruction — bringing the flat images back into three dimensions, descending from the many projections to the single source — that requires work, consistency, mathematics. The ascent is a shutter click. The descent is an optimization problem. — after Grothendieck

You already know how photogrammetry works. You have shot objects from a dozen angles, fed the images into Metashape or COLMAP, and watched a point cloud materialize from coherent matches. What you may not know is that the mathematical skeleton underneath this workflow — the reason it works at all — is one of the deepest structures in 20th-century mathematics: Grothendieck’s theory of descent.

This is not an analogy. It is not a metaphor. Photogrammetry is descent, performed on real geometry, using the same axioms that govern vector bundles and Galois extensions. This essay will make the dictionary explicit. By the end of Part I you will understand what a cover, a fiber, a fiber product, and a descent datum are — because you already use them every time you click a shutter.

Section 01

The Scene You Cannot Touch

In category theory, descent begins with a base object $X$. This is the thing you want to understand — a scheme, a space, a geometric structure. The crucial condition: you cannot access $X$ directly. You can only interrogate it through maps into it.

In photogrammetry, $X$ is the 3D scene. The geometry is out there — vertices, edges, surfaces, textures — but you cannot reach in and read off coordinates. There is no God-view. You are locked behind cameras.

The Base Object — $X$
drag to orbit
You can rotate the scene, but you cannot measure it. No coordinates are revealed. This is $X$ — opaque.
Category Theory
The base $X$ lives in a category $\mathcal{C}$. We want to understand objects "over" $X$ — sheaves, bundles, modules — but we can only do so by pulling back along morphisms $Y \to X$. The scene is the object. The cameras are the morphisms.
Section 02

The Shutter Click: Pullback $p^*$

Each camera defines a map $\pi_i : X \dashrightarrow \mathbb{P}^2$ — a projection from three dimensions into two. A perspective projection is a pullback: it takes geometry living on the base $X$ and produces a flat representation on the image plane.

The key insight: projection destroys information. The depth axis — the direction from the camera to the scene — is collapsed. Every ray through the camera center maps to a single pixel. This is exactly the pullback functor $p^*$: it lifts structure from the base to the cover, but in doing so kills the fiber direction.

Projection — $\pi_i : X \dashrightarrow \mathbb{P}^2$
30°
Scene vertices
Camera & image plane
Projection rays
Slide to orbit the camera. Toggle rays to see the projection map $\pi_i$. Each ray collapses a line of 3D points to one 2D pixel.

Notice that each photograph gives you complete information within its field of view — colors, edges, texture — but it is irredeemably flat. The photograph is a local trivialization: on the patch of $X$ visible to camera $i$, everything is split and readable. But the splitting only works locally. Move to a different view and the depth relationships change.

Section 03

What the Lens Destroys: The Fiber

Pick a pixel in the image. Which 3D point produced it? You don't know. Any point along the ray from the camera through that pixel could have been responsible. This set of candidate 3D points — the entire ray — is the fiber over that pixel.

In the language of fibered categories: the map $\pi_i : X \dashrightarrow \mathbb{P}^2$ gives a fibered structure where the fiber over each point $p$ in the image is $\pi_i^{-1}(p)$ — a line in 3D space. This is the kernel of the projection. It is what the lens destroys.

$$\text{fiber}(p) = \pi_i^{-1}(p) = \{Q \in X : \pi_i(Q) = p\} \cong \mathbb{A}^1$$
Fibers — $\pi_i^{-1}(p)$
hover over image to probe fibers
Hover over the image plane (right side) to see the ray of 3D points — the fiber — that maps to that pixel.
The Fiber in Descent
In descent theory, the cover $Y \to X$ is a morphism that is faithfully flat. Each fiber $Y \times_X \{x\}$ over a point $x \in X$ is non-empty (faithfulness) and "spread out" (flatness). In photogrammetry, each visible scene point maps to at least one pixel — the fiber over that pixel contains the actual depth.

A single view cannot resolve the fiber. It sees the pixel but not the depth. To collapse the fiber — to determine which point on the ray is the real one — you need a second view. This is the first hint that reconstruction requires multiple cameras. One projection creates ambiguity. Two projections can begin to resolve it.

Section 04

The Cover Must Be Faithful

Not just any collection of cameras will do. The cover $\{U_i \to X\}$ must satisfy a crucial property: faithful flatness. In photogrammetry terms, this means:

Surjectivity (faithfulness): every point of the scene must be visible to at least one camera. If a region is never photographed, it cannot be reconstructed. It falls outside the cover.

Overlap (flatness): for reconstruction to work, most points should be visible to multiple cameras. The overlapping regions are where feature matching happens — where descent data lives.

Coverage Map — Faithful Flatness
0 views (uncovered)
1 view only
2 views (overlap)
3+ views (rich overlap)
Toggle cameras to see coverage change. Red regions — visible to only one camera — have ambiguous depth. Dark regions are invisible: descent fails there.

The coverage map is a heatmap of the multiplicity of the cover. Points with zero coverage are outside the domain of reconstruction. Points with exactly one view retain depth ambiguity. Points with two or more views are where the fiber product lives — where we can begin to triangulate.

$$\text{coverage}(x) = |\{i : x \in U_i\}| \quad\geq 2 \;\text{ for effective descent}$$
Section 05

Where Views Meet: The Fiber Product $Y \times_X Y$

Two cameras see overlapping regions of the scene. The set of points visible to both camera $i$ and camera $j$ is the fiber product $U_i \times_X U_j$. This is where the magic happens.

In the fiber product, every scene point has two representations: a pixel in image $i$ and a pixel in image $j$. These two pixels correspond to the same 3D point. The act of identifying them — "this pixel here is the same scene point as that pixel there" — is the descent datum. But we are getting ahead of ourselves. First, let's see the fiber product.

Fiber Product — $U_i \times_X U_j$
Visible to camera $i$ only
Visible to camera $j$ only
In the fiber product $U_i \times_X U_j$
Not visible to either
Adjust camera angles. Green points live in the fiber product — they have two projections, one in each view. This is where correspondences can be established.
Fiber Product
The fiber product $U_i \times_X U_j$ is the categorical pullback: $$U_i \times_X U_j = \{(p_i, p_j) : \pi_i(p_i) = \pi_j(p_j) \in X\}$$ It consists of all pairs of image points (one from each view) that correspond to the same scene point. The projections $\text{pr}_1, \text{pr}_2$ give you back the individual views.
Section 06

The Descent Datum: Feature Matching as $\varphi_{ij}$

Now we reach the heart. On the fiber product — the overlap between views — we establish correspondences. Point $A$ in image $i$ matches point $B$ in image $j$. This identification $\varphi_{ij}: p_1^* E \xrightarrow{\sim} p_2^* E$ is the descent datum.

In photogrammetry, this is feature matching — SIFT, ORB, SuperPoint, whatever your detector of choice. You find keypoints in each image, compute descriptors, and match them across views. Each verified match says: "these two pixels see the same piece of reality."

Feature Matching — Descent Datum $\varphi_{ij}$
click matching points in both views
View $i$ — Camera Left
View $j$ — Camera Right
0 correspondences established. Each one is an element of the descent datum $\varphi_{ij}$.

Each correspondence carries geometric information encoded in the essential matrix $E_{ij}$ (or fundamental matrix $F_{ij}$ for uncalibrated cameras). This $3 \times 3$ rank-2 matrix captures the epipolar geometry — the constraint on where a point in image $i$ can appear in image $j$.

$$\mathbf{p}_j^\top \, E_{ij} \, \mathbf{p}_i = 0$$

This epipolar constraint is the algebraic expression of the descent datum. The essential matrix is the transition function $\varphi_{ij} \in \text{SE}(3)$ — the relative camera pose — encoded as a bilinear form on image coordinates.

What Comes Next
We now have a cover (cameras), fibers (depth rays), a fiber product (overlap regions), and descent data (feature correspondences). Part II will ask: when do these data cohere? The cocycle condition $\varphi_{jk} \circ \varphi_{ij} = \varphi_{ik}$ will become bundle adjustment. Its failure will become reprojection error. And effective descent will become 3D reconstruction.
Section 07

The Dictionary So Far

Every term on the left you have used, perhaps without knowing it, every time you ran a photogrammetry pipeline. The mathematics is not imposed from outside. It was there all along.

Descent Theory Photogrammetry Why It's Not Metaphor
Base scheme $X$ 3D scene The unknown geometry you cannot directly access
Cover $Y \to X$ Camera array Multiple viewpoints whose union covers the scene
Faithful flatness Sufficient coverage & overlap Every point visible to $\geq 1$ camera; most to $\geq 2$
Pullback $p^*E$ 2D projection Lossy map: the depth fiber is killed
Fiber $\pi^{-1}(p)$ Ray through pixel All 3D points mapping to one 2D location
Fiber product $Y \times_X Y$ Overlap region Points visible to both cameras $i$ and $j$
Descent datum $\varphi_{ij}$ Feature correspondence Essential matrix / matched keypoints
Coming in Part II

The Descent: Cocycles, Reconstruction, and the Sheaf That Refuses

Bundle adjustment as the cocycle condition • Trifocal tensors on triple overlaps • Reprojection error as obstruction in $H^1$ • Gauge ambiguity and SE(3) • NeRF as the sheaf that never descends