The Cover
You already know how photogrammetry works. You have shot objects from a dozen angles, fed the images into Metashape or COLMAP, and watched a point cloud materialize from coherent matches. What you may not know is that the mathematical skeleton underneath this workflow — the reason it works at all — is one of the deepest structures in 20th-century mathematics: Grothendieck’s theory of descent.
This is not an analogy. It is not a metaphor. Photogrammetry is descent, performed on real geometry, using the same axioms that govern vector bundles and Galois extensions. This essay will make the dictionary explicit. By the end of Part I you will understand what a cover, a fiber, a fiber product, and a descent datum are — because you already use them every time you click a shutter.
The Scene You Cannot Touch
In category theory, descent begins with a base object $X$. This is the thing you want to understand — a scheme, a space, a geometric structure. The crucial condition: you cannot access $X$ directly. You can only interrogate it through maps into it.
In photogrammetry, $X$ is the 3D scene. The geometry is out there — vertices, edges, surfaces, textures — but you cannot reach in and read off coordinates. There is no God-view. You are locked behind cameras.
The Shutter Click: Pullback $p^*$
Each camera defines a map $\pi_i : X \dashrightarrow \mathbb{P}^2$ — a projection from three dimensions into two. A perspective projection is a pullback: it takes geometry living on the base $X$ and produces a flat representation on the image plane.
The key insight: projection destroys information. The depth axis — the direction from the camera to the scene — is collapsed. Every ray through the camera center maps to a single pixel. This is exactly the pullback functor $p^*$: it lifts structure from the base to the cover, but in doing so kills the fiber direction.
Notice that each photograph gives you complete information within its field of view — colors, edges, texture — but it is irredeemably flat. The photograph is a local trivialization: on the patch of $X$ visible to camera $i$, everything is split and readable. But the splitting only works locally. Move to a different view and the depth relationships change.
What the Lens Destroys: The Fiber
Pick a pixel in the image. Which 3D point produced it? You don't know. Any point along the ray from the camera through that pixel could have been responsible. This set of candidate 3D points — the entire ray — is the fiber over that pixel.
In the language of fibered categories: the map $\pi_i : X \dashrightarrow \mathbb{P}^2$ gives a fibered structure where the fiber over each point $p$ in the image is $\pi_i^{-1}(p)$ — a line in 3D space. This is the kernel of the projection. It is what the lens destroys.
$$\text{fiber}(p) = \pi_i^{-1}(p) = \{Q \in X : \pi_i(Q) = p\} \cong \mathbb{A}^1$$A single view cannot resolve the fiber. It sees the pixel but not the depth. To collapse the fiber — to determine which point on the ray is the real one — you need a second view. This is the first hint that reconstruction requires multiple cameras. One projection creates ambiguity. Two projections can begin to resolve it.
The Cover Must Be Faithful
Not just any collection of cameras will do. The cover $\{U_i \to X\}$ must satisfy a crucial property: faithful flatness. In photogrammetry terms, this means:
Surjectivity (faithfulness): every point of the scene must be visible to at least one camera. If a region is never photographed, it cannot be reconstructed. It falls outside the cover.
Overlap (flatness): for reconstruction to work, most points should be visible to multiple cameras. The overlapping regions are where feature matching happens — where descent data lives.
The coverage map is a heatmap of the multiplicity of the cover. Points with zero coverage are outside the domain of reconstruction. Points with exactly one view retain depth ambiguity. Points with two or more views are where the fiber product lives — where we can begin to triangulate.
$$\text{coverage}(x) = |\{i : x \in U_i\}| \quad\geq 2 \;\text{ for effective descent}$$Where Views Meet: The Fiber Product $Y \times_X Y$
Two cameras see overlapping regions of the scene. The set of points visible to both camera $i$ and camera $j$ is the fiber product $U_i \times_X U_j$. This is where the magic happens.
In the fiber product, every scene point has two representations: a pixel in image $i$ and a pixel in image $j$. These two pixels correspond to the same 3D point. The act of identifying them — "this pixel here is the same scene point as that pixel there" — is the descent datum. But we are getting ahead of ourselves. First, let's see the fiber product.
The Descent Datum: Feature Matching as $\varphi_{ij}$
Now we reach the heart. On the fiber product — the overlap between views — we establish correspondences. Point $A$ in image $i$ matches point $B$ in image $j$. This identification $\varphi_{ij}: p_1^* E \xrightarrow{\sim} p_2^* E$ is the descent datum.
In photogrammetry, this is feature matching — SIFT, ORB, SuperPoint, whatever your detector of choice. You find keypoints in each image, compute descriptors, and match them across views. Each verified match says: "these two pixels see the same piece of reality."
Each correspondence carries geometric information encoded in the essential matrix $E_{ij}$ (or fundamental matrix $F_{ij}$ for uncalibrated cameras). This $3 \times 3$ rank-2 matrix captures the epipolar geometry — the constraint on where a point in image $i$ can appear in image $j$.
$$\mathbf{p}_j^\top \, E_{ij} \, \mathbf{p}_i = 0$$This epipolar constraint is the algebraic expression of the descent datum. The essential matrix is the transition function $\varphi_{ij} \in \text{SE}(3)$ — the relative camera pose — encoded as a bilinear form on image coordinates.
The Dictionary So Far
Every term on the left you have used, perhaps without knowing it, every time you ran a photogrammetry pipeline. The mathematics is not imposed from outside. It was there all along.
| Descent Theory | Photogrammetry | Why It's Not Metaphor |
|---|---|---|
| Base scheme $X$ | 3D scene | The unknown geometry you cannot directly access |
| Cover $Y \to X$ | Camera array | Multiple viewpoints whose union covers the scene |
| Faithful flatness | Sufficient coverage & overlap | Every point visible to $\geq 1$ camera; most to $\geq 2$ |
| Pullback $p^*E$ | 2D projection | Lossy map: the depth fiber is killed |
| Fiber $\pi^{-1}(p)$ | Ray through pixel | All 3D points mapping to one 2D location |
| Fiber product $Y \times_X Y$ | Overlap region | Points visible to both cameras $i$ and $j$ |
| Descent datum $\varphi_{ij}$ | Feature correspondence | Essential matrix / matched keypoints |