February 26, 2026

Semantic Spatial Memory for Robots using Gaussian Splatting

3 min read

Big Idea

Robots need spatial memory, not just per-frame detection. Detection is stateless — every time the robot looks at a scene it starts from scratch. The goal here is a system that captures a space once and makes it permanently queryable: ask "where is the charging dock" and get back 3D coordinates a planner can act on directly.

The approach uses Gaussian Splatting as a metric scene representation and builds a language-queryable semantic layer on top of it. Crucially, semantics stay separate from geometry. Tightly coupling them — baking features into the Gaussians themselves — creates a representation that's hard to update when objects move, and that constraint matters for anything beyond a static demo.

Dynamic scenes are out of scope for now. Building something like SplaTAM or MonoGS is a separate project. But the architecture should not foreclose it.

Current Progress

The geometry pipeline is solid. A Swift app captures RGB frames and exports a session bundle. COLMAP runs full SfM on the frames — ARKit poses were tried early but triangulation quality was poor, so COLMAP handles pose estimation entirely. 3DGS trains on the COLMAP output on a Colab T4. The resulting scene is pure geometry and appearance, no semantics attached.

The question was where and how to attach semantics.


The natural first attempt was to assign CLIP features directly to Gaussians — extract patch-level features per training frame, project into 3D, aggregate per Gaussian across views, optionally compress with an MLP. This is what LERF does for NeRF. It failed fast: inter-Gaussian cosine similarity of 0.22, queries for "desk" and "keyboard" producing identical responses.

The failure wasn't the choice of CLIP. It was aggregation. A Gaussian visible across 50%+ of training frames accumulates features from dozens of different semantic contexts — near the cup in some frames, near the keyboard in others, near nothing in most. Averaging those produces a vector that represents nothing in particular. The MLP had nothing to compress.

Switching to DINOv2 — which is explicitly trained for spatial coherence, unlike CLIP patch tokens — didn't save it. Per-frame features were better. Aggregated features were the same. The collapse happened in the accumulation step, not the feature extractor. Per-Gaussian crop supervision and uniformity loss both failed to recover signal that was already gone.


The deeper issue is that per-Gaussian semantics is the wrong abstraction. A Gaussian doesn't correspond to an object — it's a rendering primitive that participates in many views at many scales. Trying to assign stable semantic identity to something with no stable semantic identity is the wrong problem. And stepping back: the queries this system needs to answer are object-level. "Where is the cup" doesn't require knowing which individual Gaussian belongs to the cup — it requires knowing where the cup is in 3D space.

The current architecture reflects that. SAM3 runs across training frames to produce consistent instance masks. Each object gets a single CLIP embedding from a clean, unambiguous crop — not averaged across views. Masks are projected into Gaussian space to assign object membership. A scene graph is built over the resulting object clusters with spatial relation edges. The scene graph was always part of the plan; what changed is that it's now the primary semantic representation rather than a layer on top of a dense feature field.

The aggregation problem disappears. CLIP embeddings come from crops where the object is clearly the subject, not from per-pixel features mixed across hundreds of frames with different contexts.