February 26, 2026

Semantic Spatial Memory for Robots using Gaussian Splatting

6 min read

Big Idea

While there are many challenges that must be solved for robots to operate effectively in homes, stores, warehouses, or other human environments, one of the most important is spatial memory. Most robotics systems today are optimized for understating what is visible in the current frame: detecting objects, estimating poses, planning actions from immediate observations, etc. While these capabilities are crucial, they do not provide a persistent understanding of an environment over time.

Unlike some autonomous vehicles operating on roads, robots like humanoids cannot rely on detailed maps created before deployment. Homes, offices, warehouses, etc., exhibit a massive long tail of layouts, objects arrangements, and environmental changes. Robots must therefore build their understanding online, incrementally learning about a space as they interact with it.

Despite this, indoor environments exhibit strong semantic regularities. Kitchens contain counters, sinks, appliances, and cabinets; bedrooms contain beds and dressers; living rooms contain seating and entertainment furniture. While exact layouts vary considerably, these recurring structures provide useful priors for building and organizing spatial memory.

Object detection tells a robot what is visible right now. Spatial memory allows it to retain what it has already learned about a space. The goal of this project is to create a persistent queryable 3D model that can be constructed during deployment. The system should capture geometry, localize the robot within that geometry, attach semantic labels to objects and regions, and maintain that knowledge over time. This enable queries like “Which cabinet contains the coffee mugs?” to return actionable 3D poses or regions that downstream planners can use directly.

Existing Gaussian-SLAM systems such as SplaTAM and MonoGS focus on reconstruction and localization. This project treats geometric reconstruction as a solved prerequisite and instead focuses on how semantic information should be attached to, organized within, and queried from a persistent spatial representation.

Gaussian Splatting is not the only possible scene representation for this problem, nor is it necessarily the most computationally efficient. Traditional approaches such as occupancy grids, TSDFs, point clouds, and voxel-based maps could also provide metric localization and support spatial queries. It was selected because it provides a mature, metric, and visually interpretable representation that allows the project to focus on semantic memory and querying rather than reconstruction research. Gaussian Splatting produces high-fidelity reconstructions that remain directly queryable in 3D space, making it a useful substrate for grounding semantic information and language queries.

A potential advantage of a persistent geometric memory is that most human environments are far more stable than they initially appear. While homes and offices contain dynamic objects, the majority of their geometry changes infrequently. Walls, floors, cabinets, countertops, appliances, shelving, and large furniture often remain fixed for long periods of time. In practice, most environmental changes involve a relatively small set of movable objects such as mugs, bags, boxes, chairs, doors, and people.

This suggests that geometry and semantics evolve at different rates. Once a high quality geometric representation has been established, large portions of the environment may require little or no recomputation. Semantic state, however, changes continuously as objects are observed, moved, reclassified, or related to one another. The architecture therefore maintains geometry and semantics separately. The geometric representation serves as persistent environmental memory, while semantic labels, object identities, and spatial relationships evolve independently on top of that substrate.

Under this model, dynamic observations are grounded within an existing 3D representation rather than requiring reconstruction of the entire environment whenever local changes occur. This provides a path toward more efficient long term spatial memory systems as environments evolve over time. More importantly, it shifts the focus of the project away from reconstruction and toward memory itself. The use of Gaussian Splatting is ultimately an implementation choice rather than the central contribution. The primary objective is to develop a memory architecture capable of grounding semantic knowledge in physical space and exposing that knowledge through language and spatial queries. While a dense 3D representation provides a useful foundation for localization, planning, and metric reasoning, the broader architecture could in principle be adapted to alternative scene representations.

Current Progress

(As of March 19th)

I think my geometry pipeline is working well, I have a Swift app capturing RGB frames and exporting a session bundle. I tried to use Arkit poses early on so I could include them in the export but the triangulation quality was poor so I ended up using COLMAP for the pose estimation; COLMAP runs full SfM on the frames.

Then before I decouple semantics from geometry, I wanted to keep them tied together for a MVP. I attempted to assign CLIP features directly to Gaussians by extracting patch-level features per training frame, project into 3D, and aggregate per Gaussian across views, optionally compress with an MLP. This is what LERF does for NeRF. This failed very fast. Inter-Gaussian cosine similarity was 0.22, queries for “desk” and “keyboard” produced identical responses.

I think the failure was aggregation and not CLIP. A gaussian visible across 50%+ of training frames accumulates features from dozens of different semantic contexts. Near the cup in some frames, near the keyboard in others, near nothing in most. Averaging those produces a vector that represents nothing and the MLP has nothing to compress.

I think the deeper issue is that per-Gaussian semantics is the wrong abstraction. A Gaussian doesn’t correspond to an object, it’s just a rendering primitive that participates in many views at many scales. Trying to assign a stable semantic identity to something w/o a stable semantic identity is the wrong problem. Stepping back, the queries the system needs to answer are object-level. “Where is the cup?” doesn’t require knowing which individual Gaussian belongs to the cup, it just requires knowing where the cup is in 3D space.

I think I could run SAM3 across training frames to produce consistent instance masks. Each object would get a single CLIP embedding from a clean, unambiguous crop, one that’s not averages across views. Masks are then projected into Gaussian space to assign object membership. A scene graph is built over the resulting object clusters with spatial relation edges.

The aggregation problem should disappear as CLIP embeddings come from crops where the object is clearly the subject, not from per-pixel features mixed across hundreds of frames with different contexts.