How the memory retrieval pipeline delivers sub-two-second responses across 100,000 stored memories using approximate search, caching, and layered filtering.
Each memory stored in the companion system is represented as a
768-dimensional vector — a point in high-dimensional space where
nearby points represent semantically similar concepts. To find relevant memories,
the system must compare an incoming query vector against every stored memory.
A naive exhaustive search — comparing the query against every single memory — takes 5 to 10 seconds at this scale. Users expect conversational responses in under 2 seconds total, and memory retrieval is only one step in a longer pipeline. The search itself needs to complete in tens of milliseconds, not seconds.
The solution is to avoid searching every memory. Instead, the system pre-organizes memories into groups of similar content, then only searches the groups most likely to contain relevant results.
All stored memory vectors are partitioned into roughly 100 clusters based on semantic similarity. Memories about combat group together. Memories about a specific friend group together. Memories about locations group together. This happens once during index construction, not at query time.
When a search query arrives, the system first compares it against the 100 cluster centers — one comparison per cluster — to find the 5 most relevant groups. This narrows the search space before examining individual memories.
Only the memories inside those 5 clusters are searched exhaustively. With roughly 1,000 memories per cluster, this means examining about 5,000 vectors instead of the full 100,000.
This is an approximate search. By only examining 5 out of 100 clusters, the system might miss a relevant memory that happens to sit in cluster number 6. In practice, approximate nearest-neighbor search finds roughly 95% of the true top results.
For companion conversation, this trade-off is more than acceptable. The companion does not need the single mathematically optimal memory. It needs a handful of relevant memories to ground its response. Missing one out of twenty barely changes the conversation. Cutting search time by an order of magnitude changes the user experience dramatically.
Clustering is one layer of a four-tier optimization strategy. Each layer targets a different bottleneck, and they stack multiplicatively.
Recent query results are cached in memory. If the same or very similar query arrives again — common in ongoing conversations about a single topic — the entire search is skipped. Achieves roughly a 30% hit rate in typical conversation patterns.
The perception system tracks which entities (players, objects) have already been reported. If nothing has changed in a companion's surroundings, the spatial context update is skipped entirely. This eliminates roughly 70% of redundant spatial processing.
Before vector search begins, low-significance memories are excluded from the candidate set. Casual greetings and filler exchanges (scored below 0.3) never enter the search pipeline. This reduces the working dataset by approximately 40%.
The clustering approach described above, powered by FAISS. Instead of exhaustive comparison against all remaining memories, only the most promising clusters are searched. Provides the largest single speedup in the stack.
From the moment a user sends a message to the moment the companion begins responding, nine steps execute in sequence. Most complete in single-digit milliseconds. The two most expensive steps — embedding generation and AI response generation — are bounded by external service latency rather than application logic.
Steps 3 through 8 — the retrieval pipeline proper — account for roughly 85 milliseconds combined. The latency budget is dominated by two external calls: generating the query embedding (step 2) and waiting for the language model to produce its first token (step 9). Everything the application controls directly runs in well under 100 milliseconds.
Every optimization in this stack reflects a single principle: trade perfectionism for practicality. Each decision sacrifices a small amount of theoretical precision in exchange for a large improvement in user-perceived responsiveness.
None of these optimizations are individually groundbreaking. Caching, filtering, deduplication, and approximate search are well-established techniques. The contribution is in their composition: four layers, each targeting a different cost center, stacking to deliver a retrieval pipeline that runs in under 100 milliseconds on a dataset that would take seconds to search naively.
The user does not see any of this. They see a companion that responds quickly, remembers what matters, and never stumbles over its own memory. That is the point. The best infrastructure is the kind no one notices.