Chapter 5

Vector Search & Optimization

How the memory retrieval pipeline delivers sub-two-second responses across 100,000 stored memories using approximate search, caching, and layered filtering.

The Scale Problem

Each memory stored in the companion system is represented as a 768-dimensional vector — a point in high-dimensional space where nearby points represent semantically similar concepts. To find relevant memories, the system must compare an incoming query vector against every stored memory.

100,000 Stored Memories
768 Dimensions Each
76.8M Operations per Search
< 2s User Expectation

A naive exhaustive search — comparing the query against every single memory — takes 5 to 10 seconds at this scale. Users expect conversational responses in under 2 seconds total, and memory retrieval is only one step in a longer pipeline. The search itself needs to complete in tens of milliseconds, not seconds.

How Clustering Works

The solution is to avoid searching every memory. Instead, the system pre-organizes memories into groups of similar content, then only searches the groups most likely to contain relevant results.

1

Training: Group by Similarity

All stored memory vectors are partitioned into roughly 100 clusters based on semantic similarity. Memories about combat group together. Memories about a specific friend group together. Memories about locations group together. This happens once during index construction, not at query time.

2

Query: Find Nearest Clusters First

When a search query arrives, the system first compares it against the 100 cluster centers — one comparison per cluster — to find the 5 most relevant groups. This narrows the search space before examining individual memories.

100 cluster comparisons → top 5 selected
3

Search: Exhaustive Within Top Clusters

Only the memories inside those 5 clusters are searched exhaustively. With roughly 1,000 memories per cluster, this means examining about 5,000 vectors instead of the full 100,000.

5 clusters x ~1,000 memories = ~5,000 comparisons
Total comparisons: 100 + 5,000 = 5,100 instead of 100,000 ~20x faster

The Accuracy Trade-Off

This is an approximate search. By only examining 5 out of 100 clusters, the system might miss a relevant memory that happens to sit in cluster number 6. In practice, approximate nearest-neighbor search finds roughly 95% of the true top results.

For companion conversation, this trade-off is more than acceptable. The companion does not need the single mathematically optimal memory. It needs a handful of relevant memories to ground its response. Missing one out of twenty barely changes the conversation. Cutting search time by an order of magnitude changes the user experience dramatically.

The Complete Optimization Stack

Clustering is one layer of a four-tier optimization strategy. Each layer targets a different bottleneck, and they stack multiplicatively.

L1

LRU Cache

Recent query results are cached in memory. If the same or very similar query arrives again — common in ongoing conversations about a single topic — the entire search is skipped. Achieves roughly a 30% hit rate in typical conversation patterns.

O(1)
~30% queries eliminated
L2

Entity History Tracking

The perception system tracks which entities (players, objects) have already been reported. If nothing has changed in a companion's surroundings, the spatial context update is skipped entirely. This eliminates roughly 70% of redundant spatial processing.

O(1)
70% spatial reduction
L3

Significance Filtering

Before vector search begins, low-significance memories are excluded from the candidate set. Casual greetings and filler exchanges (scored below 0.3) never enter the search pipeline. This reduces the working dataset by approximately 40%.

O(1)
~40% candidates removed
L4

Approximate Nearest Neighbor

The clustering approach described above, powered by FAISS. Instead of exhaustive comparison against all remaining memories, only the most promising clusters are searched. Provides the largest single speedup in the stack.

O(√n)
~20x search speedup

The Complete Pipeline

From the moment a user sends a message to the moment the companion begins responding, nine steps execute in sequence. Most complete in single-digit milliseconds. The two most expensive steps — embedding generation and AI response generation — are bounded by external service latency rather than application logic.

1
Cache check
~2 ms
2
Embedding generation
~180 ms
3
Vector search (approximate nearest neighbor)
~30 ms
4
Keyword search (complementary exact match)
~20 ms
5
Hybrid scoring (70% vector + 30% keyword)
~5 ms
6
Significance and age filters
~5 ms
7
Content fetch (full memory text retrieval)
~15 ms
8
Context assembly (identity + memories + recent)
~10 ms
9
AI generation begins (first token)
~500 ms
Keystroke to first response token ~767 ms

Steps 3 through 8 — the retrieval pipeline proper — account for roughly 85 milliseconds combined. The latency budget is dominated by two external calls: generating the query embedding (step 2) and waiting for the language model to produce its first token (step 9). Everything the application controls directly runs in well under 100 milliseconds.

The Engineering Philosophy

Every optimization in this stack reflects a single principle: trade perfectionism for practicality. Each decision sacrifices a small amount of theoretical precision in exchange for a large improvement in user-perceived responsiveness.

Practical Over Perfect

95% accuracy Approximate search finds 19 out of 20 optimal results. The missing one would not change the conversation.
~2K tokens A small, curated context window outperforms a large, noisy one. Quality of context matters more than quantity.
70% dedup Entity history tracking eliminates seven out of ten redundant spatial updates. Companions only process what has changed.
30% cache hits Conversations naturally revisit topics. A simple LRU cache eliminates nearly a third of all search operations for free.

None of these optimizations are individually groundbreaking. Caching, filtering, deduplication, and approximate search are well-established techniques. The contribution is in their composition: four layers, each targeting a different cost center, stacking to deliver a retrieval pipeline that runs in under 100 milliseconds on a dataset that would take seconds to search naively.

The user does not see any of this. They see a companion that responds quickly, remembers what matters, and never stumbles over its own memory. That is the point. The best infrastructure is the kind no one notices.