The Problem with Text-Only Memory
Traditional chatbots store conversations as flat text and retrieve them with keyword search. This works for exact matches, but language is not exact. A user who mentions "sunny weather" expects the companion to recall conversations about "beautiful days" or "warm afternoons" or "rainfall," even though none of those phrases share a single keyword.
Keyword search fails silently. It does not tell you it missed something — it just returns nothing, and the companion sounds forgetful. For a virtual world where relationships develop over weeks and months, silent forgetting is a dealbreaker.
That is why poqpoq companions use embedding-based semantic memory: a system that understands meaning, not just spelling.
What Are Embeddings
An embedding is a list of 768 numbers — a 768-dimensional vector — that represents the meaning of a piece of text. The embedding model reads a sentence and compresses everything it understands about that sentence into those 768 numbers.
The critical property: similar meanings produce nearby vectors. Two sentences about the same topic will have vectors that point in roughly the same direction, even if they use completely different words. Unrelated sentences will point in different directions.
// Conceptual example — not actual vectors (768D is hard to draw)
embed("sunny weather") // → [0.82, 0.41, -0.12, ...] region A
embed("beautiful day") // → [0.79, 0.44, -0.09, ...] region A (close!)
embed("quantum physics") // → [-0.31, 0.67, 0.55, ...] region B (far away)
similarity("sunny weather", "beautiful day") // → 0.93 (very similar)
similarity("sunny weather", "quantum physics") // → 0.11 (unrelated)
This is how a companion can hear "remember when it rained?" and retrieve a memory stored as "the weather turned bad during our walk." No keyword overlap, perfect semantic match.
The Memory System
Companion memory operates in three stages: generate an embedding, store it alongside the original text, and retrieve using a hybrid search that combines vector similarity with keyword matching.
Stage 1: Embedding Generation
When a user sends a message, the system generates a 768-dimensional vector representing its meaning. This happens server-side using a dedicated embedding model. Critically, this work starts before the user presses Enter — the type-ahead system detects pauses in typing and begins precomputing the embedding. By the time the message arrives, the vector is usually already cached.
Stage 2: Memory Storage
Every memory is stored as both human-readable text (so the AI model can incorporate it into conversation) and a numeric embedding (so the search system can find it by meaning). The data model:
| Field | Type | Purpose |
|---|---|---|
| id | uuid | Unique memory identifier |
| content | text | The original message or event, readable by the AI |
| embedding | vector(768) | Semantic vector for similarity search |
| significance | float | How important this memory is (0.0 to 1.0) |
| memory_type | enum | Category: conversation, observation, event, relationship |
| created_at | timestamp | When this memory was formed |
The embedding column uses a vector data type backed
by a specialized index for fast approximate nearest-neighbor
search. A query like "find memories similar to this sentence" runs
against the index and returns results in single-digit milliseconds,
even across tens of thousands of stored memories.
Stage 3: Hybrid Search
Neither pure vector search nor pure keyword search is sufficient on its own. Vector search captures meaning but can miss exact references (proper nouns, specific numbers). Keyword search catches exact matches but misses paraphrases. The system combines both:
final_score = (0.70 * vector_similarity) + (0.30 * keyword_match)
// Vector similarity: cosine distance between the query embedding
// and each stored memory's embedding (0.0 to 1.0)
// Keyword match: normalized overlap between query tokens
// and memory text tokens (0.0 to 1.0)
The 70/30 weighting was determined empirically. It biases toward semantic understanding while keeping a strong enough keyword signal that a companion never "forgets" a name or a place that was mentioned explicitly.
Significance Scoring
Not every utterance deserves to be remembered. "hmm" and "ok" are conversational filler. "I promise I'll help you with your garden tomorrow" is a commitment that should persist. The significance scorer assigns a weight to each incoming message based on its content.
Scores are additive and capped at 1.0. Two examples:
"I'll" (commitment): +0.15
Emotional connotation of helping: +0.20
Total: 0.65
The significance score is stored alongside the memory and used again during retrieval. Low-significance memories are the first to be filtered out when the system assembles context for the AI model.
Three Layers of Decay
Human memory is not a perfect archive. We forget minor details quickly, recall important events for years, and keep recent conversations easily accessible. Companion memory mimics this with three layers of decay.
Together, these three layers ensure that a companion's working memory stays focused and relevant. It remembers what matters, forgets what does not, and keeps the most recent context instantly available.
The Type-Ahead Optimization
Latency is the enemy of conversational flow. If a companion takes three seconds to respond, the illusion of a living character breaks. The type-ahead optimization attacks this problem by starting work before the user finishes typing.
The system monitors the WebSocket connection for keystroke activity. When it detects a 500-millisecond pause in typing, it assumes the user is composing a thought and begins precomputing:
The partial message text is sent to the embedding model. A 768D vector is generated (roughly 200ms). That vector is immediately used to run a hybrid search against the memory store. The top candidate memories are pulled and cached.
By the time the user presses Enter, the system already has both the embedding and the relevant context assembled. The only remaining work is sending the final message to the AI model for response generation.
If the user keeps typing and changes the meaning of their message, the cache is invalidated and regenerated on the next pause. The 60-second TTL on the type-ahead cache prevents stale results from lingering.
Complete Flow: Keystroke to Response
Here is the full pipeline from the moment a user starts typing to the moment the companion begins responding.
- Type-ahead detects a 500ms pause and generates the embedding (~200ms)
- Hybrid search runs: 70% vector similarity + 30% keyword match (~15ms)
- Results filtered: significance >= 0.3, age <= 90 days (<1ms)
- Context assembled: top 10 memories + user identity + companion personality (<5ms)
- User presses Enter; message sent over WebSocket with pre-computed context (<50ms)
- AI model generates a response, streamed back token by token (~1.5s to first token)
- New memory stored: original text + embedding + significance score (~100ms, async)
- Type-ahead cache updated for follow-up queries (60s TTL)
Steps 1 through 4 happen before the user sends the message. The perceived latency is dominated by step 6 (the AI model's generation time), which is outside the memory system's control. Everything the memory system can control is already complete before the message leaves the client.