Chapter 2 Memory & Embeddings

Semantic memory, significance scoring, and the three layers of decay that keep companion recall human-like

The Problem with Text-Only Memory

Traditional chatbots store conversations as flat text and retrieve them with keyword search. This works for exact matches, but language is not exact. A user who mentions "sunny weather" expects the companion to recall conversations about "beautiful days" or "warm afternoons" or "rainfall," even though none of those phrases share a single keyword.

Keyword search fails silently. It does not tell you it missed something — it just returns nothing, and the companion sounds forgetful. For a virtual world where relationships develop over weeks and months, silent forgetting is a dealbreaker.

That is why poqpoq companions use embedding-based semantic memory: a system that understands meaning, not just spelling.

What Are Embeddings

An embedding is a list of 768 numbers — a 768-dimensional vector — that represents the meaning of a piece of text. The embedding model reads a sentence and compresses everything it understands about that sentence into those 768 numbers.

The critical property: similar meanings produce nearby vectors. Two sentences about the same topic will have vectors that point in roughly the same direction, even if they use completely different words. Unrelated sentences will point in different directions.

// Conceptual example — not actual vectors (768D is hard to draw)

embed("sunny weather")     // → [0.82, 0.41, -0.12, ...]  region A
embed("beautiful day")     // → [0.79, 0.44, -0.09, ...]  region A  (close!)
embed("quantum physics")   // → [-0.31, 0.67, 0.55, ...]  region B  (far away)

similarity("sunny weather", "beautiful day")    // → 0.93  (very similar)
similarity("sunny weather", "quantum physics")  // → 0.11  (unrelated)

This is how a companion can hear "remember when it rained?" and retrieve a memory stored as "the weather turned bad during our walk." No keyword overlap, perfect semantic match.

The Memory System

Companion memory operates in three stages: generate an embedding, store it alongside the original text, and retrieve using a hybrid search that combines vector similarity with keyword matching.

Stage 1: Embedding Generation

When a user sends a message, the system generates a 768-dimensional vector representing its meaning. This happens server-side using a dedicated embedding model. Critically, this work starts before the user presses Enter — the type-ahead system detects pauses in typing and begins precomputing the embedding. By the time the message arrives, the vector is usually already cached.

Stage 2: Memory Storage

Every memory is stored as both human-readable text (so the AI model can incorporate it into conversation) and a numeric embedding (so the search system can find it by meaning). The data model:

Field Type Purpose
id uuid Unique memory identifier
content text The original message or event, readable by the AI
embedding vector(768) Semantic vector for similarity search
significance float How important this memory is (0.0 to 1.0)
memory_type enum Category: conversation, observation, event, relationship
created_at timestamp When this memory was formed

The embedding column uses a vector data type backed by a specialized index for fast approximate nearest-neighbor search. A query like "find memories similar to this sentence" runs against the index and returns results in single-digit milliseconds, even across tens of thousands of stored memories.

Stage 3: Hybrid Search

Neither pure vector search nor pure keyword search is sufficient on its own. Vector search captures meaning but can miss exact references (proper nouns, specific numbers). Keyword search catches exact matches but misses paraphrases. The system combines both:

final_score = (0.70 * vector_similarity) + (0.30 * keyword_match)

// Vector similarity: cosine distance between the query embedding
// and each stored memory's embedding (0.0 to 1.0)

// Keyword match: normalized overlap between query tokens
// and memory text tokens (0.0 to 1.0)

The 70/30 weighting was determined empirically. It biases toward semantic understanding while keeping a strong enough keyword signal that a companion never "forgets" a name or a place that was mentioned explicitly.

Significance Scoring

Not every utterance deserves to be remembered. "hmm" and "ok" are conversational filler. "I promise I'll help you with your garden tomorrow" is a commitment that should persist. The significance scorer assigns a weight to each incoming message based on its content.

Signal Weight
High-significance phrases (promise, trust, remember, important) +0.30
Emotional content (happy, sad, afraid, love, angry) +0.20
Commitments (will, shall, decided, going to) +0.15
Questions (any interrogative) +0.10
Substantial length (more than 100 characters) +0.10

Scores are additive and capped at 1.0. Two examples:

Example A
"hmm"
No high-significance phrases, no emotion, no commitment, not a question, only 3 characters.
Significance: 0.00 — not stored
Example B
"I promise I'll help you with your garden tomorrow"
"promise" (high-significance phrase): +0.30
"I'll" (commitment): +0.15
Emotional connotation of helping: +0.20
Total: 0.65
Significance: 0.65 — stored and will persist

The significance score is stored alongside the memory and used again during retrieval. Low-significance memories are the first to be filtered out when the system assembles context for the AI model.

Three Layers of Decay

Human memory is not a perfect archive. We forget minor details quickly, recall important events for years, and keep recent conversations easily accessible. Companion memory mimics this with three layers of decay.

Layer 1: Type-Ahead Cache
TTL: 60 seconds
The fastest layer. Holds recently computed embeddings and search results for rapid follow-up queries. If a user asks about the weather and then immediately asks a related question, the cache provides instant recall without re-searching the database.
Layer 2: Age-Based Filtering
Window: 90 days
Memories older than 90 days are excluded from standard search results. This balances recency (the companion prioritizes what happened last week over what happened six months ago) with continuity (90 days is long enough to span a story arc or relationship development).
Layer 3: Significance Decay
Soft decay, weighted
Within the 90-day window, memories are further ranked by significance. A 0.65-significance promise from last month will surface ahead of a 0.10-significance "ok" from yesterday. High-significance memories effectively persist; trivial ones fade.

Together, these three layers ensure that a companion's working memory stays focused and relevant. It remembers what matters, forgets what does not, and keeps the most recent context instantly available.

The Type-Ahead Optimization

Latency is the enemy of conversational flow. If a companion takes three seconds to respond, the illusion of a living character breaks. The type-ahead optimization attacks this problem by starting work before the user finishes typing.

The system monitors the WebSocket connection for keystroke activity. When it detects a 500-millisecond pause in typing, it assumes the user is composing a thought and begins precomputing:

On pause detection

The partial message text is sent to the embedding model. A 768D vector is generated (roughly 200ms). That vector is immediately used to run a hybrid search against the memory store. The top candidate memories are pulled and cached.

By the time the user presses Enter, the system already has both the embedding and the relevant context assembled. The only remaining work is sending the final message to the AI model for response generation.

If the user keeps typing and changes the meaning of their message, the cache is invalidated and regenerated on the next pause. The 60-second TTL on the type-ahead cache prevents stale results from lingering.

Complete Flow: Keystroke to Response

Here is the full pipeline from the moment a user starts typing to the moment the companion begins responding.

  1. Type-ahead detects a 500ms pause and generates the embedding (~200ms)
  2. Hybrid search runs: 70% vector similarity + 30% keyword match (~15ms)
  3. Results filtered: significance >= 0.3, age <= 90 days (<1ms)
  4. Context assembled: top 10 memories + user identity + companion personality (<5ms)
  5. User presses Enter; message sent over WebSocket with pre-computed context (<50ms)
  6. AI model generates a response, streamed back token by token (~1.5s to first token)
  7. New memory stored: original text + embedding + significance score (~100ms, async)
  8. Type-ahead cache updated for follow-up queries (60s TTL)
Keystroke to first response token: approximately 2 seconds

Steps 1 through 4 happen before the user sends the message. The perceived latency is dominated by step 6 (the AI model's generation time), which is outside the memory system's control. Everything the memory system can control is already complete before the message leaves the client.