Chapter 8

RAG Architecture & Knowledge Seeding

How to give AI companions verifiable, updatable world knowledge without retraining the model -- and why retrieval-augmented generation is the right trade-off for dynamic virtual environments.

The Breakthrough

AI companions need to answer factual questions about the world they inhabit. "What is this platform?" "How much does it cost?" "How do I get started?" Without a knowledge foundation, companions either hallucinate answers or deflect with "I don't know."

The solution is Retrieval-Augmented Generation (RAG): a pipeline that retrieves relevant facts from a knowledge base and injects them into the language model's prompt. Companions answer from verified sources, not imagination. Knowledge updates happen through editing a file, not retraining a model.

Three Approaches to AI Knowledge

When you need an AI system to "know" something, you have three main architectural options. Each involves a different trade-off between cost, speed, accuracy, and maintainability.

1

Fine-Tuning (Bake It Into the Weights)

Retrain the language model on your specific data, encoding facts directly into billions of parameters. Inference is fast because there are no external lookups, but every knowledge update requires expensive retraining. You cannot easily verify what the model "learned," and it may confidently produce wrong answers. For a platform FAQ that changes monthly, fine-tuning is overkill.

2

Raw Embeddings (Search Without Grounding)

Store knowledge entries as embedding vectors and search for semantically relevant results, but do not inject them into the prompt. The AI still generates answers from its own training data. Search is fast and storage is cheap, but the model may ignore retrieved results entirely and hallucinate its own answer. There is no way to trace responses back to authoritative sources.

3

RAG: Retrieval-Augmented Generation

Retrieve relevant knowledge via semantic search, augment the prompt with the retrieved facts, then generate an answer grounded in that context. Every answer traces back to its source. Updates require editing a knowledge file and re-seeding, not retraining. The cost is minimal: embedding generation is fractions of a cent per query, and vector search completes in single-digit milliseconds.

Why RAG Wins for Virtual Worlds

RAG occupies the sweet spot for dynamic, verifiable, updatable knowledge at scale. Five factors drove the decision.

  1. Knowledge changes frequently -- pricing, features, and platform details evolve. Editing a structured file is trivial; retraining a model is not.
  2. Multiple companion personalities -- each AI character needs access to shared platform knowledge plus their own unique perspective. RAG supports scoped and global knowledge in the same search query.
  3. Verifiability matters -- developers need to audit exactly what companions "know." Every RAG answer traces to a specific source entry.
  4. Cost efficiency -- the only costs are embedding generation and vector search, both negligible compared to fine-tuning.
  5. Hybrid search -- RAG naturally combines user conversation history with system knowledge, returning both personal memories and factual answers in a single query.

Architecture: From Knowledge File to Semantic Search

The RAG pipeline has five stages. Knowledge starts as a structured file, passes through an embedding pipeline, lands in a vector database, and surfaces in the language model's prompt at query time.

Knowledge Creation
Structured Knowledge Files
Question/Answer Pairs
Keywords & Significance Scores
Seeding Pipeline
Parse Entries
Generate Embeddings
Compute Significance
Deduplicate by Hash
Vector Database
Embedding Vectors
Full-Text Index
Significance Scores
Companion Scoping
Hybrid Search
Vector Similarity (70%)
Keyword Matching (30%)
User + System Results
Prompt Augmentation
Companion Identity
Retrieved Knowledge
User Question
Grounded Response

The Knowledge Schema

Knowledge is stored in structured files that are human-readable, version-controlled, and editable without touching code. Each entry contains a question, an answer, keywords for hybrid search, and a significance score.

Why structured files? Non-technical team members can update knowledge without deploying code. Version control tracks every change with clean diffs. Comments explain why entries exist. The format is structured enough for automation yet readable enough for humans.

Entry Anatomy

Each knowledge entry carries several fields that serve different parts of the pipeline.

Hybrid Search: Vectors + Keywords

The retrieval query combines two strategies in a single database operation. Vector similarity (70% weight) finds semantically related entries through approximate nearest-neighbor search. Keyword matching (30% weight) ensures exact terms are never missed. The combined score determines ranking.

The Scoping Problem

A critical early bug: when searching for a specific companion (say, "Apollo"), the query filter used companion_id = 'apollo', which excluded all global knowledge entries. Apollo could not find platform FAQs even though they existed in the database.

The fix was an OR condition: (companion_id = 'apollo' OR companion_id = 'global'). This ensures every companion sees both their own scoped knowledge and the shared global knowledge base.

Search Architecture

The search query filters on three dimensions simultaneously.

Results are ordered by the weighted combination of vector similarity and keyword rank, limited to the top 10 entries.

Performance Profile

~2ms Vector Search
~12ms Embedding Generation
~18ms Total RAG Overhead
2.5% Of Total Pipeline

The complete RAG overhead -- embedding the query, searching the vector index, and returning results -- adds roughly 18 milliseconds to a pipeline where model inference alone takes 500ms or more. RAG is effectively free in terms of user-perceived latency.

Why the Vector Index is Fast

An IVFFlat (Inverted File with Flat quantization) index partitions the vector space into clusters. At query time, only the nearest clusters are searched rather than every vector in the table. For a dataset of hundreds to tens of thousands of entries, this delivers consistent sub-5ms search times.

Embedding Dimension Trade-offs

A practical decision every RAG system faces: what dimensionality for embeddings? Higher dimensions capture more semantic nuance but cost more in compute, memory, and latency.

Factor Lower Dimension (384D) Higher Dimension (768D)
Generation speed ~12ms per embedding ~50ms cached, much slower cold
Memory per vector 1.5 KB 3.0 KB
Semantic quality Excellent for FAQ matching Marginally better (~2%)
At 100K entries ~150 MB ~300 MB
Best for Real-time queries, production Research, batch processing

For real-time virtual world companions where every query must complete in under 20 milliseconds, the lower-dimensional model is the right choice. The marginal quality improvement of higher dimensions does not justify the latency and memory cost. Save 768D for offline batch analysis or research contexts where speed is not a constraint.

Core Knowledge vs. User Memory

The knowledge base coexists with user conversation memories in the same database table and search query. Two mechanisms prevent them from interfering with each other.

1

System Identifier Isolation

Knowledge base entries are stored under a reserved system identifier, separate from any real user. The search query includes both the current user's ID and the system ID, returning personal memories and factual knowledge in a single pass. No second query needed.

2

Core Tier Protection

Knowledge entries are tagged with a core tier that the memory management system respects absolutely. While old user memories may be compressed or archived over time, core knowledge is never deleted, never compressed, and never expires. Platform FAQs remain available regardless of how much time passes.

Significance Scoring

Without proper significance scoring, a casual conversational memory ("we chatted about the weather") might outrank a critical FAQ entry ("what is this platform"). The scoring system ensures that important knowledge always surfaces first.

The result: platform knowledge entries land in the 0.75 to 0.95 range, ensuring they rank above casual conversation memories when both are relevant.

RAG vs. Tool Teaching

RAG answers questions. The next architectural layer -- tool teaching -- takes action. Understanding the boundary between them is critical for system design.

Aspect RAG (Knowledge) Tool Teaching (Agency)
Purpose Answer questions Manipulate the world
Data flow Database to prompt Prompt to API call
Side effects None (read-only) World state changes
Failure mode "I don't know" "Operation failed"
Security model Read-only, inherently safe Requires permission system
Complexity Low (search + inject) High (NLP + execution + sync)

RAG is the foundation. Once companions can reliably retrieve and communicate knowledge, the same retrieval pipeline can teach them tool syntax -- bridging the gap from passive knowledge to active world manipulation.

Evolution Path

The RAG architecture is designed to grow in several directions without requiring fundamental changes.

1

Companion-Specific Knowledge

Different companions know different things. A wisdom-oriented companion has philosophical knowledge; a trickster has different lore. The search query already filters by companion scope, so adding personality-specific knowledge is simply a matter of creating scoped entries alongside global ones.

2

Knowledge Versioning

Track when knowledge changes and allow rollback. Each entry carries a version identifier. When pricing changes from one tier to another, the old version is retained in history while the new version becomes active.

3

User-Suggested Entries

When users ask questions that have no existing match, the system can flag these gaps. A human reviewer approves an answer, it enters the knowledge base, and future users get instant answers to the same question. The knowledge base learns from real usage patterns.

4

Multi-Language Support

Knowledge entries can carry language tags. The search query filters by user language with a fallback to the default language, ensuring coverage even when translations are incomplete.

Lessons Learned

Dynamic Filters Can Override Base Clauses

When building search queries with dynamic filter conditions, an AND clause added later can silently negate an earlier inclusion. The fix: always use OR conditions for "include additional scope" logic. This pattern applies broadly to any system where multiple scopes need to coexist in a single query.

Knowledge Creation Is Not Knowledge Availability

Writing a knowledge file does not make it searchable. The seeding pipeline must run to generate embeddings and insert entries into the database. Establishing a clear deployment checklist -- write, seed, verify -- prevents the gap between authoring and availability.

Test Both Scoped and Global Queries

Testing only global-scope queries can mask filtering bugs that appear when searching as a specific companion. The test matrix should cover global scope, each companion scope, and user memory retrieval independently.

Significance Scoring Prevents Priority Inversion

Without explicit significance, the most recent conversational memory may outrank a critical FAQ. Assigning importance scores to knowledge entries ensures that foundational facts always surface when relevant, regardless of how many casual memories accumulate.

Key Takeaway

RAG is the sweet spot for dynamic, verifiable, updatable AI knowledge at scale. Fine-tuning is expensive and opaque. Raw embeddings lack grounding. RAG combines the strengths of both: semantic search finds relevant knowledge, and prompt augmentation ensures the model answers from verified sources rather than imagination.

The overhead is negligible -- roughly 18 milliseconds added to a pipeline measured in hundreds of milliseconds. The knowledge base is a structured file that anyone can edit. And the same retrieval infrastructure that delivers factual knowledge today can teach companions tool syntax tomorrow.