The Green Screen Problem
Early integration with computer vision seemed straightforward: analyze every frame, extract what the companion can "see," and feed it to the AI. Simple in theory. Disastrous in practice.
The problem is that virtual worlds are mostly static. A character standing in a plaza sees the same buildings, the same sky, the same nearby objects from one second to the next. Analyzing every frame at 30fps produces 1,800 perception updates per minute, almost all of them identical.
Naive frame-by-frame analysis generates massive redundancy. In a scene where nothing moves for ten minutes, that is 18,000 identical updates — consuming compute, bandwidth, and API calls for zero new information.
Make perception intelligent. Instead of capturing everything all the time, capture only what matters, only when it changes. Borrow from biology: human eyes saccade to areas of interest, not uniformly across the visual field. Our perception system works the same way.
Attention-Driven Sampling
Rather than a fixed polling rate, the perception system uses three attention levels that adapt to what is happening in the world. When nothing is going on, sampling slows to a crawl. When something important happens, it ramps up.
| Level | Interval | Updates / Hour | When |
|---|---|---|---|
| Baseline | Every 120 seconds | 30 | Nothing happening nearby |
| Engaged | Every 12 seconds | 300 | Normal player activity |
| High Urgency | Every 5 seconds | 720 | Important events unfolding |
A naive approach — scanning every frame and sending every result — would produce roughly 108,000 updates per hour. With attention-based sampling, even the busiest scenario tops out at 720. That is a 99% reduction in perception overhead, and the AI loses nothing meaningful.
Escalation Triggers
Attention level is not static — it responds to events in the world. Certain activities cause the system to escalate from one level to the next, increasing the companion's awareness when it matters most.
When someone walks up and starts talking, the companion immediately shifts to High Urgency — scanning the scene every five seconds to stay current on who is present and what is happening. If a player is moving through the area without interacting, the system settles at Engaged, scanning every twelve seconds. This mirrors how a real person would pay more attention to someone speaking directly to them than to a passerby.
Attention Decay
What goes up must come down. Without decay, a single voice event would leave the companion at High Urgency forever. The system follows a natural decay pattern that mimics biological attention:
High Urgency to Engaged — after 30 seconds of silence with no new triggers, attention drops one level. The companion is still paying attention, but no longer on high alert.
Engaged to Baseline — after 2 minutes with no activity, attention drops to its resting state. Scanning continues, but at the minimum rate of once every two minutes.
This creates a natural rhythm. A companion that was in an active conversation stays attentive for a short while afterward — the way a person might glance up after someone walks away — before gradually settling back into idle awareness. The decay timers reset every time a new trigger fires, so sustained activity keeps attention elevated without manual intervention.
Deduplication
Even with intelligent sampling, many snapshots will be identical to the last one. The same characters standing in the same positions, the same buildings visible from the same angles. Sending duplicate data wastes resources and clutters the AI's context with noise.
The deduplication layer applies five rules to every snapshot before deciding whether to send it downstream. If none of these conditions are met, the snapshot is silently dropped.
- Send Entity count changed — someone arrived or left
- Send Any entity moved more than 0.5 units since last snapshot
- Send Crowd level transitioned (empty / intimate / social / crowded)
- Send An entity has been absent for over 30 seconds (confirmed departure)
- Skip Everything else — positions unchanged, same entities, same crowd level
Combined with attention-based sampling, deduplication achieves a further 70% reduction in perception traffic. Of the snapshots that do get taken, the majority are dropped because nothing meaningful changed. The AI only receives data when there is actually something new to know about.
Object Permanence
In a 3D engine, entities can vanish from the scene for reasons that have nothing to do with the world state. A character might disappear because they walked behind a building (frustum culling), dropped below a level-of-detail threshold, or simply moved to the edge of the render distance. These are rendering artifacts, not real departures.
The perception system maintains a short history of recently seen entities. When an entity disappears from a scene scan, it is not immediately reported as "gone." Instead, the system waits. If the entity reappears within 30 seconds, the disappearance is treated as a rendering artifact and never reported. Only after the 30-second window expires does the system confirm the departure and notify the AI.
Without object permanence, a companion standing near a corner would constantly report "Dana appeared... Dana disappeared... Dana appeared..." as the character walked in and out of the camera frustum. The 30-second timeout eliminates this noise, so the AI only hears about genuine arrivals and departures.
This is a deliberate parallel to developmental psychology — specifically Piaget's concept of object permanence in infants. Just as a baby eventually learns that a toy hidden behind a blanket still exists, our perception system learns that an entity hidden behind geometry probably still exists too.
Spatial Snapshots
When attention triggers a scan and deduplication confirms something has changed, the system assembles a spatial snapshot — a structured summary of everything the companion can currently perceive. This is what the AI actually receives as context.
What Gets Captured
Each snapshot contains three categories of information: nearby characters, visible architecture, and an overall scene summary. Distances are measured in world units, directions are cardinal (north, southeast, etc.), and activities are inferred from animation state.
Dana — 3.2m north, idle
Brock — 8.7m southeast, walking
Whimsical Cottage — 12.4m west, building
Open Directions: East, South
The natural language summary at the bottom is generated alongside the structured data. It gives the AI a quick, human-readable overview that can be injected directly into a conversation prompt. The structured fields are available for programmatic queries — for example, "who is the closest character?" — without requiring the AI to parse free text.
The Complete Pipeline
Each component described above is a stage in a single pipeline. Data flows through them in sequence, with each stage reducing volume and increasing signal quality.
-
Attention determines sampling rateBased on current triggers and decay timers, the system decides how frequently to scan: every 5, 12, or 120 seconds.
-
Scene scan generates spatial snapshotThe 3D engine is queried for all entities within perception range. Positions, distances, directions, and activities are recorded.
-
Deduplication filters unchanged dataThe new snapshot is compared against the previous one. If nothing meaningful changed, the snapshot is dropped silently.
-
Significant changes trigger deliveryOnly snapshots that pass deduplication are sent to the backend. Object permanence filters confirm departures before reporting them.
-
Backend stores spatial contextThe snapshot is stored as the companion's current spatial awareness, replacing the previous one. Historical snapshots are retained briefly for change detection.
-
AI retrieves context during response generationWhen the companion needs to respond, the latest spatial snapshot is included in its context. The AI knows who is nearby, where things are, and what the scene looks like — without processing a single video frame.
The pipeline transforms raw 3D scene data into lean, meaningful spatial context. From a potential 108,000 raw captures per hour, the AI receives perhaps a few dozen meaningful updates. Each one tells it something new about the world.