Chapter 3: Attention & Perception — poqpoq World AI Architecture

The Green Screen Problem

Early integration with computer vision seemed straightforward: analyze every frame, extract what the companion can "see," and feed it to the AI. Simple in theory. Disastrous in practice.

The problem is that virtual worlds are mostly static. A character standing in a plaza sees the same buildings, the same sky, the same nearby objects from one second to the next. Analyzing every frame at 30fps produces 1,800 perception updates per minute, almost all of them identical.

The Problem

Naive frame-by-frame analysis generates massive redundancy. In a scene where nothing moves for ten minutes, that is 18,000 identical updates — consuming compute, bandwidth, and API calls for zero new information.

The Solution

Make perception intelligent. Instead of capturing everything all the time, capture only what matters, only when it changes. Borrow from biology: human eyes saccade to areas of interest, not uniformly across the visual field. Our perception system works the same way.

Attention-Driven Sampling

Rather than a fixed polling rate, the perception system uses three attention levels that adapt to what is happening in the world. When nothing is going on, sampling slows to a crawl. When something important happens, it ramps up.

Level	Interval	Updates / Hour	When
Baseline	Every 120 seconds	30	Nothing happening nearby
Engaged	Every 12 seconds	300	Normal player activity
High Urgency	Every 5 seconds	720	Important events unfolding

A naive approach — scanning every frame and sending every result — would produce roughly 108,000 updates per hour. With attention-based sampling, even the busiest scenario tops out at 720. That is a 99% reduction in perception overhead, and the AI loses nothing meaningful.

108K

Naive updates / hour

720

Max with attention

99%

Reduction

Escalation Triggers

Attention level is not static — it responds to events in the world. Certain activities cause the system to escalate from one level to the next, increasing the companion's awareness when it matters most.

Voice activity detected High Urgency

Player within 5 units High Urgency

Player moving nearby Engaged

New entity enters area Engaged

When someone walks up and starts talking, the companion immediately shifts to High Urgency — scanning the scene every five seconds to stay current on who is present and what is happening. If a player is moving through the area without interacting, the system settles at Engaged, scanning every twelve seconds. This mirrors how a real person would pay more attention to someone speaking directly to them than to a passerby.

Attention Decay

What goes up must come down. Without decay, a single voice event would leave the companion at High Urgency forever. The system follows a natural decay pattern that mimics biological attention:

Decay Rules

High Urgency to Engaged — after 30 seconds of silence with no new triggers, attention drops one level. The companion is still paying attention, but no longer on high alert.

Engaged to Baseline — after 2 minutes with no activity, attention drops to its resting state. Scanning continues, but at the minimum rate of once every two minutes.

This creates a natural rhythm. A companion that was in an active conversation stays attentive for a short while afterward — the way a person might glance up after someone walks away — before gradually settling back into idle awareness. The decay timers reset every time a new trigger fires, so sustained activity keeps attention elevated without manual intervention.

Deduplication

Even with intelligent sampling, many snapshots will be identical to the last one. The same characters standing in the same positions, the same buildings visible from the same angles. Sending duplicate data wastes resources and clutters the AI's context with noise.

The deduplication layer applies five rules to every snapshot before deciding whether to send it downstream. If none of these conditions are met, the snapshot is silently dropped.

Send Entity count changed — someone arrived or left
Send Any entity moved more than 0.5 units since last snapshot
Send Crowd level transitioned (empty / intimate / social / crowded)
Send An entity has been absent for over 30 seconds (confirmed departure)
Skip Everything else — positions unchanged, same entities, same crowd level

Combined with attention-based sampling, deduplication achieves a further 70% reduction in perception traffic. Of the snapshots that do get taken, the majority are dropped because nothing meaningful changed. The AI only receives data when there is actually something new to know about.

Object Permanence

In a 3D engine, entities can vanish from the scene for reasons that have nothing to do with the world state. A character might disappear because they walked behind a building (frustum culling), dropped below a level-of-detail threshold, or simply moved to the edge of the render distance. These are rendering artifacts, not real departures.

The perception system maintains a short history of recently seen entities. When an entity disappears from a scene scan, it is not immediately reported as "gone." Instead, the system waits. If the entity reappears within 30 seconds, the disappearance is treated as a rendering artifact and never reported. Only after the 30-second window expires does the system confirm the departure and notify the AI.

Why This Matters

Without object permanence, a companion standing near a corner would constantly report "Dana appeared... Dana disappeared... Dana appeared..." as the character walked in and out of the camera frustum. The 30-second timeout eliminates this noise, so the AI only hears about genuine arrivals and departures.

This is a deliberate parallel to developmental psychology — specifically Piaget's concept of object permanence in infants. Just as a baby eventually learns that a toy hidden behind a blanket still exists, our perception system learns that an entity hidden behind geometry probably still exists too.

Spatial Snapshots

When attention triggers a scan and deduplication confirms something has changed, the system assembles a spatial snapshot — a structured summary of everything the companion can currently perceive. This is what the AI actually receives as context.

What Gets Captured

Each snapshot contains three categories of information: nearby characters, visible architecture, and an overall scene summary. Distances are measured in world units, directions are cardinal (north, southeast, etc.), and activities are inferred from animation state.

Spatial Snapshot

Characters:
Dana — 3.2m north, idle
Brock — 8.7m southeast, walking

Architecture:
Whimsical Cottage — 12.4m west, building

Crowd Level: Intimate (2 characters)
Open Directions: East, South

"2 characters nearby, 1 building visible. Dana is closest at 3.2m to the north."

The natural language summary at the bottom is generated alongside the structured data. It gives the AI a quick, human-readable overview that can be injected directly into a conversation prompt. The structured fields are available for programmatic queries — for example, "who is the closest character?" — without requiring the AI to parse free text.

The Complete Pipeline

Each component described above is a stage in a single pipeline. Data flows through them in sequence, with each stage reducing volume and increasing signal quality.

Attention determines sampling rate

Based on current triggers and decay timers, the system decides how frequently to scan: every 5, 12, or 120 seconds.
Scene scan generates spatial snapshot

The 3D engine is queried for all entities within perception range. Positions, distances, directions, and activities are recorded.
Deduplication filters unchanged data

The new snapshot is compared against the previous one. If nothing meaningful changed, the snapshot is dropped silently.
Significant changes trigger delivery

Only snapshots that pass deduplication are sent to the backend. Object permanence filters confirm departures before reporting them.
Backend stores spatial context

The snapshot is stored as the companion's current spatial awareness, replacing the previous one. Historical snapshots are retained briefly for change detection.
AI retrieves context during response generation

When the companion needs to respond, the latest spatial snapshot is included in its context. The AI knows who is nearby, where things are, and what the scene looks like — without processing a single video frame.

The pipeline transforms raw 3D scene data into lean, meaningful spatial context. From a potential 108,000 raw captures per hour, the AI receives perhaps a few dozen meaningful updates. Each one tells it something new about the world.