Dark Code: An Organizational Problem, Not an Engineering One

A Term Getting Named in Real Time

In the past ninety days, three respected engineers — working separately, apparently unaware of one another — reached for new vocabulary to describe the same phenomenon. In March 2026, Jouke Waleson, speaking at the OpenAI Codex Meetup in Amsterdam, called it dark code: “lines of software that no human has written, read, or even reviewed.” Two weeks later, Addy Osmani — a Chrome engineer at Google — called it comprehension debt: the hidden tax of AI-generated code that nobody on the team can fully explain. In April, Nate B. Jones extended Waleson’s framing for a business audience and gave the phenomenon a second name: the comprehension crisis.

Dark code — software running in production that nobody on the team can explain

Three labels. One problem. And the fact that competent engineers reached for new vocabulary in the same ninety days is the actual news story. The industry does not give a phenomenon a new name while an existing name still fits.

Dark code. Comprehension debt. The comprehension crisis. Three names for a single problem: code running in production that nobody on the payroll can fully explain.

This piece uses dark code — with credit to Waleson for the coinage and to Jones for bringing it to a business audience — because the umbrella term is still usable and not yet saturated. Osmani’s comprehension debt is the more precise instrument; it captures the accumulation and interest-compounding quality of the phenomenon. They are not in tension. They are naming the same thing from different vantage points.

What the Term Actually Names

Dark code is code running in production that nobody can explain. Not the engineer who shipped it. Not the team that owns the service. Not the AI session that produced it. Not the executive who approved the architecture three years ago. The code works. It passes tests. It clears CI. It deploys without incident. And no human on the payroll fully understands what it does, why it does it, or what would happen if it did something else.

Jones and Osmani are both careful to distinguish dark code from three adjacent categories it is often confused with.

It is not buggy code. Buggy code fails visibly. Dark code passes.

It is not spaghetti code. Spaghetti code is hard to read. Dark code may be perfectly readable — the comprehension step just never happened.

It is not technical debt in the traditional sense. Technical debt is code you understand and have consciously chosen not to pay down. Dark code is code no one understood in the first place. Osmani’s “comprehension debt” is a deliberate cousin to the classic Cunningham metaphor: interest accrues, but the principal is understanding you never had, not a shortcut you once took.

Dark Code, Precisely

Code that was generated, checked, and shipped — without the comprehension step ever occurring. In the AI era, we are producing it at a velocity the profession has never seen.

The honest skeptical reading is that this is not new. Enterprise codebases have always contained code nobody fully understands: acquired companies, departed contractors, legacy systems nobody in the building wrote. Joel Spolsky observed in 2000 that “it’s harder to read code than to write it.” The skeptic’s point is correct in spirit and wrong about the delta. Where organizations used to accumulate dark code at the rate of staff turnover and acquisitions, they now accumulate it at the rate of API calls. The pattern is old. The velocity is not.

What the Data Says

A few numbers worth carrying into the rest of the piece.

An Anthropic-authored randomized trial published in January 2026 gave fifty-two engineers a new library to learn. The AI-assisted group completed tasks in roughly the same time as the control group — and scored 17 percentage points lower on comprehension of what they’d built (50% vs. 67%), with the steepest drops in debugging ability.
Stack Overflow’s 2025 Developer Survey found professional AI adoption at 84% — and trust in AI output accuracy collapsing from 40% in 2024 to 29% in 2025. 46% of developers now actively distrust the accuracy of AI-generated code; only 3% highly trust it. Two-thirds report spending more time fixing “almost-right” AI output than they save from generating it.
Google’s DORA research across 2024–2025 shows AI use correlating with a 7.2% decrease in delivery stability. Pull requests per developer are up 20%, but incidents per PR are up 23.5%. Review time per PR is up 91%.
GitClear’s 2025 analysis of 211 million lines of code found copy-pasted code doubled (from 8% to 18%), while refactored-and-moved code collapsed (from 25% to under 10%). Heavy AI users produce nine times more code churn.
A Lightrun survey in early 2026 reports that 43% of AI-generated code changes require manual debugging in production even after passing QA and staging.

The skeptical read is that these numbers describe a discipline in transition: new tools, old metrics, short-run turbulence that will normalize. That may be right. But every one of those data points tracks exactly what the comprehension-crisis framing predicts. Code is flowing through pipelines faster. The review layer is not expanding to match. What ships is not bugs — it is code that ships without anyone fully understanding what it does.

Why Tooling Doesn’t Fix It

The first reflex — especially in organizations that have spent a decade investing in platform tooling — is to treat this as a tooling problem. More observability. Better agent pipelines. Richer static analysis. An LLM-powered reviewer that summarizes every PR.

None of those things are bad. Several help. But none resolves the core issue, because the core issue is not visibility — it is comprehension. Observability tells you what dark code is breaking. It does not make the code understood. Agentic review pipelines add layers to troubleshoot, not layers of understanding. The claim that “the AI knows what its code does” is a bet on a system that cannot reliably tell you when it’s overconfident.

Dark code is an organizational capability problem. Which means the fix has to be organizational.

Three Layers of Defense — as Outcomes

Jones names three layers: spec-driven development, context engineering, and comprehension gates. The framework is his; I’ll credit it accordingly. In the BBWorlds / poqpoq ecosystem we adopted all three and renamed the middle one to self-describing systems — a reframing that emphasizes where the context lives (in the code and its manifests) rather than what activity you perform to produce it.

It helps to restate each layer as an outcome — what it actually prevents.

Layer 1 — spec-driven development. The outcome: no code is written without someone — human or AI — first writing down what it should do, what it depends on, and what success looks like. The form doesn’t matter. An ADR. A paragraph in a ticket. A JSON manifest. What matters is that the comprehension step precedes the generation step.

This prevents the failure mode every AI-assisted team knows: a developer types “make this work,” ships whatever came out, and now owns dark code with a confident-looking author line.

Layer 2 — self-describing systems. The outcome: any person or agent reading any file, with no prior context, can answer three questions. What does this do? What does it depend on? What depends on it? And for the non-obvious pieces — the counterintuitive decisions, the load-bearing workarounds — the code carries the why with it.

This prevents comprehension from living in one person’s head. Six months from now, that person leaves, forgets, or asks an AI to refactor. The codebase is either self-describing, or it becomes mute.

Layer 3 — comprehension gates. The outcome: no code ships without someone being able to answer five questions. What changed? Why? What could break? Is the spec updated? Can someone unfamiliar with this code understand what it does by reading it?

This prevents the failure mode where tests are green and the PR gets rubber-stamped. Tests verify functional correctness. They do not verify comprehension.

A test suite can be completely green while the code is completely dark.

First, a Frame Adjustment

Before segmenting by role, there’s an elephant worth naming. A meaningful part of the conversation about AI-assisted coding — especially inside educational institutions — still treats the practice itself as a form of cheating. The argument has a familiar shape: the AI does the work, so the human isn’t learning; therefore using the tool short-circuits the learning that should be happening. It is the same argument calculators faced in the 1970s, and word processors in the 1980s, and search engines in the 2000s. In each case the institutional answer was eventually the same. The tool redefines what “skill” means. A student who cannot use the tool well is at a disadvantage; an educational system that pretends the tool doesn’t exist is producing graduates for a world that no longer exists.

AI-assisted code generation is at that moment now, and the arithmetic is starker than in any prior case. A competent LLM with coding skills produces more correct code, faster, than any human typing at twenty to forty words per minute. That is not a close call. The productive question is no longer should people use AI to write code. It is how does a person use AI to write code in a way they remain accountable for, able to evaluate, and able to change six months later.

Dark code is what happens when the second half of that question gets skipped. It is not an argument against AI-assisted development. It is an argument for comprehension-preserving AI-assisted development — which is entirely different from working unassisted, and entirely different from handing the steering wheel to the model.

The productive question is no longer "should people use AI to write code." It is "how does a person use AI to write code in a way they remain accountable for, able to evaluate, and able to change six months later."

That frame adjustment matters because it determines which problem you are solving. In the “AI is cheating” frame, every practice described below reads as scolding or overhead. In the “AI is legitimate; comprehension is the new scarce input” frame, every practice below reads as what professional use of the tool actually requires.

What This Looks Like From Where You Sit

With the frame established, four lenses — the four most common vantages reading a piece like this:

If you’re the executive or AI program owner

Dark code shows up in your P&L first as a metric that looks great — until it doesn’t. PRs-per-week are up. Incidents-per-PR are up faster. The engineers you hired to understand the systems are spending less of their time on comprehension and more on cleanup. The fix is not a tool purchase. It is a shipping criterion: no code merges without someone — human or agent — able to answer the five comprehension questions. That costs paper velocity. It buys you a codebase your organization still owns in three years.

The metric change is also leadership’s responsibility, and nobody else’s. “Velocity” measured as PRs merged produces dark code. Velocity measured as explainable PRs merged does not. Until the C-suite and the engineering leadership both shift the measure, any comprehension-gate policy will be quietly overridden by the velocity gradient beneath it.

If you’re a senior engineer

You already know this problem. You’ve lived it with outsourced code, contractor code, legacy systems nobody in the building wrote. What’s new isn’t the pattern — it’s the velocity. Where you used to accumulate dark code at the rate of handoffs and turnover, you now accumulate it at the rate of API calls. The instinct to “clean it up later” was always slightly fictional; at AI-generation velocity, later is never.

The practical moves: insist on the five comprehension questions in review. Add structural headers to every file you touch. Stop merging code you can’t explain in a sentence. Your scarce resource has flipped from typing speed to review thoughtfulness — which means the skill you’ve been developing for fifteen years is newly valuable, not obsolete. Pair the AI’s generation velocity with your comprehension velocity, and the combination outperforms either alone. Skip the second half and you produce dark code faster than any human ever could.

If you’re building with AI but don’t think of yourself as an engineer — the “vibe coder”

You are doing legitimate work. The version of what you do that goes wrong is not the one where a non-engineer uses AI; it is the one where anyone — engineer or not — accepts AI output without being able to evaluate it. The vibe coder who specifies clearly and probes the output thoughtfully frequently produces less dark code than the overconfident engineer who types “fix this” and ships whatever came back.

Your highest-leverage practice is this: before asking the AI to generate something, write down in one paragraph what the thing should do and what you’d consider broken. That paragraph is the spec. If you can’t write it, you don’t yet know what you want — and the AI will fill that vacuum with confident-sounding code you have no way to evaluate. It feels the most like slowing down. It pays back fastest.

When the AI produces code, ask it to explain its choices. Not the what — the why. “Why this pattern and not the other one?” “What happens if this function is called twice?” “What assumption are we making about the input?” The question is often worth more than the code it produced.

If you’re an instructional designer or learning leader

This is your turf, but with a new landscape under it. The traditional model — train engineers once, assume comprehension arrives with experience — no longer matches how code gets produced. What learners now need is metacognition about working with AI: how to recognize the moment they’re accepting code without comprehending it, how to write specifications that make comprehension checkable, how to ask the AI to explain choices in ways that actually transfer understanding.

The skills depreciating fastest right now are closest to implementation — the typing, the boilerplate, the syntax lookup. The skills appreciating fastest are architectural clarity, behavioral specification, and comprehension evaluation. That last one — how to know whether you understood what just got produced — barely appears in most curricula. It needs to. It is to the AI era what reading comprehension was to the industrial-era classroom: the skill that determines whether the other skills are usable.

For enterprise L&D, the shift is from “train developers on new tools” toward “teach everyone in the AI-adjacent workflow — not just engineers — how to evaluate AI-produced artifacts.” Product managers writing AI-assisted specs. Analysts using AI-generated SQL. Marketers editing AI-drafted copy. The meta-skill is the same in all of them: recognize when comprehension was skipped, and close the loop before the artifact gets used.

One Week, One Ecosystem: What Adoption Actually Looks Like

Between April 15 and April 20, 2026 — roughly a week of evenings and weekends after our own standards document’s publication across the BBWorlds / poqpoq repositories — the commit graph produced concrete evidence of what adoption looks like in practice. None of these are hypothetical. They shipped.

Example 1: The NEXUS modularization (ADR-074 Phase 1)

NEXUS is the central server underneath the platform — sessions, persistence, realtime sync, marketplace transactions. It had grown into a 6,993-line monolith: one file holding 18 HTTP routes, socket handlers, and the kind of accumulated history that makes senior engineers sigh. It worked. Nobody could explain it on a single screen.

The extraction was not a rewrite. It was a forced act of comprehension. Phase 1 pulled three domain modules — health, marketplace, AI — into their own files. Each ships with a manifest.json declaring, in machine-readable form: the routes it exposes, the modules it depends on, the modules that depend on it, and a contract: invariants, failure modes, performance expectations.

For the health module, that contract reads in part: “All routes are read-only. /health must respond within 100ms (systemd dependency). Database disconnected → /health reports false but still 200.” Three lines. They would have lived in a senior engineer’s head, or been rediscovered on an outage call at 3am. They now live with the code, in a format any agent can read without parsing source.

There is also an introspection endpoint — GET /nexus/modules returns the live module surface. A new team member or an AI agent can discover what the server does without reading a single line of source code.

Example 2: Structural headers in BlackBoxAvatar

BlackBoxAvatar is the character-creation tool in the suite. Four days into adoption, a commit went in titled “docs: add structural file headers per Human-AI Codebase Standards.” Every file touched that day got a doc block at the top: what the module does, what it depends on, what depends on it, and behavioral notes for any non-obvious pattern.

One such note documented a bone-naming convention — the animation system distinguishes between L_ / R_ prefix for source bones and Left / Right suffix for target bones. Trivially discoverable if you know to look for it and a permanent source of bugs if you don’t. That is a twelve-word comment. It probably saves a future developer — or a future AI session — a full afternoon of debugging. Self-describing doesn’t mean baroque; it means the comment lives with the thing it describes.

Example 3: The Legacy importer — spec as eval

The Legacy component imports asset data from older regions. Before adoption, it classified every non-phantom object as “static,” producing tens of thousands of Havok rigid bodies that choked avatar movement in dense scenes. The fix was not a hot patch. It was a tiered classification spec:

Tier 1 reads explicit flags: PhysicsShapeType=None → phantom, Convex → static, Physics flag set → dynamic
Tier 2 applies heuristics: very small objects → phantom; light sources → phantom; unflagged imported meshes → phantom

The spec was written first. Then the code. Then twelve new tests — one per classification path — that enumerate the contract exactly. Read the tests and you understand the classification system. That is spec-as-eval in its cheapest form: no heavy process, no new tooling, just spec → tests → code. Six months from now, an AI agent asked to add a new classification rule has an unambiguous reference for what “correct” means.

Example 4: The instance switcher — state machine before code

ADR-081 documented an instance-switch lifecycle as a state machine before a single line of the switcher was written: IDLE → FADING_OUT → TEARING_DOWN → LOADING → REBUILDING → VALIDATING → FADING_IN. An ordered 16-system teardown. Post-switch validation. Failure recovery.

The implementation commit then reduced main.ts by 391 lines, extracting three modules with documented responsibilities. The ADR preceded the code; the code references the ADR; any reader — human or agent — can follow the breadcrumb instead of reverse-engineering a monolith. This is the full loop in a single shipping unit: spec, self-describing modules, comprehension gate.

The aggregate signal

A week of evenings and weekends. Fifteen module manifests across the ecosystem. Eight ADRs authored or updated. Four repositories with structural headers added. A CLAUDE.md update that now points new contributors — and new AI sessions — at the standards document before they touch a file. Not theater. Load-bearing governance, visible in the commit graph.

But Won’t This Slow You Down?

Two critiques deserve engagement. Neither is the “process theater” hand-wave.

The Osmani objection. In the same March 2026 essay that popularized “comprehension debt,” Addy Osmani offered the sharpest rebuttal to spec-first development: specs thorough enough to fully constrain behavior become the program itself. A specification detailed enough to replace review is code in a different syntax. Osmani is right in the narrow sense, and the weakest version of spec-driven development collapses on contact. The strongest version replies: the point of a spec is not to replace the code. The point is to make comprehension happen before the code exists, so the code can be checked against it. An ADR that says “this module handles placed object CRUD with advisory edit locking, 10-second expiry, no pagination” is not the implementation. It is the contract the implementation is accountable to. That distinction is load-bearing — and it is what separates useful specs from ceremony.

The incentives objection. The second critique, surfacing in every HN thread on this topic, is structural: organizations measure velocity by shipped-PR counts. Engineers who pause for comprehension are outperformed by engineers who don’t; over time the careful ones get reorged out, and the remaining population adapts to the gradient. No policy survives that incentive. This argument is serious. The correct response is not to wave it off — it is to change the metric. Velocity measured as “PRs merged” produces dark code. Velocity measured as “explainable PRs merged” does not. Organizations that make the measurement swap early will compound. Organizations that don’t will pay in the incident rates DORA keeps documenting.

What we observed in our own adoption window: velocity did not drop. In several places it measurably improved. AI agents got faster at producing correct code when the spec was clear. Review cycles got shorter when the PR description answered the comprehension questions up front. Debugging sessions compressed because the behavioral comment in the code told you what you were looking at. The meta-work of figuring out what someone else — or some earlier AI session — meant shrank substantially. That meta-work is the single largest time sink in most engineering organizations.

The “process theater” objection is legitimate at the edge. Sixteen-artifact hurdles before code can be written deserve the critique. But a paragraph is a spec. A ten-line ADR is a spec. A minute-long manifest is a spec. The test is not length — it is whether comprehension happened before generation.

What Leadership Actually Has to Do

This is the uncomfortable section. The practices described here are cheap to adopt at small scale and expensive to retrofit at large scale. A five-person team with strong practice sustains a remarkably large, comprehensible codebase. A five-hundred-person team that accumulated dark code for a decade is going to pay — in migration effort, in training cost, in the political weight of telling teams “we do not ship what we don’t understand.”

What leadership actually has to do:

Make comprehension a shipping criterion. Not a stretch goal. Not a review nice-to-have. A criterion. If a PR cannot answer the five comprehension questions, it does not merge. That requires the organization to accept that velocity is not measured by PRs-per-week but by explainable PRs-per-week.

Budget for the first layer explicitly. The spec step is where cost is most visible and where teams cut first under pressure. If there is no budget for the paragraph, the paragraph will not be written. Put it in the definition of done.

Hold AI agents to the same standard as humans. The Co-Authored-By: Claude line is not a formality. It is a declaration that the human co-author understood what was committed. Teams that rubber-stamp AI output because the tests pass are not using AI — they are producing dark code at scale with plausible deniability.

Invest in machine-readable artifacts. Manifests, introspection endpoints, structured ADRs. These are not engineering vanity. They are the substrate that keeps a codebase legible to the next AI session, which will not have the institutional memory of the current one.

The codebase that describes itself to agents is the codebase that agents stay useful in. Everything else is a slow drift into illegibility.

Train for comprehension, not just generation. This is the L&D move most organizations have not yet built for. Teach the people in your AI-adjacent workflows — engineers, PMs, analysts, marketers, designers — how to evaluate AI-produced artifacts. The meta-skill is not prompt engineering. It is recognizing when comprehension was skipped and closing the loop before the artifact gets used.

Accept the small-team asymmetry. A small team with strong practice produces codebases that stay comprehensible at scale. A large team without these habits accumulates dark code faster than any documentation effort can recover. Some of the most durable engineering organizations of the next decade will be smaller than the market expects — precisely because comprehension is the scarce input, and it does not benefit from headcount.

Closing: The Comprehension Input

None of this is optional.

The AI agents deployed across enterprises right now are not going to get slower. They will produce more code, with less prompting, at higher quality, more autonomously. Every quarter. That is a direct organizational benefit — until the moment someone in production needs to change code nobody remembers writing, and the organization discovers that its engineering capability is not the capacity to generate code, but the capacity to understand it.

Organizations that build for comprehension will ship faster, with less fragility, in five years than organizations optimizing for generation volume now. The bet is not a close call.

The code your team is writing this quarter is either being written with the comprehension step or without it. That choice is the scarce input. Everything downstream — quality, velocity, hiring, compliance, the ability to adopt the next generation of AI tools without technical bankruptcy — follows from it.

Dark code is an organizational choice. So is the alternative.

This article is a companion to the Human-AI Codebase Standards adopted across the BBWorlds / poqpoq ecosystem on April 15, 2026. The terminology is drawn from Jouke Waleson’s Three Thoughts on Dark Code, Addy Osmani’s Comprehension Debt, and Nate B. Jones’ Your Codebase Is Full of Code Nobody Understood — all published between March and April 2026. The framework adapted here is Jones’; the reframing of the middle layer as “self-describing systems” is our own.