Skip to content
Blog Post

Structural Divergence: The Metric That Doesn't Exist Yet

Geoff Godwin
Listen to this article Listen on Spotify
0:00 / -:--
Listen to this post

Part 2 of a series on navigating the AI tooling landscape as an enterprise engineering leader. Part 1: Two Different Promises established an honest baseline about what LLMs are and aren’t. This post names what accumulates when AI-assisted development scales without structural awareness.


I’ve spent the better part of fifteen years building customer-facing systems at large financial technology organizations. The kind of environments where a single production incident can touch millions of accounts and the architecture decisions made three years ago are still shaping what’s possible today. I mention this not to credential the argument but to establish where I’m standing when I make it: inside the wave, not above it. I use AI-assisted development tools daily and am building an agentic pipeline called Tekhton as both a practical tool and a public portfolio artifact. I am naturally a skeptic of the very tools I use and build, but this isn’t some cautionary tale from someone who hasn’t touched them.

I am simply someone who has been around long enough to recognize a particular pattern, the one where an entire industry is about to learn the same expensive lesson at roughly the same time, and the leading indicator is always the same: every team in the building is celebrating the same metric while nobody is watching the thing the metric doesn’t capture.

Right now, every enterprise engineering organization I’ve interacted with is measuring AI adoption through throughput: PRs merged, stories completed, sprint velocity, lines of code assisted. The dashboards are as green as springtime, the quarterly reports look excellent, and yet the feeling in the room when someone tries to onboard into a module that was heavily agent-assisted three months ago is decidedly not green, though nobody has figured out quite how to say that in a standup yet.

In Part 1, I argued that the AI productivity promise is real and that the people dismissing it entirely are making a mistake. This post accepts that premise and asks a different question: what are the second-order consequences of that promise being kept at enterprise scale? Because second-order consequences are where enterprise engineering gets expensive, and my honest assessment is that we’re accumulating a category of cost that none of our current instruments are designed to detect.


The Conversation That’s Already Happening

Before I lay out what I think is missing, I want to be clear about what’s already been said, and said well, by people whose work I’ve learned from.

Neil Kakkar, an engineer at Tano, coined the term “agentic debt” in February 2026 to describe something he experienced firsthand: a codebase where AI agents had independently produced three nearly identical but structurally distinct frontend implementations across different modules, each one individually reasonable, collectively incoherent. His core insight is that agentic debt is self-reinforcing: An agent writes code that works, you ship it, the next agent reads that code as context and makes its own locally optimal choices, with the compounding effect that each successive agent’s reasoning gets worse because the landscape it’s reasoning about has become muddier. Kakkar names the mechanism correctly, and his observation that the best way to improve agent performance is to keep the code simple enough for humans to model is, in my experience, exactly right.

Margaret-Anne Storey, a researcher whose writing on cognitive debt was subsequently amplified by Simon Willison, makes a complementary and equally important argument: the real debt lives in the developers’ minds, not in the code. Even if AI-generated code is technically readable, the humans involved may have lost the plot on what the system is supposed to do, how their intentions were implemented, and how the program can be changed over time. Storey draws on Peter Naur’s insight that a program is more than its source code; it’s a theory held in the minds of its developers. When agents do the implementation, the theory can become orphaned. I feel that She names where the debt lands correctly.

Technical debt lives in the code; cognitive debt lives in developers' minds.

The empirical evidence is accumulating quite rapidly. A large-scale study published on arxiv in March 2026 tracked over 110,000 surviving AI-introduced issues across open-source repositories, with code smells being the most common type and the cumulative count continuing to rise. CodeRabbit’s analysis, summarized in a Stack Overflow blog post, found that while PRs per developer increased 20% with AI assistance, incidents per PR increased 23.5%. Ox Security’s research characterized AI-generated code as “highly functional but systematically lacking in architectural judgment,” which is about as precise a summary of the problem as I’ve encountered.

So the problem has been named and the problem has been measured after the fact. What none of these contributions have proposed is a way to detect the drift while it’s accumulating, in real time, before it manifests as the regression, the outage, or the rewrite. The mechanism is understood and the destination is well documented. What’s missing is the instrument that reads the pressure rise before the pipe bursts.

That’s what the rest of this post is about.


Why Your Current Tools Don’t Catch It

I find the most useful way to understand this gap is to think about cities rather than code, because anyone who’s met me knows I love a good metaphor and the analogy maps with surprising precision to make the blind spot immediately visible.

Imagine a city where every individual building passes inspection. Every permit is approved, every structure is up to code, meets fire safety requirements, and has been signed off by a qualified inspector. From the perspective of any individual structure, everything is fine. Nobody is looking at the whole city though. Nobody has compared the master plan, the one that specified residential zones and commercial corridors and the flow of traffic through the arterial roads, against an aerial photograph of what’s actually being built. And if they did, they’d find that three different water systems are now running in parallel because different contractors made different reasonable assumptions about the municipal connection points from their local perspective. The road network has become a maze because each development added its own access roads without reference to the ones next door. Commercial buildings have slowly bled into what was supposed to be residential, not because anyone violated zoning but because the variance requests were each individually justifiable and nobody was tracking the aggregate.

Every building is fine and yet the city is becoming unlivable.

Code Quality #1513
I honestly didn't think you could even USE emoji in variable names. Or that there were so many different crying ones.

This is what’s happening inside enterprise codebases under high-concurrency AI-assisted development, and the reason it’s invisible is that every tool in the current quality stack evaluates code at the wrong unit of analysis.

Linters are building inspectors: they check individual structures for code violations, and they’re good at it. But a building inspector doesn’t tell you that the city’s road network has become a maze, because that’s not their job and they never see the aerial view.

Static analyzers are structural engineers. They evaluate classes and methods for complexity, coupling, and cohesion, the equivalent of checking whether a building’s foundation will hold. They’re valuable, but they’re evaluating individual structures in isolation. Two buildings can each have perfect structural integrity while being positioned in a way that makes the street between them impassable.

Code review is the zoning board. It approves individual permits, individual diffs, in context. The context is the PR, not the city. A zoning board that reviews each variance request in isolation, without tracking how many variances have been granted in the same neighborhood over the past six months, will approve a slow-motion disaster one reasonable decision at a time.

What’s missing is the urban planner who periodically overlays the master plan onto an aerial photograph of what’s actually been built and says, “we have a problem in the northeast quadrant.” Not because any individual structure failed inspection, but because the collective shape of what’s been built no longer resembles the city that was intended.

And here’s where the city analogy extends in a way I think is particularly useful for enterprise engineers and other technical leaders: infrastructure needs change as a system grows. An intersection where a stop sign was perfectly adequate at first may later need a traffic light. Sometimes a roundabout would have been the better call, but it wasn’t obvious until more infrastructure grew around the intersection and the actual traffic patterns revealed themselves, at which point the non-functional requirements have shifted underneath you and the original design choice, even though it was correct at the time, is now working against the system rather than for it. This is a normal part of how systems evolve. The problem isn’t that it happens; the problem is that nobody is watching for it.

Two specific failure modes fall through the cracks of every tool I’ve just described:

The first is dialect proliferation. Two modules handle API errors using structurally different approaches, not because anyone decided they should, but because two different agent sessions read two different files for context and made two different reasonable choices. Each approach passes linting and code review. Together, they represent a codebase that has quietly developed two dialects for the same operation, and the next contributor, human or agent, who encounters both will have to guess which one is canonical. This happens at every level of granularity: at the architectural level you get two different data access patterns coexisting in what’s supposed to be a unified domain layer, and at the implementation level you get four subtly different approaches to request validation, three styles of logging configuration, and two conventions for pagination, all of which technically work but none of which agree with each other. Multiply that across a large codebase and you get a city where every neighborhood speaks a slightly different language: not just different accents, but different grammars.

The second is boundary erosion. Most enterprise codebases have implicit or explicit module boundaries: layered architecture, domain separation, service contracts. Each individual PR that adds a new cross-boundary dependency looks reasonable in context. The diff is clean. The reviewer approves it. But over the course of fifty PRs, each one adding a single new dependency that crosses a boundary, the intended separation of concerns has been quietly dissolved. No single change broke the architecture, but the aggregate of all changes did.


Measuring the Thing That Matters

There’s a useful distinction in medicine between a thermometer and a fever chart. A thermometer tells you the patient’s temperature right now. A fever chart tells you whether the patient is getting better or worse, and how fast. A single reading of 100.2°F could mean the patient is recovering (it was 102 yesterday) or deteriorating (it was 98.6 this morning). The number alone tells you nothing about trajectory, it’s the trend that tells you everything.

Thermometer vs. Fever Chart

A diagram comparing a thermometer showing a single reading of 100.2°F with no trajectory information, versus a fever chart plotting the same temperature across Mon through Fri — showing it peaked on Tuesday and is now trending down, making the recovery clear.
A thermometer tells you the current reading. A fever chart tells you whether the patient is improving or deteriorating. Every existing code quality metric is a thermometer.

Every existing code quality metric is a thermometer. Cyclomatic complexity, LCOM (Lack of Cohesion of Methods, a measure of how closely related the internals of a class are to each other), coupling between objects, the Maintainability Index: these all measure the state of the code at a single point in time. They’re valuable but they’re insufficient for the problem I’ve been describing, because the problem isn’t that the codebase is unhealthy at any given moment. It’s that the codebase is drifting, and the rate of that drift is accelerating in ways that point-in-time measurements can’t detect alone.

What enterprise engineering teams actually need is the fever chart: a way to measure the rate of change of structural coherence over time. Not “how healthy is the code right now” but “is the code getting more or less coherent, and how fast?” We’ve been measuring whether individual changes are good. We haven’t been measuring whether the collective direction of all changes is coherent.

We've been measuring whether individual changes are good. We haven't been measuring whether the collective direction of all changes is coherent.
On structural divergence

I’ve been calling this concept the Structural Divergence Index, and my understanding is that nothing like it currently exists as a named, measured, trackable quantity in any widely adopted engineering practice. The idea is a composite measurement built from four observable dimensions, each tracked over time, that together tell you whether independent contributors, human or agent, are collectively pulling a codebase toward coherence or away from it.

The first measurement is of Pattern Entropy. How many structurally distinct implementations exist for the same conceptual operation within the codebase? If your project has one canonical way to handle API errors, your pattern entropy for that operation is low. If it has four different approaches that all work but are structurally different, it’s high. The measurement itself is a count of distinct structural shapes, identified through AST (Abstract Syntax Tree) analysis, for each category of operation. What matters isn’t the absolute number so much as the direction: when pattern entropy is rising, it means new implementation dialects are being introduced faster than existing ones are being consolidated. When it’s stable or falling, someone is actively gardening.

The second measurement is the Convention Drift Rate. This is the velocity dimension: how many new structural patterns were introduced in the last N commits, minus how many existing patterns were consolidated or removed? A net positive drift rate means the codebase is becoming more internally inconsistent over time. A net negative rate means someone is doing the unglamorous work of convergence. The distinction matters because a high absolute pattern entropy might be perfectly acceptable for a mature, complex system, while a rising entropy is a warning signal regardless of where it started.

The third measurement is the Coupling Topology Delta. Not “how coupled is the code” but “how is the shape of the coupling changing between measurements?” This is a graph-theoretic property: you take the dependency graph at two points in time and compare its structure. Is the number of circular dependencies increasing? Are hub nodes, modules that everything depends on, getting more concentrated? Is the maximum dependency path length growing? Each of these individually might be unremarkable. The aggregate trend of all of them tells you whether independent contributors are collectively tightening the dependency graph into a knot.

The fourth and final measurement is of Boundary Violation Velocity. How fast are new cross-boundary dependencies appearing? Each individual boundary crossing might be the right call in context, a pragmatic shortcut to ship a feature, approved by a reviewer who understood the tradeoff. But the rate at which new boundary crossings accumulate over time tells you whether the architecture’s intended separation of concerns is being respected or quietly abandoned. Think of it as the zoning variance tracker: one variance is a judgment call, ten variances in the same quarter is a pattern that suggests the zoning itself needs to be reconsidered, or that contributors aren’t aware it exists.


Why This Is Different Now

Here’s the dimension that makes all of this specific to the current moment rather than a restatement of problems the industry has always had: every one of these measurements accelerates in a non-linear fashion with the number of independent context windows generating code simultaneously.

I want to be precise about what I mean by that, because I can already hear the counterargument: “our agents aren’t starting from zero, we use RAG (Retrieval-Augmented Generation) pipelines and AGENTS.md files and architecture documentation in the system prompt, so the context problem is solved.” And I’d push back on that characterization, not because those approaches aren’t valuable, they absolutely are and I use them myself, but because there’s a meaningful difference between giving an agent access to information and giving it access to every other contributor’s reasoning in real time. Even with retrieval-augmented context, each agent session is reconstructing its understanding of the project from whatever materials happen to be retrieved for its specific query, and that reconstruction is lossy, selective, and bounded by context window limits. Two agent sessions operating on the same codebase at the same time, given the same RAG corpus, will still make different implementation choices because the retrieval is query-dependent, the context window forces prioritization, and there is no shared working memory between concurrent sessions. RAG gives agents access to documentation. It doesn’t give them access to each other’s decision-making. It’s in that gap that Structural divergence accumulates.

Human teams coordinate naturally through standups, hallway conversations, code review discussions, and the shared mental model that builds up over months of working together on the same system. Agent sessions don’t. Each one begins by reconstructing context from static artifacts, does its work, and exits. The next session does the same, with no memory of the first session’s reasoning or choices. For small teams and small codebases, this is manageable. For enterprise-scale development with dozens of concurrent agent sessions across a large monorepo, the drift rate becomes a function of contributor concurrency in a way that human-only teams simply never experienced.

One more thing worth saying about technical feasibility, because I don’t want this to read as a concept that can’t be built. Modern AST parsing, and specifically tree-sitter, provides the primitive for structural fingerprinting, the ability to parse source code into a concrete syntax tree and query it for structural patterns across languages. I’m aware that tree-sitter is having a moment right now and that much of the conversation around it focuses on the obvious use cases: better syntax highlighting, giving LLMs richer code context, powering editor features. My argument is narrower and, I believe, more interesting: tree-sitter gives you the ability to identify structurally equivalent operations across a codebase and count the distinct shapes they take, which is the foundation for pattern entropy measurement. A recent research project called Codebase-Memory, published in March 2026, demonstrates that tree-sitter-based knowledge graphs exposed via MCP are technically feasible right now across dozens of languages. The parsing primitive exists. The gap isn’t the tool; it’s that nobody has pointed it at the drift measurement problem specifically.

The detailed treatment of what a structural divergence measurement looks like in practice, including a proposed snapshot format, the AST query patterns that enable each dimension, and some initial thinking on how the component dimensions might be weighted, will come in a companion technical paper I’m still noodling on. This post is about establishing why the instrument is needed and what it would measure. That paper will be about how to actually implement it.


Drift Control Snapshots

Measurement only matters if it leads to action, and I think the most practical mechanism for turning the Structural Divergence Index from a concept into a habit is something I’ve been calling a drift control snapshot: a periodic, automated capture of the project’s structural fingerprint that gets committed to the repository and diffed against previous snapshots.

Think of it as the structural equivalent of what database migration snapshots do for schema. A migration snapshot captures the shape of your data model at a point in time so you can track how it evolves and detect when something has changed in a way that needs attention. A drift control snapshot does the same thing for your codebase’s architectural shape: its pattern catalog, its dependency topology, its boundary health, and its convention consistency.

The capture cadence should be tied to your version control workflow, not to arbitrary time-boxed ceremonies. My recommendation is to generate a snapshot on every merge to your primary integration branch. The snapshot itself is cheap: it’s an automated AST analysis pass, not a manual review. The cost of generating it is trivial compared to the cost of missing drift, and tying it to merges rather than calendars means the measurement frequency naturally scales with your development velocity. Teams that merge ten times a day get ten data points. Teams that merge twice a week get two. The measurement cadence matches the rate of change, which is exactly what you want.

(SWITCH TO EDITOR NOTE)

What goes in a snapshot, at the napkin-math level:

{
  "snapshot_version": "0.1.0",
  "timestamp": "2026-04-01T14:32:00Z",
  "commit": "a1b2c3d",
  "pattern_catalog": {
    "error_handling": { "distinct_shapes": 3, "canonical": "try-catch-log-rethrow" },
    "data_access": { "distinct_shapes": 2, "canonical": "repository-pattern" },
    "api_validation": { "distinct_shapes": 4, "canonical": null }
  },
  "dependency_topology": {
    "node_count": 142,
    "edge_count": 387,
    "cycle_count": 3,
    "max_depth": 7,
    "hub_concentration": 0.12
  },
  "boundary_status": {
    "defined_boundaries": 8,
    "active_cross_boundary_deps": 23
  },
  "divergence_summary": {
    "pattern_entropy_delta": "+1 (api_validation)",
    "net_convention_drift": "+2 since last snapshot",
    "new_boundary_crossings": 1
  }
}

This is deliberately simplified, a napkin sketch rather than a specification. The companion technical paper goes deeper on the schema, the query patterns that populate each field, and the edge cases that emerge when you try to define “structurally equivalent operations” rigorously across different languages and paradigms. For now, the point is that the shape of this data is straightforward and the content is inexpensive to generate.


Reading the Chart and Writing the Prescription

What you do with successive snapshots:

Step one is to diff them. The delta between snapshot N and snapshot N-1 tells you what changed structurally in that merge window. A new distinct shape appeared in error handling. A boundary crossing was added or the cycle count in the dependency graph went from 3 to 4. Each of these is a fact, not a judgment, and the diff gives you a factual record of structural evolution.

Step two is to trend them. It takes three data points to make a trend and three consecutive snapshots showing rising pattern entropy in the same operation category is a signal worth investigating. A single spike in distinct shapes during a planned migration or a major refactor is expected and healthy. The trend is what separates signal from noise, and the ability to look at a trendline over twenty or fifty snapshots is what turns structural awareness from an occasional gut feeling into a data-informed practice.

Step three is to alert on rate of change, instead of absolute values. “Pattern entropy for error handling rose by 3 in the last 10 merges” is an actionable alert that tells you something specific is happening and where to look. “Pattern entropy is 7” is not actionable, because 7 might be perfectly healthy for your project’s stage and complexity. The thresholds that matter are rates, not snapshots, which is the entire point of measuring the fever chart rather than the thermometer.

But measurement and alerting are only half the picture. A doctor doesn’t just read the chart at the foot of the patient’s bed and walk out of the room. The chart tells you what to investigate, and the investigation tells you what to prescribe. So let’s talk about what you actually do when the numbers start moving in the wrong direction.

If your pattern entropy is climbing for a specific operation, say error handling, what that’s really telling you is that your contributors are inventing new approaches to the same problem because they don’t know there’s already a right way to do it, or they can’t find it. The fix is straightforward and boring: pick the canonical pattern, document it somewhere that agents will actually read (an AGENTS.md or a conventions file in the repo root, not a Confluence page three clicks deep), and then carve out time to bring the variants back into line. Nobody will celebrate this work in a sprint review. It will save your team more hours than most feature work does.

If your coupling topology is tightening, more circular dependencies creeping in, more modules becoming hubs that everything depends on, then what’s happening is that contributors are taking the shortest path between two points and that path keeps going through walls. Go look at the recent cross-module dependencies and ask yourself a simple question for each one: did we decide this should be a dependency, or did it just happen because it was the fastest way to ship? The ones that “just happened” need to be pulled back before they become load-bearing, because once other code starts depending on a shortcut, removing it stops being a cleanup task and starts being a project.

If your boundary crossings are accelerating, that’s your zoning being eroded. Your architectural boundaries exist in someone’s head, or maybe in a wiki page, and contributors are quietly redrawing them one PR at a time. The intervention here is to make boundaries enforceable rather than aspirational: fitness functions in your CI pipeline that actually prevent a new cross-boundary dependency from merging without explicit approval. If the only thing standing between your architecture and dissolution is the hope that every contributor reads the design doc first, that’s a bet I wouldn’t take even with a fully human team, and it’s a losing bet entirely in an agentic environment.

And if your overall convention drift rate is positive over multiple measurement periods, meaning you’re accumulating dialects faster than you’re consolidating them, what you need is a gardener. Someone, or an agent specifically tasked with this role, whose job it is to regularly survey the codebase for variant implementations and converge them toward the canonical ones. Neil Kakkar calls this the “gardener agent” concept, and I think the framing is exactly right: a contributor whose job isn’t to add features but to tend the structural coherence of the thing so that everyone else’s features can grow in soil that makes sense.


What This Costs If You Ignore It

I want to ground this in something concrete, because the temptation with architectural arguments is to stay safely in the realm of principle where nobody has to write a check.

The most instructive example I know of, for the specific phenomenon this post describes, actually predates the current AI era by more than a decade. When HealthCare.gov launched in October 2013, it was the product of 55 independent contractors building components in parallel, each one working to their own specifications, each one producing code that passed its own quality checks, and nobody watching whether the aggregate system was structurally coherent. The result for once is well-documented: a system that cost an estimated $1.7 billion, required five million lines of code to be rewritten, and failed so comprehensively at launch that paper applications had to be routed to a call center that was trying to use the same broken system manually. The individual components weren’t all bad. It was the collective shape of the thing that was so catastrophically incoherent, and nobody had an instrument that would have warned them of that before launch day.

HealthCare.gov was, in effect, 55 independent human context windows building in parallel without structural coordination. The reason I bring it up isn’t because it’s an AI story, it isn’t, but because it’s the pre-agentic version of the exact dynamic I’ve been describing. What AI-assisted development does is create the same multi-contributor coordination problem, except instead of 55 contractors you might have 55 agent sessions per day, the accumulation rate is orders of magnitude faster, and the structural coordination problem is fundamentally harder because agents don’t attend standups, don’t build shared mental models through hallway conversations, and don’t develop the institutional intuition that tells a human contributor “we don’t do it that way here.”

HealthCare.gov was, in effect, 55 independent human context windows building in parallel without structural coordination.
On the HealthCare.gov debacle

The broader industry data tells the rest of the story. Estimates from Accenture’s Digital Core report put the annual cost of technical debt in the United States alone at $2.41 trillion. A McKinsey survey of CIOs found that technical debt amounts to roughly 40% of an enterprise’s total technology estate. Organizations carrying high levels of technical debt spend 40% more on maintenance and deliver features 25 to 50% slower than their competitors. And a Carnegie Mellon University study, one that I think deserves more attention than it gets, found that architectural issues are the single most significant source of technical debt, more costly and more persistent than code-level bugs, insufficient testing, or missing documentation.

That last finding is directly on point for this argument. Architectural debt, the kind that arises from structural divergence across modules and boundaries, is already the most expensive category of technical debt the industry contends with. What I’m suggesting is that agentic development is about to compound it at a rate the industry has never experienced, while the quality instruments currently deployed are evaluating at a unit of analysis that can’t see it happening.

If structural divergence is real, unmeasured, and accelerating with agent concurrency, then the current enterprise strategy of “maximize AI-assisted throughput” is building toward a correction that nobody has budgeted for. Not a single dramatic incident, but a gradual deceleration: the codebase becomes so internally inconsistent that feature velocity slows despite AI assistance, because every new feature has to navigate an increasingly incoherent landscape. The senior engineers stop building new things and start spending their time on archaeological expeditions through agent-generated code, trying to figure out which of four error-handling patterns is the one they’re supposed to extend. The rewrite gets proposed. The budget gets approved. And the postmortem, if anyone bothers to write one, will say “technical debt,” because we don’t yet have a more precise name for what actually happened. “Technical debt” will become the catch-all “IBS” term of the coding world.

So what I’m proposing is that that we give this phenomenon a proper name, and that we build the instrument to measure it before the postmortem forces the conversation.

What this means for the people doing the work, and specifically what it suggests about where senior engineering judgment becomes most valuable in an increasingly automated landscape, is a question I’ll take up directly in Part 4 of this series.

Good Code #844
You can either hang out in the Android Loop or the HURD loop.

Carrying This Forward

If I can leave you with one practical thing, it should be this: the next time your team celebrates a sprint velocity milestone, or your engineering leadership reports on AI-assisted throughput metrics, ask whether anyone can tell you what happened to the shape of the codebase during that period. Not what was added to it. Not what was shipped. What changed about its structural coherence, about the number of ways the same operation is implemented across modules, about the density and direction of its dependency graph, about whether its architectural boundaries are still intact or have been quietly renegotiated by a hundred individually reasonable PRs.

If nobody can answer that question, you’re flying on instruments that don’t include an altimeter. The ground might be far away. It might not be. The fact that you can’t tell the difference is the problem.

In the companion technical paper, I’ll go deeper on what a structural divergence measurement looks like in practice: the AST parsing primitives that enable it, a proposed snapshot schema, and some initial thinking on how to weight the component dimensions against each other. That piece is for the people who want to build the instrument. This one was for the people who need to know the instrument is missing.

And the drift problem doesn’t stop at the repository boundary. When your system isn’t a codebase but a constellation of services, elastic infrastructure that reconfigures itself under load, pods that come and go, the structural divergence problem moves up a level of abstraction and the measurement challenge gets fundamentally harder. How you maintain a coherent understanding of a system that is itself dynamic, and how you give agentic tooling meaningful context about something that’s already changed by the time you’ve described it, is what Part 3 of this series will address.

We built a whole generation of quality tools around one core assumption: that if every individual change is good, the whole system is healthy. That assumption worked fine when the people making those changes were also the people holding the big picture in their heads. That’s no longer a given, and the instruments haven’t caught up. So let’s build the instrument that catches what the others don’t.


If this resonated, or if you think I got something wrong, I’d genuinely like to hear it. Come find me at LinkedIn.com/in/geoffgodwin or GitHub.com/geoffgodwin

Next
Two Different Promises: Why the AI That Helps You Write Emails Isn't the AI That Will Cure Cancer