I put agentic AI through a real engineering stress test. Here’s what I learned.

Mar 11, 2026

A lot of the conversation around AI and software engineering is still missing the point.

Some people are stuck on whether AI will replace engineers. Others are captivated by flashy demos where a disposable prototype gets spun up in 10 minutes. Both conversations are too shallow. They do not really explain what is changing in the work itself.

I usually prefer to write at a higher level of abstraction than I’m going to here. But sometimes a shift in how work gets done is significant enough that you have to zoom all the way in before you can say anything useful about it. This is one of those cases.

A few years ago, models started to write buggy code in small bursts. That was interesting, but not enough to change the work. The latest agentic tools are different. Claude Code, Codex, and similar systems on the latest models can now participate in much more of the engineering loop than most people appreciate. They can inspect environments, read logs, reason through failures, propose next steps, generate scripts, document discoveries, and help turn messy progress into cleaner systems. That is a very different thing than a code assistant that mostly helps you type faster.

The project: a real modern system, not a toy

I wanted to pressure test that for myself.

Not with a cute prototype app or a carefully staged demo. I wanted a project that was both relevant and difficult. I wanted what I built to capture the complex underlying architecture driving the type of end-user value many EPD teams are chasing today: enterprise search, internal knowledge systems, support copilots, and agent memory layers.

So I spent the better part of a Sunday building a complete system that could pull information out of tools like Jira, Notion, and Readwise Reader, store it in one place, make it searchable by semantic meaning rather than just keywords, and expose it through a simple API for apps and agents to use. OpenAI’s Codex helped drive the engineering work from start to finish.

What I actually built

Although no one piece of this architecture was novel, the project crossed several engineering layers at once — infrastructure, container setup, data syncing, database connectivity, schema inspection, search and retrieval design, API exposure, and day-to-day operator workflow.

Starting from an empty repo, I ended up with a repeatable environment bootstrapping toolset and operator workflow built around:

A data-ingestion tool (Airbyte)
Containerized services (Docker)
A database (Postgres) as the system of record
A shared search layer (pgvector) across multiple sources
A small local API (written using Flask)
A repo-local setup for AI skills and documentation so onboarding and maintenance into the project didn’t require the AI to constantly re-learn the environment from scratch.

It was a great stress test because there were plenty of potentially messy failure modes.

And agentic AI rose to the challenge. Only 1 day, 17 chat threads, and a few dollars later, I went from “let’s see what happens” to a working local system that could pull scattered information into one place and let apps or AI search and use it. That local system also had a clear path to cloud infrastructure run at scale.

The ground that is shifting beneath our feet

The scope of engineering work AI can now help meaningfully drive is much larger than most people think–as long as AI is utilized properly.

This was not a one-shot miracle where AI delivered a shiny UI that only half-worked under the hood. And yet, in only 17 rounds of iterations, AI did much more–it brought to life some of the most robust backend plumbing possible.

So, if your model of AI-assisted development is still “prompt in, code out,” you are behind the curve.

The real advance is that these tools can now work with you inside the actual loop of engineering: inspect the current state, form a grounded hypothesis, make a targeted change, verify what happened, react to the evidence, and help turn the discovery into something durable.1 That loop is where real engineering lives. Most projects are not about writing isolated code snippets. They are about figuring out what is true in a messy environment, reducing uncertainty, and making the next good decision.

This project meaningfully advanced my mental model of what AI was capable of yet again.

The biggest ‘aha’: compressed exploration

Normally, a project like this is full of dead time. You hit an issue, lose momentum, start cross-referencing docs, try a few fixes, realize the problem is somewhere else, and slowly bleed context as you go. The newest agentic tools dramatically reduce that friction. They help you hold context longer, narrow the search space faster, and keep moving.2 A lot of leverage in engineering comes not from raw code production, but compressing the time between “something is wrong” and “I now know the next best step.”

That is the part many people still do not understand.

Here are the AI-first practices that I found mattered most:

1. Use AI as an engineering collaborator, not a code generator.

That sounds obvious, but it is a very different stance in practice. I was not asking for giant blobs of output and hoping they worked. I was using AI to help understand the current state, propose the next step, implement a focused change, and reason about the result.

Early on, for instance, the challenge was getting my Airbyte install to behave reliably. The wrong approach would have been to ask for one massive setup script and pray. The right one was to work in short loops: install, verify, inspect failures, patch the setup, rerun checks. That is how issues like stateful login behavior, system resource requirements, and install detection quirks actually surfaced.

2. Force AI to inspect, not speculate.

This may be the most important habit of all.

AI gets dangerous when it is allowed to answer from generic memory instead of the real system in front of it. So throughout the project, I kept grounding it in evidence: command output, logs, row counts, actual database schemas, service behavior, failed sync details.

That changed the quality of the work substantially. When one sync failed, for example, the easy answer would have been the usual suspects: bad credentials, bad networking, bad destination config. But the live evidence showed the source and destination connections were both fine. The actual issue was a stale cursor configuration on one data stream. That is a very different diagnosis, and it leads to a very different fix.3

3. Keep the work in short, testable loops.

This sounds simple because it is simple, but people ignore it constantly.

When AI is involved, the temptation is to hand it a huge objective and hope. That usually backfires. The better pattern is to define crisp checkpoints. If you are adding a new source to a data retrieval pipeline, do not start with “index everything perfectly.” Start with “can one real document make it all the way through the path and show up correctly in the retrieval tables?”

That kind of smoke test reduces ambiguity, limits the surface area, and gives both you and the model something concrete to reason against.

4. Give AI local context so it stops acting like a tourist.

General model knowledge is helpful, but it is not enough once you are deep into a specific workspace.

In this project, I encoded repo-specific operating knowledge into AI skills and documentation: what the repo actually was, how the operator flow worked, what the landed data schemas looked like, how the data retrieval layer was organized, and which commands actually mattered. The difference between an AI that vaguely understands Airbyte and one that understands this exact workspace’s Airbyte setup is enormous.

Same model. Totally different leverage.4

5. Turn discoveries and misses into durable assets immediately.

Every real project uncovers annoying truths. The normal temptation is to solve them once and move on. That is a mistake.

When you discover the exact sequence needed to make a host database reachable from containers, the right defaults for a setup flow, or the specific verification command that tells you whether the environment is actually healthy, capture that right away in a script, a check, a skill, or documentation.5

The same goes for misses. When the model makes a bad assumption, or when a specific type of task clearly requires grounding in live data rather than generic knowledge, the answer is not only to fix the current issue. The answer is to improve the surrounding guidance so future work starts from a stronger place.

This is one of the most underappreciated ways to compound gains with AI. Most people treat every session like a clean slate. That leaves a lot of value on the table. AI is very good at helping turn what you just learned into something cleaner and more reusable. The win is not merely solving a problem once. It is making the next feature, the next bug fix, and the next session cheaper.

6. Refactor aggressively. It’s cheap.

One of the more interesting architectural lessons here was how helpful AI was in moving from point solutions to shared patterns.

It would have been easy for the semantic search layer to stay tightly coupled to the first data source it worked against. That is how a lot of internal tooling evolves: one hyper-specific success at a time, followed by quiet duplication without stepping back to make a DRY pattern. I was able to avoid that by asking AI to refactor things–quickly and efficiently–after the same concepts were duplicated two or three times.6

AI did not proactively surface those refactoring opportunities for me. But it was very helpful in accelerating the refactoring itself when prompted.

That is a bigger deal than people think. If refactoring and generalization get cheaper, teams should expect better internal architecture, not just faster delivery.

What this means for engineers

For individual engineers, I think the takeaway is straightforward: the highest-leverage way to use AI is not as a faster typist. It is as a collaborator in debugging, exploration, and system design.

That matters because the highest-value part of engineering isn’t writing code. It is inspecting reality, forming hypotheses, reducing uncertainty, choosing the next best step, and turning discoveries into durable systems. AI can now participate meaningfully in that loop. But the loop still requires judgment, prioritization, tradeoff-making, and interpretation. That is why engineering is also a human domain.

If you use these tools like a vending machine, you will get vending-machine outcomes. If you use them inside a disciplined loop — grounded in evidence, broken into short checkpoints, and constantly converted into reusable knowledge — you can move through real engineering work much faster without making the work sloppier. That is not a replacement for all aspects of the broader engineering discipline. But it does meaningfully change the repetitive daily loop.

What this means for engineering leaders

For leaders, the question is larger than developer productivity.

If you evaluate AI only through the lens of lines of code, tickets closed, or hours saved, you are asking too small a question. The more important question is whether your team knows how to use these tools to reduce uncertainty, ground decisions in evidence, structure work into short loops, turn discoveries into reusable assets, recognize when point solutions should become shared systems, and build the kind of project-specific context that helps AI stop guessing and start working from reality. As the underlying models improve, these skills will only matter more.

AI greatly amplifies strong human habits. But if your team has weak ones, AI often gives you more noise than signal. “Garbage in, garbage out.”

That is why I came away from this project less interested in the usual “AI replaces engineers” conversation and more interested in operational fluency. The organizations that get the most out of this moment will not be the ones ignoring it, chasing the loudest demos, or freeing up dollars paid to engineering staff. No, it will be the organizations that encourage engineers to best guide these systems, verify them, shape the context around them, and work with them in a disciplined, grounded, compounding way.

That is the real advantage.

Next up: The EPD operating model

What I’ve written here is mostly about the engineering loop itself: how agentic AI changes the day-to-day mechanics of building, debugging, and refining software. The even bigger story is organizational, not just individual. Soon, I’ll be doing a deeper dive on what I believe this means for the EPD operating model: what becomes the new unit of execution, where judgment and accountability move, how teams should be structured, what happens to juniors, and why shared context, trust, and evaluation start to matter more than ever.

Real engineering loop: inspect, patch, verify, then encode the result

The repo quickly accumulated scripts that do real operational work: detect the current state, make an environment change, and print the next concrete step. That is much closer to engineering collaboration than autocomplete.

See: Environment verification part 1 and part 2

Compressed exploration showed up as rapid same-day architectural movement

The commit history on March 8, 2026 compresses what would normally be a slower sequence: initial setup, host Postgres connectivity, first Jira vector path, generic retrieval refactor, then additional source adapters.

94f214c 10:53 first commit
2eb9dab 13:19 postgres instructions
1d8beb4 15:37 Enable the Airbyte builder
3847ec8 17:53 Jira vector search
6b3d7b2 18:44 Generic vector search
1e47302 19:40 New Notion skill
cd7887f 20:28 Notion AI vector search
9a5c040 20:43 Readwise Reader AI vector search

See:

Feeding logs back into an agent gives it the context it needs to actually debug

For example, with real logs attached, I asked:

$Airbyte Local CE Expert    I'm running into issues using the Jira source. See the attached logs. What seems to be the problem?

The answer gave me the information I needed about the runtime error itself, not just hypothetical runtime errors based on code/config:

The main problem is not Jira connectivity. The Jira source check passed, the Postgres destination check passed, and schema discovery also passed.

The sync then fails during destination initialization because Airbyte thinks stream `jira.issues` is configured with a cursor field that is not present in that stream’s schema. The key error is: `For stream jira.issues: The cursor does not exist in the schema`.

So the likely issue is:
- that stream is set to incremental / deduped mode with an invalid cursor field, or
- the stream schema changed after discovery and the saved cursor selection is now stale.

What to do in Airbyte:
- Refresh the source schema.
- Open the `jira.issues` stream config.
- Re-select a valid cursor field if incremental is intended.
- If that stream has no usable cursor, switch it to full refresh / append instead.

The Postgres connector is just where the validation trips; it is not the underlying problem.

Local context was turned into project-specific AI operating knowledge

The repo did not just add code. It added reusable skills that explicitly tell the model to inspect the landed schema instead of guessing from generic API memory. For example, consider the skill for interpreting Notion data that has landed in a Postgres destination via Airbyte:

Start from the `ntn` schema in Postgres. Confirm the actual tables, row counts, column types, and nested storage shape before writing SQL or making schema claims.

and

- Treat the landed destination schema as the source of truth for implementation details.
- Use Notion API docs to interpret meanings, not to override what actually landed in Postgres.

See: Notion skill commit

Annoying chores were turned into scripts instead of being re-solved later

A good example is host-to-container Postgres connectivity. Rather than leaving that as a one-time chore, the repo captured it as a reusable setup script that patches config, restarts Postgres, creates credentials, and prints the exact Airbyte destination values.

See: Postgres setup script

The first working retrieval design was quickly refactored into a shared system

The project first shipped a Jira-specific vector layer, then generalized it less than an hour later into a source-agnostic search server that could accommodate multiple SaaS systems.

This:

CREATE TABLE IF NOT EXISTS jira_ai.vect_issues (
  issue_id text PRIMARY KEY,
  ...
);

CREATE TABLE IF NOT EXISTS jira_ai.vect_issue_chunks (
  issue_id text NOT NULL,
  ...
);

quickly turned into this:

CREATE TABLE IF NOT EXISTS ai.documents (
  document_uid text PRIMARY KEY,
  source_type text NOT NULL,
  source_instance text NOT NULL,
  source_document_id text NOT NULL,
  ...
);

CREATE TABLE IF NOT EXISTS ai.document_chunks (
  document_uid text NOT NULL REFERENCES ai.documents(document_uid) ON DELETE CASCADE,
  source_type text NOT NULL,
  source_instance text NOT NULL,
  source_document_id text NOT NULL,
  ...
);