The Harness Is the Goldmine

Most conversations about agents start with the model.

Which one is smarter? Which one writes better code? Which one has the better benchmark score?

Those things matter. I care about them too.

The more practical question is this:

What can the agent actually do?

Can it read the codebase? Can it run the app? Can it open the browser? Can it inspect logs? Can it query traces? Can it run tests, apply a fix, check the result, and explain what changed?

Can it recover when the first idea is wrong?

That surrounding system is the harness. And I think the harness is the goldmine.

A model by itself is a chat box

A model in a text box can be impressive.

It can explain code. It can sketch an architecture. It can write a function. It can suggest a fix.

It is still working from the context you gave it.

Real engineering work needs more than that.

It follows a loop:

Understand the problem.
Inspect the system.
Form a theory.
Make the smallest useful change.
Run the checks.
Look at what happened.
Repeat until the system behaves correctly.

Humans do this all day.

We move between the editor, terminal, browser, logs, traces, database, tickets, docs, and deployment tools. The code is one part of the work.

Agents need access to that loop if we want them to do real engineering work.

They need enough access to observe, act, and verify. A good harness gives them that access with boundaries.

Why we are not agentic enough

Agents already do real work.

People are running agents with access to code, logs, browsers, terminals, review comments, test suites, and deployment systems. The frontier has moved past the chat box.

I still think we are not agentic enough.

The gap is environmental control.

There is a big difference between an agent reading logs and an agent creating the conditions it needs to solve the problem.

Can it add the missing test harness? Can it create test data? Can it add temporary instrumentation? Can it spin up the right services? Can it reproduce the bug without waiting for a human to package the world for it?

Can it keep going when the first path fails?

This is where agentic work gets interesting. Useful agents need enough control to make progress. Safe agents need that control shaped by the environment around them.

Control is the bottleneck

Small tasks need little control.

Change a label. Fix a type error. Update a dependency. Move a button.

The agent can work in a narrow part of the repo, run a check, and stop.

Bigger tasks need more room.

"Improve onboarding" touches product behavior.

"Make this flow reliable" crosses code, data, and runtime.

"Reduce checkout failures" requires evidence before changes.

Those tasks require exploration. The agent has to inspect the product, trace the behavior, understand the data model, make changes across boundaries, and verify the experience from the user's point of view.

That takes control.

It needs to run the app. Open the page. Create an account. Trigger a workflow. Inspect the database. Read the logs. Change the code. Restart the service. Check again.

The risk grows with that control.

An agent with file access can make a bad edit.

An agent with environment access can delete the wrong files, kill the wrong process, rotate the wrong secret, migrate the wrong database, or deploy something that should never have left the branch.

The goal is practical control with practical safety.

The harness shapes the blast radius

The harness is where we decide what the agent can touch.

Can it read production logs? Maybe.

Can it write to production data? Usually no.

Can it delete files? Only inside a disposable workspace.

Can it restart services? Local services first. Shared services need a higher bar.

Can it deploy? Only after a clean build, passing checks, a smoke test, and human approval.

These boundaries should live in the environment.

A good harness gives the agent room to move while keeping risky actions explicit and reviewable.

That means isolated workspaces. Clear permissions. Reversible operations. Dry runs where possible. Approval gates where needed. Logs of what happened.

It also means review surfaces.

Show me the diff. Show me the commands. Show me the tests. Show me the browser result. Show me the logs after the change. Show me what you tried and why you stopped.

The target is an agent with the right power inside a system that knows how to constrain it.

Verification is the contract

If an agent says "done," what does that mean?

Did it compile? Did the tests pass? Did it test the actual behavior? Did it check the browser? Did it inspect logs after the change? Did it improve the system?

This is where we still depend too much on humans.

We ask an agent to make a change, then we become the verification layer. We read the diff. We run the app. We click around. We decide whether the work is actually good.

That process works for small tasks. It breaks down as the work gets larger.

Larger work needs stronger verification layers.

Unit tests are one layer. Type checks are one layer. Lint is one layer. Browser checks are one layer. Screenshots are one layer. Log inspection is one layer. Trace queries are one layer. A production smoke check is one layer.

Each layer catches a different kind of failure.

The agent can say it worked. The environment needs to prove enough of it.

That is the contract.

When the environment can prove more, we can trust the agent with more. When it proves less, the human stays closer to every step.

We should specify outcomes clearly

There is another shift hidden in all of this.

As agents get more capable, we should spend more time specifying outcomes and less time specifying implementation.

Most of us still talk to agents like we are assigning narrow tickets:

"Change this function."

"Add this prop."

"Move this button."

That works. It also keeps the human responsible for decomposing everything. The agent becomes a faster pair of hands.

The more interesting version sounds different:

"Users should be able to publish a post by dropping a markdown file into the configured folder. Make it work locally, in Docker, and in production. Verify it."

That instruction gives the agent a goal, a boundary, and a standard for completion. It leaves the implementation open.

Outcome-based work needs a strong harness. The agent needs room to explore, plus checks that catch plausible mistakes.

This is judgment at a higher level.

The better instruction describes the outcome, the constraints, the checks that must pass, and the points that require approval.

That is when the work starts to feel properly agentic.

How big can the task be?

This is the question I keep coming back to.

How big a task can an agent take on while I am away?

An hour? An afternoon? A day? A week?

The answer depends on continuity as much as intelligence.

Can the agent keep track of what it tried? Can it recover from bad assumptions? Can it split work into phases? Can it stop when the risk changes? Can it ask for approval only when needed?

Can it leave enough evidence that I can re-enter the work without replaying everything?

Longer-running work needs memory, checkpoints, rollback points, and review surfaces. It needs a way to separate progress from wandering.

That is when the harness starts to look less like tooling and more like an operating environment.

The agent needs a place to work, a way to observe, a way to act, a way to verify, and a way to explain itself.

With that, bigger tasks become manageable delegations.

The harness decides how far we can trust the work

Everyone will use AI.

The advantage will go to teams that build environments where agents can safely do more of the loop.

Good context. Clear permissions. Fast checks. Isolated workspaces. Observable runtime behavior. Strong verification.

That is where engineering gets stronger in the age of AI.

The fundamentals still matter. The harness turns them into rails that agents can run on.

A powerful model helps. A strong harness turns that power into trustworthy work.