The harness is the key to unlocking maximum agent accuracy.

The Problem

An agent's output is determined by its harness: the interplay of prompts, memory, tools, guardrails, and control logic around the core models.

The optimal harness is unique to every application and business, and evolves constantly as new frontier models, research, and tooling are released.

There is a tremendous opportunity to improve reliability, intelligence, and token efficiency for AI agents across every domain. Most teams aren’t capturing it because it’s impossible to know what changes will be most effective, they lack the engineering resources to conduct experiments at scale, and have no framework to improve hypotheses progressively.

Agents ship with known gaps, and end users experience the failure.

Why the Harness Matters

Experimenting with a new harness improved Claude Opus 4.5’s accuracy on CORE-bench from 42% to 78% for Anthropic. Cursor cut token usage by 46% by implementing a new lazy-loading MCP tool harness. Vercel improved agent reliability by 20% and cut token usage by 37% by cutting 13 tools out of its harness.

We believe there is a critical piece missing in the agent development stack: an effective layer to experiment on harnesses at scale.

Our Solution

SquareDiff is building the autonomous experimentation platform to help teams find the optimal harness for their agent.

It generates informed hypotheses to improve performance based on baseline evaluation benchmarks, traces, and research. Then autonomously codes, deploys, and executes evaluations on hundreds of harness variants in parallel to find the most optimal version.

As the frontier evolves, teams can use our platform to fit their harness with the latest models 3x faster than rewriting the harness every time.

Our Mission & Team

We are on a mission to unlock model intelligence and reliability that make agentic experiences delightful to the world by maximizing the harness.

We’re a team of engineers and experimenters combining deep agent engineering expertise with statistical testing rigor, shaped by building agents for the world’s largest technology companies and scaling experimentation programs at leading startups.

Join Us

If this resonates with you, we'd love to hear from you.