The autonomous agent
experimentation platform.

SquareDiff generates informed hypotheses to improve agent performance based on evaluation scores, traces, and frontier research. Then autonomously codes, deploys, and evaluates 100+ variants in parallel to discover the most optimal harness continuously.

Customer Success Agent / Experiment Run #52
SquareDiff·Customer Success Agent
240 hypotheses · n=1,240LIVE
Variants5ChartingTraces
#Variant · Wtd. Eval ± CIScorep-val
1

ReAct vs CoT reasoning strategies

WINNER
94.2% ±1.1%p<0.001
2

Skills toolkit vs MCP tool serving

91.8% ±2.3%p=0.008
3

Parallel subagent delegation patterns

87.3% ±1.6%p=0.041
4

Multi-agent: orchestrator + 3 workers

83.1% ±3.2%p=0.119
5

Memory compression + struct. evals

running...-
Eval Scores·by iteration
Trajectory
Hallucination
Adversarial
20%40%60%80%100%#1#2#3#4#5#6#7#8#9#10#11#12
Weighted Eval Breakdown·Winner EXP-047
Trajectory
78.95%w=0.40
Adversarial
48.5%w=0.35
Hallucination
92.1%w=0.25

Built & advised by talent from

Meta
Apple
Stripe
Productboard

Generate new agent variants with multiple modes:

AutonomousLet our agents generate and choose what ideas to test in a loop.
SuggestedChoose from generated ideas based on your agent's eval scores and traces.
TemplatePick from our curated, evolving library of proven improvement ideas.
ManualDescribe an idea with natural language ("Try using a swarm architecture...").

Import your existing evaluation criteria or build a new set with our eval generation suite.

We work closely with our customers to audit and establish base evals that define what great performance means for their unique agent.

Track the impact of every experiment. See accuracy, cost, and latency deltas with statistical significance, surface regressions instantly, and measure total improvement across your full experimentation history.

Ship winning variants as GitHub PRs in a single click. Stage, deploy, and roll back versions at any time.

Experiment mode

Connect from any
agent framework

Get started in minutes. SquareDiff connects with the agent frameworks your team already uses, no migration or rewrites needed.

LangChainLangChain
CrewAICrewAI
AutoGenAutoGen
LlamaIndexLlamaIndex
Semantic KernelSemantic Kernel
HaystackHaystack
OpenAI AgentsOpenAI Agents
Vercel AI SDKVercel AI SDK
DSPyDSPy
MastraMastra
AgnoAgno
CustomCustom