The autonomous agent
experimentation platform.

SquareDiff generates informed hypotheses to improve agent performance based on evaluation scores, traces, and frontier research. Then autonomously codes, deploys, and evaluates 100+ variants in parallel to discover the most optimal harness continuously.

Get a demo

Customer Success Agent / Experiment Run #52

SquareDiff·Customer Success Agent

240 hypotheses · n=1,240LIVE

Variants5ChartingTraces

#Variant · Wtd. Eval ± CIScorep-val

ReAct vs CoT reasoning strategies

WINNER

94.2% ±1.1%p<0.001

Skills toolkit vs MCP tool serving

91.8% ±2.3%p=0.008

Parallel subagent delegation patterns

87.3% ±1.6%p=0.041

Multi-agent: orchestrator + 3 workers

83.1% ±3.2%p=0.119

Memory compression + struct. evals

running...-

Eval Scores·by iteration

Trajectory

Hallucination

Adversarial

Weighted Eval Breakdown·Winner EXP-047

Trajectory

78.95%w=0.40

Adversarial

48.5%w=0.35

Hallucination

92.1%w=0.25

Built & advised by talent from

Experiment

Generate new agent variants with multiple modes:

AutonomousLet our agents generate and choose what ideas to test in a loop.

SuggestedChoose from generated ideas based on your agent's eval scores and traces.

TemplatePick from our curated, evolving library of proven improvement ideas.

ManualDescribe an idea with natural language ("Try using a swarm architecture...").

Evaluate

Analyze

Evolve

Experiment

Evaluate

Import your existing evaluation criteria or build a new set with our eval generation suite.

We work closely with our customers to audit and establish base evals that define what great performance means for their unique agent.

Analyze

Evolve

Experiment

Evaluate

Analyze

Track the impact of every experiment. See accuracy, cost, and latency deltas with statistical significance, surface regressions instantly, and measure total improvement across your full experimentation history.