Kryon

The Kryon Kata

May 2026

型

kata (型) — way of doing

the quality bar

As RL rollouts increase in token count and price, the cost of low-quality data compounds. Most vendors today treat quality with a low bar: does the oracle solution pass the verifier? Is the task fair? These should be table stakes, not standards.

The quality bar needs to shift. Does this data prove post-training gain? Do complex environments experience reward hacking? Even when gains occur, are they happening for the intended reasons? Vendors today largely lack post-training talent or investment and outsource it for core offerings. Keeping this in-house unlocks true quality.

the synthetic data tarpit

It's straightforward for existing vendors to help labs hillclimb benchmarks by synthetically scaling data creation in-domain. The harder problem is scaling out-of-distribution: underrepresented languages, enterprise-fidelity bugs, edge cases that never appear in pretraining, tribal workflows that live in Slack threads and internal wikis.

Synthetic generation without verification creates weak data flywheels where models train on their own outputs and gradually collapse. Catching this requires research-first approaches like mechanistic interpretability to distinguish when generation is teaching the model something genuinely new versus when it's reinforcing existing patterns or gaming the evaluation.

the incentive problem

RL data vendors today have backwards incentives. The standard pricing model is price per task multiplied by number of tasks, which rewards raw volume over signal quality. One lever to inflate short-term revenue is weak harness design to inflate perceived task difficulty: intentionally poor system prompts, limited state management, restricted tool access, and other choices that don't reflect real-world tooling. What actually matters is scaling quality per task, not raw volume of low-signal tasks.

Harness and system prompt design alone can often account for more variability than post-training itself. A weak harness doesn't just inflate difficulty, it produces fundamentally misleading signal. The vendor's pitch is "your model doesn't pass our environment" when in reality the environment itself is the problem.

The cost of this misalignment surfaces late. Customers can't screen for weak harness quality until after procurement, when expensive rollouts within low-quality environments fail to produce the expected capability gains. By then time and resources are wasted.

Kryon aligns incentives with the people who need actual capability gains: the labs training frontier models, applied AI companies shipping agents into production, and enterprises adopting AI for real workflows. Our northstar is measurable performance improvement per task, not task volume.

the contractor army ceiling

Traditional human data companies expanded into code offerings with an over-reliance on lower-skilled labor whose day-to-day workflows often just involve prompting the very models they're training. As model performance scales exponentially, this approach hits a ceiling fast.

As verifiers move beyond rigid unit testing for RLVR, verifying individual tasks becomes more of a research bottleneck than a contractor-level side job with minimal attention. This requires full-time engineers deeply tied to environment and task creation, people who understand the models well enough to find their actual failure modes and construct tasks that push capability forward.