Kryon


Open Research Areas

May 2026

multi-agent coordination

We're studying what happens when agents need to plan together, divide work, and recover from each other's failures. Current single-model benchmarks don't capture any of this. We're building grounded task environments that measure coordination rather than individual accuracy, and we're actively exploring what evaluation frameworks look like when the agent isn't working alone.

human–agent collaboration

We're investigating what happens when the unit of productive work shifts from the individual to the human-agent pair. Right now we're building instrumented environments that capture how people and models actually collaborate on real tasks, tracking the handoffs, corrections, and breakdowns that determine whether a model is useful in practice.

user simulation

We're developing dedicated user simulation models and systems that go beyond backstory prompts. The problem we're working on is how to build simulated users with domain-specific knowledge, realistic interaction patterns, and the kind of variability that exposes failure modes static personas miss. Getting this right is what makes collaborative evaluation possible at scale without losing ecological validity.

enterprise data procurement

We're working on procuring and constructing datasets grounded in enterprise workflows, professional tooling, and domain-specific tasks. RL rollouts are getting expensive, the marginal value of harder data keeps going up, and benchmark performance increasingly reflects pretraining distributions rather than latent model capability. We're focused on sourcing the kind of signal that moves capability on the problems that matter commercially.

data efficiency & interpretability

We're focused on understanding what data maximizes model improvement with minimal volume, rather than hillclimbing low-fidelity benchmarks with more of the same. The open question we're working on is how to reliably identify which training signal actually shifts capability, catch reward hacking before it compounds downstream, and build this verification into the data pipeline itself rather than treating it as an afterthought.