Most AI agents are frozen at deploy time. The prompts, routing logic, and tool configurations that ship on day one are the same ones running six months later. If the agent makes a bad decision, a human has to notice, diagnose the issue, and manually adjust the system prompt or parameters. That doesn't scale.
We wanted Odigos agents to get better on their own. Not through fine-tuning (expensive, brittle, requires curated datasets) but through a lightweight evolution loop that runs continuously in the background.
The Core Loop
After every conversation, an evaluator scores the interaction using implicit feedback signals and rubric-based assessment. These scores feed into a "dream" cycle that runs during idle time:
- Evaluate -- Score each response. Was the tool selection efficient? Did the agent ask unnecessary clarifying questions? Did it resolve the task?
- Analyze -- Look for patterns across recent conversations. Which classification routes are underperforming? Which tools get used together?
- Propose -- Generate candidate changes: adjust a classification rule, modify a prompt section, create a new skill from a repeated pattern.
- Trial -- Run the change as a time-boxed experiment. Track scores during the trial period.
- Promote or revert -- If scores improve, the change becomes permanent. If they drop, it's rolled back automatically.
Research Foundations
This approach draws from several lines of research that converged in late 2025:
Automating Skill Acquisition introduced evolvable classification rules -- the idea that an agent's routing logic (which requests get fast-tracked vs. decomposed into sub-tasks) should itself be a learnable parameter, not a hardcoded decision tree.
XSkill: Continual Learning in Multimodal Agents formalized the concept of "tactical experiences" -- structured lessons extracted from past tool interactions. Instead of just logging what happened, the agent captures why a tool call succeeded or failed, creating a knowledge base that prevents repeating mistakes.
The evolution engine itself is inspired by autoresearch by Andrej Karpathy, which demonstrated that automated experimentation with score-based promotion can iteratively improve complex systems without human intervention.
What This Means in Practice
An Odigos agent deployed today will handle certain requests poorly. That's expected -- no system is perfect at launch. The difference is that by next week, those weak spots will be smaller. The agent will have adjusted its own routing, refined its tool selection patterns, and possibly authored new skills to handle cases it previously stumbled on.
No one has to touch a prompt. The agent reports what it changed and why, so you maintain full visibility. But the work of improving the system happens automatically.