Designing Rovo’s Character
My experience with model design and behavioral evals at Atlassian.
2025 - Present
Context
Rovo is an AI assistant built into Jira and Confluence, also available standalone on web and mobile. It uses all the information a company stores across connected apps, allowing people to find what they’re looking for instantly, answer questions in a chat, write and summarize content, and handle repetitive tasks using agents.
I’d worked on Rovo for a long time, primarily focused on content-related growth experiments aimed at helping users discover and adopt chat and search, and connect more third-party apps (without those connections, Rovo wasn’t terribly useful).
By summer of 2025, Rovo was pretty fast, accurate, and visually polished. We were out to the races launching more connectors, agents, and the mobile and desktop apps. However, we had largely ignored how Rovo spoke and acted. Like a beautiful person with a boring personality, talking to Rovo just wasn’t as enjoyable as one would hope.
Problems
Too many words: Rovo gave long, rambling answers that barely fit in the chic little floating modal our design team meticulously crafted. So users got tired of scrolling and simply gave up.
One voice for every job: Rovo spoke the same way everywhere. It didn't understand that someone using Jira is trying to finish a task, while someone in Confluence is trying to tell a story or learn something new.
No personality: Authentic, I suppose, but Rovo truly spoke like a machine. It didn't need to be a poet, but its voice was flat and robotic. It lacked the rhythm and warmth that makes a conversation feel natural.
Our goals
We had no way to measure if a conversation was "good" or "bad." A small team of content designers stepped in to fix this. Like soul to skin, we believed that Rovo’s character is just as important as its UI and performance. If we didn't give Rovo a personality now, it would only get harder and more expensive to fix later as the system scaled.
Evaluation framework
I built our first eval to test how Rovo speaks. Other teams were measuring accuracy. The only thing I cared about if the conversation felt right. I graded Rovo on a scale of 1 to 5 across four types of work:
Quick questions: Simple, one-turn asks.
Getting things done: Following commands to finish a task.
Big projects: Planning work that takes many steps.
Confusing requests: Handling prompts where the user’s goal isn't clear.
The goal was to see if Rovo could be helpful and human without losing its way.
Methodology
The process was slow but simple. And it proved to leadership that Rovo’s character was worth the investment, which set us up for running more complex and automated evals.
We followed a simple loop:
Pick a task: Choose a specific scenario, like planning a project.
Talk to Rovo: Run a conversation for 3 to 5 turns.
Score the results: Grade against our evaluation criteria. Jot notes.
Find patterns: Spot where Rovo thrived and where it failed.
Tweak system prompt: Rewrite Rovo’s constitution, so to speak.
Compare: Put two versions side-by-side to see which felt closer to target.
To speed things up, I built a "Character Lab" in Replit. It was a simple tool that let other teams play with Rovo’s multiple personalities, and edit the system prompts to see how small edits created ripple voice shifts in real-time.
Outcomes
This work fundamentally changed how we think about AI experiences at Atlassian. While Rovo’s character is still and always be a work in progress, we achieved four main things:
A shared yardstick: We finally agreed on what a "good" conversation looks like.
A repeatable path: We have a clear way to test and improve how the bot speaks.
Better foresight: We can now see how small changes in rules affect behavior.
A solid foundation: We created a baseline to eventually let the system grade itself.
The "Character Lab" started as a prototype, but it became the best way to demonstrate to leadership why character matters. It also started a bigger conversation: how can we let users eventually choose their own version of Rovo?
Next steps
My new mission is to help Rovo naturally adapt to the person using it, not just its environment; and allow users to explicitly set their preferences, for example, in some sort of customer-facing version of my original character lab.
Every user is different: some want a brief answer, others want every detail. Some trust the AI right away, while others need to see the receipts.
I am now leading the work to build a system that:
Reads the room: Understands a user’s mood and style by how they interact.
Changes its shape: Gives short answers or long explanations depending on what the user needs in the moment.
Stays safe: Honors the essential guardrails, no matter how much its voice changes per inferred or explicit personalization.