PolicyAI: a system that scaled faster than trust

UX consulting, conversation design, and user research for a document intelligence platform at a major financial institution

role	Conversational UX design consultant (via Slalom)
duration	4 months
methods	Design audit, semi-structured interviews, live observation sessions
participants	8 associates across underwriting, risk, and compliance
tools	Custom AI heuristic framework, interview protocols, prioritization matrix
deliverables	AI-adapted UX heuristic audit, user research report, strategic recommendations
impact	Surfaced systemic trust and governance gaps before enterprise rollout; delivered detailed recommendations for rebuilding trust in high-stakes AI workflows

role: Conversational UX design consultant (via Slalom)

duration: 3 months

methods: Design audit, semi-structured interviews, live observation sessions

participants: 8 associates across underwriting, risk, and compliance

tools: Custom AI heuristic framework, interview protocols, prioritization matrix

deliverables: AI-adapted UX heuristic audit, user research report, strategic recommendations

impact: Surfaced systemic trust and governance gaps before enterprise rollout; delivered a prioritized playbook for rebuilding trust in high-stakes AI workflows

the short version: I was brought in to improve the conversational UX for a document intelligence tool that had an 80% technical accuracy rate but only 30% perceived accuracy among users. A design audit revealed systematic trust-breaking patterns in the interface, and user research confirmed that people had reduced a six-figure AI investment to a Ctrl+F machine. The root cause wasn’t the model — it was missing content architecture: no taxonomy, no metadata, no governance. I left the team with detailed recommendations for rebuilding trust. They decided to scale 3x first.

This is less of a dazzling success story than it is a cautionary tale. It’s a story about what happens when foundational work—governance, user research, trust-building—gets deferred in favor of scaling quickly. But honestly, the team that put me on this assignment did me a solid, because I learned a ton.

the tool

PolicyAI is an LLM-powered conversational AI assistant that lets commercial banking underwriters ask questions about internal policy documents. It returns a synthesized answer plus source documents. If you’ve used ChatGPT or Gemini, you know what it looks like. Friendly greeting, input bar, “ask PolicyAI…”

It was originally built as a technical proof of concept without a dedicated design resource, and I was brought in to improve the conversational UX as the team refined the model and planned to scale the MVP to more users. Right in my wheelhouse. But PolicyAI was not what it appeared. There were some surprising things happening under the hood.

the numbers

The adoption rate was 30%. The retention rate was even worse. My job was to find out why.

The team wasn’t worried. They were sending out a ten-page prompting manual (!!) and “fixing accuracy on the back end.” Which told me everything I needed to know about how badly they needed to hear directly from the people using this tool.

the audit

I’ve been working on conversational AI for a long time, and these patterns were familiar — but seeing them in a high-stakes financial product in 2025 clarified exactly why foundational work cannot be treated as optional.

I started a formal design audit using Nielsen’s UX heuristics — tried and true usability principles — as a foundation. But those heuristics were designed for traditional interfaces, not conversational AI. So I built my own adapted framework that adds AI-specific dimensions: transparency of AI reasoning, conversational agency, context preservation, graceful misunderstanding recovery.

The audit revealed a technically functional system that was experientially broken: no confidence indicators, no conversational memory, overzealous content moderation that flagged “how sure are you about this?” as a high-risk query, and dual AI confusion where users had to mentally switch between Gemini and PolicyAI with no visual cues.

I presented the audit with recommendations, and the team’s response was understandable — they had a feature roadmap, tight timelines, and I was essentially asking them to slow down and rebuild trust infrastructure. But heuristic violations are abstract. I needed to show them what this looked like for actual users.

So I shifted from “here's what's broken” to “let me show you what your users are actually experiencing.”

the research

I worked with the PM and research ops to put together a research plan. They have a killer research ops team. Once I filed my plan, they recruited and scheduled eight interviews in a couple of days. I conducted interviews with three main user segments: underwriters, risk, and compliance associates. I watched them use the tool. I documented the workarounds. I listened for the hesitations.

These are people working on nine-figure deals. If policy is wrong, careers and legal risk are on the line.

what users actually said

“I don't trust the summaries... I don't know how often I’m gonna use this if the hit rate’s gonna be less than 50%.”

To a person, everyone we talked to said they skipped the synthesized answers. They were verbose and meandering. Sometimes the right answer was buried in a bunch of jibberjabber. Sometimes it took the model three paragraphs to conclude with “in short, I don't have an answer.”

“I can't trust the answers, but I can trust that it’ll point me to the right document. From there, I’ll do my own Ctrl+F.”

This quote stopped me in my tracks. This is a senior underwriter describing how they actually use an AI tool their company spent hundreds of thousands of dollars to build. They’ve reduced it to a search engine because they can’t trust the answers. That’s not a tech failure — that’s a trust failure.

“Policy is an absolute no-fail scenario. You cannot get policy wrong.”

This was from a senior director of underwriting. He was also one of the business’s “AI champions”— someone who volunteered to test PolicyAI because he was excited about the concept. His frustration was visceral, and completely justified given what was at stake. These underwriters are working deals in the hundreds of millions of dollars, with serious legal and career repercussions if they get policy wrong. You do not get to be casual about accuracy in that environment.

the trust gap

Internal validation showed 80% accuracy. But with only 30% of users actively adopting the tool — and even those users bypassing the AI answers to Ctrl+F the source documents — perceived accuracy was closer to 30%. That fifty-point spread is where trust lives and dies.

Here’s what was killing trust:

mismatched mental model. No conversational memory means single-turn dialogues, but the UI implies multi-turn conversation threads. You look like Gemini, you better act like Gemini.

overzealous guardrails. The system flagged reasonable follow-up questions as off-topic or potentially harmful. Asking “how sure are you about this?” was treated as a high-risk query. Think about that.

verbose, meandering responses. Answers were unclear and difficult to parse, burying useful information under paragraphs of hedge language.

mismatched citations. Cited source documents often didn’t correspond to the synthesized answer. Users couldn’t verify what the AI told them because the receipts didn't match.

the jargon gap. The model required formal policy terminology, not banking vernacular. If an underwriter asked about a “borrower,” the model got confused. But ask about an “obligor” and you'd get a decent answer. The people using this tool every day had to become prompt engineers for their own domain.

what I recommended

transparent source highlighting. Show the exact text snippets that informed the AI’s answer. Don’t make people hunt.

terminology mapping. Translate natural banking language to formal policy terms so users don’t have to become prompt engineers for their own domain.

document-level feedback loops. Replace the smiley/frowny emojis for the synthesized response with ✅ ❌ on each source document. That generates actionable data the team can actually use to improve the model.

workflow integration. Embed capabilities into existing tools instead of forcing people to a standalone destination. PolicyAI was an occasional lookup tool, not an integrated workflow partner, and context-switching was killing sustained adoption.

governance guardrails. Define corpus owners, update cadence, and “do-not-ship” criteria tied to user trust metrics — not just model accuracy. Without governance, every other fix is temporary.

the root cause

The trust gap wasn’t just a UX problem. It was a content architecture problem. AI products have foundational requirements that traditional software doesn’t — and none of them were in place.

no taxonomy. Documents were organized by internal filing systems, not by how people actually do their jobs. No mapping between user language and policy language. No hierarchy of document relationships. No connection to deal lifecycle stages.

no metadata. No way to filter by document type, know document freshness, or scope by loan type or business line. No way to surface authority level or know which user roles a policy applies to. The model knew a document pattern-matched some words, but had no way to know why it mattered.

no governance. No ownership of content quality. No process for corpus updates. No feedback loops. Metrics measured model accuracy, not user success. No criteria for when not to ship. No definition of “done” that included users’ confidence in the results. And instead of system fixes, they wrote a ten-page prompting manual.

In other words: they built an impressive LLM interface on top of an information architecture that wasn’t ready to support it.

accuracy ≠ trust

You can have the most amazing data scientists and engineers and product folks in the world, but if you don’t deeply understand what people are actually trying to accomplish with your product, you won't earn their trust. The decisions we make from the very beginning of this process are amplified and multiplied at scale, and users feel it. They can tell when a product wasn’t built with their actual work in mind.

what actually happened

I was told they were addressing these issues “on the back end,” and that “once accuracy is fixed, this will all be solved.”

They opted to stick to their original feature roadmap and scale up 3x before addressing any of it.

I couldn’t change the roadmap, but I left them with a clear, prioritized path to rebuild trust once they're ready.

Here’s the thing about AI products right now: we’re all still learning the best way to build them. Especially in enterprise, where software development processes were codified decades ago. With emerging tech like LLMs, it’s crucial to keep in mind that we’re bushwhacking, not following a paved road. Talking to users is not a box to check. It's our navigation. When we learn something that challenges the plan, we have to be willing to adapt.

The recommendations are still there when they're ready. It might just be a longer journey to get where they're headed: a tool their users can confidently trust as part of their high-stakes workflows.

Trustworthy AI is built on content architecture. If we want to earn trust, we can’t skip the foundational work.

The best time to build trust infrastructure is before you scale. The second best time is now.

about

work

writing etc

work with me