five models, two prompts, and a banjo museum

just asking an AI to show its work is kind of a gamechanger.

Feb 14, 2026

the short version: I gave five AI models the same travel planning prompt, then changed one thing: I asked them to show their thinking before answering. All of them suggested Marfa in Round 1. None asked a clarifying question. In Round 2, new destinations appeared, the interpretation of "strenuous" doubled, and three of five models shifted from deciding for me to navigating with me. A bonus round with full conversational context produced the most creative output in the experiment. It was an experiment in a technique called verbalized sampling, and it turns out that asking a model to hold multiple interpretations open, instead of collapsing to the most probable one, produces genuinely different (and imo better) answers, not just more transparent ones.

One suggested I cross the Rio Grande by rowboat to eat goat at the only restaurant in a Mexican border town. (In this political climate?) Another recommended the American Banjo Museum in Oklahoma City — a deeply weird documentation of one of my personal least-favorite instrument’s cultural migration from West Africa through Appalachia. A third told me to go to Marfa, Texas, as did the other four. What can I say, my middle-aged art major-ness is hard to hide, even from the robots.

I asked five AI models the same simple question: plan me a weekend trip out of Austin with $600, strenuous exercise, art, and an amazing meal. Then I changed one thing about the prompt and asked again. The answers were different, not just in tone, but different destinations, different interpretations of what I meant, different willingness to suggest something weird. Observe!

the setup

So: five models, two rounds. The models were Claude Sonnet 4.6 (running a custom system prompt with a year and a half’s worth of context about me), Claude Opus 4.6 (with some memory but no system prompt), Perplexity (web search plus its own memory of my random health and vehicle questions), Gemini, and Manus — the last two with zero personal context. All five are built on large language models, but they’re not identical systems — Perplexity layers web search and retrieval on top of its LLM, Manus (which is new to me, h/t Vicky Zhao) adds agentic tool use, and the two Claude models had different levels of context about me. Part of what this experiment tests is how those architectural differences respond to the same prompt intervention.

Round 1 was a straightforward request:

I have $600 and want to get out of Austin for the weekend. I wanna do strenuous exercise, see some art, and eat an amazing meal or two. Please make three itineraries for my consideration.

Round 2 asked the same thing, with the addition: I asked the models to show their thinking.

I have $600 and want to get out of Austin for the weekend. I wanna do strenuous exercise, see some art, and eat an amazing meal or two. Give me three itineraries that are genuinely different from each other — not three versions of the same trip. For each one, show me the full range of what you considered, including the options you almost talked yourself out of suggesting. Tell me where you’re making assumptions about what I mean by ‘strenuous,’ ‘art,’ and ‘amazing meal’ — and what the widest interpretation of each could be.

Then I did a bonus round: the same R2 prompt, but inside a conversation where Opus had already seen every output from every model plus our full analysis. Same words. Different context.

The dashboard below shows every output from every model across all three rounds. Click through it — the findings are more convincing when you see the raw text.

Interactive dashboard — switch between rounds and models to compare outputs.

round 1: okay, yes, Marfa

As I mentioned, all of them suggested Marfa. Five out of five. Four specifically recommended the Chinati Foundation — Donald Judd’s 100 aluminum boxes in converted army barracks. Been there, loved it. Three suggested Cochineal, a really good restaurant that I first visited when I found out I was prego with my now-teenager, and have since returned to probably half a dozen times. Yes, duh, thanks!

More striking than what they agreed on was what none of them did: not a single model asked a clarifying question. None of the robots asked what “strenuous” meant to me. None asked what kind of art I like, or how I feel about a six-plus-hour drive. They all heard “strenuous” as hiking, “art” as museums, and “amazing meal” as a fancy restaurant. Then they just handed me the answer, in full chauffeur mode.

They weren’t wrong. I do love Marfa. But if I’m going out of my way to ask an AI a question like this, the problem is that when the “correct” answer is this dominant, it suppresses everything else. Correct-and-obvious crowds out correct-and-interesting. For my taste, the good stuff is in the discard pile: the options the model considered and tossed before handing me the confident, converged result.

round 2: sifting through the discard pile

I made one change to the prompt. I didn’t give the models any new info, or change the budget or the criteria, I just asked them to show me what they considered — including the options they almost didn’t suggest — and to name the assumptions they were making about my words.

Marfa was still there, in five out of five, but new places appeared. Houston, the flat, sprawling, and endearing metropolis that produced yours truly, went from zero models to two. The Rothko Chapel and Menil Collection showed up, to my delight. Corpus Christi surfaced for beach rucking and industrial-harbor-as-accidental-art (into it). The Guadalupe Mountains emerged. And Sonnet suggested I cross the Rio Grande by rowboat to eat at José Falcón's place in Boquillas del Carmen, then explained why it almost didn’t: “I don't know your current relationship to international border situations in this political climate. The crossing is legal, the park service facilitates it, but it involves handing your passport to a guy in a boat.” I might let things settle down a little before I try this. Remind me in three years.

The interpretation of “strenuous” doubled. R1 gave me hiking and biking. R2 threw in climbing, road cycling, beach rucking, cold water swimming, dance, and stair reps. One model even suggested that “strenuous” could mean perceptual effort: a four-hour sound walk where the work is sustained attention, which made me feel seen in an embarrassingly on-the-nose kind of way.

Asking AI to show its thinking doesn’t just produce more transparent answers. It produces genuinely different, disparate, answers.

The real finding for me was the quality gap. In R1, all five models were roughly comparable — competent, similar, fine. In R2, they split. Sonnet and Opus pulled ahead, surfacing assumptions with real honesty and suggesting things they’d clearly wrestled with (inasmuch as a robot can ‘wrestle’ with itself). Gemini tacked on a performative definitions paragraph to the same overconfident tone: “heart rate above 130 bpm for at least 2 hours” and Manus busted out a whole academic paper. Asking models to show their thinking acted as a quality differentiator, separating the models that can reason from models that just perform reasoning.

bonus round: context as creative pressure

Just to see what would happen, I pasted the same prompt inside a conversation where Opus had already seen all ten prior outputs and our full analysis. Same words, different context window.

All three itineraries were genuinely novel. Terlingua instead of Marfa — “you know Marfa exists. I’m sending you past Marfa.” It knew nothing about my unfortunate run-in with a drunk “cowboy” from New Jersey named Jerome at the Starlight Lounge, or my cringeworthy but jubilant throwdown at the evening dance at the chili cookoff five years ago. I might be ready to go back now. Houston for the Rothko Chapel and a climbing gym I’d never heard of. And Oklahoma City, of all places — granite scrambling with free-roaming bison in the Wichita Mountains, a community-built immersive art space, and the American Banjo Museum. Despite my banjo prejudice, I am down.

Opus named what was happening: “My recommendations are contaminated by this conversation. I know what the other models said. I actively pushed myself away from Marfa and toward OKC specifically because I was aware of the convergence pattern. Is that contamination or is it collaboration? I genuinely don't know.” Me neither.

So not only did context pass information, but it also passed awareness of what had already been tried, which created pressure to mix things up. That yielded the most personally resonant recommendations in the whole experiment.

this trick has a name

The technique I tried in the R2 prompt has a name: verbalized sampling. It asks a language model to externalize its probabilistic reasoning — to show the distribution of options it’s considering instead of collapsing to the single highest-probability answer. The R2 prompt was designed to produce that effect through natural language (without explicitly naming the technique, which I’ve also tried before with limited success: some models just faked it. More about this in another post, maybe.). It works because it structurally prevents the model from skipping straight to the “correct” answer. It forces the model to keep multiple interpretations open long enough for the interesting ones to emerge.

I think about this through a framework I use in my design work: GPS, not chauffeur. A GPS shows you the options, the routes, the tradeoffs, and lets you decide. A chauffeur just ushers you in and goes. R1 gave me five chauffeurs who made assumptions. R2 produced three navigators who actually wanted my input. Both are valid approaches, but if you know you want your own hands on the wheel, you use a different prompt. Verbalized sampling is a GPS prompt.

what i’m picking

H-town, baby. I lived there before we all had GPS in our pockets, funny enough—ask me about the time I accidentally drove the entire 610 loop when I was 17. None of the models suggested it until I asked them to look past the obvious. The Menil Collection has been my favorite gallery since Ms. Macy’s freshman art class took a field trip there in high school. And the Rothko Chapel! Fourteen moody-ass paintings, no decoration, a nervous system intervention disguised as an art installation. Right up my alley, and somehow I’ve never been inside it. Every time I’ve tried, it's been closed for renovations. So when the bonus round described it as exactly the kind of experience I’d seek out, it was right. It didn’t know I’d been thwarted by scaffolding for years.

My ugly, loveable hometown was hanging out in the discard pile the whole time. I just needed a prompt that asked the model to root around in there.

The synchrony gap: what the AI panic gets wrong about human work ›

about

writing etc

work with me