ai · standard
"How would you make an LLM's output more creative?"
How would you make an LLM's output more creative?
This question tests whether you know the difference between an ML engineer’s answer and a PM’s answer. The ML engineer lists parameters. The PM defines “creative” for their specific surface first, then explains which controls to pull and why, then describes how they’d know creativity actually improved. Candidates who jump straight to “raise the temperature” are answering the wrong version of the question.
What the interviewer is testing
The underlying check is product judgment on a three-axis tradeoff: creativity up means coherence down, cost up, and safety risk up. The PM has to decide where on that surface the product should sit, justify it in terms of user value, and build the eval to confirm it landed there. Any answer that treats “creative” as self-evident, or that ignores cost and risk, or that has no measurement plan, fails the rubric.
The three-layer control model
Think of creativity controls in three layers, ordered by cost and reversibility.
Layer 1: Inference-time controls (immediate, tunable, reversible). Temperature scales the logit probability distribution before softmax. Below 1, it sharpens the distribution toward high-probability tokens (more predictable). Above 1, it flattens it (more diverse). For frontier models in 2026, the practical creative range is 0.8 to 1.1. Above 1.2, coherence degrades faster than originality improves, and you often get incoherent or sycophantic outputs. “Crank temperature” is 2023 advice.
Pair temperature with nucleus sampling (top-p). At each token step, top-p considers only the smallest set of tokens whose cumulative probability exceeds p. Setting p to 0.92 to 0.95 expands the candidate pool while excluding the implausible long tail. Top-k is a simpler alternative: cap the candidate set at the k highest-probability tokens (k = 40 to 100 for creative tasks). Min-p, a 2024 addition now common in open-source inference servers, sets a dynamic probability floor relative to the top token, avoiding both the rigidity of top-k and the incoherence that comes from high top-p on a flat distribution.
The highest-leverage inference-time pattern is sampling-and-ranking: generate three to five candidates at moderate temperature, then rank them with a smaller reward model or an LLM-as-judge scoring originality and quality. This reliably outperforms single-shot high-temperature generation on creative benchmarks. The cost is three to five times the inference cost per request, which is a product decision about whether the engagement lift justifies the spend. For on-device deployment (Apple Intelligence), latency and memory constraints make n greater than one sampling largely infeasible, so temperature and top-p are the only practical levers.
Layer 2: Prompt-time controls (zero marginal cost, highest leverage for most surfaces). A strong creative persona in the system prompt, two to three few-shot examples of the register you want, and explicit constraints (“avoid clichés,” “use unexpected analogies,” “vary sentence rhythm”) are routinely underused. Constraints paradoxically increase originality by forcing the model off its default high-probability patterns. Style anchors work too: “write in the register of a food critic who spent a decade abroad” gives the model a concrete attractor without requiring any model changes. This is the first place to look before touching any inference knob.
Layer 3: Model-time controls (high upfront cost, permanent distribution shift). Fine-tuning on a curated creative corpus or using RLHF with human raters scoring originality shifts the model’s base distribution. Inference cost stays constant after training. This is the right call when creativity is a core product differentiator: Character.AI’s character voice consistency and Midjourney’s stylistic coherence are maintained this way. The barrier is labeled data quality and the cost of annotation, not compute.
The 2026 reframe
Frontier models already have very high creative floors. The real PM challenge is bounded creativity: creative within brand voice, within safety rails, within format constraints, within a latency budget. The question is no longer “can the model be creative?” but “creative enough to feel surprising and human, while staying coherent, on-brand, and affordable to serve at scale?” That is a product problem. Candidates who answer with parameter values alone have not made the shift.
Strong and weak answers
strong
"Before touching any controls, I'd define what 'creative' means for this surface, because the right lever depends entirely on the use case. For ad copy, creative might mean unexpected word choice and tonal range. For a character AI, it might mean behavioral unpredictability and emotional register shifts. For a poetry tool, it might mean structural variety and avoidance of cliché. I'd pick a definition, write an eval that measures it, and then tune.
Layer one is inference-time: raise temperature from a conservative 0.3–0.5 to 0.8–1.0, pair with top-p around 0.93 to keep coherence while expanding the candidate pool. Avoid going above 1.1 on a frontier model; past that, coherence collapses faster than originality improves. For a surface where quality matters more than latency, I'd use sampling-and-ranking: generate three to five candidates at moderate temperature and rank with a quality judge. That reliably outperforms single-shot high-temperature output, but costs three to five times per request, so I'd validate that the save or share rate lift justifies the infra cost before shipping it.
Layer two is prompt-time and it's the most underused: a strong creative persona in the system prompt, two to three few-shot examples of the quality and register I want, and explicit constraints. Constraints are counterintuitive but they work; forcing the model off the highest-probability default patterns is where originality comes from.
Layer three is model-time: fine-tuning on a curated creative corpus or RLHF with raters scoring originality. High upfront cost but inference cost stays constant. Worth it only when creativity is a core differentiator and I have the labeled data to justify it.
For measurement: I'd set up an eval tracking fluidity, originality, and quality, either via human raters or an LLM-as-judge scoring those dimensions (Google's VANTAGE protocol formalizes this across six dimensions), plus product-side proxies like save rate, share rate, and re-generation rate. If creative outputs lift saves but also raise safety flags, that's a product decision about where we want to sit on the tradeoff, not a model decision.
If this is a Google context: Gemini's structured generation and grounding constraints mean the problem is bounded creativity, creative within factual and format rails. That makes prompt-layer controls more important than raw temperature tuning, and the eval has to include groundedness alongside originality."
weak
"I'd increase the temperature parameter. That adds randomness to the output and makes it more creative. You can also use top-p sampling for more diverse responses." This answers the ML engineer's version of the question. It treats "creative" as self-evident when a PM must define it for their surface. It ignores prompt-layer controls, which are the highest-leverage and zero marginal cost. It shows no understanding of the tradeoffs (coherence, safety, cost, latency). It has no measurement plan. Interviewers at Google and Apple are checking whether the candidate can reason about product tradeoffs. This answer has none.
Follow-up questions to prepare for
“How do you measure whether creativity actually improved?” Human raters scoring on originality and fluency, an LLM-as-judge evaluating against those dimensions (Google’s VANTAGE protocol covers six: fluidity, originality, quality, building on ideas, elaborating, and selecting), and product-side proxies like re-generation rate (high re-gen = the output wasn’t good enough) and save or share rate.
“What’s the cost tradeoff of sampling-and-ranking?” Three to five times inference cost per request. Worth it when the product surface has high output value per generation (a personalized greeting card, a key scene in a game) and lower request volume. Not worth it for high-volume, latency-sensitive surfaces.
“What happens to safety when you increase creativity?” Higher temperature and wider sampling expand the probability mass toward lower-probability tokens, including off-policy content. This is a product decision: define acceptable creative risk, build an output classifier or constitutional AI layer to screen, and set the creativity controls to the maximum that passes your safety eval on your test set. Don’t treat it as an engineering problem to solve after launch.