Prompting for Reasoning Models

The new generation of reasoning models is much more capable, but it requires some upgrades to your prompting techniques

Artificial intelligence is undergoing a significant transformation.

We are moving beyond large language models (LLMs) primarily focused on generating text or retrieving information towards a new generation of models explicitly designed for enhanced reasoning.

Systems like OpenAI's o-series (o3, o4-mini), Anthropic's Claude 3.7 Sonnet, and specialized tools such as ChatGPT DeepResearch represent this evolution.

These models are engineered to tackle complex tasks that demand logic, planning, multi-step problem-solving, analysis, and even a degree of self-reflection.

They aim not just to provide answers but to work through problems in ways that bear resemblance to human cognitive processes like breaking down challenges, exploring potential solutions, and evaluating different paths.  

This advancement signifies more than just incremental improvement; it reflects a qualitative shift in AI capabilities. The focus is increasingly on the process of arriving at an answer – the model's ability to "think" – rather than solely on the final output. This suggests AI is being developed to handle ambiguity and complexity with greater autonomy, moving closer to performing tasks that require genuine analytical capabilities.

AI LESSON

Prompting for Reasoning Models

The new generation of reasoning models is much more capable, but it requires some upgrades to your prompting techniques

Unlocking the full potential of these sophisticated models requires a corresponding evolution in how users interact with them. Basic instructions, or simple questions, often fall short.

Prompting is evolving from simple instruction-giving into a more nuanced form of cognitive guidance, demanding a deeper understanding from the user to effectively harness these advanced capabilities.  

Meet the New Generation of Thinkers

Several new models and features exemplify the trend toward enhanced AI reasoning. Understanding their specific characteristics is key to selecting the right tool for a given task.

OpenAI o-series (o3, o4-mini, o4-mini-high)

These models are part of OpenAI's series explicitly trained to "think for longer" before responding. They possess agentic capabilities, meaning they can independently decide to use and combine various tools within the ChatGPT environment, such as searching the web, analyzing data using Python code, interpreting visual inputs, and generating images, to solve complex, multi-faceted problems. They can integrate images directly into their reasoning process, not just viewing them but thinking with them.
A key distinction within the series: o3 focuses on the highest reasoning quality, o4-mini balances speed and affordability, and o4-mini-high offers high-throughput performance at a lower cost.

ChatGPT DeepResearch

ChatGPT DeepResearch is an agentic capability within ChatGPT, powered by an o3 variant optimized for research. It autonomously browses the web, potentially analyzing hundreds of sources (including text, images, PDFs), synthesizes findings, and generates comprehensive, structured reports with citations. The process typically takes between 5 and 30 minutes. Its strengths lie in conducting deep dives into topics, finding niche or non-intuitive information, and providing documented outputs. However, it has limitations: the process is relatively slow, the quality is dependent on the information available online, and despite claims of lower hallucination rates, it can still invent facts, miss key details, or struggle with very recent information. Access is typically limited and associated with premium ChatGPT subscriptions.

Anthropic Claude 3.7 Sonnet

This model introduces a hybrid approach, functioning as both a standard LLM and a reasoning model through its "extended thinking" mode. When activated, the model engages in self-reflection before answering, improving performance on tasks like math, physics, complex instruction-following, and coding. A unique aspect is the API control over the "thinking budget"—users can specify the maximum number of tokens allocated for this reasoning phase, allowing a direct trade-off between response quality, speed, and cost. Anthropic emphasizes optimization for real-world business tasks and coding capabilities (this only applies if you consume the Claude API and not the chatbot itself).

Gemini 2.5

Gemini 2.5, part of Google's Gemini family, brings major improvements in structured reasoning and memory integration. It is specifically engineered for "multi-modal" reasoning, handling text, code, images, audio, and video inputs with a unified framework. Unlike previous iterations, Gemini 2.5 demonstrates better "stateful" thinking—it can track complex chains of logic, reference prior context more effectively, and synthesize insights across multiple types of media. A key differentiator is Gemini 2.5's use of adaptive compute strategies: the model dynamically adjusts the depth of its reasoning based on task complexity, aiming to optimize cost and responsiveness. Early reports show Gemini 2.5 competing strongly with the best o-series and Claude 3.5/3.7 models on benchmarks involving long-context tasks, advanced mathematics, and multi-document summarization. Access is positioned primarily through Google's Workspace integrations (Docs, Gmail) and select API partners.

Gemini Deep Research

Gemini Deep Research is Google's answer to autonomous research capabilities, built on Gemini 2.5 infrastructure. It operates similarly to ChatGPT DeepResearch but with distinct differences: Gemini Deep Research is designed for cross-modal research—synthesizing information from text, web pages, PDFs, images, and video transcripts. It emphasizes high-fidelity source tracking, allowing users to trace claims back to original documents with embedded citations. Early user reports suggest it performs especially well at aggregating dispersed information from large datasets and public knowledge graphs, giving it an edge in enterprise, technical, and scientific research contexts.
However, it shares familiar trade-offs: research sessions can be lengthy, outputs must still be verified for factual accuracy, and deep integrations with Google Workspace suggest a bias toward Google's ecosystem for full functionality. Availability is currently limited to enterprise customers and developers using the Gemini API.

The development of distinct models like OpenAI’s o3 vs. o4-mini, Claude’s extended thinking mode, and Gemini’s adaptive compute points toward a clear trend: specialization. Vendors increasingly recognize that not all tasks demand the same intensity or cost of reasoning.

Today’s models offer configurable options, allowing users to balance performance, cost, and speed, and giving more agency in tailoring AI’s cognitive effort.

At the same time, the rise of agentic features like ChatGPT DeepResearch and Gemini Deep Research blurs the traditional line between AI as a passive responder and AI as an autonomous worker.
These systems can independently take multiple steps, make intermediate decisions, and integrate information over extended periods without constant human intervention.

This shift represents a major step toward AI agents capable of executing multi-step tasks autonomously. However, it also raises important new considerations: ensuring the reliability of these processes, requiring critical human verification, and redefining what we mean by AI-driven “research.” Interaction is evolving — moving from simply prompting for answers to delegating tasks and critically verifying the outputs.

Reasoning Model Features

Model/FeatureKey CharacteristicStrengthsPotential Weaknesses/Trade-offsIdeal Use Cases
OpenAI o3High-reasoning capability, tool integration, multimodalStrongest overall reasoning, complex tasks, long reasoning chains, tool chainingSlower, significantly more expensive than o4-miniHigh-precision tasks, research problems, scientific workflows, critical reasoning where cost/speed are secondary
OpenAI o4-mini / o4-mini-highFaster, cheaper reasoning model, tool integration, multimodal by defaultSpeed, cost-efficiency, optimized for coding & visual tasks, high throughputPotentially less consistent on highly complex reasoning vs. o3; user reports of issues with o4-mini-high coding reliabilityProduction environments, real-time tools, cost-sensitive applications, tasks prioritizing speed/volume, specific coding/visual tasks
Anthropic Claude 3.7 Sonnet (Ext. Thinking)Hybrid model with controllable "extended thinking" mode via APISelf-reflection improves complex tasks (math, coding, instructions), controllable thinking budget (quality/cost trade-off)Requires explicit API call for extended thinking, potential cost increase with larger budgetComplex coding, math/physics problems, detailed instruction following, tasks benefiting from deliberate reflection, real-world business tasks
ChatGPT DeepResearchAgentic research capability using optimized o3 variantAutonomous deep dives across web sources, synthesis, structured reports with citations, finding niche infoSlow (5–30 min), output quality depends on web data, potential for errors/hallucinations, limited access/costIn-depth market analysis, literature reviews, complex product comparisons, tasks requiring synthesis from numerous online sources

Foundational Prompting for Clearer Reasoning

While advanced techniques are powerful, mastering the fundamentals of prompting remains essential for guiding even the most sophisticated reasoning models.

Clarity and Precision: This is non-negotiable. Reasoning models, despite their advancements, cannot read minds or resolve ambiguity effectively. Vague prompts like "Analyze the data" are likely to yield generic or unhelpful responses. Instead, be explicit about the objective, the subject, the desired action, and any constraints. A better prompt would be: "Analyze the Q4 2024 sales data in the attached CSV file. Identify the top 3 performing regions by total revenue. Provide the output as a bulleted list." Clear, direct language minimizes misinterpretation and sets the model on the right path.  

Providing Context: Context is the bedrock upon which reasoning is built. Models need sufficient background information to understand the nuances of a request. However, providing excessive, unstructured information can be counterproductive, potentially overwhelming the model. The best practice is often to summarize lengthy background materials, highlighting the key points relevant to the task. Include essential constraints, assumptions, or parameters within the prompt itself.  

Role-Playing/Persona: Assigning a role or persona to the model can effectively frame its reasoning process and output style. Instructing the model with "Act as a senior financial analyst reviewing this investment proposal" primes it to adopt a specific perspective, vocabulary, and set of analytical priorities relevant to that role.  

Zero-Shot, Few-Shot, and Many-Shot Prompting

Choosing the right prompting strategy is essential to maximizing model performance for different tasks:

  • Zero-Shot Prompting
    The model is asked to perform a task without any examples, relying solely on its pretraining knowledge. It works well for straightforward, commonly encountered tasks where instructions are enough. Example: “Summarize this article in one paragraph.”

  • Few-Shot Prompting
    The model is provided with one to five examples within the prompt to demonstrate the desired input-output relationship. Few-shot is effective when tasks require specific formatting, style, or handling of subtle nuances that pure instructions may not capture. Example: Showing two formatted summaries before asking for a third.

  • Many-Shot Prompting
    This method uses more than five examples, sometimes dozens or hundreds, usually via fine-tuning or very large in-context demonstrations. Many-shot approaches are increasingly practical with long-context models (e.g., o4-mini-high, Gemini 2.5) that can process tens or hundreds of thousands of tokens. It excels at complex pattern replication and advanced reasoning but increases cost and latency. Example: Feeding an entire dataset of customer complaints with corresponding responses before asking for a new one.

Each method trades off between token efficiency, model generalization, and the need for precision. Selecting the right strategy depends on task complexity, model capacity, and operational constraints (speed, cost, context window limits).

A Note of Caution: Interestingly, some research involving predecessors to the current o-series models indicated that few-shot prompting could sometimes hinder performance on complex reasoning tasks compared to zero-shot prompts. This observation suggests that for models with highly developed internal reasoning capabilities, overly prescriptive examples might conflict with or constrain their own, potentially more effective, learned strategies. The implication is that for the most advanced reasoning models, prompt engineering might focus more on clearly defining the what (goal, context, constraints, output format) rather than prescribing the how (the exact steps of reasoning) unless deliberately employing techniques like Chain-of-Thought. However, this requires empirical testing for specific models and tasks, as few-shot remains valuable for format control.  

Specifying Output Structure (Basic): Clearly defining the desired output format from the outset is crucial for usability, especially when the AI's response needs to be integrated into other processes.  

  • "Use direct instructions, like: 'Provide the answer as a numbered list.' or 'Generate the summary in a single paragraph.'”

  • Provide examples (few-shot): Show the model exactly how the output should look.  

  • Reference common formats: LLMs understand structures like JSON, CSV, XML, or Markdown. Asking for output "in JSON format" or "as a CSV with columns 'Company' and 'Address'" leverages this knowledge.  

  • Specify data types: Instruct the model on the type of data expected for specific fields, such as Text, Number, Date (e.g., "YYYY-MM-DD"), or Boolean.  

The consistent emphasis on specifying structured output formats like JSON, CSV, and tables, even in foundational prompting advice, signals a significant trend. Reasoning models are increasingly being utilized not just as conversational partners or text generators, but as functional components within larger data processing pipelines and automated workflows. This requires their outputs to be machine-readable, consistent, and easily parsable, making explicit format instructions a critical element of effective prompting for practical, enterprise applications.  

Refining Your Interaction: Iteration and Control

Achieving optimal results from reasoning models often requires more than a single, perfectly crafted prompt.

Iterative refinement and precise control over the output are crucial aspects of effective interaction.

Iterative Prompting

The fundamental idea is simple: treat prompting as a cycle of trial, evaluation, and adjustment. Start with an initial prompt, carefully review the generated output for accuracy, relevance, completeness, and adherence to instructions, identify shortcomings, and then modify the prompt to address those issues in the next attempt.

It is uncommon to achieve the desired output on the first try, especially for complex tasks. Documenting changes between prompt versions and comparing the corresponding outputs is essential for systematic improvement.  

Self-Refine

This is a specific, structured technique for iterative refinement where the LLM itself plays a role in the critique process. The cycle involves:

  1. Generate: Obtain an initial output from the LLM based on the task prompt.

  2. Feedback: Feed the original prompt and the generated output back into the same LLM, asking it to provide constructive feedback on how the output could be improved (e.g., regarding accuracy, clarity, efficiency, or adherence to constraints). A stopping criterion is needed, often asking the model to state if no further improvements are possible.

  3. Refine: Provide the original output and the generated feedback to the LLM, instructing it to produce a revised output based on the critique.

  4. Repeat: Steps 2 and 3 can be repeated until the output meets the desired standard or the model indicates no further refinement is needed.  

The effectiveness of techniques like Self-Refine suggests that advanced LLMs possess latent self-assessment capabilities. They appear capable, to some extent, of recognizing flaws or areas for improvement in their outputs when prompted correctly. This implies an internal capacity for critique that goes beyond simple generation, which structured prompting can activate and leverage for better results.  

Key Takeaways & Your Prompting Toolkit

Navigating advanced reasoning models requires a blend of technical understanding and strategic interaction. Mastering the art and science of prompting is key to unlocking their potential while mitigating their risks.

Ultimately, effectively leveraging these powerful reasoning models necessitates a shift in user mindset. It moves beyond simply "asking questions" towards actively "designing interactions." This involves understanding the nuances of different models and techniques, strategically structuring the problem for the AI, providing appropriate guidance and constraints, and rigorously evaluating the results.

I appreciate your support.

Mark R. Hinkle

Your AI Sherpa,

Mark R. Hinkle
Publisher, The AIE Network
Connect with me on LinkedIn
Follow Me on Twitter

Reply

or to participate.