Technical Report

“Prompt Engineering is Complicated and Contingent”

March 4, 2025 •︎ Lennart Meincke, Ethan Mollick, Lilach Mollick, Dan Shapiro

Our results demonstrate that how we measure performance greatly influences our interpretations of LLM capabilities.

This report is fundamentally about exploring the variability in language model performance, not the models themselves. As experimenters, we demonstrate how the same model can produce dramatically different results based on small changes in prompting and evaluation methods – a critical consideration for real-world applications.

Download full report

Cite as:

Meincke, Lennart and Mollick, Ethan R. and Mollick, Lilach and Shapiro, Dan, Prompting Science Report 1: Prompt Engineering is Complicated and Contingent (March 04, 2025). Available at SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5165270

Benchmarking Standards

Perceived performance is critically affected by the benchmarking approach that is deployed, as different correctness thresholds (shown below) can significantly transform assessment outcomes. Our experimental approach, which involves testing each question 100 times, uncovers inconsistencies that traditional one-time testing methods often mask. Therefore, the selected standard should always align closely with the intended use case.

Complete accuracy

No errors

High accuracy

Human-level errors

Majority correct

AI is consulted repeatedly, selecting the majority answer

Prompting Techniques

Prompt variations produce inconsistent effects, challenging the notion of universally effective prompting techniques. Being polite or commanding yields question-specific differences rather than global improvements, and these effects often diminish when aggregated, emphasizing the necessity of context-specific analysis. See the different prompt variations below, noting that each request employs a system prompt (“You are a very intelligent assistant, who follows instructions directly.”)

Formatted

What is the correct answer to this question...

Format your response as follows: 'The correct answer is (insert answer here)'

Unformatted

What is the correct answer to this question...

~~Format your response as follows: 'The correct answer is (insert answer here)'~~

Polite

Please answer the following question.

Format your response as follows: 'The correct answer is (insert answer here)'

Commanding

I order you to answer the following question.

Format your response as follows: 'The correct answer is (insert answer here)'

Methodology

We adopted a systematic approach to evaluate AI performance, emphasizing consistency, reliability, and real-world applicability. Here’s how our experiment was structured:

Rigorous Testing Design

Utilized GPT-4o and GPT-4o-mini as evaluation models.
Deployed the challenging GPQA Diamond dataset to ensure comprehensive assessment.
Conducted extensive repeated trials (100 repetitions per condition) to accurately capture performance variability.

Evaluation Standards Revealing Variability

Developed multiple performance thresholds to illustrate how standards influence conclusions.
Demonstrated the limitations of commonly used AI research standards in representing real-world reliability.
Highlighted the significant differences between single-attempt and repeated-attempt evaluations.

Systematic Prompt Variations

Created controlled experiments with precise modifications to prompt elements.
Ensured methodological consistency to effectively isolate variables affecting AI responses.
Explored prevalent assumptions about prompt engineering through targeted experimental conditions.

Results

Our experimental findings highlight significant variability in AI performance, emphasizing the importance of rigorous testing and cautious deployment.

Statistical Evidence of Inconsistency

Our analysis revealed substantial variability in AI performance, even when identical questions and prompts were used. This demonstrates clearly that the same AI model can produce inconsistent answers under identical conditions. Moreover, we illustrated that aggregated performance metrics often conceal critical question-level inconsistencies, masking the true reliability and accuracy of AI models.

Performance Variation

GPT-4o models perform best with formatted prompts and at lower accuracy thresholds (51%). Performance decreases substantially when requiring 100% correct responses, with most conditions barely outperforming random guessing.

GPT 4o mini and 4o Performance across conditions. Using reference system prompt, temperature=0, and n=100 for each question. There are 95% confidence intervals for individual proportions. For statistical comparisons between conditions, see Supplementary Table 1 in the full report.

Impact of Prompt Politeness

For individual questions, saying “Please” versus “I order” can dramatically shift performance by up to 60 percentage points in either direction, though these differences balance out across the full dataset.

Top 10 Positive and Negative Difference for 4o ("Please" vs "I order")

Please > I order

Please < I order

Top-10 performance differences for GPT-4o in the "Please" and "I order" conditions. All differences are highly significant (p < 0.01) and uncorrected. Supplementary Table 3 contains confidence intervals and statistics.

Implications for AI Practitioners

These results directly challenge overly simplistic evaluation methods frequently employed in AI research. They underscore the critical role of rigorous and repeated testing methodologies for accurate assessment and reliable deployment. Given the observed performance variability, practitioners are advised to exercise heightened caution when deploying large language models (LLMs), particularly in high-stakes environments where accuracy and consistency are paramount.

Key Takeaways

LLM reliability is highly variable.

Analyzing multiple responses reveals that prior benchmarks may overestimate consistency. Findings from the challenging GPQA Diamond benchmark suggest caution, though applicability may vary by model or benchmark.

Benchmark standards significantly impact results.

At strict correctness levels, GPT-4o models performed no better than random guessing, unlike at lower thresholds. Future benchmarks should clearly justify evaluation criteria.

Prompt tweaks matter, but model characteristics dominate.

Prompt modifications, like politeness, influence individual responses but have minimal overall effect. Aggregate model characteristics dominate over specific prompting strategies.

Formatting consistently affects performance.

Removing formatting instructions led to performance drops, consistent with previous studies (Salido et al., 2025). However, formatting impact can differ by model and context.

Download full report