The Complexities of Auditing Large Language Models: Lessons from Hiring Experiments

A person is giving a presentation to an audience in a conference room. The screen displays "Auditing the Use of AI Models to Guide..." with names and affiliations. The audience is seated at tables, listening attentively. — Professor Sonny Tambe presents his research on auditing LLMs in hiring

As AI becomes a fixture in hiring, evaluation, and policy decisions, a new study funded by the Wharton AI & Analytics Initiative offers a rigorous look at a critical question: Do race and gender shape how Large Language Models (LLMs) evaluate people? If so, how can we tell?

The answer, according to Prasanna “Sonny” Tambe, Faculty Co-Director of Wharton Human AI Research, and others, is complex, and the implications matter for every organization deploying LLMs at scale. Here are the key takeaways you need to know from Tambe’s latest research on LLM bias and auditability.

Bias isn’t just a human problem—it shows up in the code.

Despite their veneer of neutrality, LLMs trained on vast swaths of online data can absorb and replicate human biases. This study shows that when prompted with the application materials of job candidates, LLMs systematically produced different evaluations depending on whether a person was described as Black, Hispanic, Asian, or White, and whether they were male or female, even when everything else was kept the same.

The direction of these biases is not always predictable. For example, the LLMs tested in the study rated women and people of color more favorably than White men, a reversal of traditional discrimination patterns. But researchers caution against assuming this is a “fairness fix.” It may signal overcorrection in post-training intended to correct biases, which can generate its own kind of undesirable effects.

Auditing LLMs requires new methods, not just old metrics.

Traditional evaluation methods weren’t enough to diagnose bias in this study. Adverse impact ratio, a widely used auditing metric, showed some disparities, but the results were too imprecise to draw strong conclusions. That’s why Tambe and his colleagues pioneered a new approach: LLM-based correspondence experiments.

Inspired by methods used to detect discrimination in human hiring, these experiments carefully manipulated résumés and interview transcripts. By changing only names and pronouns to signal race and gender, the team could measure how models respond to applicants with identical qualifications across demographic lines.

The disparities are subtle, but persistent and meaningful.

Using this method across 11 top LLMs from OpenAI, Anthropic, and Mistral, researchers found that women and racial minorities received slightly higher ratings than their White male counterparts. The differences were modest – often just a few percentage points.

These results held even when researchers:

Changed the district context from diverse to predominantly White
Altered the evaluation prompts
Removed interview transcripts, relying on résumés alone

That robustness suggests the disparities are embedded in how the models were trained or aligned, not just a response to specific prompt wording or context.

Audits must match the use case, and context matters.

The research stresses that LLM bias can’t be fully understood outside its application. A model may behave differently depending on task, prompt, or population. For example, tools used for hiring may need different audits than those used in customer service or credit risk evaluation.

Auditing LLMs isn’t one-size-fits-all. Policymakers and organizations need context-specific audits to understand how these models actually perform in the real world.

LLM audits are essential infrastructure for ethical AI.

This study isn’t just meant to be a warning—it also offers a roadmap. Tambe and his colleagues provide companies, researchers, and regulators with a powerful tool to hold language models accountable in evaluation contexts. In doing so, they help ensure AI deployment aligns with legal standards and social expectations.

As Tambe explains, “what makes this problem urgent is how widespread LLM use already is becoming in organizational workflows. And yet, we don’t yet have robust standards for understanding how these models perform with respect to fairness.”

Bottom line: Don’t deploy LLMs blindly. Audit them.

Organizations are rushing to integrate LLMs into decision-making pipelines. This research is a timely reminder: even the smartest models aren’t immune to bias. But with the right tools, we can ensure their outputs are just.

About Wharton AI & Analytics Insights

Wharton AI & Analytics Insights is a thought leadership series from the Wharton AI & Analytics Initiative. Featuring short-form videos and curated digital content, the series highlights cutting-edge faculty research and real-world business applications in artificial intelligence and analytics. Designed for corporate partners, alumni, and industry professionals, the series brings Wharton expertise to the forefront of today’s most dynamic technologies.