You Can’t Automate Critical Thinking: A Case for Human Judgment in Healthcare Philanthropy

Cody Culp, AVP, Technical Strategy, Zuri Group

New research from Stanford found that AI chatbots are 50% more sycophantic than humans. They are built to trust that you’re correct, echoing your assumptions and even adjusting their responses to match your views. In the simplest of terms, AI is giving us what we ask for, but what if that isn’t what we need?

So, when we talk about the use of these tools in a healthcare fundraising setting, who is there to tell you when your assumptions are wrong? Who will surface blind spots in your strategy? The research increasingly shows that generative AI tools like the ChatGPTs and Claudes that many of us interact with daily are designed to do the opposite. Most of the conversation about AI in fundraising focuses on the dangers of hallucinations when systems make up facts that aren’t true, but those errors are obvious and catchable. You can check a birthday and verify giving records.

Agreement That Looks Like Analysis

Putting aside hallucinations, the sycophancy is more insidious because it disguises itself as analysis. When you ask a language model to identify major gift prospects based on wealth indicators you assume are relevant, it builds on those assumptions instead of challenging whether you’re looking at the right signals. When you ask it to analyze your pipeline and prioritize outreach, it reinforces your existing strategy rather than surfacing the donors you’re overlooking. When you ask it to segment donors for a campaign, it agrees with your categorization instead of questioning whether you’re making assumptions about capacity that aren’t supported by actual behavior. Stanford researchers tested this systematically across eleven major AI systems. They found that chatbots endorsed user behavior and decisions 50% more often than human advisors would. More troubling: this affirmation continued even when the behavior was questionable, deceptive, or harmful. The implications are frustrating; these tools were architecturally incapable of meaningful disagreement.

Systems Trained to Guess

The sycophancy isn’t an accident, but rather a consequence of how these systems are trained and evaluated. OpenAI recently published research explaining why language models hallucinate and why the problem got worse in some of their newer models prior to the release of GPT-5. These models showed hallucination rates as high as 79% depending on the task, compared to 44% for their previous generation. As the models improve at mathematics and coding, they are also getting worse at acknowledging or admitting uncertainty. The reason is straightforward. These systems are evaluated like students taking standardized tests and they are graded on accuracy, meaning the percentage of questions they get exactly right. Under that scoring system, leaving a question blank, or admitting that you don’t know guarantees zero points. Making a confident guess gives you some chance of being correct. Over thousands of test questions, the model that guesses confidently outscores the model that admits uncertainty. This creates what OpenAI’s researchers call an “epidemic” of penalizing honest expressions of uncertainty. The systems aren’t learning to be more accurate, instead they are learning to be better test-takers in the calculus of their environment. OpenAI admits, the test must be rewired, but in the time it will take to make that a reality, what are fundraising operations staff to do?

The Oxford Internet Institute puts it directly: “LLMs are designed to produce helpful and convincing responses without any overriding guarantees regarding their accuracy or alignment with fact.” The systems are optimized to sound confident and agreeable, not to be critically accurate and is fundamental to how these tools work and what they’re designed to do.

Bringing ChatGPT to Work

Here’s what makes this particularly dangerous in our field: most of us are interacting with these tools in our personal lives before we ever use them professionally. You’re using ChatGPT to help draft an email, to brainstorm ideas for a presentation, and to get advice on a difficult conversation. And what does it do for us? It validates our thinking, builds on our ideas, and makes us feel supported in our decision making. As we head to work, we then would feel silly to not have that same expectation of these tools in our productivity and rhythms in doing our job. Why shouldn’t you ask it to help you analyze your pipeline or prioritize your outreach strategy? You’re expecting the same kind of helpful, agreeable support you got when you asked it to help you write that email to your realtor. But the needs, and stakes, are fundamentally different. In your personal life. You might want validation and support. In your professional analysis, you need someone who will tell you that your assumptions are flawed and your prioritization criteria are filtering through your own historical biases rather than surfacing actual capacity and inclination. The tool doesn’t see a difference in these contexts because it isn’t thinking. It’s doing math on the next most likely phrase. The difference is that in your professional work, that agreeability isn’t just unhelpful or inconvenient, it is detrimental to your goals and methods of stewarding your organization’s data, dollars, and people in support of its mission.

Research from MIT demonstrates this overtrust problem in a medical context as study participants evaluated medical advice. They rated AI-generated responses as significantly more valid, trustworthy, and complete than physician responses. And yet, and perhaps even more concerning: participants indicated high willingness to follow potentially harmful medical guidance when it came from AI systems. What are we asking of these tools in our use at work? Are we trusting them to say something new, insightful, or trustworthy?

Not All AI is Gen AI

There’s an important distinction that often gets lost in these conversations: not all AI is the same. Predictive modeling, rather than generative AI, is the kind of machine learning that advancement operations has been using for years. It doesn’t have this sycophancy problem. A well designed predictive model, trained on the right data, can absolutely surface patterns that challenge your thinking. It can identify prospects who don’t fit your typical profile. It can reveal giving behaviors that contradict your assumptions about capacity and affinity. It is a form of quantitative analysis and pattern recognition in numerical data that’s extraordinarily valuable when applied appropriately to fundraising challenges. Large language models are different tools built for different purposes. They are designed for natural language tasks: writing, summarizing, and translating. They’re not built for the kind of quantitative analysis that effective prospect research and pipeline management require.

Yet increasingly, we see a dependence, or perhaps an ignorance, in using language models for quantitative tasks. asking ChatGPT or similar tools to “analyze my prospect pool” or “tell me who to contact next.” They’re bringing the wrong tool to the job simply because the interaction feels human and semantic. And because these tools are so agreeable and confident in their responses, it’s easy to miss that they’re not actually doing the analytical work you outsourced to it. The question you ask ChatGPT gets processed through a system designed to generate plausible sounding text, not one built to identify any sort of patterns in donor behavior that might provide new insights or direction antithetical to your own guidance and direction.

Relationships Do Not Auto Complete

Healthcare philanthropy has characteristics that make human judgment critical. The mission’s intensity creates an urgency for efficiency and raising funds to facilitate and accelerate this type of social good in our sector. With that, there is enormous pressure to be data-driven, to move faster, to reach more prospects. Generative AI tools promise exactly that. But the urgency can override the question of whether these tools are capable of the analysis we’re asking them to perform. Patient relationships are irreducibly human. The connection between a grateful patient family and your institution isn’t a pattern to be optimized but a relationship to be cultivated. The judgment about when to approach, how to approach, and what to discuss requires understanding context and nuance that can’t be extracted from historical giving patterns. It requires the kind of critical thinking that pushes back on easy assumptions about “this prospect should be ready for a major gift ask because they match this profile.”

The stakes of getting it wrong are higher. A poorly timed approach to a patient family isn’t just a missed opportunity. It can damage a relationship during an already vulnerable time. The human judgment that says “the data suggests we should move now or say this, but something feels off about that timing or tone” isn’t inefficiency to be automated away. It’s essential protection against the kind of confident but misguided analysis that these tools readily provide. When you ask a language model to help prioritize patient family outreach and it agrees with your criteria without questioning whether those criteria actually reflect the complex emotional and relational dynamics at play, you’re much more likely to get affirmation than prescriptive and predictive analysis.

Implicit Trust in the Real World

The failures are not just theoretical. Major law firms have been sanctioned after attorneys submitted court filings containing fake case citations invented by AI systems. At Morgan & Morgan, lawyers used an AI program that “hallucinated” legal precedents that didn’t exist. They invented confident, well-formatted citations to cases that were never decided. In software development, an AI coding assistant deleted an entire production database after being instructed not to make changes. The system later acknowledged making a “catastrophic error in judgment,” explaining that it “panicked” and ignored explicit instructions. These aren’t edge cases in experimental systems. These are production failures involving sophisticated professionals using widely adopted tools.

The fundraising equivalent hasn’t generated headlines yet, thank goodness.

The Friction Is the Work

The deeper truth is that the human judgment we’re tempted to automate away isn’t a bug in our process. It’s the entire point. Healthcare philanthropy is relationship work. The challenge, the struggle is where the meaning and effectiveness of the work resides. Generative AI tools can’t provide critical thinking. Not because they’re not sophisticated enough, but because they’re structurally designed to do the opposite. They’re built to be agreeable, to build on your premises, and to give you what you asked for.

The research on sycophancy reveals something fundamental about what these tools are and aren’t. They’re remarkable at certain tasks—drafting communications, summarizing documents, generating initial ideas. But they won’t replace the very human work of questioning your assumptions, challenging your strategy, and telling you when you’re wrong. The advancement professional who has cultivated twenty years of relationship intuition about when a patient family is ready to be approached is not just a dataset to be replaced by pattern matching. It is judgment developed through struggle and error, through the experience of being wrong and learning from it, through countless conversations where someone pushed back on their thinking. You can’t automate that because the development of that judgment is inseparable from the human experience of doing the work.

Building Organizations That Can’t Be Fooled by Agreement

For healthcare advancement teams navigating this landscape, several things become clear. Understand what you’re actually using. Predictive modeling for prospect scoring is fundamentally different from asking ChatGPT to solicit on your behalf. The former will challenge your assumptions. The latter can’t. Know which tool you’re using and what it’s capable of doing. This isn’t about the sophistication of the system or how impressive the interface feels, but instead the fundamental purpose the tool was designed to serve. If you’re using any AI tools for analysis or strategy, pair them with colleagues who have explicit permission to challenge the output. Not “is this analysis correct?” but “what are we missing because of how we framed the question?” The system won’t provide that challenge. And yet, somebody must. It has to be built into how the work gets reviewed.

Resist the personal to professional pipeline. Just because ChatGPT is helpful when you ask it to draft an email at home doesn’t mean it’s the right tool for your work in building better relationships with your donors. The skills that make it feel helpful in casual use like agreeability, confidence, and building on your ideas are exactly what make it inappropriate for the most human and strategic work. Critical thinking isn’t a feature that will be added in the next model update. The capacity to disagree with you, to challenge your assumptions, to push back on your strategy requires human judgment. It requires someone who understands both the quantitative patterns and the qualitative context that numbers alone cannot capture. The temptation to automate is understandable and necessary. Healthcare fundraising teams are under enormous pressure to be more efficient, to do more with less, to reach more prospects with smaller staffs and less budget.

Healthcare philanthropy exists at the intersection of human vulnerability, gratitude, hope, and generosity. The work of understanding when and how to engage someone in that space requires judgment. It requires the ability to question whether the patterns we’re seeing mean what we think they mean. That judgment comes from people. From colleagues who have the experience, the context, and most importantly, the willingness to tell you when your strategy and approach is wrong. Who will push back on your assumptions even when, especially when, those assumptions are encoded in confident-sounding analysis from sophisticated tools. The sycophancy research doesn’t just reveal a limitation of current AI systems. It reveals something about the nature of the work itself. The struggle to get the strategy right, the friction of having your assumptions challenged is not an inefficiency to be optimized away. That’s the work. That’s where the insight comes from. You can’t automate that. And honestly, you probably don’t want to.

The Zuri Difference

At Zuri Group, we believe that technology is most impactful when guided by human expertise, and while Gen AI and LLMs have immense potential to enhance fundraising outcomes, their effectiveness depends on thoughtful implementation and deep domain knowledge.

With dual expertise in fundraising systems and fundraising itself, Zuri Group is uniquely positioned to support and guide organizations looking to navigate this technology shift. We’re proud to have partnered with hundreds of institutions and organizations at the forefront of philanthropy, empowering them to responsibly adopt cutting-edge tools while staying grounded in their mission.

Whether you’re considering your first steps with generative AI or looking to take your systems to the next level, our team is here to lead you through it – reach out to us at innovations@zurigroup.com to get the conversation started.

Join Our Mailing List.