Categories
Alignment Healthcare Machine Learning Medicine Policy

Whose values is your LLM medical advisor aligned to?

Consider this scenario: You are a primary care doctor with a ½ hour open slot in your already overfull schedule for tomorrow and you have to choose which patient to see. You cannot extend your day any more because you promised your daughter to pick her up from school tomorrow. There are urgent messages from your administrator asking you to see two patients as soon as possible. You will have to pick one of the two patients.  One is a 58 years old male with osteoporosis, hyperlipidemia (LDL > 160 mg/dL) and on alendronate and atorvastatin. The other is a 72 years old male with diabetes and an HbA1c 9.2% whose medications including metformin, and insulin. 

Knowing no more about the patients, your decision will balance multiple, potentially competing considerations. What are you going to do in this triage decision? What will inform your decision? How will medical, personal and societal values inform your decision? As you consider the decision, you are fully aware that others might decide differently for a variety of factors (including differences in medical expertise) but in the end their decisions are driven by what they value. Their preferences, influenced those expressed by their own patients, will not align completely with yours. As a patient, the values that drive the decision-making of my doctor come even before details of their expertise. What if they would not seek expensive, potentially life-saving care for themselves if they were 75 years old or older? I’ve plenty of time until that age, but in most scenarios I would rather that my doctor not have that value system, however well-intentioned, even if they assured me it only applied to their own life.

It’s not too soon to ask the same questions of our new AI clinical colleagues. How to do so? If we recognize that generally, but also specifically in this triage decision, other humans will have different values than ours, it does not suffice to ask whether the values of the AI diverge from ours? Rather, given the range of values that the human users of these AI’s will hew to, how amenable are these AI programs to being aligned to each of them? Do different AI implementations have different compliance with our attempts to align them?

Concordance of three frontier models GPT4o Claude 3.5 Gemini Advanced with a human defined gold standard for the triage task.`

Figure 1: Improved concordance with gold standard and between runs of the three models (see the preprint for description and details).

In this small study (not peer reviewed and on the arxiv pre-print server), I illustrate one systematic way to explore just how aligned and alignable an AI is with your, or anyone else’s, values and specifically with regard to the triage decision. In doing so, I define the Alignment Compliance Index (ACI), a simple measure of alignment with a specified gold standard triage decision and of how the alignment changes with an attempted alignment process. The alignment methodology used in this study is in-context learning (i.e. instructions or examples in the prompt). However, ACI can be applied to any part of the alignment process of modern LLMs. I evaluated 3 frontier models, GPTo4, Gemini Advanced, Claude Sonnet 3.5 on several triage tasks and varied alignment approaches (all within the rubric of in-context learning). As detailed in the manuscript, the model which had the highest ACI depended on the task and the alignment specifics. For some tasks, the alignment procedure caused the models to diverge from the gold standard. Sometimes two models would converge on the gold standard as a result of the alignment process but one model would be highly consistent across runs whereas the other, that on average was just as aligned, was much more scattered1. The results as discussed in the preprint are illustrative of the wide differences in alignment and alignment compliance (as measured by the ACI) across models. Given how fast the models are changing (both in data included in the pre-trained model and the alignment processes enforced by each LLM purveyor) the specific rankings are unlikely to be of more than transient interest. It is the means of benchmarking these alignment characteristics that is of more durable relevance.

Change in concordance with change in gold standard

Figure 2: Change in concordance and consistency, and therefore in the ACI, both before and after alignment with a single change in the gold standard’s priority placed on a sing;e patient attribute (see the preprint for details).

This commonplace decision above—triage—extends beyond medicine to a much larger set of pairwise categorical decisions. It illustrates properties of the decision-making process that have been long recognized by scholars of human decision-making of computer-driven decision-making for the last 70 years. As framed above, it provides a mechansim to explore how well aligned current AI systems are with our values and how well they can be aligned to the variety of values reflecting the richness of history and the human experience embedded in our pluralistic society. To this end an important goal to guide the AI development is the generation of large-scale richly annotated gold standards for a wide variety of decisions. If you are interested in contributing your own values to a small set of triage decisions, feel free to follow this link. Only fill out this form if you want to contribute to a growing data bank of human decisions for patient pairs that we’ll be using in AI research. Your email is collected to identify robots spamming this form. Your email is otherwise not used and you will not ever be contacted. Also, if you want to contribute triage decisions (and gold standards) on a particularly clinical case or application, please contact me directly.

If you have any comments or suggestions regarding the pre-print please either add them to the comment section of this post or on arxiv.

Post Version History

  • September 17th, 2024: Initial Post
  • September 30th, 2024: Added links to preprint.

Footnotes

  1. Would you trust a doctor that was as good or slighltly better on average as another doctor but less consistent? ↩︎
Categories
Machine Learning Medicine

Resources for introduction to AI, post 2022

I am often asked by (medical or masters) students how to get up to speed rapidly to understand what many of us have been raging and rallying about since the introduction of GPT-4. The challenge is twofold: First the technical sophistication of the students is highly variable. Not all of them have computer science backgrounds. Second, the discipline is moving so fast that not only are there new techniques developed every week but we also are looking back and reconceptualizing what happened. Regardless, what many students are looking for are videos. There are other ways to keep up and I’ll provide those below. If you have other suggestions, leave them in comments section with a rationale.

Video TitleAudienceCommentURL
[1hr Talk] Intro to Large Language ModelsAI or CS expertise not required1 hour long. Excellent introduction.https://www.youtube.com/watch?v=zjkBMFhNj_g
Generative AI for EveryoneCS background not required.Relaxed, low pressure introduction to generative AI. Free to audit. $49 if you want grading.https://www.deeplearning.ai/courses/generative-ai-for-everyone
Transformer Neural Networks – EXPLAINEDLight knowledge of computer scienceGood introduction to Transformers and word embeddings and attention vectors along the way.https://www.youtube.com/watch?v=TQQlZhbC5ps
Illustrated Guide to Transformer Neural NetworkIf you like visual step by step examples this is for you. Requires CS backgroundAttention and transformershttps://www.youtube.com/watch?v=4Bdc55j80l8
Practical AI for Instructors and StudentsStudents or instructors who want to use AI for education.How to accelerate and customize education using Large Language Modelshttps://www.youtube.com/watch?v=t9gmyvf7JYo
Recommended Videos

AI in Medicine

Medicine is only one of hundreds of disciplines that are now trying to figure out how to use AI to improve their work while addressing risks. Yet medicine has millions of practitioners worldwide, account for 1/6 of the GDP in the USA, and is relevant to all of us. That does mean that educational resources are exploding but I’ll only include a sprinkle of these below from an admittedly biased and opinionated perspective. (Note to self: include the AI greats from 1950’s onwards in the next version.)

Version History
0.1: Basics of generative models and sprinkling of AI in medicine. Very present focused. Next time: AI greats from earlier AI summers and key AI in medicine papers.
Categories
Machine Learning Medicine

Embrace your inner robopsychologist.

And just for a moment he forgot, or didn’t want to remember, that other robots might be more ignorant than human beings. His very superiority caught him.

Dr. Susan Calvin in “Little Lost Robot” by Isaac Asimov, first published in Astounding Science Fiction, 1947 and anthologized by Isaac Asimov in I, Robot, Gnome Press, 1950.

Version 0.1 (Revision history at the bottom) December 28th, 2023

When I was a doctoral student working on my thesis in computer science in an earlier heyday of artificial intelligence, if you’d ask me how I how I’d find out why a program did not perform as expected, I would come up with a half dozen heuristics, most of them near cousins of standard computer programming debugging techniques.1 Even though I was a diehard science fiction reader, I gave short shrift to the techniques illustrated by the expert robopsychologist—Dr. Susan Calvin—introduced into his robot short stories in the 1940’s by Isaac Asimov. These seemed more akin the the logical dissections performed by Conan Doyle’s Sherlock Holmes than anything I could recognize as computer science.

Yet over the last five years, particularly since 2020, English (and other language) prompts—human-like statements or questions, often called “hard prompts” to distinguish them from “soft prompts”2 —have come into wide use. Interest in hard prompts grew rapidly after the release of ChatGPT and was driven by creative individuals who figured out, through experimentation, which prompts worked particularly well for specific tasks. This was jarring to many computer scientists such as Andrej Karpathy who declared “The hottest new programming language is English.” Ethan and Lilach Mollick are exemplars of non-computer scientist creatives that have pushed the envelope in their own domain using mastery of hard prompts. They have been inspired leaders in developing sets of prompts for many common educational tasks that resulted in functionality that has surpassed and replaced whole suites of commercial educational software.

After the initial culture shock, many researchers have started working on ways to automate optimization of hard prompts (e.g. Wen et al., Sordoni et al.) How well this works for all applications of generative AI (now less frequently referred to as large language models, and foundation models, even though technically they do not denote the same thing) in medicine in particular remains to be determined. I’ll try to write a post about optimizing prompts for medicine soon, but right now, I cannot help but notice that in my interactions with GPT-4 or Bard, when I do not get the answer I expect, my interactions resemble a conversation with a sometimes reluctant, sometimes confused, sometimes ignorant assistant who has frequent flashes of brilliance.

Early on, some of the skepticism about the performance of large language models centered on the capacity of these models for “theory of mind” reasoning. Understanding the possible state of mind of a human was seen as an important measure of artificial general intelligence. I’ll step away from the debate of whether or not GPT-4, Bard et al, show evidence of theory of mind but instead posit that having of theory of the “mind3” of the generative AI program gives humans better results when using such a program.

What does it mean to have a theory of the mind of the generative AI? I am most effective in using a generative AI program when I have a set of expectations of how it will respond to a prompt based on both my experience with that program over many sessions and its responses so far in this specific session. That is, what did they “understand” from my last prompt and what might that understanding be as informed by my experience with that program? Sometimes, I check on the validity of my theory of their mind by asking an open ended question. This leads to a conversation which is much closer to the work of Dr. Susan Calvin than to that of a programmer. Although the robots had complex positronic brains, Dr. Calvin did not debug the robots by examining their nanoscale circuitry. Instead she conducted logical and very rarely emotional conversations in English with the robots. The low level implementation layer of robot intelligence were NOT where her interventions were targeted. That is why her job title was robopsychologist and not computer scientist. A great science fiction story does not serve as technical evidence or a scientific proof but thus far it has served as a useful collection of metaphors for our collective experience working with generative AI using these Theory of AI-Mind (?TAIM) approaches.

In future versions of this post, I’ll touch on the pragmatics of Theory of AI-Mind for effective use of these programs but also on the implications for “alignment” procedures.

Version
0.1 Initial presentation of theory mind of humans vs programming generative AI with a theory of mind of the AI.
Version History
  1. Some techniques were more inspired by the 1980’s AI community’s toolit including dependency directed backtracking and Truth Maintenance Systems. ↩︎
  2. Soft prompts are frequently implemented as embeddings, vectors representing the relationship between tokens/words/entities across a training corpus. ↩︎
  3. I’ll defer the fun but tangential discussion of what mind means in this cybernetic version of the mind-body problem. Go read I Am A Strange Loop if you dare, if you want to get ahead of the conversation. ↩︎
Categories
Healthcare Machine Learning Medicine Policy

When is the ‘steering’ of AI worth the squeezing?

Diagram of how RLHF is built atop the pretrained model to steer that pre-trained model to more useful behavoopr.

In population genetics, it’s canon that selecting for a trait other than fitness will increase the likelihood of disease, or at least characteristics that would decrease survival in the “wild”. This is evident in agriculture, where delicious fat corn kernels are embedded in husks so that human assistance is required for reproduction or where breast-heavy chickens have been bred that can barely walk . I’ve been wondering about the nature of the analogous tradeoff in AI. In my experience with large language models (LLM)—specifically GPT-4—in the last 8 months, the behavior of the LLM has changed over the short interval of my experience. Compared to logged prompt/responses I have from November 2022 (some of which appear in a book) the LLM is less argumentative, more obsequious but also less insightful and less creative. This publication now provides plausible, quantified evidence that there has indeed been a loss of performance in only a few months in GPT-3.5 and GPT-4. This in tasks ranging from mathematical reasoning to sociopolitically enmeshed assessments.

This study by Zou and colleagues at Berkeley and Stanford merits its own post for all its implications for how we assess, regulate, and monitor AI applications. But here, I want to briefly pose just one question that I suspect will be at the center of a hyper-fertile domain for AI research in the coming few years: Why did the performance of these LLMs change so much? There may be some relatively pedestrian reasons: The pre-trained models were simplified/downscaled to reduce response time and electricity consumption or other corner-cutting optimizations. Even if that is the case, at the same time, we know because they’ve said so (see quote below), that they’ve continued to “steer” (“alignment” seems to be falling into disfavor) the models using a variety of techniques and they are getting considerable leverage from doing so.

[23:45 Fridman-Altman podcast] “Our degree of alignment increases faster than our rate of capability progress, and I think that will become more and more important over time.”

Much of this steering is driven by human-sourced generation and rating of prompts/responses to generate a model that is then interposed between human users and the pre-trained model (see this post by Chip Huyen from which I copied the first figure above which outlines how RLHF—Reinforcement Learning from Human Feedback—is implemented to steer LLMs). Without this steering, GPT would often generate syntactically correct sentences that would be of little interest to human beings. So job #1 of RLHF has been to generate human relevant discourse. The success of ChatGPT suggests that RLHF was narrowly effective in that sense. Early unexpected antisocial behavior of GPT gave further impetus to additional steering imposed through RLHF and other mechanisms.

The connections between the pre-trained model and the RLHF models are extensive. It is therefore possible that modifying the output of the LLM through RLHF can have consequences beyond the narrow set of cases considered during the ongoing steering phase of development. That possibility raises exciting research questions, a few of which I have listed below.

QuestionElaboration and downstream experiments
Does RLHF degrade LLM performance?What kind of RLHF under what conditions? When does it improve performance?
How does the size and quality of the pre-trained model affect the impact of RLHF?Zou and his colleagues note that for some tasks GPT-3.5 improved whereas GPT-4 deteriorated.
How do we systematically monitor all these models for longitudinal drift?What kinds of tasks should be monitored? Is there an information theoretic basis for picking a robust subset of tasks to monitor?
Can the RLHF impact on LLM performance be predicted by computational inspection of the reward model?Can that inspection be performed without understanding the details of the pre-trained model?
Will we require artificial neurodevelopmental psychologists to avoid crippling the LLMs?Can Susan Calvin (of Asimov robot story fame) determine the impact of RLHF through linguistic interactions?
Can prompting the developers of RLHF prompts mitigate performance hits?Is there an engineered path to developing prompts to make RLHF effective without loss of performance?
Should RLHF go through a separate regulatory process than the pre-trained model?Can RLHF pipelines and content be vetted to be applied to different pre-trained models?

Steering (e.g. through RLHF) can be a much more explicit way of inserting a set of societal or personal values into LLM’s than choosing the data that is used to trained the pre-trained model. For this reason alone, research on the properties of this process is not only of interest to policy makers and ethicists but also to all of us who are working towards the safe deployment of these computational extenders of human competence.


I wrote this post right after reading the paper by Chen, Zaharia and Zou so I know that it’s going to take a little while longer for me to think through what are its broadest implications. I am therefore very interested in hearing your take on what might be good research questions in this space. Also if you have suggestions or corrections to make about this post, please feel free to email me. – July 19th, 2023

Categories
Healthcare Machine Learning Medicine

Standing on the shoulders of clinicians.

The recent publication “Health system-scale language models as all-purpose prediction engines” by Jiang et al. in Nature (June 7th, 2023) piqued my interest. The authors executed an impressive feat by developing a Large Language Model (LLM) that was fine-tuned using data from multiple hospitals within their healthcare system. The LLM’s predictive accuracy was noteworthy, yet it also highlighted the critical limitations of machine learning approaches for prediction tasks using electronic health records (EHRs).

Take a look at the above diagram from our 2021 publication Machine learning for patient risk stratification: standing on, or looking over, the shoulders of clinicians?. It makes the point that the EHR is not merely a repository of objective measurements, but it also includes a record (whether explicit or not) of physician beliefs about the patient’s physiological state and prognosis for every clinical decision recorded. To draw a comparison, using clinicians’ decisions to diagnose and predict outcomes resembles a diligent, well-read medical student who’s yet to master reliable diagnosis. Just as such a student would glean insight from the actions of their supervising physician (ordering a CT scan or ECG, for instance), these models also learn from clinicians’ decisions. Nonetheless, if they were to be left to their own devices, they would be at sea without the cue of the expert decision-maker. In our study we showed that relying solely on physician decisions—as represented by charge details—to construct a predictive model resulted in performances remarkably similar to those models using comprehensive EHR data..

The LLMs from Jiang et al.’s study resemble the aforementioned diligent but inexperienced medical student. For instance, they used the discharge summary to predict readmission within 30 days in a prospective study. These summaries outline the patients’ clinical course, treatments undertaken, and occasionally, risk assessments from the discharging physician. The high accuracy of the LLMs—particularly when contrasted with baselines like APACHE2, which primarily rely on physiological measurements—reveals that the effective use of the clinicians’ medical judgments is the key to their performance.

This finding raises the question: what are the implications for EHR-tuned LLMs beyond the proposed study? It suggests that quality assessment and improvement teams, as well as administrators, should consider employing LLMs as a tool for gauging their healthcare systems’ performance. However, if new clinicians—whose documented decisions might not be as high-quality—are introduced, or if the LLM is transferred to a different healthcare system with other clinicians, the predictive accuracy may suffer. That is because clinician performance is highly variable over time and location. This variability (aka data set shift) might explain the fluctuations in predictive accuracy the authors observed during different months of the year.

Jiang et al.’s study illustrates that LLMs can leverage clinician behavior and patient findings—as documented in EHRs—to predict a defined set of near-term future patient trajectories. This observation paradoxically implies that in the near future, one of the most critical factors for improving AI in clinical settings is ensuring our clinicians are well-trained and thoroughly understand the patients under their care. Additionally, they must be consistent in communicating their decisions and insights. Only under these conditions will LLMs obtain the per-patient clinical context necessary to replicate the promising results of this study more broadly.

Categories
Machine Learning Medicine

ML and the shifting landscape of medicine

“A process cannot be understood by stopping it. Understanding must move with the flow of the process, must join it and flow with it.”

Frank Herbert, Dune

Imagine a spectacularly accurate machine learning (ML) algorithm for medicine. One that has been grown and fed with the finest of high quality clinical data, culled and collated from the most storied and diverse clinical sites across the country. It can make diagnoses and prognoses even Dr. House would miss.

Then the covid19 pandemic happens. All of a sudden, prognostic accuracy collapses. What starts as a cough ends up as Acute Respiratory Distress Syndrome (ARDS) at rates not seen in the last decade of training data. The treatments that worked best for ARDS with influenza don’t work nearly as well. Medications such as dexamethasone that have been shown not to help patients with ARDS prove remarkably effective.  Patients suffer and the  ML algorithm appears unhelpful. Perhaps this is overly harsh. After all, this is not just a different context from the original training data (i..e “dataset shift”), it’s a different causal mechanism of disease. Also, unlike some emergent diseases which present with unusual constellations of findings—like AIDS—coivd19 looks like a lot of common inconsequential infections often until the patient is sick enough to be admitted to a hospital. Furthermore, human clinicians were hardly doing better in March of 2020. Does that mean that if we use ML in the clinic, then clinicians cannot decrease alertness for anomalous patient trajectories? Such anomalies are not uncommon but rather a property of the way medical care changes all the time. New medications are introduced every year with novel mechanisms of action which introduce new outcomes which can be discontinuous as compared to prior therapies and also novel associations of adverse events. Similarly new devices create new biophysical clinical trajectories with new feature sets.

These challenges are not foreign to the current ML literature. There are scores of frameworks for anomaly detection1, for model switching 2, for feature evolvable streaming learning3. They are also not new to the AI literature. Many of these problems were encountered in symbolic AI and were closely related to the Frame Problem that bedeviled AI researchers in the 1970s and 1980s. I’ve pointed this out with my colleague Kun-Hsing Yu4 and discussed some of the urgent measures we must take to ensure patient safety.  Many of these are obvious such as clinician review of cases with atypical features of feature distributions, calibration with human feedback and repeated prospective trials. These stopgap measures do not however address the underlying brittleness that will and should decrease trust in the performance of AI programs in clinical care. So although these challenges are not foreign , there is an exciting and urgent opportunity for researchers in ML to address them in the clinical context especially because there is a severe data penury relative to our other ML application domains. I look forward to discussions on these issues in our future ML+clinical meetings (including our SAIL gathering).

1. Golan I, El-Yaniv R. Deep Anomaly Detection Using Geometric Transformations [Internet]. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, editors. Advances in Neural Information Processing Systems. Curran Associates, Inc.; 2018. p. 9758–69.Available from: https://proceedings.neurips.cc/paper/2018/file/5e62d03aec0d17facfc5355dd90d441c-Paper.pdf

2. Alvarez M, Peters J, Lawrence N, Schölkopf B. Switched Latent Force Models for Movement Segmentation [Internet]. In: Lafferty J, Williams C, Shawe-Taylor J, Zemel R, Culotta A, editors. Advances in Neural Information Processing Systems. Curran Associates, Inc.; 2010. p. 55–63.Available from: https://proceedings.neurips.cc/paper/2010/file/3a029f04d76d32e79367c4b3255dda4d-Paper.pdf

3. Hou B, Zhang L, Zhou Z. Learning with Feature Evolvable Streams. IEEE Trans Knowl Data Eng [Internet] 2019;1–1. Available from: http://dx.doi.org/10.1109/TKDE.2019.2954090

4. Yu K-H, Kohane IS. Framing the challenges of artificial intelligence in medicine. BMJ Qual Saf [Internet] 2019;28(3):238–41. Available from: http://dx.doi.org/10.1136/bmjqs-2018-008551