AMIA 2024 – It’s a house of cards 0: the LLMs are supervising the LLMs (W05) Empowering Healthcare with Knowledge-Augmentsed LLMs – Innovations and Applications

Hullo! Long-time no see, actual blog posts on career / tech oriented topics.

Today I’m attending the American Medical Informatics Association 2024 Annual Symposium in San Francisco. My first fully-funded work trip, from flight to food (per-diem) to conference registration and hotels 🙂

For now, since time is short between sessions, I’ll focus on my own reflections. Later I can go back and (especially with the slides as reference) fill in a synopsis of (what I learned) from the workshop.

The first session of the day was a workshop on LLMs. I thought this might be more of a “work” shop but it was actually mostly presentations of recent papers by several speakers, and a panel. This is for the best since there were several dozen attendees. I’m actually really enjoying myself regardless. I am getting dragged from the dark ages to the new frontier of LLMs, and it’s exciting to see an entirely new field of research as well as application. Here I’m entirely anonymous, a blank slate. Mentally, I can come in with few preconceptions or emotional ties.

Note to self: get to conference early in order to avoid badge line, though it did move quick. Also exit early to get the snacks (nutri-grain bars and such).

So what does the frontier of LLMs look like? From my outsider perspective, it looks like a crazy house of cards to be honest, with LLMs on LLMs on LLMs. To be fair, I walked in (late, I got absorbed into ironing lol) and the first presentation included self-verification, with LLM-LLM checking compared to human-LLM checking. Cue visions of runaway AIs merrily herd-guiding themselves into complete ethical chaos. A lot of my work I doubted as being stuff I made up hackishly as a feral coding child that didn’t have the engineering chops to implement better solution. Surely the experts out there have far better methods. But nope! Indeed even for RAG (retrieval-augmented generation, like a librarian retrieving specific references for you, the LLM, to consider) the presenters are relying on LLMs (LangChain) to create the chunks to feed into the LLM.

Wat.

Maybe this is a solved problem and I don’t know it? Hopefully later workshops will reduce my skepticism since many people seem to just take it as part of the cog. I mean, if it gets you presentations and workshops … well … I guess you don’t have to innovate on every single part of a system to get a publication and move the field forward. Makes sense.

Also: I do have to consider that humans make plenty of mistakes also. Can we compare to self-driving cars, which (last I checked) have lower accident rates in several categories than humans? (though I need to check if this is simply because self-driving cars currently drive in easier conditions over fewer miles than humans). It’s entirely possible LLMs may be inaccurate but still better than humans. They incorporate a larger information base. For instance, locally humans may make a mistake (as in a MITOC seminar I attended) where e.g. instructors were not sure of signs of frostbite on dark skin instead of fair skin and had to look up the information. For LLMs, likely the information already “exists.”

Should we demand higher performance for LLMs?

Unfortunately, LLMs have hallucination and other errors that deviate from human intuition. (Consider how LLMs struggle to answer “how many r’s are in strawberry.” The conjecture is that LLMs are trained at the chunk-level (more like words) rather than character level, and contain no sense of logical reasoning since they are trained to predict the next token, so will struggle on these. However, this is clearly not intuitive nor expected to a vast majority of users). I view it as drawing a “bubble” of stability/robustness (as in robotics) that we can predict well with humans. But we have little intuitive idea of the boundaries with LLMs.

From software and systems engineering, we have the concept of CI/CD, or continuous integration / continuous development. Essentially, we write tests that are automatically run to monitor (production) code output. I could imagine something like the CAPTCHA system which I think relies on humans verifying other humans. We could imagine a system where a subset of predictions by either LLMs or physicians are sent for a “second opinion” by another physician (or LLM?) for quality monitoring.

The other approach is creating tests – creating the benchmarks, which apparently don’t exist for evaluation hallucinations. The panel at the end had speakers with many examples of how LLMs had failed, even at longstanding tasks such as checking for negation (e.g. patient did not receive xyz drug). I hope that there are efforts to crowdsource such examples. (A little voice in me is like: could a nefarious party use such a dataset? This voice has distracted me negatively in my career so I’ve decided to ignore it. I do believe the positive outcomes outweight any negatives at this stage. Certainly for me personally…).

I do need to read more in detail on how the LLMs of today are trained, since of course OpenAI faced similar issues with monitoring their chat output. They solved this in part by having humans read the output and answer multiple choice questions about the LLM output.

Another thought: a lot of hand wringing about how to ensure fairness across diverse patient populations. I know this is a technical talk, but part of me is like “oh easy if you actually put money on the line it will be solved.”

Another question I’ve yet to have answered: can LLMs be retrained (or finetuned) to embed a concept of uncertainty and self-evaluation of bias from the ground up? I am still skeptical of the original data sources LLMs are trained on, as well as the opaque safeguards implemented (though this does seem like a fun black hat activity now that I realized it’s actually a security question). I suspect this is too resource-intensive for companies to find it worthwhile, alas. (Another ethical issue I have with working in LLM space — I’m not working on climate change mitigation for sure, not even on research to make LLMs more efficient. Again going against my instinct and reminding myself to focus on core technical skills for now, as that’s what holding me back, not my varied social justice issues and ideas hah).

I am also curious if we have a sense of when halllucination and trust could be considered solved. If we have a clear vision of that, it’s possible we can work backwards.

All-in-all, it’s been really intellectually stimulating. I do feel a difference in my own confidence (arrogance) with my PhD. I wonder if this is how fresh grads with more confidence feel. I don’t feel a pressure to be perfect — to seize the opportunity to ask interesting questions all the time, to network obsessively. I can focus a bit more on just thinking things and doing happy things instead of being anxious about everything I’m not doing. It’s interesting to feel like “of course I will have interesting insights given my background and skill set, all the people who doubt me are wrong.” (Okay, it’s arrogance from a root of insecurity lol). Instead of acknowledging that everyone here will have really interesting insights and my task is to give space and respect people with different backgrounds and work to find these insights (or create them together).

Hopefully this will be reigned in by (in the negative?) performance evaluations one day, in a good and constructive way. Or perhaps in the positive by more genuine confidence in my skill set that lets me be open to valuing and respecting others more.

Honestly, I spend so much time right now just reading novels or stressing about the future and relationships instead of just making cool coding projects. But I do feel like I’m starting to come out of that and start living in the present. It will just take time.

Time to sneak in a bit of lunch in 12 minutes? I skipped breakfast. I guess I’ll just live off of the calories from coffee.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.