AMIA 2024 – Monday (11/11/24) Recap

oracle

i started off with oracle session as i hadn’t really attended industry panels. i enjoyed some of the acronyms, like “stormed and normed.” also, the idea that oracle as a big company was late to the cloud / others had years ahead, so to catch up they hire the eng that build other clouds and ask for better faster cheaper. (?!)

from words to wonder

i dropped by there – so it appears that for some tasks fine-tuned bert is still performing better even than latest llm on token-level NER vs. document level NER (not entirely sure what this means). this makes sense since llm is trained for causal prediction (next token) vs . embeddings may have more sense of individual words (bidirectional) and may be better suited for such classifciation tasks. i found it interesting that they point out limitations that there are other ways to instruct llms – for my own research i found it difficult to figure out how many prompts i should use to feel confident in my results. i guess … just one lol

food entities

this paper looked more at zero shot and one shot for identifying food entities from patient generated data. chatgpt performed a lot better than rule-based systems trained on fairly clean data, since it could handle stuff like “bbq” for barbecue and “happy meal” for mcdonalds. the research was excited that we can now use more varied data without tailoring algorithms across datasets.

bioner

i finally got an order of magnitude sense, the presenter said that fine tuning the llm directly took around 30 mins on 4x A100 18 GB gpus. but other methods include llora and parameter efficient fine tuning (peft). interesting point to just say that there was not statistically significance for some of comparisions

nih common data element normalization

again emphasis that ensemble methods perform better. i guess this is true for humans too!

nabla

i also went to Q&A part of nabla, it was cool to sit in a room of practicing clinicians vs. rooms of fellow researchers that are slightly adjacent to my work. (i also miss the extremely technical nerds sometimes). i created a mastodon account (to use every half year or so between conferences? next time i’ll add it to my badge):

TIL: paperwork burden big driver of provider burnout (!), and “Ambient AI” shortens from 90 mins to 30 mins = can go home and spend time with family in evening, can see more patients #amia2024 nrobot@mastodon.social

https://mastodon.social/@nrobot/113467025691843019

i had no idea that the paperwork was so severe that it actually drives burnout. but it makes sense if it’s taking an extra 1.5 hrs of your life after work to catch up on all the paperwork !!

also this interesting quote “gen AI scribing [note to summary] is like ice cream — there’s some people who like it, but not many!”

WIG NLP

then i learned a bit about working groups, still not entirely sure what they are, but i have to be an AMIA member to access anyway (vs just attending the conference). they were trying to set up a mentor / mentee network but kind of struggling. in general i’ve noticed there’s less willingness to randomly connect and mentor than i expect. especially i heard fabled west coast startups willingness to help each other out. maybe it’s that there are many high level industry execs here.

yale – emulated clinical trial

trying to find cohort retrospectively — realized the biggest issue (spent years) is the messiness of the data. normal clinical trial is 100% accurate. if you have 6 parameters that have to be pulled out with nlp, and the current state of art is 80-90% accuracy, that comes down to 30-50% accuracy which is unacceptable for running an emulated clinical trial.

generally

i’ve heard a lot about how finding patients (and others aspects) of clinical trials are slowing down research in a major way. so there’s a ton of research into how to address this.

posters

then i caught up with my coworkers posters! since i have most reference on this i’ll post more later. basically presenting on the system side of a pilot to rank patients for screening, anecdotal catching patient with cancer that wouldn’t have been caught by old system. rolling out pilot to 28 hospitals soon. the clinicians present on the outcomes side of impact on patients, vs the informatics group presenting on rapid system roll-out

here i’ve started talking to random posters and connecting with people. i think i was just talking to people that were too busy / high level before. humbling for sure, and also confusing coming from socializing in startup circles i guess as i never integrated well into higher education :'(

TIL: paperwork burden big driver of provider burnout (!), and “Ambient AI” shortens from 90 mins to 30 mins = can go home and spend time with family in evening, can see more patients

generally

the nicest seminar was still the keynote on sunday, not sure if that’s because i’m flitting in and out of sessions instead of sticking in one session and talking to people

i skipped out on an entire half session, just went to sit and read a book. the sensory room is just a small conference room set on a different floor, but it was chairs at tables, no comfy couches, so i went to the lobby and sat in a comfy couch instead and put in headphones. actually that place was incredibly loud but it worked haha

dei event

i would not have gone on my own, at this conference i feel like not-minority lmao. but i went with my coworkers and the industry sponsor provided good food. at first it was just a few people and we stood by ourselves, but the chair / co-chair came to talk to us, and by the time that finished the room was incredibly full. i was really impressed by one of them who could tell that ray and rui are pronounced differently (rui has the tongue further back), legit thought he did linguistics. he connected it to christian beliefs which i found interesting, that the lord knew our true name before we were born, so really pronouncing names is important to him. i didn’t connect with anyone new (too burnt out) but i reconnected with folks from WINE.

career talk

i talked to one person about my life goals, and that was a good talk. i should likely not treat my supervisor as anything adversarial. i admit that it threw me off for him to say that i would be working under my coworker so that was the more important opinion for performance evals. and it has made me mrrr to hear that i should be doing the work of the position i want when i am being paid half the cost of that position to do that much work. but i do not think my supervisor meant to imply i would be working under my coworker no matter what.

VA event

there was also a VA event that we dropped by, but everyone there was super tired so we talked to one group of people and left. i again felt some regional politics at play slightly.

personal

okay, i feel like i’m totally at failing at connecting with people, but have to remind myself i’m like … 2 days into getting to know people here. and two months into having some new blurb to write about myself.

it’s been incredibly relaxing even if possibly fool hardy to not go around the exhibition hall trying to connect with companies.

also there is so much i need to learn, that i can feel free to list now that i have a job and a foot in the door (and emotional reserves). likert, rouge, llora, spearman, these are all terms i need to learn. maybe i’ll install chat on my phone just so i can learn those in the moment as i’m motivated and there’s context for how it’s used …

prior self

i can see how i must have come across to people before i got a job now. just kind of emotionally disconnected from everyone else, not wanting people’s pity and knowing they can’t help right away, and feeling so hopelessly caught up in my own life issues. i keep trying to remind myself. one day when i struggle to go to the bathroom on my own i’ll look back on this time in my life and be like “wow that was such a good time.”

or maybe i’ll get cancer earlier and just wish i had such normal problems where i thought i would live to old age. i’ll miss the time when i could still just call my parents and they didn’t have major health problems and i still had hope for them to live to 120. who knows, who knows.

general state of the world

i feel surprisingly detached from all my anger and fear about the elections in the united states. it barely takes up 1% of my mental headspace. it’s like 50% industry career new job, 30% technical skills worry, 15% relationships, 4% excitement about extremely random stuff on my backburner, 1% state of domestic and international world affairs / bitterness (how could we have someone who molests women as our president? and millions of people turned out to vote for him? i feel such shame and horror and despair and hopelessness that kamala harris is not our next president. all my degrees and accomplishments feel utterly useless. i feel so angry, so angry at anyone in my life that doesn’t hate trump. but remember — the world is not ending today or tomorrow, my friends can ally with me on other things, and my life will be okay).

high level planner

i’ve been trying to move in the opposite direction of my natural instincts. if i’m drawn to the climate and diversity and equity talks, i’ve been forcing myself to go in more purely technical directions. i would like to find my own pure happiness in technical implementation for the next few years independent of the impact. of course, from my perspective any work i do on AI is work I’m not doing on equity in AI and contributing to climate change in data centers. i’m not living up to my principles and the emotional energy i directed at it. but reminding myself that the alternate pathway is: get rich, make changes, donate to causes and empower others to work on these topics. i feel after my thesis on illicit massage industry and the energy i put into supporting my ex partner, it’s time to focus on just me. and nothing has to be permanent. it’s like i constantly have a higher-level planner — what should i be spending my time thinking about? usually the answer is, not what i’m thinking about right now. and it’s important but hopefully i can start tuning it down some day soon. i want to be thinking about interesting technical papers, where the field is going, cool packages i learned about recently, learning about monads and the latest bug in some code i wrote. hopefully next week i can finally start doing that.

AMIA 2024 – [Official Day 1] quick thoughts

my brain is totally fried

right, saturday was workshops were i was upskilling and connecting with people on linkedin and sort of my coworkers

sunday — when i really started to dive into informatics. sunday was infinitely more exhausting, just way more people. interesting thoughts but no one to talk to about them, not possible to go to all the sessions. but i did learn yesterday (saturday at WINE, the women .. informatics… networking thing) you can swap between sessions during paper presentations, so i marked mine out carefully and have been going in and out of sessions.

sunday morning

morning i went to sessions on social media research. as usual i was late, insomnia and turning off my alarm and i find it really hard to stop ironing once i start but i’m also bad at it. but it was freeing that no one knowed or cared if i showed up on time, since my lab group went to alcatraz without me … first half of workshop was a bunch of paper presentations. afterward was group discussion. and at the end there was coffee break, so you could stick around and talk to people.

i learned through brief chat after: best practice is to contact a forum moderator and explain what research you want to do. if they don’t respond, tell university irb. university irb will likely say exempt since it’s social media research. prior big discussion on whether this is reasonable, observation that junior faculty are encouraged to use social media if irb approvals taking too long. note that irbs are overworked, so either have inter-institutional irb focused on this, or perhaps have subject matter experts to call in on these topics. dataset release: also occurs, and just run an off-the-shelf name stripper. (i stressed that names will slip through, but i guess… it’s… fine?)

i went over my idea of how to flip the legal power dynamic: companies agree to users’ privacy policy when we agree to theirs. incentive: perhaps just that

keynote

the keynote in the early afternoon was a really good crash course on theoretical ethics. no longer felt so obscure and hard to understand. i should start watching more talks.

> differential privacy — the simplest example is randomized response. for a survey about behavior people might lie about, e,g, their social distancing habits, we can give people plausible deniability. tell them to flip a coin, if it’s heads tell the truth, if it’s tails, flip the coin again: if it’s heads, tell the truth, if it’s tails, tell the predetermined response. that default response is yes to illegal behavior, so you have plausible deniability. on the individual level, who knows; at the aggregate level, you can work out mathematically an estimate of the truth.

i’d heard of this technique long ago, but never realized it counted as a differential privacy algorithm.

throw money at it, aka bug bounties, and concatenate models for more accuracy without less fairness

bounty a: find subgroup with different outcomes. bounty b: if not only subgroup where outcomes different for current model, but also find model with optimal performance on that subgroup;

then you can just concatenate models! and don’t have to trade fairness with accuracy — strict (theoretically proven) increase in accuracy. if it’s in subgroup z then use model B, otherwise use model A. downside: do not know full model complexity ahead of time so risk overfitting.

bug bounties in traditional security: avoid adversarial interaction (propublica, facebook, compas) and shift to more collaborative interaction.

(will insert rest of summary of talk tomorrow)

sunday afternoon

i flitted between sessions.

demos: people’s screens were illegibly tiny, and i couldn’t get my camera to zoom enough. next time i’m bringing binoculars or a phone camera binocular attachment. one presentation was mostly a webapp llm, but the enthusiasm was catchy. (the topic was connecting pregnant women to sources of info). little practical tidbits, like they tried conversation history but the llm tended to stop referring to primary srcs and start reflecting user inaccuracies. makes sense since the user input will be the most recent input. the other was a recommendation for LLM engineers handbook. also, models can run on laptops (macbook m series i’m guessing?).

-> lost-in-the-middle: issue with RAG where references at the beginning (ranked relevant) and end (due to how llms are trained to predict next token) are highly rated but ones in the middle are ignored.

epic: packed. the ten minutes i heard were a lot of “physicians hate changes usually but they actually sometimes liked this!” a lot of focus on a. we have so many users! and b. they’re happy with us! and not a lot of mention of hallucinations and integrity checking. i will have to find someone who went to the epic session to debrief me. otherwise, it matched my fears of industry: focused on putting gen AI in everything regardless of if reasonable or not and not really caring about unknown bugs.

~ epic: my own thoughts on ethics: i find there are multiple “choice” frames to approach this from. one: if it makes care 1% better for 1000 patients and 50% worse for one patient, is it worth it? two: if taking an extra month to implement some quality assurance avoids that one bad outcome, is the extra month worth it? three: what if it means we can provide care for one person we couldn’t see at all before due to lack of physician time? etc.

JAMA: i learned that physicians are exhausted and frankly rather scared of how AI is taking over. the FDA has approved over a thousand devices with AI components the past year. there is work on a podcast to explain recent research to physicians.

JAMA – methods move fast: if used same methods as five years ago in a paper submitted today, paper would be returned with tons of methods critiques

child maltreatment public (state-level) policy analysis: how different state laws correlated with different pediatric outcomes.

used LLMs to summarize a bunch of public policies. does maltreatment definition include physical abuse? is reporting centralized or not (hotline)? then factor analysis to find factors to make cohorts (find underlying patterns in policies): mandated reporter? reporter training? then cluster around to find specific factor combinations (few/moderate/high levels of training, or penalties, etc.). then look at outcome differences. (will insert screenshots later)

policy -> difference-in-difference: check if results still hold between 2019 and 2021 when policies were in-place and not-in-place.

dinner / expo

they have free headshots! i desperately need one, but [tmi] i have a giant pimple right now :'( maybe that’s what AI is for lol. the expo … somehow i was completely uninterested. i think i felt rejected by my coworkers. i called a friend for moral support lol

personal thoughts

overall the conference is more diverse than i expected (which isn’t saying that much though tbh). i wonder how the stats compare to IROS/RSS/ICRA.

i need to remind myself i’m utterly new to this field and i’m not here presenting a paper or poster. of course, i will have to work harder to connect technically with people, and surface a whole different set of my interests than i usually nerd out about. people here are not excited about welding robots or maker faire or engineering education. nor are they nerding out about the latest theoretical advances and architectures and benchmark leaderboards. people here talk about weird interesting healthcare things.

i’ve only officially been to robotics conferences as a presenter

so there’s multiple factors at play here for my social sore thumb feeling.

mostly, i was really surprised when a fellow attendee who also went to the boston VA for many years declined to connect on linkedin. but i guess we exchanged conference app qr codes? in any case – the main thing i learned talking to her is that there a huge disparity in resources in regional VAs. i thought it was a national organization but apparently not. we in boston have the resources to figure out how to get at data sets but researchers in other VA regional sites do not. even though the data sets are VA datasets! at least, this was her experience a decade ago. also, generally feeling a bit surprised that my coworkers are not meeting up between sessions and discussing topics in real time constantly. i guess we are not really in the same lab. i gave up on organizing meals with my labmates and just messaged randomly people (around my career level) to find someone to go to dinner with. a person from WINE invited me to a dinner with a bunch of student volunteers for the conference. note: if you join the working group, there’s a chance you can be invited to volunteer for the conference and the registration fee is waived. (but travel, hotel, food not covered).

people expect postdocs to have a research idea and focus, so i need a different self-intro hook.

that makes sense. i’m not really working with a professor directly on a specific idea. as a result of my lack of topic, i need a different self blurb than “i do data science at the VA.” i think the only thing i’ve done of interest to folks is my thesis research topic.

alternatively, i could just not talk to anyone and do my own work. but oh, i felt a bit sad to not have people to compare notes with after the sessions. i really wanted those interesting and thought-provoking technical conversations i would have late at night in undergrad. (i never really got that in grad school except for classes, not research). i really did yesterday. but y’know, this is like one day into the field for me depending on how you count lol

grad school – i wish that after leave of absence they’d invested in my success instead of throwing me in and asking me to perform better than other people with a handicap already. i learned so much in so little time from this conference ! !

i guess for many of the older/senior folks, AMIA is like putzgiving, where people fly in every year and it’s a reunion.

AMIA 2024 – It’s a house of cards 0: the LLMs are supervising the LLMs (W05) Empowering Healthcare with Knowledge-Augmentsed LLMs – Innovations and Applications

Hullo! Long-time no see, actual blog posts on career / tech oriented topics.

Today I’m attending the American Medical Informatics Association 2024 Annual Symposium in San Francisco. My first fully-funded work trip, from flight to food (per-diem) to conference registration and hotels 🙂

For now, since time is short between sessions, I’ll focus on my own reflections. Later I can go back and (especially with the slides as reference) fill in a synopsis of (what I learned) from the workshop.

The first session of the day was a workshop on LLMs. I thought this might be more of a “work” shop but it was actually mostly presentations of recent papers by several speakers, and a panel. This is for the best since there were several dozen attendees. I’m actually really enjoying myself regardless. I am getting dragged from the dark ages to the new frontier of LLMs, and it’s exciting to see an entirely new field of research as well as application. Here I’m entirely anonymous, a blank slate. Mentally, I can come in with few preconceptions or emotional ties.

Note to self: get to conference early in order to avoid badge line, though it did move quick. Also exit early to get the snacks (nutri-grain bars and such).

So what does the frontier of LLMs look like? From my outsider perspective, it looks like a crazy house of cards to be honest, with LLMs on LLMs on LLMs. To be fair, I walked in (late, I got absorbed into ironing lol) and the first presentation included self-verification, with LLM-LLM checking compared to human-LLM checking. Cue visions of runaway AIs merrily herd-guiding themselves into complete ethical chaos. A lot of my work I doubted as being stuff I made up hackishly as a feral coding child that didn’t have the engineering chops to implement better solution. Surely the experts out there have far better methods. But nope! Indeed even for RAG (retrieval-augmented generation, like a librarian retrieving specific references for you, the LLM, to consider) the presenters are relying on LLMs (LangChain) to create the chunks to feed into the LLM.

Wat.

Maybe this is a solved problem and I don’t know it? Hopefully later workshops will reduce my skepticism since many people seem to just take it as part of the cog. I mean, if it gets you presentations and workshops … well … I guess you don’t have to innovate on every single part of a system to get a publication and move the field forward. Makes sense.

Also: I do have to consider that humans make plenty of mistakes also. Can we compare to self-driving cars, which (last I checked) have lower accident rates in several categories than humans? (though I need to check if this is simply because self-driving cars currently drive in easier conditions over fewer miles than humans). It’s entirely possible LLMs may be inaccurate but still better than humans. They incorporate a larger information base. For instance, locally humans may make a mistake (as in a MITOC seminar I attended) where e.g. instructors were not sure of signs of frostbite on dark skin instead of fair skin and had to look up the information. For LLMs, likely the information already “exists.”

Should we demand higher performance for LLMs?

Unfortunately, LLMs have hallucination and other errors that deviate from human intuition. (Consider how LLMs struggle to answer “how many r’s are in strawberry.” The conjecture is that LLMs are trained at the chunk-level (more like words) rather than character level, and contain no sense of logical reasoning since they are trained to predict the next token, so will struggle on these. However, this is clearly not intuitive nor expected to a vast majority of users). I view it as drawing a “bubble” of stability/robustness (as in robotics) that we can predict well with humans. But we have little intuitive idea of the boundaries with LLMs.

From software and systems engineering, we have the concept of CI/CD, or continuous integration / continuous development. Essentially, we write tests that are automatically run to monitor (production) code output. I could imagine something like the CAPTCHA system which I think relies on humans verifying other humans. We could imagine a system where a subset of predictions by either LLMs or physicians are sent for a “second opinion” by another physician (or LLM?) for quality monitoring.

The other approach is creating tests – creating the benchmarks, which apparently don’t exist for evaluation hallucinations. The panel at the end had speakers with many examples of how LLMs had failed, even at longstanding tasks such as checking for negation (e.g. patient did not receive xyz drug). I hope that there are efforts to crowdsource such examples. (A little voice in me is like: could a nefarious party use such a dataset? This voice has distracted me negatively in my career so I’ve decided to ignore it. I do believe the positive outcomes outweight any negatives at this stage. Certainly for me personally…).

I do need to read more in detail on how the LLMs of today are trained, since of course OpenAI faced similar issues with monitoring their chat output. They solved this in part by having humans read the output and answer multiple choice questions about the LLM output.

Another thought: a lot of hand wringing about how to ensure fairness across diverse patient populations. I know this is a technical talk, but part of me is like “oh easy if you actually put money on the line it will be solved.”

Another question I’ve yet to have answered: can LLMs be retrained (or finetuned) to embed a concept of uncertainty and self-evaluation of bias from the ground up? I am still skeptical of the original data sources LLMs are trained on, as well as the opaque safeguards implemented (though this does seem like a fun black hat activity now that I realized it’s actually a security question). I suspect this is too resource-intensive for companies to find it worthwhile, alas. (Another ethical issue I have with working in LLM space — I’m not working on climate change mitigation for sure, not even on research to make LLMs more efficient. Again going against my instinct and reminding myself to focus on core technical skills for now, as that’s what holding me back, not my varied social justice issues and ideas hah).

I am also curious if we have a sense of when halllucination and trust could be considered solved. If we have a clear vision of that, it’s possible we can work backwards.

All-in-all, it’s been really intellectually stimulating. I do feel a difference in my own confidence (arrogance) with my PhD. I wonder if this is how fresh grads with more confidence feel. I don’t feel a pressure to be perfect — to seize the opportunity to ask interesting questions all the time, to network obsessively. I can focus a bit more on just thinking things and doing happy things instead of being anxious about everything I’m not doing. It’s interesting to feel like “of course I will have interesting insights given my background and skill set, all the people who doubt me are wrong.” (Okay, it’s arrogance from a root of insecurity lol). Instead of acknowledging that everyone here will have really interesting insights and my task is to give space and respect people with different backgrounds and work to find these insights (or create them together).

Hopefully this will be reigned in by (in the negative?) performance evaluations one day, in a good and constructive way. Or perhaps in the positive by more genuine confidence in my skill set that lets me be open to valuing and respecting others more.

Honestly, I spend so much time right now just reading novels or stressing about the future and relationships instead of just making cool coding projects. But I do feel like I’m starting to come out of that and start living in the present. It will just take time.

Time to sneak in a bit of lunch in 12 minutes? I skipped breakfast. I guess I’ll just live off of the calories from coffee.

projects blog (nouyang)