Published: Sun May 31 2026

Should you really trust health advice from an AI chatbot?

There's a recent article on BBC news: Should you really trust health advice from an AI chatbot?.

I mentioned in a recent post that I had a virus about 7 weeks ago. The recovery has not been going swimmingly.

A point that's not always made when comparing the effectiveness of chatbots to the human is that the chatbot doesn't have to be perfect, it just has to beat the alternative, which in my case is the NHS. So the article should really be: Is an AI chatbot better for health advice than the NHS?

I've had blood tests done and had a follow up phone appointment with a nurse, not a doctor, because my GP surgery didn't let me see a doctor. The nurse's view was that my blood tests were all normal so it's just standard viral recovery which they don't consider a problem until 12 weeks have elapsed. Build up exercise gradually and take a multivitamin (I do already). This appointment did not provide me with anything of value.

Obviously, a chatbot can't perform blood tests for me, but it can analyse the results. I also have a (self-created) training readiness dashboard with activity logs and biometrics from my Garmin plus written diary entries on anything that seems relevant. I've expanded this so that I can export the last N days to CSV and feed it into an LLM and get advice.

On the blood test, in contrast with the NHS, the chatbot said my vitamin D is classified as "insufficient", which is the grey area between "deficient" and "good", and this is especially significant given that 1. I'm very active and 2. I'm dealing with viral recovery and vitamin D is important for the immune system. I've checked it, and it's right, my vitamin D is below optimal levels even though the NHS considers it normal. And that's with a multivitamin. So that's a significant performance improvement of the chatbot over the NHS.

On the day to day issues, the chatbot has given me a lot of information. It's diagnosed me with post viral autonomic dysfunction, meaning my nervous system is not functioning correctly. The nervous system has two branches, called the parasympathetic, which is responsible for being calm, relaxed, and letting the body heal, and the sympathetic branch which primes the body for fight or flight scenarios. In a healthy nervous system they work in tandem where one activates and the other backs off, at appropriate times. In post viral recovery the sympathetic branch can become over active which means you stay physically stressed for longer after exertion, and the parasympathetic less active, which means even during complete physical rest (like sleep) you struggle to reach a state where your body really starts recovering effectively. This is terrible for recovery and manifests in ways including exercise intolerance and crashes/relapses after doing too much. I suspect that trying to push through this leads to some forms of post-viral fatigue or chronic fatigue syndrome, as the nervous system runs hot for too long and eventually the body shuts down to protect itself, but that's just speculation on my part.

The chatbot checks my biometrics and diary entries daily and has mostly recommended total rest with gentle walks. It wants to see more stability and upwards trends in my biometrics before reintroducing exercise, because 1. fitness will come back quickly once I'm able to train, and 2. every relapse I have prolongs recovery. Further actionable advice I get are things like:

Control my hay fever, because the allergy response causes inflammation and the release of cytokines, which stresses the nervous system.
Take frequent breaks from deep concentration tasks because these stimulate the sympathetic nervous system.
Practice diaphragmatic breathing daily and especially if I feel tense, because this activates the parasympathetic branch.
Supplement magnesium glycinate and omega 3, because these are calming for the nervous system, plus timing recommendations.

This is all useful detail that I didn't get from the NHS.

The quality of the output is pretty good in terms of broad diagnosis. It explains exactly what I'm dealing with and gives me enough details that I can check it independently. It also gives me actionable advice on how to proceed from here, with supporting arguments. It's given me a better understanding of my nervous system and has let me build the self awareness to feel what it's doing at any given time. It's one thing to be sat working and think "I feel a bit off", it's much better to think "I feel a bit off and my heart beat is a bit heavy, that means my sympathetic nervous system is going too hard, maybe I should close my eyes and do some breathing exercises until I feel it soften and avoid a crash later". Any athlete will probably already be aware of the feeling of sympathetic dominance; it's what you get when you're 12 hours after a hard race and you can feel that heaviness in your heart beat and the vague sense of unease as your body is working overtime to repair all the damage. So it's just a matter of being aware of that class of feeling and noticing when it's appearing at inappropriate times and try to nudge it back to normal.

Is it perfect? No. It's much worse than an intelligent human with the necessary domain knowledge. Beyond very broad output, I'd actually regard it is as so poor as to be useless.

Large points of frustration are:

It's far too agreeable. Initially it was giving me very precise HRV conditions for resuming running, e.g. HRV above 50 three days in a row, and was effectively treating HRV as the holy grail of recovery tracking. My physio told me to ignore HRV and focus more on subjective measurements and overnight heart rate. After I put this in the diary section, it has never once tried to use HRV as a condition again. This gives me very little confidence in its analytical abilities.

Logical reasoning - I had an interesting situation where I'd done a (very) short run in the afternoon, then had good biometrics until about 9PM when it all went downhill until I woke up at 3AM, then improved massively until the morning. This is hard for me to interpret because I basically have two completely opposing readings for the two halves of the night. The AI confidently diagnosed it as post exertional malaise. I then added the information that I was suffering from hay fever and it was raining heavily (which has pollen implications) at 9PM when I took my dog out, and I'd woken up at 3AM. It told me that this strengthened the argument it was post exertional malaise. This is a clear logical fallacy; the extra information can be dismissed as irrelevant or weaken the argument but it cannot strengthen it. I pushed back with the theory that the rain concentrated the pollen and led to breathing issues overnight until I woke up and shifted and cleared my airways, which let my body relax. The AI wasn't having it. I then asked again in a fresh context and it said "The overnight pattern you described is a near-perfect fingerprint of upper airway resistance being acutely relieved." Alright then.

...Which leads me to the next flaw, inconsistency in decisions - I'm currently at the point where I could probably start reintroducing running, because the biometrics are looking promising and I'm feeling good. A week ago it was consistent on advising me to rest. But now I can press regenerate 4 times and get 4 different pieces of advice ranging from full rest to a brisk walk to attempting run/walk intervals to a 45 minute continuous run (which is absurdly bad advice given my last 15 minute run set me back a week, and it knows this).

More egregiously, it can also start disagreeing with its own advice. I've had a few situations where it's given me advice, I've followed the advice, and then it's told me off because it's since decided it was a silly nonsensical idea. When I questioned it on this, it told me that "the wellness industry loves to give simple physical, but not scientifically backed, solutions to complex, invisible problems" and that "LLMs are trained on the vast expanse of the internet, which includes alternative wellness blogs.". So there you go - the LLM is telling me it has been trained to give the wrong answers.

Chronological comprehension - I can see that it gets confused sometimes on details and chronology. I feed it a CSV of my data and it sometimes gets muddled, mixing data from one record with another. The biggest frustration is that no matter how I prompt, it will often fail to understand that the record for a given day's biometric data is taken before any activities are recorded for that day, so often it will tell me that my recorded activity for today caused my HRV to dip. No it didn't, you don't have an HRV measurement post-activity yet.

Total failure - Sometimes the LLM is just terrible at evaluating what you give it. Sometimes it will get columns in the CSV mixed up, sometimes it will make up numbers and event invent metrics to analyze.

I've tried various different models, including some bigger ones through OpenRouter, and while some are better than others, Qwen 3.6 and Gemini 4 running locally both produce output at about the same quality as anything else. The bigger models in OpenRouter are no better, they just run faster than local models on my laptop. There are interesting personality differences between the models though; Qwen 3.6 is consistently more conservative in its activity recommendations than Gemma 4.

I think the summary is that the models are really good at broad diagnosis, because they have a wealth of medical information in their training datasets and can easily match symptom sets to conditions, but they're bad at analysis of subtle details, because they're just text prediction engines and the more subtlety is involved in the input the less likely the input is to match anything they were trained on.

In the end, I have actually disabled it for day to day advice because the output is too volatile to be helpful. It might be useful to use week by week and smooth out the input data to 7-day means, so it's not having to analyse noisy daily details, but day by day the output is too unstable to take seriously. The vast diversity in its output reminds the user that it really is just a monkeys with typewriters situation.

The bottom line is: Is it good? No. Is it better than the NHS? Yes, as long as you respect the limitations.

Mark Watkinson