Published: Sun May 31 2026

Should you really trust health advice from an AI chatbot?

There's a recent article on BBC news: Should you really trust health advice from an AI chatbot?.

I mentioned in a recent post that I had a virus about 7 weeks ago. The recovery has not been going swimmingly.

A point that's not always made when comparing the effectiveness of chatbots to the human is that the chatbot doesn't have to be perfect, it just has to beat the alternative, which in my case is the NHS. So the article should really be: Is an AI chatbot better for health advice than the NHS?

I've had blood tests done and had a follow up phone appointment with a nurse, not a doctor, because my GP surgery didn't let me see a doctor. The nurse's view was that my blood tests were all normal so it's just standard viral recovery which they don't consider a problem until 12 weeks have elapsed. Build up exercise gradually and take a multivitamin (I do already). This appointment did not provide me with anything of value.

Obviously, a chatbot can't perform blood tests for me, but it can analyse the results. I also have a (self-created) training readiness dashboard with activity logs and biometrics from my Garmin plus written diary entries on anything that seems relevant. I've expanded this so that I can export the last N days to CSV and feed it into an LLM and get advice.

On the blood test, in contrast with the NHS, the chatbot said my vitamin D is classified as "insufficient", which is the grey area between "deficient" and "good", and this is especially significant given that 1. I'm very active and 2. I'm dealing with viral recovery and vitamin D is important for the immune system. I've checked it, and it's right, my vitamin D is below optimal levels even though the NHS considers it normal. And that's with a multivitamin. So that's a significant performance improvement of the chatbot over the NHS.

On the day to day issues, the chatbot has given me a lot of information. It's diagnosed me with post viral autonomic dysfunction, meaning my nervous system is not functioning correctly. The nervous system has two branches, called the parasympathetic, which is responsible for being calm, relaxed, and letting the body heal, and the sympathetic branch which primes the body for fight or flight scenarios. In a healthy nervous system they work in tandem where one activates and the other backs off, at appropriate times. In post viral recovery the sympathetic branch can become over active which means you stay physically stressed for longer after exertion, and the parasympathetic less active, which means even during complete physical rest (like sleep) you struggle to reach a state where your body really starts recovering effectively. This is terrible for recovery and manifests in ways including exercise intolerance and crashes/relapses after doing too much.

The chatbot checks my biometrics and diary entries daily and has mostly recommended total rest with gentle walks. It wants to see more stability and upwards trends in my biometrics before reintroducing exercise, because 1. fitness will come back quickly once I'm able to train, and 2. every relapse I have prolongs recovery. Further actionable advice I get are things like:

  • Control my hay fever, because the allergy response causes inflammation and the release of cytokines, which stresses the nervous system.
  • Take frequent breaks from deep concentration tasks because these stimulate the sympathetic nervous system.
  • Practice diaphragmatic breathing daily and especially if I feel tense, because this activates the parasympathetic branch.
  • Supplement magnesium glycinate and omega 3, because these are calming for the nervous system, plus timing recommendations.

This is all useful detail that I didn't get from the NHS.

Example output

The quality of the output is pretty good in terms of broad diagnosis. It explains exactly what I'm dealing with and gives me enough details that I can check it independently. It also gives me actionable advice on how to proceed from here, with supporting arguments. It's given me a better understanding of my nervous system and has let me build the self awareness to feel what it's doing at any given time. It's one thing to be sat working and think "I feel a bit off", it's much better to think "I feel a bit off and my heart beat is a bit heavy, that means my sympathetic nervous system is going too hard, maybe I should close my eyes and do some breathing exercises until I feel it soften and avoid a crash later". Any athlete will probably already be aware of the feeling of sympathetic dominance; it's what you get when you're 12 hours after a hard race and you can feel that heaviness in your heart beat and the vague sense of unease as your body is working overtime to repair all the damage. So it's just a matter of being aware of that class of feeling and noticing when it's appearing at inappropriate times and try to nudge it back to normal.

Is it perfect? No. It's worse than an intelligent human with the necessary domain knowledge. Large points of frustration are:

Logical reasoning - I had an interesting situation where I'd done a (very) short run in the afternoon, then had good biometrics until about 9PM when it all went downhill until I woke up at 3AM, then improved massively until the morning. This is hard for me to interpret because I basically have two completely opposing readings for the two halves of the night. The AI confidently diagnosed it as post exertional malaise. I then added the information that I was suffering from hay fever and it was raining heavily at 9PM when I took my dog out, and I'd woken up at 3AM. It told me that this strengthened the argument it was post exertional malaise. This is a clear logical fallacy; the extra information can be dismissed as irrelevant or weaken the argument but it cannot strengthen it. I pushed back with the theory that the rain concentrated the pollen and led to breathing issues overnight until I woke up and shifted and cleared my airways, which let my body relax. The AI wasn't having it. I then asked again in a fresh context and it said "The overnight pattern you described is a near-perfect fingerprint of upper airway resistance being acutely relieved." Alright then.

...Which leads me to the next flaw, inconsistency in decisions - I'm currently at the point where I could probably start reintroducing running, because the biometrics are looking promising and I'm feeling good. A week ago it was consistent on advising me to rest. But now I can press regenerate 4 times and get 4 different pieces of advice ranging from full rest to a brisk walk to attempting run/walk intervals to a 45 minute continuous run (which is absurdly bad advice given my last 15 minute run set me back a week, and it knows this)

Chronological comprehension - I can see that it gets confused sometimes on details and chronology. I feed it a CSV of my data and it sometimes gets muddled, mixing data from one record with another. The biggest frustration is that no matter how I prompt, it will often fail to understand that the record for a given day's biometric data is taken before any activities are recorded for that day, so often it will tell me that my recorded activity for today caused my HRV to dip. No it didn't, you don't have an HRV measurement post activity yet.

I've tried various different models, including some bigger ones through OpenRouter, and while some are better than others, Qwen 3.6 and Gemini 4 running locally both produce output at about the same quality as anything else. The bigger models in OpenRouter are no better, they just run faster than local models on my laptop. There are interesting personality differences between the models though; Qwen 3.6 is consistently more conservative in its activity recommendations than Gemma 4. This may be an indication that Qwen is the better model, but we'll have to see what its recommendations are when I'm running again.

I think the summary is that the models are really good at broad diagnosis, because they have a wealth of medical information in their dataset and can easily match symptom sets to conditions, but they're bad at analysis of subtle details, because they're just text prediction engines and the more subtlety is involved in the input the less likely the input is to match anything they were trained on.

The bottom line is: Is it good? Not really. Is it better than the NHS? Yes.