Jason Bowen via TUHS <tuhs(a)tuhs.org> writes:
May 26, 2025 11:57:18 Henry Bent
<henry.r.bent(a)gmail.com>:
It's like Wikipedia.
No, Wikipedia has (at least historically) human editors who
supposedly
have some knowledge of reality and history.
An LLM response is going to be a series of tokens predicted
based on
probabilities from its training data. The output may correspond
to a
ground truth in the real world, but only because it was trained
on
data which contained that ground truth.
Assuming the sources it cites are real works, it seems fine as a
search engine, but the text that it outputs should absolutely
not be
thought of as something arrived at by similar means as text
produced
by supposedly knowledgeable and well-intentioned humans.
LLMs are known to hallucinate sources. Here's a database of "legal
decisions in cases where generative AI produced hallucinated
content":
https://www.damiencharlotin.com/hallucinations/
Here's a research paper about LLMs hallucinating software
packages:
https://arxiv.org/abs/2406.10279
Not to mention about LLMs hallucinating 'facts' about people:
https://www.abc.net.au/news/2025-03-21/norwegian-man-files-complaint-chatgp…
As a result of what they're trained on, chatbots can be
"confidently wrong":
[T]he chatbots often failed to retrieve the correct
articles. Collectively, they provided incorrect answers to
more than
60 percent of queries. Across different platforms, the level
of
inaccuracy varied, with Perplexity answering 37 percent of the
queries incorrectly, while Grok 3 had a much higher error
rate,
answering 94 percent of the queries incorrectly.
--
https://www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-a…
and can contradict themselves:
https://catless.ncl.ac.uk/Risks/34/04/#subj2.1
Coming back to LLMs in the context of software:
AI coding tools ‘fix’ bugs by adding bugs
...
What happens when you give an LLM buggy code with and tell it
to fix
it? It puts in bugs! It might put back the same bug!
Worse yet, 44% of the bugs the LLMs make are previously known
bugs. That number’s 82% for GPT-4o.
...
I know good coders who find LLM-based autocomplete quite okay
—
because they know what they’re doing. If you don’t know what
you’re
doing, you’ll just do it worse. But faster.
--
https://pivot-to-ai.com/2025/03/19/ai-coding-tools-fix-bugs-by-adding-bugs/
A research paper found that:
participants who had access to an AI assistant based
on
OpenAI's
codex-davinci-002 model wrote significantly less secure code
than
those without access. Additionally, participants with access
to an
AI assistant were more likely to believe they wrote secure
code than
those without access to the AI assistant. Furthermore, we find
that
participants who trusted the AI less and engaged more with the
language and format of their prompts (e.g. re-phrasing,
adjusting
temperature) provided code with fewer security
vulnerabilities.
--
https://arxiv.org/abs/2211.03622
As someone who has spent a lot of time and energy working on
wikis, i would say that a big difference between LLMs and wikis is
that one can directly fix misinformation on wikis in a way that
one can't do with LLMs. And wikis typically provide public access
to the trail of what changes were made to the information, when,
and by whom (or at least from what IP address), unlike the
information provided by LLMs.
The LLM cat is well and truly out of the bag. But the combination
of LLMs hallucinating information, together with the human
tendency to correlate the confident conveyance of information with
veracity of that information, means that people should be
encouraged to take LLM output with a cellar of salt, and to check
the output against other sources.
Alexis.