Language models (LMs) exhibit improved performance with increased size and training data, yet the relationship between model scale and hallucinations remains unexplored. Defining hallucinations in LMs presents challenges due to their varied manifestations. A new study from Google Deepmind focuses on hallucinations where correct answers appear verbatim in training data. Achieving low hallucination rates demands larger models and more computational resources than previously thought. Hallucination detection becomes increasingly difficult as LM size grows. Knowledge graphs (KGs) offer a promising approach to providing structured, factual training data for LMs, potentially mitigating hallucinations.
The study investigates the relationship between the language model (LM) scale and hallucinations, focusing on instances where correct answers are present in the training data. Using a knowledge graph (KG)–based dataset, researchers train increasingly large LMs to control training content effectively. Findings indicate that larger, longer-trained LMs hallucinate less, but achieving low hallucination rates requires significantly more resources than previously thought. The study also reveals an inverse relationship between the LM scale and hallucination detectability.
Precisely defining and quantifying hallucinations in natural language settings remains challenging due to language ambiguity and unclear knowledge content in training data. Despite advancements in generative capabilities, hallucinations persist as a significant challenge for LMs. The research addresses the gap in understanding how hallucinations depend on model scale. Knowledge graphs offer a structured approach to LM training, enabling straightforward fact verification against the dataset and providing a quantifiable measure of hallucination.
Traditional language models (LMs) trained on natural language data often produce hallucinations and repetitive information due to semantic ambiguity. The study employs a knowledge graph (KG) approach, using structured triplets of information to provide a clearer understanding of how LMs misrepresent training data. This method allows for a more precise evaluation of hallucinations and their relationship to model scale.
The study constructs a dataset using knowledge graph triplets (subject, predicate, object), enabling precise control over training data and quantifiable hallucination measurement. Language models (LMs) are trained from scratch on this dataset, optimizing auto-regressive log-likelihood. Evaluation involves prompting models with subject and predicate, and assessing object completion accuracy against the knowledge graph. Token tasks and head detectors evaluate hallucination detection performance. The methodology focuses on hallucinations where correct answers appear verbatim in the training set, exploring the relationship between the LM scale and hallucination frequency.
The research trains increasingly large LMs to investigate scale effects on hallucination rates and detectability. Analysis reveals that larger, longer-trained LMs hallucinate less, though larger datasets may increase hallucination rates. The authors acknowledge limitations in generalizability to all hallucination types and the use of smaller-than-state-of-the-art models. This comprehensive approach provides insights into LM hallucinations and their detectability, contributing to the field of natural language processing.
The study reveals that larger language models and extended training reduce hallucinations on fixed datasets, while increased dataset size elevates hallucination rates. Hallucination detectors show high accuracy, improving with model size. Token-level detection generally outperforms other methods. A trade-off exists between fact recall and generalization ability, with extended training minimizing hallucinations on seen data but risking overfitting on unseen data. AUC-PR serves as a reliable measure of detector performance. These findings highlight the complex relationship between model scale, dataset size, and hallucination rates, emphasizing the importance of balancing model size and training duration to mitigate hallucinations while addressing challenges posed by larger datasets.
In conclusion, the study reveals that larger, longer-trained language models exhibit reduced hallucination rates, but achieving minimal hallucinations requires substantial computational resources. Increased dataset size correlates with higher hallucination rates when model size and training epochs remain constant. A trade-off exists between memorization and generalization, with extended training improving fact retention but potentially hindering adaptability to new data. Paradoxically, as models grow larger and hallucinate less, detecting remaining hallucinations becomes more challenging. Future research should focus on enhancing hallucination detection in larger models and exploring the practical implications of these findings for language model applications.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..
Don’t Forget to join our 48k+ ML SubReddit
Find Upcoming AI Webinars here
Shoaib Nazir is a consulting intern at MarktechPost and has completed his M.Tech dual degree from the Indian Institute of Technology (IIT), Kharagpur. With a strong passion for Data Science, he is particularly interested in the diverse applications of artificial intelligence across various domains. Shoaib is driven by a desire to explore the latest technological advancements and their practical implications in everyday life. His enthusiasm for innovation and real-world problem-solving fuels his continuous learning and contribution to the field of AI