AI

Meet Occiglot: A Large-Scale Research Collective for Open-Source Development of Large Language Models by and for Europe

2 Mins read

A team of researchers in Europe has introduced OcciGlot to address the need for dedicated language modeling solutions. The model aims to maintain Europe’s academic and economic competitiveness, AI sovereignty, and digital language equality. The model focuses on incorporating European values like linguistic diversity and cultural richness, which is lacking in current large language models introduced by big tech companies and deep tech startups, which focus on creating an understanding of the English language.

Currently, the field of language modeling is dominated by a few major players, leaving European languages and cultural diversity underrepresented. In response, Occiglot introduces Model Release v0.1, a set of intermediary 7B model checkpoints focused on the five largest European languages: English, German, French, Spanish, and Italian. This release is a result of bi-lingual continual pre-training and instruction tuning for each language, as well as the development of a multilingual model covering all five languages. The models are available under an open-source license on Hugging Face, aiming to democratize access to language models.

Occiglot leverages a novel approach that involves continual pre-training and instruction tuning of transformer-based language models for each target language, starting from an existing pre-trained model for English. The models are then fine-tuned and optimized for each specific language, with a focus on linguistic diversity and cultural nuances. This iterative process ensures the development of high-quality language models tailored to the European context. The collective also emphasizes collaboration within the community to gather large-scale training data, curate instruction-tuning datasets, and evaluate model performance accurately.

The performance of Occiglot’s language models is evaluated based on their ability to support diverse linguistic tasks and applications across different European languages. The release of intermediary model checkpoints marks a significant step towards achieving the long-term goal of creating a cohesive language modeling approach covering all official languages within the European Union and beyond. Furthermore, the commitment of hessian.AI to provide computing resources supports the initiative’s scalability and sustainability.

In conclusion, Occiglot’s initiative addresses the pressing need for accessible and culturally sensitive language models in Europe. By releasing open-source LLM checkpoints and fostering collaboration within the research community, they are opening the way for advancements in language technology that align with European values of linguistic diversity and cultural richness. 


Pragati Jhunjhunwala is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Kharagpur. She is a tech enthusiast and has a keen interest in the scope of software and data science applications. She is always reading about the developments in different field of AI and ML.




Source link

Related posts
AI

GENAUDIT: A Machine Learning Tool to Assist Users in Fact-Checking LLM-Generated Outputs Against Inputs with Evidence

2 Mins read
With the recent progress made in the field of Artificial Intelligence (AI) and mainly Generative AI, the ability of Large Language Models…
AI

This AI Paper from the University of Oxford Proposes Magi: A Machine Learning Tool to Make Manga Accessible to the Visually Impaired

2 Mins read
In storytelling, Japanese comics, known as Manga, have carved out a significant niche, captivating audiences worldwide with their intricate plots and distinctive…
AI

The Dawn of Grok-1: A Leap Forward in AI Accessibility

2 Mins read
In an era where the democratization of artificial intelligence technology stands as a pivotal turning point for innovation across industries, xAI has…

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *