AI

Meet Occiglot: A Large-Scale Research Collective for Open-Source Development of Large Language Models by and for Europe

2 Mins read

A team of researchers in Europe has introduced OcciGlot to address the need for dedicated language modeling solutions. The model aims to maintain Europe’s academic and economic competitiveness, AI sovereignty, and digital language equality. The model focuses on incorporating European values like linguistic diversity and cultural richness, which is lacking in current large language models introduced by big tech companies and deep tech startups, which focus on creating an understanding of the English language.

Currently, the field of language modeling is dominated by a few major players, leaving European languages and cultural diversity underrepresented. In response, Occiglot introduces Model Release v0.1, a set of intermediary 7B model checkpoints focused on the five largest European languages: English, German, French, Spanish, and Italian. This release is a result of bi-lingual continual pre-training and instruction tuning for each language, as well as the development of a multilingual model covering all five languages. The models are available under an open-source license on Hugging Face, aiming to democratize access to language models.

Occiglot leverages a novel approach that involves continual pre-training and instruction tuning of transformer-based language models for each target language, starting from an existing pre-trained model for English. The models are then fine-tuned and optimized for each specific language, with a focus on linguistic diversity and cultural nuances. This iterative process ensures the development of high-quality language models tailored to the European context. The collective also emphasizes collaboration within the community to gather large-scale training data, curate instruction-tuning datasets, and evaluate model performance accurately.

The performance of Occiglot’s language models is evaluated based on their ability to support diverse linguistic tasks and applications across different European languages. The release of intermediary model checkpoints marks a significant step towards achieving the long-term goal of creating a cohesive language modeling approach covering all official languages within the European Union and beyond. Furthermore, the commitment of hessian.AI to provide computing resources supports the initiative’s scalability and sustainability.

In conclusion, Occiglot’s initiative addresses the pressing need for accessible and culturally sensitive language models in Europe. By releasing open-source LLM checkpoints and fostering collaboration within the research community, they are opening the way for advancements in language technology that align with European values of linguistic diversity and cultural richness. 


Pragati Jhunjhunwala is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Kharagpur. She is a tech enthusiast and has a keen interest in the scope of software and data science applications. She is always reading about the developments in different field of AI and ML.




Source link

Related posts
AI

Google AI Introduces ZeroBAS: A Neural Method to Synthesize Binaural Audio from Monaural Audio Recordings and Positional Information without Training on Any Binaural Data

3 Mins read
Humans possess an extraordinary ability to localize sound sources and interpret their environment using auditory cues, a phenomenon termed spatial hearing. This…
AI

Microsoft Presents a Comprehensive Framework for Securing Generative AI Systems Using Lessons from Red Teaming 100 Generative AI Products

3 Mins read
The rapid advancement and widespread adoption of generative AI systems across various domains have increased the critical importance of AI red teaming…
AI

On the Modeling Capabilities of Large Language Models for Sequential Decision Making

1 Mins read
Large pretrained models are showing increasingly better performance in reasoning and planning tasks across different modalities, opening the possibility to leverage them…

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *