The European Union has taken a significant step towards reducing its dependence on US tech giants by launching an ambitious open-source language model project. Dubbed OpenEuroLLM, the initiative aims to develop a series of "truly" open-source language models that cover all 24 official EU languages, as well as languages from countries currently negotiating to join the EU market. This move is seen as a key step towards promoting digital sovereignty in the region.
The project is a collaboration between over 20 organizations, co-led by Jan Hajič, a computational linguist from the Charles University in Prague, and Peter Sarlin, CEO and co-founder of Finnish AI lab Silo AI, which was acquired by AMD last year for $665 million. The project's budget is €37.4 million, with roughly €20 million coming from the EU's Digital Europe Programme. While this may seem like a modest investment compared to the billions being poured into AI research by corporate giants, the project's partners believe that their collective expertise and resources will be sufficient to achieve their goals.
The OpenEuroLLM project is part of a broader narrative that has seen Europe push digital sovereignty as a priority. In recent years, the EU has signed an $11 billion deal to create a sovereign satellite constellation to rival Elon Musk's Starlink, and cloud giants have been investing in local infrastructure to ensure that EU data stays local. OpenAI recently unveiled a new offering that allows customers to process and store data in Europe, further underscoring the region's commitment to digital sovereignty.
However, some have questioned whether the OpenEuroLLM project's goals are achievable, given the sheer number of disparate participating parties. Anastasia Stasenko, co-founder of LLM company Pleias, noted that a "sprawling consortia of 20+ organizations" may struggle to match the focus and agility of a private AI firm. Nevertheless, Hajič is confident that the project's diverse range of partners will bring a unique set of skills and perspectives to the table.
The project is building on the foundations laid by the High Performance Language Technologies (HPLT) project, which has been developing free and reusable datasets, models, and workflows using high-performance computing (HPC) since 2022. Hajič expects the first version of the OpenEuroLLM model to be released by mid-2026, with the final iteration arriving by the project's conclusion in 2028. While the project has only just started, Hajič is optimistic that the collective expertise of the partners will enable them to get up to speed quickly.
The project's top-line goal is to create a series of foundation models for transparent AI in Europe, preserving the linguistic and cultural diversity of all EU languages. The models will be designed for general-purpose tasks where accuracy is paramount, with smaller "quantized" versions potentially being developed for edge applications where efficiency and speed are more important. Hajič emphasized that the project's focus is on creating high-quality models that meet the needs of European users, rather than simply trying to outmaneuver Big Tech.
The project's definition of "open source" is likely to be a topic of debate, with some arguing that true open-source AI requires not only the models but also the datasets, pretrained models, and weights to be freely available. While the OpenEuroLLM project may need to make some compromises on this front, Hajič is committed to ensuring that the models are of the highest quality possible, even if some of the training data cannot be redistributed.
The launch of OpenEuroLLM has also sparked comparisons with a similar project, EuroLLM, which launched its first model in September and a follow-up in December. EuroLLM shares similar goals to OpenEuroLLM, but with a smaller consortium of nine partners. Hajič acknowledged the similarities, but stressed that OpenEuroLLM is restricted in terms of its collaborations with non-EU entities due to its EU funding.
Despite the challenges ahead, the OpenEuroLLM project represents a significant step towards promoting digital sovereignty in Europe. By developing open-source language models that meet the needs of European users, the project aims to reduce the region's dependence on US tech giants and promote a more diverse and resilient AI ecosystem.