Bielik – the first Polish language model developed at the AGH University-- AGH UST

Bielik – the first Polish language model developed at the AGH University

28-08-2024

The AGH University Academic Computer Centre Cyfronet has provided the computational resources of the two fastest supercomputers in Poland, Helios and Athena, for the purpose of creating Bielik, the first Polish language model.

Bielik-11B-v2 – a new Polish large language model

Bielik has been developed as a result of joint efforts of SpeakLeash and the AGH University Academic Computer Centre Cyfronet. It is a Polish model falling under the LLM category (Large Language Models), i.e. a large language model with 11 billion parameters.

SpeakLeash – a group of enthusiasts and creators of Bielik

SpeakLeash is a foundation connecting people of various professions. This group of enthusiasts has decided to aim high and create the largest Polish text database, following the example of foreign initiatives like The Pile. The project team involves employees of Polish enterprises, researchers, and students of AI-related fields of study. The work on the Polish language model took over a year and their initial scope entailed, among many others, data collection, processing, and classification.

“The most challenging task was to obtain data in Polish. We must operate only on source data and we must know where it comes from,” explains Sebastian Kondracki, Bielik’s originator.

Currently, the resources of SpeakLeash are the largest, best described and documented collection of data in Polish.

Helios and Athena – computational power for science

Supercomputers from the AGH University Academic Computer Centre CYFRONET allowed for Bielik to spread its wings.

The cooperation between the AGH University staff and SpeakLeash enabled the use of the computing power needed to create the model and support of the SpeakLeash team with the necessary expertise and scientific knowledge guaranteeing the success of the project.

Cyfronet supported the project in terms of the optimisation and scaling of training processes, the work on data processing pipelines and the development and operation of synthetic data generation methods, as well as models’ testing methods. Its result is the Polish ranking of models (Polish OpenLLM Leaderboard). Valuable experiences and the knowledge gained as a result of this cooperation made it possible for the team of PLGrid experts to prepare guidelines and optimised solutions, including computing environments for working with language models based on Athena and Helios clusters for the needs of scientific users.

“We used the capacity and resources of Helios, currently the fastest machine in Poland, to teach language models,” Marek Magryś, Deputy Director of AGH University Cyfronet for High Performance Computers, tells us. “Our role is to provide support with our expertise, experience, and above all with computational power in data cataloguing, collecting, and processing, as well as in teaching language models. Thanks to the joint efforts of SpeakLeash and the AGH University, we have managed to create Bielik, an LLM model which handles our language and cultural context perfectly well and which may be a key element of text data processing pipelines for our language in scientific and business uses. High positions on ranking lists for Polish are only a confirmation of Bielik’s quality.

The computational power of Helios and Athena in traditional computer simulations amounts to over 44 PFLOPS, and for lower precision AI calculations it is even 2 EFLOPS.

“If we operate on such extensive data, as in the case of Bielik, the infrastructure required for this purpose exceeds the capacity of a regular computer. We must have adequate computational power at out disposal, so that we could prepare and compare data, train models. Availability of such supercomputers is the issue here, and only a small number of companies may perform such actions on their own. Luckily, the AGH University has that kind of resources at hand,” adds Professor Kazimierz Wiatr, Director of Cyfronet.

A few thousand of researchers representing multiple fields takes advantage of Cyfronet’s supercomputers on a daily basis. Advanced modelling and numerical computations are used mainly in: chemistry, biology, physics, medicine, materials technology, as well as astronomy, geology, and environmental protection. Available as part of PLGrid infrastructure, supercomputers in Cyfronet are also used for the purpose of high energy physics (projects: ATLAS, LHCb, ALICE, CMS), astrophysics (CTA, LOFAR), earth sciences (EPOS), European Spallation Source (ESS), gravitational waves research (LIGO/Virgo), and biology (WeNMR).

To train Bielik, we use two fastest supercomputers in Poland, Athena and Helios, but in comparison to the infrastructure of world leaders there is still much room for improvement. On top of that, several hundred other users are using supercomputer resources at the same time,” Marek Magryś explains. “However, our systems enable a few hours or days of computations, which could take years or even hundreds of years on regular computers.”

Bielik vs Chat GPT

The creators make it abundantly clear that “the collection of data powering Bielik is continuously growing, yet it will be difficult for us to compete with resources used by other models which operate in English. Besides, the amount of Polish content online is substantially smaller than of the English content.”

The most popular product taking advantage of a large language model is ChatGPT, based on the resources of OpenAI. However, the need to develop language models in other languages is justified.

As stressed by Marek Magryś:

“While ChatGPT is able to speak in Polish, it is saturated with content in English. So its understanding of the Polish culture and the nuances of Polish literature is little. It also does not truly cope with understanding the logic of more complex texts, legal or medical ones. If we want to use it in specialist fields and have a language model that thinks well in Polish and responds using correct Polish, we cannot base only on foreign language models.”

The version that the users may test in the public domain is free of charge and is still under improvement. In addition to the full versions of models, the authors have made available a range of quantised versions in the most popular formats to run the model on your own computer.

“It is worth knowing that Bielik will perform well in terms of providing summaries or short descriptions. At this moment, our language model is useful in science and business terms, it may be used for improving communication with users when handling requests in Helpdesk,” claims Szymon Mazurek from Cyfronet.

Why are Polish language models worth developing?

The creators of Bielik explain that the services of artificial intelligence available online, such as the most popular ones offered by ChatGPT, are kept on external servers. If a given company or industry develops a solution operating on specialist data, for example medical data or texts which for various reasons may not leave the walls of a company, are confidential, the only possibility is to launch such a model on site. The model may not be as perfect as ChatGPT but it does not have to be as broad.

An additional benefit stemming from the launch of language models such as Bielik is the reinforcement of the position of Poland in terms of innovation in the AI sector. Moreover, as emphasised by Bielik’s creators, it is worth to build own tools and become independent of external companies, which, in the event of market turbulence, regulations, or legal restrictions, may hinder access to their resources. By developing and improving tools in Poland, we are building a stable base and are able to secure many of our sectors, whether banking, administrative, medical, or legal.

“Intensified actions on the development of AI, language models, and other AI-based tools are in the best interest of all well-functioning economies. We observe increased efforts in the development of similar solutions in multiple countries,” adds Jan Maria Kowalski from SpeakLeash.

***

The explanation why it is worth developing language models in languages other than English generated by Bielik:

Developing language models in languages other than English may bring plenty of benefits, both in linguistic studies as well as in practical applications of technology. These are some of the reasons why it is worth it:
Better understanding of language diversity: English-based language models may not fully take into account the specificity of other languages, which introduces prejudice, stigma, or inaccuracy in the understanding and translation of content in different languages. Developing models for other languages allows for better understanding of their grammar, vocabulary, and cultural context.
Improved quality of translations: language models play a key role in automatic translation of texts. Developing models for less popular languages may improve the quality of translations by eliminating errors and ambiguities resulting from the lack of understanding for the specificity of the languages.
Support for language communities: developing language models for minority languages may have positive impact on these communities, facilitating communication and access to information in their mother tongue.
Opening doors to research on language diversity: language models may help study similarities and differences between languages, which is essential for understanding the evolution and divergence of languages worldwide.
Improvement of AI tools: the development of language models in various languages broadens the scope of the applications of AI technology by making it possible to create more advanced systems using natural languages in a more precise and accurate manner.
Protection of global language diversity: popularisation of foreign languages often leads to the marginalisation of local dialects and languages. Developing language models of less-known languages may help protecting language diversity in the entire world.

Useful links:

Test Bielik in Polish: bielik.ai
Current amount of data: speakleash.org/dashboard

Bielik-11B-v2 – a new Polish large language model

SpeakLeash – a group of enthusiasts and creators of Bielik

Helios and Athena – computational power for science

Bielik vs Chat GPT

Why are Polish language models worth developing?

Other news from this category

Recent news

Bielik – the first Polish language model developed at the AGH University

Bielik-11B-v2 – a new Polish large language model

SpeakLeash – a group of enthusiasts and creators of Bielik

Helios and Athena – computational power for science

Bielik vs Chat GPT

Why are Polish language models worth developing?

Other news from this category

Recent news

Stopka