The AGH University Academic Computer Centre Cyfronet has provided the computational resources of the two fastest supercomputers in Poland, Helios and Athena, for the purpose of creating Bielik, the first Polish language model.
Bielik has been developed as a result of joint efforts of SpeakLeash and the AGH University Academic Computer Centre Cyfronet. It is a Polish model falling under the LLM category (Large Language Models), i.e. a large language model with 11 billion parameters.
SpeakLeash is a foundation connecting people of various professions. This group of enthusiasts has decided to aim high and create the largest Polish text database, following the example of foreign initiatives like The Pile. The project team involves employees of Polish enterprises, researchers, and students of AI-related fields of study. The work on the Polish language model took over a year and their initial scope entailed, among many others, data collection, processing, and classification.
“The most challenging task was to obtain data in Polish. We must operate only on source data and we must know where it comes from,” explains Sebastian Kondracki, Bielik’s originator.
Currently, the resources of SpeakLeash are the largest, best described and documented collection of data in Polish.
Supercomputers from the AGH University Academic Computer Centre CYFRONET allowed for Bielik to spread its wings.
The cooperation between the AGH University staff and SpeakLeash enabled the use of the computing power needed to create the model and support of the SpeakLeash team with the necessary expertise and scientific knowledge guaranteeing the success of the project.
Cyfronet supported the project in terms of the optimisation and scaling of training processes, the work on data processing pipelines and the development and operation of synthetic data generation methods, as well as models’ testing methods. Its result is the Polish ranking of models (Polish OpenLLM Leaderboard). Valuable experiences and the knowledge gained as a result of this cooperation made it possible for the team of PLGrid experts to prepare guidelines and optimised solutions, including computing environments for working with language models based on Athena and Helios clusters for the needs of scientific users.
“We used the capacity and resources of Helios, currently the fastest machine in Poland, to teach language models,” Marek Magryś, Deputy Director of AGH University Cyfronet for High Performance Computers, tells us. “Our role is to provide support with our expertise, experience, and above all with computational power in data cataloguing, collecting, and processing, as well as in teaching language models. Thanks to the joint efforts of SpeakLeash and the AGH University, we have managed to create Bielik, an LLM model which handles our language and cultural context perfectly well and which may be a key element of text data processing pipelines for our language in scientific and business uses. High positions on ranking lists for Polish are only a confirmation of Bielik’s quality.
The computational power of Helios and Athena in traditional computer simulations amounts to over 44 PFLOPS, and for lower precision AI calculations it is even 2 EFLOPS.
“If we operate on such extensive data, as in the case of Bielik, the infrastructure required for this purpose exceeds the capacity of a regular computer. We must have adequate computational power at out disposal, so that we could prepare and compare data, train models. Availability of such supercomputers is the issue here, and only a small number of companies may perform such actions on their own. Luckily, the AGH University has that kind of resources at hand,” adds Professor Kazimierz Wiatr, Director of Cyfronet.
A few thousand of researchers representing multiple fields takes advantage of Cyfronet’s supercomputers on a daily basis. Advanced modelling and numerical computations are used mainly in: chemistry, biology, physics, medicine, materials technology, as well as astronomy, geology, and environmental protection. Available as part of PLGrid infrastructure, supercomputers in Cyfronet are also used for the purpose of high energy physics (projects: ATLAS, LHCb, ALICE, CMS), astrophysics (CTA, LOFAR), earth sciences (EPOS), European Spallation Source (ESS), gravitational waves research (LIGO/Virgo), and biology (WeNMR).
To train Bielik, we use two fastest supercomputers in Poland, Athena and Helios, but in comparison to the infrastructure of world leaders there is still much room for improvement. On top of that, several hundred other users are using supercomputer resources at the same time,” Marek Magryś explains. “However, our systems enable a few hours or days of computations, which could take years or even hundreds of years on regular computers.”
The creators make it abundantly clear that “the collection of data powering Bielik is continuously growing, yet it will be difficult for us to compete with resources used by other models which operate in English. Besides, the amount of Polish content online is substantially smaller than of the English content.”
The most popular product taking advantage of a large language model is ChatGPT, based on the resources of OpenAI. However, the need to develop language models in other languages is justified.
As stressed by Marek Magryś:
“While ChatGPT is able to speak in Polish, it is saturated with content in English. So its understanding of the Polish culture and the nuances of Polish literature is little. It also does not truly cope with understanding the logic of more complex texts, legal or medical ones. If we want to use it in specialist fields and have a language model that thinks well in Polish and responds using correct Polish, we cannot base only on foreign language models.”
The version that the users may test in the public domain is free of charge and is still under improvement. In addition to the full versions of models, the authors have made available a range of quantised versions in the most popular formats to run the model on your own computer.
“It is worth knowing that Bielik will perform well in terms of providing summaries or short descriptions. At this moment, our language model is useful in science and business terms, it may be used for improving communication with users when handling requests in Helpdesk,” claims Szymon Mazurek from Cyfronet.
The creators of Bielik explain that the services of artificial intelligence available online, such as the most popular ones offered by ChatGPT, are kept on external servers. If a given company or industry develops a solution operating on specialist data, for example medical data or texts which for various reasons may not leave the walls of a company, are confidential, the only possibility is to launch such a model on site. The model may not be as perfect as ChatGPT but it does not have to be as broad.
An additional benefit stemming from the launch of language models such as Bielik is the reinforcement of the position of Poland in terms of innovation in the AI sector. Moreover, as emphasised by Bielik’s creators, it is worth to build own tools and become independent of external companies, which, in the event of market turbulence, regulations, or legal restrictions, may hinder access to their resources. By developing and improving tools in Poland, we are building a stable base and are able to secure many of our sectors, whether banking, administrative, medical, or legal.
“Intensified actions on the development of AI, language models, and other AI-based tools are in the best interest of all well-functioning economies. We observe increased efforts in the development of similar solutions in multiple countries,” adds Jan Maria Kowalski from SpeakLeash.
***
The explanation why it is worth developing language models in languages other than English generated by Bielik:
Useful links: