Pat Brans Associates/Grenoble Ecole de Management
Published: 07 Jan 2022
At the time it was installed in the summer of 2018, Tetralith was more than just the fastest of the six traditional supercomputers in the National Supercomputer Centre (NSC) at Linkoping University. It was the Nordic’s most powerful supercomputer.
But three years later, Tetralith was required to be complemented by a new system that would specifically address the needs of machine learning and fast-evolving AI (AI). Tetralith wasn’t designed for machine learning – it didn’t have the parallel processing power that would be needed to handle the increasingly large datasets used to train artificial intelligence algorithms.
To support research programmes that rely on AI in Sweden, the Knut and Alice Wallenberg Foundation donated EUR29.5m to have the bigger supercomputer built. Berzelius was delivered in 2021 and began operation in the summer. The supercomputer has twice the computing power as Tetralith and takes its name from Jacob Berzelius, a well-known scientist who was born in Ostergotland, Sweden, where the NSC is situated.
Atos delivered and installed Berzelius, which includes 60 of Nvidia’s latest and most powerful servers – the DGX systems, with eight graphics processing units (GPUs) in each. Nvidia networks link the servers together – with 1.5PB (petabytes of storage hardware) between them. Atos also provided its Codex AI Suite application tool set to assist researchers. The entire system is housed in 17 racks, which when placed side-by-side extend to about 10 metres.
The system will be used for AI research – not only the large programmes funded by the Knut and Alice Wallenberg Foundation, but also other academic users who apply for time on the system. While most of the users will reside in Sweden, some may be from other countries that collaborate with Swedish scientists. The biggest areas of Swedish research that will use the system in the near future are autonomous systems and data-driven life sciences. Both of these cases require a lot machine learning on large datasets.
NSC plans to hire staff to assist users. This is not to replace core programmers but to allow users to put together existing parts. There are many software libraries available for AI. They must be used and understood. Researchers using the system often do their own programming, hire assistants or adapt existing open-source projects to meet their needs.
“So far, around 50 projects have been granted time on the Berzelius,” says Niclas Andresson, technology manager of NSC. Although the system isn’t fully used, it is growing in utilisation. Some problems require a significant portion of the system. We had a hackathon for NLP [natural language processing],, which used the system well. Nvidia provided a toolbox for NLP that scales up to the big machine.”
In fact, researchers face the greatest challenge of scaling the software they use to keep up with the new computing power. Many of them only have one or two GPUs on their desktop computers. Scaling their algorithms to hundreds of GPUs can be difficult.
Now, the Swedish researchers have the chance to think big.
AI researchers from Sweden have used supercomputer resources for many years. They used CPUs in the beginning. In recent years, GPUs have evolved from the gaming industry to supercomputing. Their massively parallel structures have elevated number crunching up to a whole new level. While the original GPUs were intended for image rendering, they are now being used for machine learning and other applications.
“Without the availability of supercomputing resources for machine learning we couldn’t be successful in our experiments,” says Michael Felsberg, professor at the Computer Vision Laboratory at Linkoping University. Although the supercomputer does not solve all our problems, it is an important ingredient. We wouldn’t be able to get anywhere without the supercomputer. It would be like a chemist without a Petri dish, or a physicist without a clock.”
“Without supercomputing resources, machine learning would not have been possible.” Even though the supercomputer won’t solve all our problems, it is essential that we have it. Without it, however, we wouldn’t be able to get anywhere.
Michael Felsberg, Linkoping University
Felsberg was part of the group that helped define the requirements for Berzelius. He also serves on the allocation committee, which decides which projects get time in this cluster, how it is allocated and how usage is calculated.
He insists it is necessary to have a large supercomputer. But it also must be the right kind of supercomputer. We have terabytes of data and need to process them thousands of times. We have a very consistent computational structure that allows us to use a single instruction for multiple data processing. This is typical where GPUs are very powerful,” says Felsberg.
” It’s more important than just the number of calculations to consider how the calculations are organized. He says that modern GPUs can also do what is needed: they perform large-scale matrix product calculations. GPU-based systems were first introduced in Sweden in 2005. However, at the time they were small and difficult to access. Now we have what we need.”
Massive parallel processing and huge data transfers
” Research does not have to be a one-off run lasting over a month. Instead, we might have as many as 100 runs, each lasting two days. Felsberg says that during those two days, huge memory bandwidth is used and that local filesystems are crucial.
” Machine learning algorithms run on supercomputers equipped with GPUs. This means that a lot of calculations are done. However, a lot of data is also transmitted. It is essential that the bandwidth and throughput between the storage system and the computational node are very high. Machine learning requires terabyte datasets and a given dataset needs to be read up to 1,000 times during one run, over a period of two days. All nodes and memory must be on the same bus.
“Modern GPUs have thousands of cores,” adds Felsberg. They all run simultaneously on different data, but they all have the same instruction. This is the single-instruction/multiple-data model. This is what you have on each chip. Then you can have multiple boards with the same chips and sets of boards on the machine. This allows you to have huge resources on one bus. This is because machine learning can often be split into multiple nodes.
” We use many GPUs simultaneously and share data and learning among them all. This allows you to achieve a significant speed increase. Imagine if this were done on one chip. It would take more than a month. But if you split it, a massively parallel architecture – let’s say, 128 chips – you get the result of the machine learning much, much faster, which means you can analyse the result and you see the outcome. He says, “Based on the outcome, you run the next experiment.”
” Another challenge is the large parameter spaces that make it impossible to cover all of the information in our experiments. We need to use heuristics and smarter search strategies to find what we are looking for in the parameter spaces. This requires you to know the results of previous runs. It makes it more like a series of experiments than one. Therefore, it’s very important that each run be as short as possible to squeeze out as many runs as possible, one after the other.”
“Now, with Berzelius in place, this is the first time in the 20 years I’ve been working on machine learning for computer vision that we really have sufficient resources in Sweden to do our experiments,” says Felsberg. “Before, there was always a bottleneck in the computer. Now, the bottleneck is somewhere else – a bug in the code, a flawed algorithm, or a problem with the dataset.”
The beginning of a new era in life sciences research
“We do research in structural biology,” says Bjorn Wallner, professor at Linkoping University and head of the boinformatics division. This involves trying to understand how the elements of a molecule are organized in three-dimensional space. Once you understand that, you can develop drugs to target specific molecules and bind to them.”
Most research is tied to a disease. This is because it allows you to solve an immediate problem. Sometimes, however, the Linkoping bioinformatics department conducts pure research in order to gain a better understanding and appreciation of biological structures.
The group uses AI to make predictions about specific proteins. DeepMind, a Google-owned company, has done work that has given rise to a revolution in structural biology – and it relies on supercomputers.
DeepMind developed AlphaFold, an AI algorithm it trained using very large datasets from biological experiments. The “weights”, which are the neural networks that can be used to make predictions, were generated by the supervised training. AlphaFold is now available as an open-source project, and can be used by research organizations such as Linkoping University’s Bjorn Wallner.
Berzelius can help us get more throughput and push the boundaries in our research. Google is a huge resource and can do many big things. But we are now able to compete.
Bjorn Wallner, Linkoping University
There is still much to learn in structural biology. AlphaFold is a novel way to find the 3D structures of proteins. However, this is only the tip of an iceberg. To dig deeper you will need supercomputing power. It is one thing to understand a single protein or a protein that is in a static state. It’s quite another to understand how different proteins interact and what happens as they move.
Any given human cell contains around 20,000 proteins – and they interact. They can also be flexible. All actions that regulate the machinery within a cell are related to the regulation of its molecules. Cells also manufacture proteins. Understanding the basics of machinery can help you make breakthroughs.
“Now, we can use Berzelius for a lot more throughput. We can also break new ground in research,” Wallner says. Wallner says that the new supercomputer has the potential to retrain AlphaFold’s algorithm. Google is a large company with a lot to offer and can do many big things. But, we might be able to compete.
” We have just begun to use the supercomputer. To make it work optimally, we need to adapt our algorithms. He says that we need to create new methods, new software and new training data in order to use the machine optimally.
“Researchers will expand on what DeepMind has done and train new models to make predictions. We can move into protein interactions, beyond just single proteins and on to how proteins interact and how they change.”
Read more on IT innovation, research and development
Thailand builds new supercomputer to boost research
By: Aaron Tan
Netherlands to build new national supercomputer
Nvidia targets datacentre memory bottleneck
By: Cliff Saran
Kao Data signs AI-focused life sciences startup InstaDeep as a customer
By: Caroline Donnelly
For more dWeb.News Technology News https://dweb.news/category/dweb-news/section-d-digital-world-tech-technology-news/
The post New supercomputer opens doors for researchers in Sweden appeared first on dWeb.News dWeb.News from Daniel Webster Publisher dWeb.News – dWeb Local Tech News and Business News