A new trend in artificial intelligence (AI) has been discovered: Chinchilla (70B) outperforms GPT-3 (175B) and Gopher (280B) in performance

Tram Ho

A new trend in artificial intelligence (AI) has been discovered: Chinchilla (70B) outperforms GPT-3 (175B) and Gopher (280B) in performance

DeepMind has found a way to cheaply scale large language models. DeepMind’s latest paper breaks the trend of building ever-larger language models to improve performance. The company figured out an important aspect of scaling large language models that no one had used before. Big tech companies like OpenAI, Google, Microsoft, Nvidia, Facebook, and even DeepMind have all done it wrong: Increasing the model size is not the best or most efficient approach.

DeepMind research has shown that increasing the number of training tokens (i.e. the amount of text data the model is fed) is just as important as increasing the model size. This means that a small model can outperform a large – but not optimal – model if trained on a significantly higher number of training tokens.

DeepMind demonstrated this using the Chinchilla model – a model with 70 billion parameters, 4 times smaller than Gopher (also built by DeepMind), but trained on 4 times the amount of data. Chinchilla “equally and significantly” passed Gopher, GPT-3, Jurassic-1, and Megatron-Turing NLG on a series of language tests.

The conclusion is clear: Existing large language models are being “trained under optimality”, which is a consequence of blindly pursuing the assumption of increasing size – making large models not is the only way to improve performance. And not only that, since Chinchilla is smaller, analysis and tuning costs are also cheaper, making use of these models to reach small companies or universities that may not have budgets or next-generation hardware. first to run larger models. “The benefits of a more suboptimal model, therefore, outweigh the immediate benefits of improved performance.”

The DeepMind paper also shows that in order to achieve the computationally optimal large language model, researchers should similarly allocate the computational budget for both increasing the model size and the number of tokens. training newspaper. “For each doubling of the model size, the number of training tokens should also be doubled.” This applies to companies with fixed compute budgets and want to create optimal models for performance.

As such, increasing the model size is no longer the only way to improve the performance of large language models. Increasing the number of training tokens is also important. Researchers can use DeepMind’s results to generate more optimal large language models with lower computational costs and enhance their performance.

In a nutshell, DeepMind has found a new way to enhance the performance of large language models by increasing the number of training tokens. This can help create more optimized large language models with lower computational cost, and make it easier for small companies or universities to use these models. This is an important step forward in the field of artificial intelligence and could open up opportunities for new applications and improve the performance of existing applications.

Compute-optimal large language models In the paper, DeepMind also studied how various factors affect the performance of large language models. Calculation budget is often the limiting and independent factor. The model size and the number of training tokens are determined by how much the company can spend on better hardware upgrades. To study how these variables affect performance, the DeepMind researchers asked the question: “Given a fixed budget of FLOPs, how to allocate model size and number of tokens train?”

Models such as GPT-3, Gopher and MT-NLG follow Kaplan’s proportionality law (Table 1). For example, if the computational budget is increased by 10x, Kaplan’s law predicts optimal performance when the model size is increased by 5.5x and the number of training tokens is increased by 1.8x.

These results show that the way to optimize large language models is not only to increase the model size, but also to increase the number of training tokens. This research provides a new approach for creating more optimal large language models, saving computational costs and opening up a lot of potential for artificial intelligence applications.

In their paper, DeepMind also points out that the study by Kaplan and colleagues reached the wrong conclusion by assuming the number of training tokens was fixed in their analysis. This prevented them from figuring out DeepMind’s answer, that both the model size and the number of training tokens should increase simultaneously, an estimated 3.16x increase (or the square root of 10x).

To study the relationship between compute budget, model size, and number of training tokens, the researchers used three methods (see section 3 of the paper for more details).

  1. Fixed Model Size : They define a range of model sizes (from 70 million to 16 billion) and vary the number of training tokens (4 variants) for each model. They then determine the optimal combination for each computational budget. Using this method, a computationally optimized model trained with the same amount of computation as Gopher would have 67 billion parameters and 1.5 trillion tokens.
  2. IsoFLOP Curve : They fix the computational budget (9 variations between 6×10¹⁸ and 3×10²¹) and explore the model size (automatic determination of the number of tokens). Using this approach, a computationally optimized model trained with the same amount of computation as Gopher would have 63 billion parameters and 1.4 trillion tokens.
  3. Fit to the arithmetic loss function : Using the results from methods 1 and 2, they model the arithmetic losses as a function that depends on the model size and the number of training tokens. Using this approach, a computationally optimized model trained with the same amount of computation as Gopher would have 40 billion parameters.

In total, they evaluated more than 400 models, ranging from 70 million to 16 billion parameters and from 5 billion to 500 billion training tokens. These three methods give similar predictions for the optimal model size and number of training tokens – significantly different from the results from Kaplan’s study.

These results suggest that the current models are too large for their computational budget (Figure 1). These findings could help scientists and companies create larger and more optimized language models at lower computational costs, and provide opportunities to improve performance and accuracy. of artificial intelligence applications.

As shown in table 3 (first method), a 175 billion parameter model (similar to GPT-3) should be trained with a computational budget of 3.85×10²⁴ FLOPs and trained on 3, 7 trillion tokens (10 times more than OpenAI uses for their 175 billion spec GPT-3 model). A 280 billion parameter model (similar to Gopher) should be trained with a compute budget of 9.90×10²⁴ FLOPs and over 5.9 trillion training tokens (i.e. 20 times the number of tokens). that DeepMind used for the Gopher model).

These results show the importance of considering both model size and number of training tokens to optimize the performance of large language models.

The DeepMind team used conservative estimates (methods 1 and 2) to determine the size and number of training tokens of a computationally optimal model trained on the budget they used. for Gopher. The result is a Chinchilla model with 70 billion parameters trained on 1.4 trillion tokens (4x smaller and 4x more data usage than Gopher). Chinchilla outperforms Gopher – and all language models before it – “uniformly and significantly”.

Their motivation is proven: Increasing the number of training tokens along with the model size at a similar rate will yield the best results, the hypothesis proved by Chinchilla.

Comparing Results: Chinchilla vs Gopher and Other Models To say that Chinchilla outperforms Gopher is an understatement when we look at the results for each test. In order not to overload the article with graphs, below I will only show the results for the two most important tests, Massive Multitask Language Understanding (MMLU) and Big-bench (80% of the tests) and Ethical tests – always deserve more scrutiny. (See section 4 of the paper for a detailed analysis that includes reading tests, common sense, and question-and-answer questions.)

MMLU & BIG-bench Chinchilla get new SOTA scores for both of these tests. On average, 67.6% accuracy was achieved on MMLU and 65.1% accuracy on Big-bench, while Gopher achieved 60% and 54.4% respectively (Figures 2, 3). As for MMLU, Chinchilla surpasses even the 63.4% mark predicted by experts to be SOTA for June 2023. No one expected such a significant improvement so soon.

The unified Chinchilla outperformed previous language models on other tests such as conventional reasoning and reading comprehension, certainly taking the throne of artificial intelligence for language.

However, Chinchilla’s dominance was short-lived. Just a week after its launch, it was surpassed by Google’s newest model, PaLM (with 540 billion parameters, it’s the largest and most efficient language model available today). The succession of new models from different companies exemplifies the rapid pace of the field. However, Google didn’t fully take inspiration from DeepMind’s findings to build PaLM because they were testing a different approach. (Expect a new article on PaLM soon!)

Gender and Malicious It is expected that Chinchilla, which has the same data set and architecture as Gopher, will show similar behavior for gender bias and maliciousness. It shows some improvement over Gopher on the Winogender dataset of gender and occupation bias (table 7), but is not uniform across all groups.

For DeepMind, they found a new relationship between computation budget, model size, and number of training tokens. However, those are not the only parameters that affect performance and efficiency.

A major problem when training large models is finding optimal hyperparameters (HPs). Current language models are so large that companies can only train them once: Finding the best set of hyperparameters is not feasible. Researchers often have to make difficult – often false – assumptions to establish them.

Recently, Microsoft and OpenAI have been working on a new type of parameterization (μP) that has the ability to scale well on models of different sizes of the same family. The HP optimal for a small model can be transferred to a larger model, delivering significantly improved results.

DeepMind’s paper refers to previous work on hyperparameter tuning but does not address this particular paper published a few weeks ago. Combining the computational optimization model with μP can yield better results for any large language model.

Another possible improvement is the retrieval mechanism. RETRO achieves performance equivalent to GPT-3 albeit 25 times smaller. Its traceability allows the model to access a large database (3T tokens) in real time (similar to how we do internet searches).

Finally, if we want to get to the bottom, a calibration technique can improve results not only in linguistic indicators but also in real-life situations. OpenAI used a method to improve GPT-3 to InstructGPT with excellent performance results. However, artificial intelligence calibration is very complex and InstructGPT does not appear to improve over previous models in terms of safety or toxicity.

If a company were to combine all these features into one model, they would create the best overall model possible with what we now know about major language models. A new trend Chinchilla’s performance is not only impressive in terms of improvement, but also because the model is smaller than all the major language models developed in the last two years and gives the best results in tests. check. Instead of focusing on making models larger, as many artificial intelligence experts have criticized, companies and researchers should focus on optimizing existing resources and specifications. their yes – otherwise, they are wasting their money.

In terms of performance and efficiency, Chinchilla is a breakthrough.

Chinchilla’s performance is no longer the best in this area, as Google’s PaLM model has achieved the best results in many tests. However, the main influence of Chinchilla is not that it is the best model out there, but that it is very good while breaking the trend of making patterns bigger and bigger.

The consequences of this will shape the future of the field. First, companies should realize that model size is not the only variable that matters for performance, but just one of many. Second, it could calm the general public’s excitement about seeing larger and larger models in the future – as a sign that we’re getting closer to AGI faster than we actually are. . Ultimately, it can help reduce the environmental impacts of the big models and barriers for smaller companies that can’t keep up with the big tech companies.

This point brings me to the second reflection. Four Important Thoughts from Chinchilla Limits Reproducibility Although smaller than other models, still not feasible for most companies and universities for training or research Currently, for many For some companies and universities, it is not possible to train or study models like Chinchilla. Calling a model with a size of 70 billion parameters “small” has shown this to be difficult. Most units do not have sufficient resources to conduct the required experiments. As a result, current AI is being built on a fragile foundation and by a few large companies deciding the direction of science.

However, this issue is not only about money. Companies like DeepMind, Google, and OpenAI don’t plan to release their models, like Chinchilla, PaLM, and DALL·E, for other people’s research purposes. These models are often only published as a way to show who is making the most progress in the field. While DeepMind is one of the AI ​​companies that have made the most of their efforts to advance science and research by allowing others to build on their discoveries (they offered free AlphaFold predictions) , but the tendency to brag still prevails in this area.

DeepMind is trying to reverse the damaging trend by building a better and smaller model at the same time. But with Chinchilla still a major model, we need to realize how far we’ve come in terms of universalizing a technology that will redefine our future. If we continue in the direction where only a few companies control the resources for scientific research, the direction of the research and the resulting breakthroughs will not be worthwhile to create AGI.

Additionally, to build better models while remaining smaller, companies need to use larger datasets than they can use today. Large-size high-quality text datasets will be essential in the near future.

Emily M. Bender, a professor of linguistics at the University of Washington, criticized Google’s method for building PaLM because the 780 billion tokens (the amount of data used to train the model) are too large to be documented. well coded, which makes the model “too large to deploy safely”. Chinchilla has been trained on double the number of tokens. If we deduce from Bender’s criticisms (which depend on the process DeepMind took to train the model), we can conclude that Chinchilla is also not secure enough to deploy.

To make the models better while remaining smaller, they need more data. But using more data will make the models less secure. We face a difficult choice between making the models larger (i.e. they become increasingly inaccessible to most researchers in the field and at the same time impacting the environment). fields) or train them on more tokens (i.e. make data testing harder and models less secure). Saying that Chinchilla is better because it’s smaller seems like a far-fetched statement these days.

Another choice can always be to focus on other lines of research that do not involve training large models with large datasets. However, since Big Tech has the money to fund the lines of research they want, only those lines of research provide results – not because other lines of research won’t work, but because they won’t. well exploited.

There is no indication that optimizing language models will solve their ethical problems. Transformer-based language models can suffer from ethical issues, such as gender discrimination, racial bias, or maliciousness, despite model size, dataset size data, hyperparameter quality, computational budget, etc. Researchers and IT developers will continue to come up with new solutions to address these problems and ensure that language models language is built with high ethical standards.

Source : https://congdongchatgpt.com/d/46-research-mot-xu-huong-moi-ve-tri-tue-nhan-tao-ai-da-duoc-phat-hien Inspire ChatGPT Translate

Share the news now

Source : Viblo