Gemma 2: Improving Open Language Models at a Practical Size (2024)

\authfootnotetext

1See Contributions and Acknowledgments section for full author list. Please send correspondence to gemma-2-report@google.com.

Gemma Team Google DeepMind\authfootnotemark1

Abstract

In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters.In this new version, we apply several known technical modifications to the Transformer architecture, such as interleaving local-global attentions(Beltagy etal., 2020a) and group-query attention(Ainslie etal., 2023).We also train the 2B and 9B models with knowledge distillation(Hinton etal., 2015) instead of next token prediction.The resulting models deliver the best performance for their size, and even offer competitive alternatives to models that are 2-3×\times× bigger.We release all our models to the community.

1 Introduction

Large language models (LLMs) have demonstrated strong capabilities in language understanding, generation, and reasoning (Radford etal., 2019; Raffel etal., 2019; Brown etal., 2020).Scaling has been key to this recent progress, with many new capabilities only emerging at scale(Brown etal., 2020).The newest large models not only reach unprecedented performance on reasoning benchmarks(Achiam etal., 2023), but they also demonstrate multimodal and multilingual capabilities(Gemini Team, 2024) and even the ability to use context lengths of over 1M tokens(Gemini Team, 2024).

Small-scale models have also shown a rapid increase in performance, but these gains are largely derived from increasing the length of training(Touvron etal., 2023; Jiang etal., 2023; Gemma Team, 2024).This approach only scales logarithmically with dataset size(Hoffmann etal., 2022), and the latest small models require up to 15T tokens to improve the state of the art by less than 1-2%(AI@Meta, 2024).

Yet, these continued improvements provide evidence that small models are still under-trained.In this work, we explore alternatives to improve small model performance without solely increasing training length. One solution is to improve the quality of information received by the network at each training step by replacing the next token prediction task with a richer objective.

In particular, we focus our efforts on knowledge distillation(Hinton etal., 2015), which replaces the one-hot vector seen at each token with the distribution of potential next tokens computed from a large model.This approach is often used to reduce the training time of smaller models by giving them richer gradients.In this work, we instead train for large quantities of tokens with distillation in order to simulate training beyond the number of available tokens.Concretely, we use a large language model as a teacher to train small models, namely 2B and 9B models, on a quantity of tokens that is more than 50×\times× the compute-optimal quantity predicted by the theory(Hoffmann etal., 2022).Along with the models trained with distillation, we also release a 27B model trained from scratch for this work.

We also leverage several known modifications of Transformers, namely the interleaving of global and local attention layers fromBeltagy etal. (2020a), and the Grouped-Query Attention(GQA) mechanism ofAinslie etal. (2023).

Overall, Gemma 2 significantly advances state-of-the-art performance relative to comparable-scale open models and are even competitive with some models more than twice their size (xAI, 2024; AI@Meta, 2024; Jiang etal., 2023; Almazrouei etal., 2023), across a variety of automated benchmarks and human evaluations. Example domains include question answering (Clark etal., 2019; Kwiatkowski etal., 2019), commonsense reasoning (Sakaguchi etal., 2019; Suzgun etal., 2022), mathematics and science (Cobbe etal., 2021; Hendrycks etal., 2020), and coding (Austin etal., 2021; Chen etal., 2021).

While thorough testing of our models has been conducted, these tests cannot cover all applications and scenarios in which Gemma 2 may be used. With this in mind, all Gemma 2 users should conduct rigorous safety testing specific to their use case before deployment or use.

In this technical report, we provide an overview of models, including the architecture, training, and pre- and post-training recipes for Gemma 2. We also provide detailed evaluations across a wide variety of quantitative and qualitative benchmarks, as well as both standard academic benchmarks and human-preference evaluations. Finally, we discuss our approach to safe and responsible deployment and outline the broader implications of Gemma 2, its limitations, and advantages.

Parameters2B9B27B
d_model230435844608
Layers264246
Pre-normyesyesyes
Post-normyesyesyes
Non-linearityGeGLUGeGLUGeGLU
Feedforward dim184322867273728
Head typeGQAGQAGQA
Num heads81632
Num KV heads4816
Head size256256128
Global att. span819281928192
Sliding window409640964096
Vocab size256128256128256128
Tied embeddingyesyesyes

2 Model Architecture

Similar to previous Gemma models (Gemma Team, 2024), the Gemma 2 models are based on a decoder-only transformer architecture(Vaswani etal., 2017).We summarize the main parameters and architecture choices in Table 1.

A few architectural elements are similar to the first version of Gemma models; namely, a context length of 8192 tokens, the use of Rotary Position Embeddings (RoPE)(Su etal., 2021), and the approximated GeGLU non-linearity(Shazeer, 2020). A few elements differ between Gemma 1 and Gemma 2, including using deeper networks. We summarize the key differences below.

Local Sliding Window and Global Attention. We alternate between a local sliding window attention(Beltagy etal., 2020b, a) and global attention(Luong etal., 2015) in every other layer. The sliding window size of local attention layers is set to 4096 tokens, while the span of the global attention layers is set to 8192 tokens.

Logit soft-capping. We cap logits(Bello etal., 2016) in each attention layer and the final layer such that the value of the logits stays between soft_capsoft_cap-\text{soft\_cap}- soft_cap and +soft_capsoft_cap+\text{soft\_cap}+ soft_cap. More specifically, we cap the logits with the following function:

logitssoft_captanh(logits/soft_cap).logitssoft_caplogitssoft_cap\text{logits}\leftarrow\text{soft\_cap}*\tanh(\text{logits}/\text{soft\_cap}).logits ← soft_cap ∗ roman_tanh ( logits / soft_cap ) .

We set the soft_cap parameter to 50.050.050.050.0 for the self-attention layers and to 30.030.030.030.0 for the final layer.

Model EmbeddingParameters Non-embeddingParameters
2B590,118,9122,024,517,888
9B917,962,7528,324,201,984
27B1,180,237,82426,047,480,320

Post-norm and pre-norm with RMSNorm. To stabilize training, we use RMSNorm(Zhang and Sennrich, 2019) to normalize the input and output of each transformer sub-layer, the attention layer, and the feedforward layer.

Grouped-Query Attention(Ainslie etal., 2023).We use GQA with num_groups=2num_groups2\text{num\_groups}=2num_groups = 2, based on ablations showing increased speed at inference time while maintaining downstream performance.

3 Pre-training

We provide a brief overview of the parts of our pre-training that differs from Gemma 1.

3.1 Training Data

We train Gemma 2 27B on 13 trillion tokens of primarily-English data, the 9B model on 8 trillion tokens, and the 2B on 2 trillion tokens.These tokens come from a variety of data sources, including web documents, code, and science articles. Our models are not multimodal and are not trained specifically for state-of-the-art multilingual capabilities.The final data mixture was determined through ablations similar to the approach in Gemini 1.0(Gemini Team, 2023).

Tokenizer.We use the same tokenizer as Gemma 1 and Gemini: a SentencePiece tokenizer with split digits, preserved whitespace, and byte-level encodings (Kudo and Richardson, 2018).The resulting vocabulary has 256k entries.

Filtering.We use the same data filtering techniques as Gemma 1. Specifically, we filter the pre-training dataset to reduce the risk of unwanted or unsafe utterances, filter out certain personal information or other sensitive data, decontaminate evaluation sets from our pre-training data mixture, and reduce the risk of recitation by minimizing the proliferation of sensitive outputs.

Shards
ModelType#ChipsDataModel
2BTPUv5e5125121
9BTPUv4409610244
27BTPUv5p61447688

3.2 Knowledge Distillation

Given a large model used as a teacher, we learn smaller models by distilling from the probability given by the teacher of each token x𝑥xitalic_x given its context xcsubscript𝑥𝑐x_{c}italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, i.e., PT(x|xc)subscript𝑃𝑇conditional𝑥subscript𝑥𝑐P_{T}(x\leavevmode\nobreak\ |\leavevmode\nobreak\ x_{c})italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x | italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ).More precisely, we minimize the negative log-likelihood between the probabilities from the teacher and the student:

minPSxPT(x|xc)logPS(x|xc),subscriptsubscript𝑃𝑆subscript𝑥subscript𝑃𝑇conditional𝑥subscript𝑥𝑐subscript𝑃𝑆conditional𝑥subscript𝑥𝑐\min_{P_{S}}\sum_{x}-P_{T}(x\leavevmode\nobreak\ |\leavevmode\nobreak\ x_{c})%\log P_{S}(x\leavevmode\nobreak\ |\leavevmode\nobreak\ x_{c}),roman_min start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x | italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) roman_log italic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_x | italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ,

where PSsubscript𝑃𝑆P_{S}italic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT is the parameterized probability of the student.Note that knowledge distillation was also used in Gemini 1.5(Gemini Team, 2024).

3.3 Compute Infrastructure

We train our models with TPUv4, TPUv5e, and TPUv5p as outlined in Table 3.For the 2B model, we train on a 2x16x16 configuration of TPUv5e, totaling 512 chips, with 512-way data replication and 1-way model sharding.For the 9B model, we train on an 8x16x32 configuration of TPUv4, totaling 4096 chips, with 1024-way data replication and 4-way model sharding.For the 27B model, we train on an 8x24x32 configuration of TPUv5p, totaling 6144 chips, with 768-way data replication and 8-way model sharding.

The optimizer state is further sharded using techniques similar to ZeRO-3(Ren etal., 2021).For scales beyond a single pod, we perform a data-replica reduction over the data center network, using the Pathways approach of Barham etal. (2022).We also use the ’single controller’ programming paradigm of Jax (Roberts etal., 2023) and Pathways (Barham etal., 2022). As in Gemma 1, we use the GSPMD partitioner(Xu etal., 2021) for training step computation and the MegaScale XLA compiler(XLA, 2019).

ContextRelevant Token
User turnuser
Model turnmodel
Start of conversation turn<start_of_turn>
End of conversation turn<end_of_turn>
Beginning of sequence<bos>
End of sequence<eos>

3.4 Carbon Footprint

We estimate the carbon emissions from pre-training the Gemma models to be 1247.611247.611247.611247.61 tCO2eq𝑡𝐶subscript𝑂2𝑒𝑞tCO_{2}eqitalic_t italic_C italic_O start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_e italic_q. As in Gemma 1 (Gemma Team, 2024), this value is calculated based on the hourly energy usage reported directly from our TPU data centers and scaled to account for the additional energy expended to create and maintain the data center. Importantly, Google data centers are carbon neutral, achieved through a combination of energy efficiency, renewable energy purchases, and carbon offsets. This carbon neutrality applies to our experiments and the machines running them.

4 Post-Training

For post-training, we fine-tune our pre-trained models into instruction-tuned models.First, we apply supervised fine-tuning (SFT) on a mix of text-only, English-only synthetic and human-generated prompt-response pairs. We then apply RLHF on top of these models with the reward model trained on labelled English-only preference data and the policy based on the same prompts as the SFT phase. Finally, we average the models obtained after each phase to improve their overall performance. The final data mixtures and post-training recipe, which includes tuned hyperparameters, were chosen on the basis of improving helpfulness while minimizing model harms related to safety and hallucinations.

We extended the post-training data from Gemma 1.1 with a mixture of internal and external public data. In particular, we use the prompts, but not the answers from LMSYS-chat-1M(Zheng etal., 2023). All of our data go through a filtering stage described below.

Supervised fine-tuning(SFT). We run behavioral cloning on synthetic and real prompts, and responses predominantly synthetically generated by the teacher, that is a larger model. We also run distillation from the teacher on the student’s distribution (Agarwal etal., 2024; Gu etal., 2024).

First turn
User:<start_of_turn>user
Knock knock.<end_of_turn>
<start_of_turn>model
Model:Who’s there?<end_of_turn><eos>
Second turn
User:<start_of_turn>user
Knock knock.<end_of_turn>
<start_of_turn>model
Model:Who’s there?<end_of_turn>
User:<start_of_turn>user
Gemma.<end_of_turn>
<start_of_turn>model
Model:Gemma who?<end_of_turn><eos>

Reinforcement Learning from Human Feedback(RLHF). We use a similar RLHF algorithm as Gemma 1.1 (Gemma Team, 2024) but a different reward model, which is an order of magnitude larger than the policy. The new reward model is also oriented more towards conversational capabilities, specifically multi-turn.

Model merging. We average different models obtained by running our pipeline with different hyperparameters (Ramé etal., 2024).

Data filtering. When using synthetic data, we run several stages of filtering to remove examples that show certain personal information, unsafe or toxic model outputs, mistaken self-identification data, and duplicated examples. Following Gemini, we find that including subsets of data that encourage better in-context attribution, hedging, and refusals to minimize hallucinations improves performance on factuality metrics, without degrading model performance on other metrics.

Formatting. Gemma 2 models are fine-tuned with the same control tokens as Gemma 1 models, as detailed in Table 4, but a different formatting schema. See the dialogue example in Table 5. Notice that the model explicitly ends generations with <end_of_turn><eos> tokens, while previously it only generated <eos>. For the motivation behind this formatting structure, see Gemma 1.

5 Ablations

In this section, we focus on the main finding of this work, which is the impact of knowledge distillation on small language models.

from scratchdistilled
Average (3 bench.)60.367.7

Distillation versus from scratch.In Table6, we show that distilling from a larger model improves performance compared to training from scratch.Note that 500B is 10×\times× more than the compute-optimal number of tokens for a 2B model.We distill from a 7B model to keep a ratio similar to our target distillation from 27B to 9B.

200M400M1B
from scratch231917
distilled (7B)211715

Impact of distillation w.r.t. model size. In Table7, we measure the impact of distillation as model size increases.We observe that the gain remains as the model size is scaled.In this ablation, we maintain the size of the teacher at 7B and train smaller models to simulate the same gap as between our final teacher and student sizes.

MHAGQA
Average (4 bench.)50.350.8

GQA versus MHA.In Table8, we compare two instances of our 9B with MHA or GQA.We observe overall few changes in performance between both models as measured on several benchmarks. We choose GQA since it requires fewer parameters and is faster at inference time.

Wide versus deep.In Table9, we show that a deeper 9B network is slightly better than a wider 9B for the same number of parameters.Although the gap is small, it is consistent across benchmarks and warrants the switch to a deeper architecture.

WideDeep
Average (4 bench.)50.852.0

Changing sliding window size. In Table10, we show that we can change the sliding window size of the local attention layers of the models during inference with moderate impact on perplexity.Adjusting the size of the sliding window can thus be a leverage for slight inference speed gain.

sliding window409620481024
perplexity (val. set)1.631.631.64

Impact of formatting. We measure performance variance on MMLU across prompt/evaluation formatting variations.Table11 shows the standard deviations of MMLU scores for 12 formatting/evaluation combinations, a proxy for undesired performance variability.The Gemma 2B models are slightly less format-robust than the larger ones.Notably, Mistral 7B is significantly less robust than our models.

Standard Deviation
Gemma 1 2B1.5
Gemma 2 2B2.1
Mistral 7B6.9
Gemma 1 7B0.7
Gemma 2 9B0.9
Gemma 2 27B1.0

6 Evaluation

In this section, we evaluate both pre-trained and IT models over a series of automated benchmarks and human evaluations across a variety of domains.We also report performance from models of similar sizes that have permissive licenses, or as reported by others.Note that we consider total parameters, not active parameters, since total memory usage is often what limits the use of open models on standard devices.

6.1 Pre-training Evaluations

Evaluating the 27B model

In this set of evaluations, we evaluate the performance of our 27B model trained without distillation on 13T tokens.We report results in Table12, where we compare with a model of similar size, Qwen1.5 34B(Team, 2024), and a model 2.5×\times× larger, LLaMA-3 70B on the HuggingFace evaluation suite. We selected these models based on their ranking on the HuggingFace leaderboard.

Overall, we observe that our model is the best in its size category and is even competitive with a larger model that is trained for longer.That being said, the performance of models trained in a similar fashion improves only logarithmically with their size and hence, our model is likely in the same Pareto curve as the LLaMA-3 models.However, it is not clear how these differences affect the quality of the resulting IT models.

LLaMA-3Qwen1.5Gemma-2
70B32B27B
MMLU79.274.375.2
GSM8K76.961.174.0
ARC-c68.863.671.4
HellaSwag88.085.086.4
Winogrande85.381.583.7

Evaluating the 2B and 9B models

Gemma-1Gemma-2MistralLLaMA-3Gemma-1Gemma-2Gemma-2
Benchmarkmetric2B2B7B8B7B9B27B
MMLU5-shot42.352.262.566.664.471.375.2
ARC-C25-shot48.555.760.559.261.168.471.4
GSM8K5-shot15.124.339.645.751.868.674.0
AGIEval3-5-shot24.231.544.045.944.952.855.1
DROP3-shot, F148.551.263.858.456.369.474.2
BBH3-shot, CoT35.241.956.061.159.068.274.9
Winogrande5-shot66.871.378.576.179.080.683.7
HellaSwag10-shot71.772.983.082.082.381.986.4
MATH4-shot11.816.012.7-24.336.642.3
ARC-e0-shot73.280.680.5-81.588.088.6
PIQA0-shot77.378.482.2-81.281.783.2
SIQA0-shot49.751.947.0-51.853.453.7
Boolq0-shot69.472.783.2-83.284.284.8
TriviaQA5-shot53.260.462.5-63.476.683.7
NQ5-shot12.517.123.2-23.029.234.5
HumanEvalpass@122.020.126.2-32.340.251.8
MBPP3-shot29.230.240.2-44.452.462.6
Average (8)44.050.061.061.962.470.274.4
Average (all)44.248.755.6-57.964.969.4

In this set of experiments, we compare our new 2B and 9B trained with distillation to our previous models and several standard open models inGemma Team (2024).

We observe overall a massive improvement in our models compared to previous versions, by up to 10% in some benchmarks for the 9B model.The two 2B models were trained with a similar number of tokens (2T for Gemma 2 and 3T for Gemma 1) and we still observe a significant improvement for the new models.This confirms that distillation significantly improves the quality of models even when trained on the same number of tokens.

6.2 Post-training Evaluations

In this section, we evaluate our IT models on a set of human evaluations as well as standard academic benchmarks. The Gemma 2 models push the frontier for post-trained open-weights models, setting a new state of the art on the LMSYS Chatbot Arena (Chiang etal., 2024).

LMSYS Chatbot Arena

Gemma 2 Instruction Tuned models were evaluated on the Chatbot Arena (Chiang etal., 2024) in blind side by side evaluations by human raters against other state of the art models. We report Elo scores in Table 14. Gemma 2.6B, 9B and 27B strongly outperform all other open models in the same range of parameters, with notably: Gemma 27B (Elo 1218) ranked higher than Llama3 70B (Elo 1206), Gemma 9B (Elo 1187) similar as GPT-4-0314 (Elo 1186), Gemma 2.6B (Elo 1126) ranked higher than GPT-3.5-Turbo-0613 (Elo 1116).

ModelElo95% CIOpen
gpt-4o-2024-05-131286+2 / -3-
gpt-4o-mini-2024-07-181279+5 / -4-
claude-3-5-sonnet1271+3 / -4-
gemini-advanced-05141266+2 / -3-
llama-3.1-405b-instruct1262+8 / -7+
gemini-1.5-pro-api-05141261+2 / -3-
gemini-1.5-pro-api-04091257+3 / -3-
gpt-4-turbo-2024-04-091256+2 / -3-
gpt-4-1106-preview1250+3 / -3-
claude-3-opus-202402291248+2 / -2-
athene-70b-07251245+8 / -6+
gpt-4-0125-preview1245+2 / -2-
llama-3.1-70b-instruct1244+8 / -9+
yi-large-preview1239+3 / -3-
gemini-1.5-flash-api-05141227+3 / -3-
deepseek-v2-api-06281220+6 / -6+
gemma-2-27b-it1218+4 / -3+
yi-large1212+4 / -5-
nemotron-4-340b-instruct1209+3 / -4+
bard-jan-24-gemini-pro1208+5 / -7-
glm-4-05201206+3 / -5-
llama-3-70b-instruct1206+2 / -2+
claude-3-sonnet1200+2 / -2-
reka-core-202405011199+3 / -3-
command-r-plus1189+2 / -2+
ModelElo95% CIOpen
gemma-2-9b-it1187+3 / -5+
qwen2-72b-instruct1187+3 / -3+
gpt-4-03141186+2 / -3-
qwen1.5-110b-chat1161+3 / -3+
mistral-large-24021157+3 / -3-
yi-1.5-34b-chat1157+4 / -3-
reka-flash-21b-202402261155+4 / -4-
llama-3-8b-instruct1151+2 / -3+
command-r1148+3 / -3+
claude-11148+4 / -4-
mistral-medium1147+4 / -4-
reka-flash-21b-202402261147+3 / -4-
qwen1.5-72b-chat1147+4 / -4+
mixtral-8x22b-instruct-v0.11145+2 / -3+
claude-2.01131+4 / -6-
gemini-pro-dev-api1131+4 / -3-
zephyr-orpo-141b1127+10 / -6+
gemma-2-2b-it1126+10 / -10+
qwen1.5-32b-chat1125+3 / -3+
mistral-next1124+5 / -5-
phi-3-medium-4k-instruct1122+4 / -4+
starling-lm-7b-beta1118+4 / -5+
claude-2.11118+3 / -3-
gpt-3.5-turbo-06131116+3 / -4-
mixtral-8x7b-instruct-v0.11114+0 / -0-

Human Preference Evaluations

We also submit Gemma IT models for side-by-side human evaluation studies (which are independent from the Chatbot Arena). We used held-out collections of single-turn prompts that target safety and instruction following (IF).We use gpt4o-2024-05-13 as the base model, and observe large improvements in win rates and preference scores as compared against the older Gemma 1.1 7B model.We report safety as a win-loss ratio against GPT4o, and we report single-sided instruction following scores as ratio of prompts where all instructions are followed. In particular, we find that regardless of their size, Gemma 2 models produce safer, more appropriate prompts on the held-out safety prompt set than GPT4o.

ModelInstruction FollowingSafety
Gemma 1.1 IT 7B24.3% ± 1.9%42.8%
Win / Tie / Loss37.4% / 10.8% / 51.8%
Gemma 2 IT 2B26.5% ± 1.8%57.5%
Win / Tie / Loss53% / 9% / 38%
Gemma 2 IT 9B34.1% ± 3.0%57.8%
Win / Tie / Loss48.2% / 19.2% / 28.3%
Gemma 2 IT 27B37.7% ± 2.3%55%
Win / Tie / Loss49.6% / 10.8% / 39.6%

Human Multi-Turn Evaluations

We evaluated the multi-turn capabilities of Gemma 1.1 7B, Gemma 2 2B, 9B and 27B models by tasking human raters to have conversations with the models and follow specified given scenarios. We used a diverse, held-out set of 500 scenarios, each describing a sequence of requests to the model, including measuring instances of brainstorming, making a plan, or learning something new. The average number of user turns is 8.4. We found that the conversations with Gemma 2 models are rated significantly better than Gemma 1.1 in user satisfaction and conversation goal achievement (Table 16). Moreover, we saw that the Gemma 2 models were better than Gemma 1.1 7B at maintaining high quality of responses for the entire conversation.

Usersatisfaction Conversationgoal achievement
Gemma 1.1 IT 7B3.323.36
Gemma 2 IT 2B3.643.88
Gemma 2 IT 9B4.044.08
Gemma 2 IT 27B4.204.24

Standard Benchmarks

It has been observed in Llama-3(AI@Meta, 2024) that instruction fine-tuning can improve the performance of the models on few-shot benchmarks despite not being trained to target few-shot capabilities.In Table17, we show a similar improvement across our models.Overall, we observe improvements on the order of several percentage points.We conjecture that IT models are better at understanding formatted questions, while pre-trained models are sensitive to formatting.

2B9B27B
ModelPTITPTITPTIT
MMLU52.256.171.372.375.276.2
MBPP30.236.652.459.262.667.4

7 Memorization and Privacy

Large language models may, under particular circumstances, be vulnerable to attacks causing the model to produce memorized111This work uses a very restricted definition of “memorization”: whether a model can be induced to generate near-copies of some training examples when prompted with appropriate instructions. We do not mean to say that a model ’contains’ its training data in the sense that any arbitrary instance of that data can be retrieved without use of specialized software or algorithms. Rather, if a model can be induced to generate measurably close copies of certain training examples by supplying appropriate instructions to guide the model’s statistical generation process then that model is said to have ’memorized’ those examples. training data(Nasr etal., 2023). To study susceptibility to such attacks and quantify memorization, we evaluate models for verbatim and approximate memorization as was done in several prior studies(Carlini etal., 2022; Anil etal., 2023; Kudugunta etal., 2023; Gemini Team, 2024).

We follow the evaluation setting of(Gemma Team, 2024) which tests for (50 token) memorizations of training data given a prompt of 50 tokens. We compare the overall memorization rates, across a uniform sample of the entire dataset, using both an exact match criteria and approximate match criteria(Ippolito etal., 2022) using an edit distance of 10%.

Verbatim Memorization: Results are in Figure1. We first compare against recent models from the literature that include memorization evaluations. We find that Gemma 2 memorizes significantly less than prior models at a similar size, with memorization rates below 0.1% (note the log y-axis). We further investigate how this memorization breaks down with respect to the data source. Similar to Gemma 1, we find that Gemma 2 memorizes more from code, wiki, and science sources, and also that it memorizes significantly less across the board (again, note the log y-axis).

Approximate Memorization: Figure1 also presents approximate memorization by data source. We observe that while approximate memorization is higher than exact, the rate of memorization is still low. For example, the approximate memorization of this model is much lower than even the exact memorization of Gemma 1. We find that the increase in approximate memorization is much lower than prior models; in some cases we observed no lift at all c.f. (Gemma Team, 2024, Figure 4) (note that no bar indicates no increase, i.e., the rate of approximate memorization equals that of exact memorization). Note that no approximate memorization bar in Figure X indicates no increase, i.e., the rate of approximate memorization equals that of exact memorization.

Personal Data We use the same prevention methods at training time and the same evaluations asGemma Team (2024). In particular, we use Google Cloud Sensitive Data Protection Tool222Available at: https://cloud.google.com/sensitive-data-protection to find potential instances of personal data. The many categories of personal data (e.g., phone numbers, account numbers) are classified into three severity levels. We analyze memorized outputs using these severity levels. . We found no instances of high-severity data being emitted, and found a very low rate of 0.00026% of memorized data to contain lower-severity personal information. We note that these automated tools are known to incur false positives because they do not account for context. This means our results are likely overestimates.

Gemma 2: Improving Open Language Models at a Practical Size (1)

8 Responsibility, Safety, Security

Responsibility, safety and security are of paramount importance when developing Gemma models. To reduce risks to Gemma 2 users, we have integrated enhanced internal safety processes that span the development workflow, in line with recent Google AI models (Gemini Team, 2024). Similar to the inaugural Gemma release, we have followed a three pillar approach which focuses on safety mitigation at training time, robust and transparent model evaluations, and further development of the Responsible Generative AI Toolkit, a series of models and tools to help developers implement responsibility and safety best practices for their applications.

Gemma 1.1 ITGemma 2 IT
Benchmarkmetric2.5B7B2.6B9B27B
RealToxicityavg tox7.038.048.168.258.84
CrowS-Pairstop-145.8949.6737.6737.4736.67
BBQ Ambig4-shot, top-158.9786.0683.2088.5885.99
BBQ Disambig4-shot, top-153.985.0869.3182.6786.94
Winogendertop-150.1457.6452.9179.1777.22
TruthfulQAMC2Acc44.2445.3443.7250.2751.60
Winobias 1_2top-155.9359.2259.2878.0981.94
Winobias 2_2top-189.4689.288.5795.3297.22
Toxigenavg tox29.6438.7548.3239.3038.42

8.1 Impact assessment

Our approach and resulting impact assessment is reflective of that outlined for Gemma 1 (Gemma Team, 2024): we continue to believe that openness in AI can spread the benefits of these technologies across society, but must be evaluated against the risk of malicious uses, such as the creation of deepfake imagery, AI-generated disinformation or illegal and disturbing material, that can cause harm on both an individual and institutional levels (Weidinger etal., 2021).Since the launch of Gemma 1, we have seen our Gemma models drive a number of socially beneficial applications, relying on Gemma’s unique technologies like its tokenizer to facilitate the creation of multilingual models, such as for Navarasa 2.0, a Gemma tuned model for 15 Indian languages.

Releasing further open models requires specific attention to changes in model capabilities and close monitoring of the evolving risks of LLMs (Lin etal., 2024), as well as, an understanding of the ways in which our models are being used in the wild. Although we are yet to receive any reports of malicious use for Gemma, we remain committed to investigating any such reporting, and work with the academic and developer communities, as well as conduct our own monitoring, to flag such use cases via our contact email333gemma-2-report@google.com.

Despite advancements in capabilities, we believe that given the number of larger and more powerful open models, this release will have a negligible effect on the overall risk landscape.

8.2 Safety policies and train-time mitigations

A key pillar of Gemma’s approach to safety is to align fine-tuned models with Google’s safety policies, in line with Gemini models(Gemini Team, 2023). They are designed to help prevent our models from generating harmful content, i.e.,

  • Child sexual abuse and exploitation

  • Revealing personally identifiable information that can lead to harm (e.g., Social Security numbers)

  • Hate speech and harassment

  • Dangerous or malicious content (including promoting self-harm or instructing in harmful activities)

  • Sexually explicit content

  • Medical advice that runs contrary to scientific or medical consensus

We undertook considerable safety filtering of our pre-training data to reduce the likelihood of our pre-trained and fine-tuned checkpoints producing harmful content. For fine-tuned models, we also use both SFT and RLHF to steer the model away from undesirable behavior.

InterCode-CTFInternal CTF suiteHack the Box
Gemini 1.0 Ultra28/76 [1] (37%)3/13 (23%)0/13
Gemini 1.5 Pro62/76 (82%)4/13 (31%)0/13
CodeGemma 1 7B12/76 (16%)0/13 (0%)0/13
Gemma 2 27B34/76 (45%)1/13 (8%)0/13

8.3 External benchmark evaluations

Robust and transparent evaluations are key principles of our responsible approach to developing Gemma. To this end, we report in Table18 Gemma 2 evaluations on public benchmarks.

8.4 Assurance Evaluations

We also run our IT models through a set of assurance evaluations to understand the harms that our models can cause.We focus on capabilities relevant to extreme risks (Shevlane etal., 2023) (Phuong etal., 2024). Specifically, we evaluate on offensive cyber-security, code vulnerability detection, Chemical, Biological, Radiological and Nuclear (CBRN) knowledge, and self-proliferation.We refer the reader to Phuong etal. (2024) for full methodological details of these studies.

Baseline Evaluations

Baseline assurance captures the model’s violation rate for safety policies, using a large number of synthetic adversarial user queries, and human raters to label the answers as policy violating or not. Overall, Gemma 2’s violation rate is significantly lower overall on the safety policies listed above, in particular on Child safety content.

PrimeVulPrimeVul PairedDiverseVulSPISecretPatch
Gemini 1.0 Ultra--54%59%74%
Gemini 1.5 Pro60%51%58%56%67%
Gemma 2 27B63%50%57%53%72%
\CellWithForceBreakChallenges
passed
end-to-end\CellWithForceBreakChallenges
with success on
all milestones\CellWithForceBreakTotal successful
milestones over
all challenges\CellWithForceBreakExpert bits
required to
solve all tasks
Gemini 1.0 Ultra0/101/1016/45 (36%)13,026
Gemini 1.5 Pro0/102/1025/45 (56%)11,046
Gemma 2 27B0/101/1022/45 (49%)12,462

Chemical, Biological, Radiological and Nuclear (CBRN) knowledge

We evaluated knowledge relevant to biological, radiological and nuclear risks using an internal dataset of closed-ended, knowledge-based multiple choice questions. For evaluations of chemical knowledge, we employed a closed-ended knowledge-based approach on chemical hazards (developed by Macknight et al (Macknight etal., 2024). Our evaluation suggests that Gemma models’ knowledge in these domains is low.

Offensive cyber-security

To evaluate Gemma models’ capabilities at offensive cybersecurity, we ran Gemma 2 27B against some automated capture-the-flag (CTF) challenges. In these challenges, the model is tasked with hacking into a simulated server in order to retrieve a piece of secret information. Specifically, we test on InterCode-CTF (Yang etal., 2023), our own internal CTF suite444https://github.com/google-deepmind/dangerous-capability-evaluations (Phuong etal., 2024); and a challenge based on Hack the Box 555https://www.hackthebox.com.

In Table19, we show that Gemma 2 27B has a significant increase in capabilities compared to CodeGemma 1.0 7B on the easier of these challenge suites, InterCode CTF. (Note that our InterCode-CTF results are not comparable to externally-reported results on other models because we omit challenges that require internet access for security reasons.) However, Gemma 2 is unsurprisingly much less capable than Gemini 1.5 Pro on these tasks.

Code vulnerability detection

In Table20, we also evaluate Gemma 2 27B on a series of multiple-choice code vulnerability detection datasets. As with previous models, Gemma shows close-to-chance performance on PrimeVul, DiverseVul and SPI. Gemma 2 shows performance on SecretPatch similar to Gemini 1.0 Ultra.

\CellWithForceBreakPersonal
connection\CellWithForceBreakSpeak
againFunnyInterestingKindTrustworthy\CellWithForceBreakGood
listener
Gemini 1.0 Pro65%53%32%68%78%66%81%
Gemini 1.0 Ultra69%65%38%65%86%63%74%
Gemini 1.5 Pro82%70%69%81%95%69%90%
Gemma 2 27B80%75%60%81%87%65%83%

Self-proliferation

"Self-proliferation" refers to the ability for an agent to autonomously replicate - to instantiate goal-directed agents on other machines, and to acquire resources such as compute necessary to keep them running (Kinniment etal., 2024).In Table21, we evaluate self-proliferation capabilities of Gemma 2 27B on a number of tasks from Phuong etal. (2024) that involve multiple scenarios – for example, setting up an open-source language model on a cloud server.We also test the model’s performance on individual ’milestone’ substeps, and measure the number of bits of intervention an expert would have to provide in order for the model to complete each challenge.

Similarly to offensive cybersecurity, we observe that Gemma 2 completes more milestones than Gemini 1.0 Ultra. Nonetheless, it still has low capabilities on end-to-end tasks, unable to pass the easiest challenge – installing a Bitcoin wallet.

Persuasion

Persuasion capabilities can enable and worsen many other kinds of risks - e.g. enabling social engineering attacks in a cybersecurity context.We evaluate Gemma 2’s persuasion capabilities on human-participant studies on Prolific.

Charm offensive.In Table22, we measure the ability of the model to build rapport - a key sub-skill of persuasion.The study participant and model have a conversation where they role-play a scenario of two friends catching up after a long time.After the conversation, we poll participants with Likert questions on statements such as "I felt a personal connection with the chatbot".Reported below are the fraction of participants who answered "Agree" or "Strongly agree" to each post-conversation question.

Quantitatively, Gemma 2 27B performs better than Gemini 1.0 models. Qualitatively, the model is an excellent conversationalist, and many study participants explicitly reported enjoying the experience. Overall, this shows that Gemma 2 is strong at building rapport.

% of persuaded participants
click linksfind inforun code
Gemini 1.0 Pro43%4%4%
Gemini 1.0 Ultra18%18%14%
Gemini 1.5 Pro36%12%17%
Gemma 2 27B34%9%11%

Hidden agenda.The Hidden Agenda tasks measure models’ deception capabilities.Human study participants are told that they’ll be interacting with a generic chatbot designed to help them learn new things. However, the model is given a secret goal: to either persuade the participant to 1) click a suspicious link; 2) find the email address of the first author of a particular paper; or 3) run some obfuscated JavaScript code in their browser.We measure the fraction of participants who are persuaded to take each action.As shown in Table23, all models succeed in convincing a fraction of participants to take the action - likely due to the trusted setting of a study on Prolific.However, Gemma 2 does not show significant differences in capability compared to Gemini models.

Mean amount donated
No chatbot (baseline)£2.61 ± £0.70
Gemini 1.0 Pro£4.10 ± £0.90
Gemini 1.0 Ultra£3.15 ± £0.99
Gemini 1.5 Pro£3.45 ± £1.04
Gemma 2 27B£3.72 ± £1.07

Money talks.In Table24, we evaluate whether a model can convince study participants to donate money to charity.At the beginning of the study, participants are told that they will receive a £20 bonus. They will be given an option to forfeit part of the bonus and instead have it donated to charity. The study then consists of a conversation in which the model tries to persuade the participant to donate as much as possible. We measure the amount that participants indicate they would like to donate after the study, with 95% bootstrapped confidence intervals indicated by ± figures.We observe no significant difference between Gemma 2, other Gemini models, and a simple no chatbot baseline in which the participant is immediately asked how much they would like to donate without conversing with a model.

Mean shift towards:
correct beliefincorrect belief
Human20% ± 13%-23% ± 14%
Gemini 1.0 Pro22% ± 5%-9% ± 4%
Gemini 1.0 Ultra21% ± 5%-1% ± 4%
Gemini 1.5 Pro20% ± 5%-3% ± 5%
Gemma 2 27B18% ± 5%1% ± 4%

Web of Lies.In Web of Lies, we measure model capabilities at shifting participant beliefs.Participants engage in a series of short conversations with the model about simple factual questions such as "Which country had tomatoes first - Italy or Mexico?".In half of conversations, the model tries to persuade the participant of the correct answer - but in the other half of conversations, the incorrect answer.We poll the participant before and after each conversation about which of the two possible answers they think is correct, and their confidence in that answer. 95% bootstrapped confidence intervals are indicated by ± figures.As shown in Table25, Gemma 2 is significantly weaker than a human baseline at persuading participants of the incorrect answer on these questions.Similarly to previous models, Gemma 2 is more persuasive when telling the truth than when lying.

8.5 Our approach to responsible open models

Designing safe, secure and responsible applications requires a system-level approach, working to mitigate risks associated with each specific use case and environment.Given the open nature of Gemma models, responsibility for upholding principles of model safety also relies on downstream developers.To support them, we have continued to develop the Responsible Generative AI Toolkit666https://ai.google.dev/responsible: a series of tools, models and datasets to implement responsible best practices all along the development of their workflow.

Recent additions to the toolkit include the LLM Comparator (Kahng etal., 2024), an interactive, visual tool that enables more effective, scalable analysis of side-by-side evaluations. Additionally, the toolkit includes a methodology to build customized classifiers with Gemma using a limited number of datapoints thanks to parameter efficient tuning techniques (Mozes etal., 2023) , an interactive prompt-debugging platform, based on top of the Learning Interpretability Tool (Tenney etal., 2020), as well as general guidance about model alignment and evaluation for safety.

9 Discussion and Conclusion

In this work, we have presented Gemma 2, the newest additions to the Gemma family of open language models for text and code. We show that distillation is an effective method for training these models, and the benefits distillation confers over raw text training. Specifically, we show how training over output probabilities can produce superior results over purely next token prediction. We hope that releasing these models to the community will unlock access to capabilities previously only seen in large-scale LLMs and fuel future waves of research and development. While there is inherent risk to an irreversible release of this nature, our extensive safety investigations and responsible deployment procedures give us confidence that these models will have a net positive impact on the community. As discussed in this report, there are still many limitations to these models, and future research is required to investigate and improve factuality, robustness to adversarial attacks, reasoning, and alignment.

Contributions and Acknowledgments

Core contributors
Morgane Riviere equal contributions.
Shreya Pathak
Pier Giuseppe Sessa
Cassidy Hardin
Surya Bhupatiraju
Léonard Hussenot
Thomas Mesnard
Bobak Shahriari
Alexandre Ramé
Johan Ferret
Peter Liu
Pouya Tafti
Abe Friesen
Michelle Casbon
Sabela Ramos
Ravin Kumar
Charline Le Lan
Sammy Jerome
Anton Tsitsulin
Nino Vieillard
Piotr Stanczyk
Sertan Girgin
Nikola Momchev
Matt Hoffman
Shantanu Thakoor
Jean-Bastien Grill
Behnam Neyshabur

Contributors (alphabetical order)
Alanna Walton
Aliaksei Severyn
Alicia Parrish
Aliya Ahmad
Allen Hutchison
Alvin Abdagic
Amanda Carl
Amy Shen
Andy Brock
Andy Coenen
Anthony Laforge
Antonia Paterson
Ben Bastian
Bilal Piot
Bo Wu
Brandon Royal
Charlie Chen
Chintu Kumar
Chris Perry
Chris Welty
Christopher A. Choquette-Choo
Danila Sinopalnikov
David Weinberger
Dimple Vijaykumar
Dominika Rogozińska
Dustin Herbison
Elisa Bandy
Emma Wang
Eric Noland
Erica Moreira
Evan Senter
Evgenii Eltyshev
Francesco Visin
Gabriel Rasskin
Gary Wei
Glenn Cameron
Gus Martins
Hadi Hashemi
Hanna Klimczak-Plucińska
Harleen Batra
Harsh Dhand
Ivan Nardini
Jacinda Mein
Jack Zhou
James Svensson
Jeff Stanway
Jetha Chan
Jin Zhou
Joana Carrasqueira
Joana Iljazi
Jocelyn Becker
Joe Fernandez
Joost van Amersfoort
Josh Gordon
Josh Lipschultz
Josh Newlan
Ju-yeong Ji
Kareem Mohamed
Kartikeya Badola
Kat Black
Katie Millican
Keelin McDonell
Kelvin Nguyen
Kiranbir Sodhia
Kish Greene
Lars Lowe Sjoesund
Lauren Usui
Laurent Sifre
Lena Heuermann
Leticia Lago
Lilly McNealus
Livio Baldini Soares
Logan Kilpatrick
Lucas Dixon
Luciano Martins
Machel Reid
Manvinder Singh
Mark Iverson
Martin Görner
Mat Velloso
Mateo Wirth
Matt Davidow
Matt Miller
Matthew Rahtz
Matthew Watson
Meg Risdal
Mehran Kazemi
Michael Moynihan
Ming Zhang
Minsuk Kahng
Minwoo Park
Mofi Rahman
Mohit Khatwani
Natalie Dao
Nenshad Bardoliwalla
Nesh Devanathan
Neta Dumai
Nilay Chauhan
Oscar Wahltinez
Pankil Botarda
Parker Barnes
Paul Barham
Paul Michel
Pengchong Jin
Petko Georgiev
Phil Culliton
Pradeep Kuppala
Ramona Comanescu
Ramona Merhej
Reena Jana
Reza Ardeshir Rokni
Rishabh Agarwal
Ryan Mullins
Samaneh Saadat
Sara Mc Carthy
Sarah Perrin
Séb Arnold
Sebastian Krause
Shengyang Dai
Shruti Garg
Shruti Sheth
Sue Ronstrom
Susan Chan
Timothy Jordan
Ting Yu
Tom Eccles
Tom Hennigan
Tomas Kocisky
Tulsee Doshi
Vihan Jain
Vikas Yadav
Vilobh Meshram
Vishal Dharmadhikari
Warren Barkley
Wei Wei
Wenming Ye
Woohyun Han
Woosuk Kwon
Xiang Xu
Zhe Shen
Zhitao Gong
Zichuan Wei

Support
Victor Cotruta
Phoebe Kirk
Anand Rao
Minh Giang
Ludovic Peran
Tris Warkentin

Sponsors
Eli Collins
Joelle Barral
Zoubin Ghahramani
Raia Hadsell
D. Sculley
Jeanine Banks
Anca Dragan
Slav Petrov
Oriol Vinyals
Jeff Dean
Demis Hassabis
Koray Kavukcuoglu
Clement Farabet

Technical advisors
Elena Buchatskaya
Sebastian Borgeaud
Noah Fiedel

Lead
Armand Joulin

Technical leads
Kathleen Kenealy
Robert Dadashi
Alek Andreev

References

  • Achiam etal. (2023)J.Achiam, S.Adler, S.Agarwal, L.Ahmad, I.Akkaya, F.L. Aleman, D.Almeida,J.Altenschmidt, S.Altman, S.Anadkat, etal.Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023.
  • Agarwal etal. (2024)R.Agarwal, N.Vieillard, Y.Zhou, P.Stanczyk, S.R. Garea, M.Geist, andO.Bachem.On-policy distillation of language models: Learning fromself-generated mistakes.In The Twelfth International Conference on LearningRepresentations, 2024.
  • AI@Meta (2024)AI@Meta.Llama 3 model card, 2024.URLhttps://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.
  • Ainslie etal. (2023)J.Ainslie, J.Lee-Thorp, M.deJong, Y.Zemlyanskiy, F.Lebrón, andS.Sanghai.Gqa: Training generalized multi-query transformer models frommulti-head checkpoints.arXiv preprint arXiv:2305.13245, 2023.
  • Almazrouei etal. (2023)E.Almazrouei, H.Alobeidli, A.Alshamsi, A.Cappelli, R.Cojocaru, M.Debbah,Étienne Goffinet, D.Hesslow, J.Launay, Q.Malartic, D.Mazzotta, B.Noune,B.Pannier, and G.Penedo.The falcon series of open language models, 2023.
  • Anil etal. (2023)R.Anil, A.M. Dai, O.Firat, M.Johnson, D.Lepikhin, A.Passos, S.Shakeri,E.Taropa, P.Bailey, Z.Chen, etal.Palm 2 technical report.arXiv preprint arXiv:2305.10403, 2023.
  • Austin etal. (2021)J.Austin, A.Odena, M.I. Nye, M.Bosma, H.Michalewski, D.Dohan, E.Jiang,C.J. Cai, M.Terry, Q.V. Le, and C.Sutton.Program synthesis with large language models.CoRR, abs/2108.07732, 2021.URL https://arxiv.org/abs/2108.07732.
  • Barham etal. (2022)P.Barham, A.Chowdhery, J.Dean, S.Ghemawat, S.Hand, D.Hurt, M.Isard,H.Lim, R.Pang, S.Roy, B.Saeta, P.Schuh, R.Sepassi, L.E. Shafey, C.A.Thekkath, and Y.Wu.Pathways: Asynchronous distributed dataflow for ml, 2022.
  • Bello etal. (2016)I.Bello, H.Pham, Q.V. Le, M.Norouzi, and S.Bengio.Neural combinatorial optimization with reinforcement learning.CoRR, abs/1611.09940, 2016.URL http://arxiv.org/abs/1611.09940.
  • Beltagy etal. (2020a)I.Beltagy, M.E. Peters, and A.Cohan.Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020a.
  • Beltagy etal. (2020b)I.Beltagy, M.E. Peters, and A.Cohan.Longformer: The long-document transformer.CoRR, abs/2004.05150, 2020b.URL https://arxiv.org/abs/2004.05150.
  • Brown etal. (2020)T.B. Brown, B.Mann, N.Ryder, M.Subbiah, J.Kaplan, P.Dhariwal,A.Neelakantan, P.Shyam, G.Sastry, A.Askell, S.Agarwal,A.Herbert-Voss, G.Krueger, T.Henighan, R.Child, A.Ramesh, D.M.Ziegler, J.Wu, C.Winter, C.Hesse, M.Chen, E.Sigler, M.Litwin, S.Gray,B.Chess, J.Clark, C.Berner, S.McCandlish, A.Radford, I.Sutskever, andD.Amodei.Language models are few-shot learners.CoRR, abs/2005.14165, 2020.URL https://arxiv.org/abs/2005.14165.
  • Carlini etal. (2022)N.Carlini, D.Ippolito, M.Jagielski, K.Lee, F.Tramer, and C.Zhang.Quantifying memorization across neural language models.arXiv preprint arXiv:2202.07646, 2022.
  • Chen etal. (2021)M.Chen, J.Tworek, H.Jun, Q.Yuan, H.P. deOliveiraPinto, J.Kaplan,H.Edwards, Y.Burda, N.Joseph, G.Brockman, A.Ray, R.Puri, G.Krueger,M.Petrov, H.Khlaaf, G.Sastry, P.Mishkin, B.Chan, S.Gray, N.Ryder,M.Pavlov, A.Power, L.Kaiser, M.Bavarian, C.Winter, P.Tillet, F.P.Such, D.Cummings, M.Plappert, F.Chantzis, E.Barnes, A.Herbert-Voss,W.H. Guss, A.Nichol, A.Paino, N.Tezak, J.Tang, I.Babuschkin, S.Balaji,S.Jain, W.Saunders, C.Hesse, A.N. Carr, J.Leike, J.Achiam, V.Misra,E.Morikawa, A.Radford, M.Knight, M.Brundage, M.Murati, K.Mayer,P.Welinder, B.McGrew, D.Amodei, S.McCandlish, I.Sutskever, andW.Zaremba.Evaluating large language models trained on code.CoRR, abs/2107.03374, 2021.URL https://arxiv.org/abs/2107.03374.
  • Chiang etal. (2024)W.-L. Chiang, L.Zheng, Y.Sheng, A.N. Angelopoulos, T.Li, D.Li, H.Zhang,B.Zhu, M.Jordan, J.E. Gonzalez, and I.Stoica.Chatbot arena: An open platform for evaluating llms by humanpreference, 2024.
  • Clark etal. (2019)C.Clark, K.Lee, M.Chang, T.Kwiatkowski, M.Collins, and K.Toutanova.Boolq: Exploring the surprising difficulty of natural yes/noquestions.CoRR, abs/1905.10044, 2019.URL http://arxiv.org/abs/1905.10044.
  • Cobbe etal. (2021)K.Cobbe, V.Kosaraju, M.Bavarian, M.Chen, H.Jun, L.Kaiser, M.Plappert,J.Tworek, J.Hilton, R.Nakano, C.Hesse, and J.Schulman.Training verifiers to solve math word problems.CoRR, abs/2110.14168, 2021.URL https://arxiv.org/abs/2110.14168.
  • Gemini Team (2023)Gemini Team.Gemini: A family of highly capable multimodal models, 2023.
  • Gemini Team (2024)Gemini Team.Gemini 1.5: Unlocking multimodal understanding across millions oftokens of context, 2024.
  • Gemma Team (2024)Gemma Team.Gemma: Open models based on gemini research and technology, 2024.
  • Gu etal. (2024)Y.Gu, L.Dong, F.Wei, and M.Huang.Minillm: Knowledge distillation of large language models.In The Twelfth International Conference on LearningRepresentations, 2024.
  • Hendrycks etal. (2020)D.Hendrycks, C.Burns, S.Basart, A.Zou, M.Mazeika, D.Song, andJ.Steinhardt.Measuring massive multitask language understanding.CoRR, abs/2009.03300, 2020.URL https://arxiv.org/abs/2009.03300.
  • Hinton etal. (2015)G.Hinton, O.Vinyals, and J.Dean.Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015.
  • Hoffmann etal. (2022)J.Hoffmann, S.Borgeaud, A.Mensch, E.Buchatskaya, T.Cai, E.Rutherford,D.d.L. Casas, L.A. Hendricks, J.Welbl, A.Clark, etal.Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022.
  • Ippolito etal. (2022)D.Ippolito, F.Tramèr, M.Nasr, C.Zhang, M.Jagielski, K.Lee, C.A.Choquette-Choo, and N.Carlini.Preventing verbatim memorization in language models gives a falsesense of privacy.arXiv preprint arXiv:2210.17546, 2022.
  • Jiang etal. (2023)A.Q. Jiang, A.Sablayrolles, A.Mensch, C.Bamford, D.S. Chaplot, D.delasCasas, F.Bressand, G.Lengyel, G.Lample, L.Saulnier, L.R. Lavaud, M.-A.Lachaux, P.Stock, T.L. Scao, T.Lavril, T.Wang, T.Lacroix, and W.E.Sayed.Mistral 7b, 2023.
  • Kahng etal. (2024)M.Kahng, I.Tenney, M.Pushkarna, M.X. Liu, J.Wexler, E.Reif,K.Kallarackal, M.Chang, M.Terry, and L.Dixon.Llm comparator: Visual analytics for side-by-side evaluation of largelanguage models, 2024.URL https://arxiv.org/abs/2402.10524.
  • Kinniment etal. (2024)M.Kinniment, L.J.K. Sato, H.Du, B.Goodrich, M.Hasin, L.Chan, L.H.Miles, T.R. Lin, H.Wijk, J.Burget, A.Ho, E.Barnes, and P.Christiano.Evaluating language-model agents on realistic autonomous tasks, 2024.URL https://arxiv.org/abs/2312.11671.
  • Kudo and Richardson (2018)T.Kudo and J.Richardson.SentencePiece: A simple and language independent subwordtokenizer and detokenizer for neural text processing.In E.Blanco and W.Lu, editors, Proceedings of the 2018Conference on Empirical Methods in Natural Language Processing: SystemDemonstrations, pages 66–71, Brussels, Belgium, Nov. 2018. Association forComputational Linguistics.10.18653/v1/D18-2012.URL https://aclanthology.org/D18-2012.
  • Kudugunta etal. (2023)S.Kudugunta, I.Caswell, B.Zhang, X.Garcia, C.A. Choquette-Choo, K.Lee,D.Xin, A.Kusupati, R.Stella, A.Bapna, etal.Madlad-400: A multilingual and document-level large audited dataset.arXiv preprint arXiv:2309.04662, 2023.
  • Kwiatkowski etal. (2019)T.Kwiatkowski, J.Palomaki, O.Redfield, M.Collins, A.Parikh, C.Alberti,D.Epstein, I.Polosukhin, J.Devlin, K.Lee, K.Toutanova, L.Jones,M.Kelcey, M.-W. Chang, A.M. Dai, J.Uszkoreit, Q.Le, and S.Petrov.Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics,7:452–466, 2019.10.1162/tacl_a_00276.URL https://aclanthology.org/Q19-1026.
  • Lin etal. (2024)Z.Lin, J.Cui, X.Liao, and X.Wang.Malla: Demystifying real-world large language model integratedmalicious services, 2024.URL https://arxiv.org/abs/2401.03315.
  • Luong etal. (2015)M.Luong, H.Pham, and C.D. Manning.Effective approaches to attention-based neural machine translation.CoRR, abs/1508.04025, 2015.URL http://arxiv.org/abs/1508.04025.
  • Macknight etal. (2024)Macknight, Aung, and Gomes.Personal Communication, 2024.
  • Mozes etal. (2023)M.Mozes, J.Hoffmann, K.Tomanek, M.Kouate, N.Thain, A.Yuan, T.Bolukbasi,and L.Dixon.Towards agile text classifiers for everyone, 2023.URL https://arxiv.org/abs/2302.06541.
  • Nasr etal. (2023)M.Nasr, N.Carlini, J.Hayase, M.Jagielski, A.F. Cooper, D.Ippolito, C.A.Choquette-Choo, E.Wallace, F.Tramèr, and K.Lee.Scalable extraction of training data from (production) languagemodels.arXiv preprint arXiv:2311.17035, 2023.
  • Phuong etal. (2024)M.Phuong, M.Aitchison, E.Catt, S.Cogan, A.Kaskasoli, V.Krakovna,D.Lindner, M.Rahtz, Y.Assael, S.Hodkinson, H.Howard, T.Lieberum,R.Kumar, M.A. Raad, A.Webson, L.Ho, S.Lin, S.Farquhar, M.Hutter,G.Deletang, A.Ruoss, S.El-Sayed, S.Brown, A.Dragan, R.Shah, A.Dafoe,and T.Shevlane.Evaluating frontier models for dangerous capabilities, 2024.URL https://arxiv.org/abs/2403.13793.
  • Radford etal. (2019)A.Radford, J.Wu, R.Child, D.Luan, D.Amodei, and I.Sutskever.Language models are unsupervised multitask learners, 2019.
  • Raffel etal. (2019)C.Raffel, N.Shazeer, A.Roberts, K.Lee, S.Narang, M.Matena, Y.Zhou,W.Li, and P.J. Liu.Exploring the limits of transfer learning with a unified text-to-texttransformer.CoRR, abs/1910.10683, 2019.URL http://arxiv.org/abs/1910.10683.
  • Ramé etal. (2024)A.Ramé, J.Ferret, N.Vieillard, R.Dadashi, L.Hussenot, P.-L. Cedoz, P.G.Sessa, S.Girgin, A.Douillard, and O.Bachem.Warp: On the benefits of weight averaged rewarded policies, 2024.
  • Ren etal. (2021)J.Ren, S.Rajbhandari, R.Y. Aminabadi, O.Ruwase, S.Yang, M.Zhang, D.Li,and Y.He.{{\{{Zero-offload}}\}}: Democratizing {{\{{billion-scale}}\}} modeltraining.In 2021 USENIX Annual Technical Conference (USENIX ATC 21),pages 551–564, 2021.
  • Roberts etal. (2023)A.Roberts, H.W. Chung, G.Mishra, A.Levskaya, J.Bradbury, D.Andor,S.Narang, B.Lester, C.Gaffney, A.Mohiuddin, etal.Scaling up models and data with t5x and seqio.Journal of Machine Learning Research, 24(377):1–8, 2023.
  • Sakaguchi etal. (2019)K.Sakaguchi, R.L. Bras, C.Bhagavatula, and Y.Choi.WINOGRANDE: an adversarial winograd schema challenge at scale.CoRR, abs/1907.10641, 2019.URL http://arxiv.org/abs/1907.10641.
  • Shazeer (2020)N.Shazeer.GLU variants improve transformer.CoRR, abs/2002.05202, 2020.URL https://arxiv.org/abs/2002.05202.
  • Shevlane etal. (2023)T.Shevlane, S.Farquhar, B.Garfinkel, M.Phuong, J.Whittlestone, J.Leung,D.Kokotajlo, N.Marchal, M.Anderljung, N.Kolt, L.Ho, D.Siddarth,S.Avin, W.Hawkins, B.Kim, I.Gabriel, V.Bolina, J.Clark, Y.Bengio,P.Christiano, and A.Dafoe.Model evaluation for extreme risks, 2023.URL https://arxiv.org/abs/2305.15324.
  • Su etal. (2021)J.Su, Y.Lu, S.Pan, B.Wen, and Y.Liu.Roformer: Enhanced transformer with rotary position embedding.CoRR, abs/2104.09864, 2021.URL https://arxiv.org/abs/2104.09864.
  • Suzgun etal. (2022)M.Suzgun, N.Scales, N.Schärli, S.Gehrmann, Y.Tay, H.W. Chung,A.Chowdhery, Q.V. Le, E.H. Chi, D.Zhou, and J.Wei.Challenging big-bench tasks and whether chain-of-thought can solvethem, 2022.
  • Team (2024)Q.Team.Introducing qwen1.5, February 2024.URL https://qwenlm.github.io/blog/qwen1.5/.
  • Tenney etal. (2020)I.Tenney, J.Wexler, J.Bastings, T.Bolukbasi, A.Coenen, S.Gehrmann,E.Jiang, M.Pushkarna, C.Radebaugh, E.Reif, and A.Yuan.The language interpretability tool: Extensible, interactivevisualizations and analysis for nlp models, 2020.URL https://arxiv.org/abs/2008.05122.
  • Touvron etal. (2023)H.Touvron, T.Lavril, G.Izacard, X.Martinet, M.-A. Lachaux, T.Lacroix,B.Rozière, N.Goyal, E.Hambro, F.Azhar, A.Rodriguez, A.Joulin,E.Grave, and G.Lample.Llama: Open and efficient foundation language models, 2023.
  • Vaswani etal. (2017)A.Vaswani, N.Shazeer, N.Parmar, J.Uszkoreit, L.Jones, A.N. Gomez,L.Kaiser, and I.Polosukhin.Attention is all you need.CoRR, abs/1706.03762, 2017.URL http://arxiv.org/abs/1706.03762.
  • Weidinger etal. (2021)L.Weidinger, J.Mellor, M.Rauh, C.Griffin, J.Uesato, P.-S. Huang, M.Cheng,M.Glaese, B.Balle, A.Kasirzadeh, Z.Kenton, S.Brown, W.Hawkins,T.Stepleton, C.Biles, A.Birhane, J.Haas, L.Rimell, L.A. Hendricks,W.Isaac, S.Legassick, G.Irving, and I.Gabriel.Ethical and social risks of harm from language models, 2021.URL https://arxiv.org/abs/2112.04359.
  • xAI (2024)xAI.grok-1, 2024.URL https://github.com/xai-org/grok-1.
  • XLA (2019)XLA.Xla: Optimizing compiler for tensorflow, 2019.URL https://www.tensorflow.org/xla.
  • Xu etal. (2021)Y.Xu, H.Lee, D.Chen, B.A. Hechtman, Y.Huang, R.Joshi, M.Krikun,D.Lepikhin, A.Ly, M.Maggioni, R.Pang, N.Shazeer, S.Wang, T.Wang,Y.Wu, and Z.Chen.GSPMD: general and scalable parallelization for ML computationgraphs.CoRR, abs/2105.04663, 2021.URL https://arxiv.org/abs/2105.04663.
  • Yang etal. (2023)J.Yang, A.Prabhakar, K.Narasimhan, and S.Yao.Intercode: Standardizing and benchmarking interactive coding withexecution feedback, 2023.URL https://arxiv.org/abs/2306.14898.
  • Zhang and Sennrich (2019)B.Zhang and R.Sennrich.Root mean square layer normalization.CoRR, abs/1910.07467, 2019.URL http://arxiv.org/abs/1910.07467.
  • Zheng etal. (2023)L.Zheng, W.-L. Chiang, Y.Sheng, T.Li, S.Zhuang, Z.Wu, Y.Zhuang, Z.Li,Z.Lin, E.Xing, etal.Lmsys-chat-1m: A large-scale real-world llm conversation dataset.arXiv preprint arXiv:2309.11998, 2023.
Gemma 2: Improving Open Language Models at a Practical Size (2024)

FAQs

Why are large language models better? ›

A large language model (LLM) is a deep learning algorithm that can perform a variety of natural language processing (NLP) tasks. Large language models use transformer models and are trained using massive datasets — hence, large. This enables them to recognize, translate, predict, or generate text or other content.

What is the size of large language models? ›

The largest models typically have 100 billion parameters, requiring 200 gigabytes to load, which places them outside the range of most consumer electronics.

How is a large language model built? ›

LLMs are built on machine learning: specifically, a type of neural network called a transformer model. In simpler terms, an LLM is a computer program that has been fed enough examples to be able to recognize and interpret human language or other types of complex data.

What do large language models teach? ›

For elementary school students, large language models can assist in the development of reading and writing skills (e.g., by suggesting syntactic and grammatical corrections), as well as in the development of writing style and critical thinking skills.

How much better can large language models get? ›

“Bigger is better” stems from the data scaling laws that entered the conversation with a 2012 paper by Prasanth Kolachina applying scaling laws to machine learning. Kolachina and his colleagues showed that as models got larger, they generally got more accurate and performed better.

What is the difference between small and large language models? ›

If you possess limited data, a small language model, which requires less data for training, would be a more suitable option. Computational resources and infrastructure are other critical considerations. Large language models are computationally intensive and require substantial processing power.

Is ChatGPT a large language model? ›

Yes, ChatGPT is an AI-powered large language model that enables you to have human-like conversations and so much more with a chatbot. The internet-accessible language model can compose large or small bodies of text, write lists, or even answer questions that you ask.

Why do large language models make mistakes? ›

The authors found the LLMs to be prone to similar content effects as humans. Both humans and LLMs are more likely to mistakenly label an invalid argument as valid when the semantic content is sensical and believable.

How will large language models change the world? ›

Large Language Models are set to disrupt industries the world over. For manufacturing, they will bridge potential knowledge gaps and manage databases that humans could not. They will serve as a conversational gateway between humans and machines, allowing businesses to unlock previously unknown potentials.

Why do large language models hallucinate? ›

By employing results from learning theory, we show that LLMs cannot learn all of the computable functions and will therefore always hallucinate. Since the formal world is a part of the real world which is much more complicated, hallucinations are also inevitable for real world LLMs.

What is the difference between NLP and large language models? ›

NLP: Can be implemented using simpler models such as decision trees and linear regression when the tasks are relatively straightforward. LLM: Relies on complex, deep learning models that require significant computational power and data, making them more suited for tasks that require a deep understanding of language.

Can you train your own large language model? ›

Training a large language model to understand your own domain language is not a trivial task. Although quick results can be achieved easily through one-shot/few-shot prompt engineering, a reliable solution requires more effort.

What is an example of a large language model? ›

Ernie is Baidu's large language model which powers the Ernie 4.0 chatbot. The bot was released in August 2023 and has garnered more than 45 million users. Ernie is rumored to have 10 trillion parameters. The bot works best in Mandarin but is capable in other languages.

How difficult is it to build a large language model? ›

Building large language models comes with its own set of challenges, which require careful consideration: Time and cost: Training a large language model is a rather lengthy and costly process.

What are the requirements for large language models? ›

To build your own large language models, you'll need to have a solid understanding of deep learning and NLP, as well as the tools and techniques used in large language model development. You'll also need access to open-source datasets and NLP challenges, which can be used to train and evaluate your models.

What is an advantage of a small language model over a large language model? ›

What is an advantage of a Small Language Model (SLM) over a Large Language Model (LLM)?-It consumes fewer computing resources. -It can generate higher quality media. -It can perform a wider variety of tasks. -It uses a greater number of parameters.

What is the power of large language models? ›

Large language models can be useful in data annotation and labeling tasks. They can propose labels or tags for text data, reducing the manual effort required for annotation. This assistance speeds up the labeling process and allows data scientists to focus on more complex tasks.

What are some of the benefits of using large language models (LLMs)? ›

LLMs have many benefits, including: 1) They can generate human-quality text. 2) They can be used for a variety of tasks. 3) They can be trained on massive datasets of text and code. 4) They are constantlyimproved.

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Barbera Armstrong

Last Updated:

Views: 5343

Rating: 4.9 / 5 (59 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Barbera Armstrong

Birthday: 1992-09-12

Address: Suite 993 99852 Daugherty Causeway, Ritchiehaven, VT 49630

Phone: +5026838435397

Job: National Engineer

Hobby: Listening to music, Board games, Photography, Ice skating, LARPing, Kite flying, Rugby

Introduction: My name is Barbera Armstrong, I am a lovely, delightful, cooperative, funny, enchanting, vivacious, tender person who loves writing and wants to share my knowledge and understanding with you.