What is large language models (LLM)
A large language model (LLM) is a type of machine learning model designed for
natural language processing tasks such as language generation. LLMs are language
models with many parameters, and are trained with self-supervised learning on a
vast amount of text.
The largest and most capable LLMs are generative pretrained transformers (GPTs).
Modern models can be fine-tuned for specific tasks or guided by prompt
engineering.[1] These models acquire predictive power regarding syntax, semantics,
and ontologies[2] inherent in human language corpora, but they also inherit
inaccuracies and biases present in the data they are trained in.[3]
History
The training compute of notable large models in FLOPs vs publication date over the
period 2010-2024. For overall notable models (top left), frontier models (top right),
top language models (bottom left) and top models within leading companies (bottom
right). The majority of these models are language models.
The training compute of notable large AI models in FLOPs vs publication date over
the period 2017-2024. The majority of large models are language models or
multimodal models with language capacity.
Before 2017, there were a few language models that were large as compared to
capacities then available. In the 1990s, the IBM alignment models pioneered
statistical language modelling. A smoothed n-gram model in 2001 trained on 0.3
billion words achieved state-of-the-art perplexity at the time.[4] In the 2000s, as
Internet use became prevalent, some researchers constructed Internet-scale
language datasets ("web as corpus"[5]), upon which they trained statistical language
models.[6][7] In 2009, in most language processing tasks, statistical language
models dominated over symbolic language models, as they can usefully ingest large
datasets.[8]
After neural networks became dominant in image processing around 2012,[9] they
were applied to language modelling as well. Google converted its translation service
to Neural Machine Translation in 2016. As it was before transformers, it was done by
seq2seq deep LSTM networks.
An illustration of main components of the transformer model from the original paper,
where layers were normalized after (instead of before) multiheaded attention
At the 2017 NeurIPS conference, Google researchers introduced the transformer
architecture in their landmark paper "Attention Is All You Need". This paper's goal
was to improve upon 2014 seq2seq technology,[10] and was based mainly on the
attention mechanism developed by Bahdanau et al. in 2014.[11] The following year
in 2018, BERT was introduced and quickly became "ubiquitous".[12] Though the
original transformer has both encoder and decoder blocks, BERT is an encoder-only
model. Academic and research usage of BERT began to decline in 2023, following
rapid improvements in the abilities of decoder-only models (such as GPT) to solve
tasks via prompting.[13]
Although decoder-only GPT-1 was introduced in 2018, it was GPT-2 in 2019 that
caught widespread attention because OpenAI at first deemed it too powerful to
release publicly, out of fear of malicious use.[14] GPT-3 in 2020 went a step further
and as of 2024 is available only via API with no offering of downloading the model to
execute locally. But it was the 2022 consumer-facing browser-based ChatGPT that
captured the imaginations of the general population and caused some media hype
and online buzz.[15] The 2023 GPT-4 was praised for its increased accuracy and as
a "holy grail" for its multimodal capabilities.[16] OpenAI did not reveal the high-level
architecture and the number of parameters of GPT-4. The release of ChatGPT led to
an uptick in LLM usage across several research subfields of computer science,
including robotics, software engineering, and societal impact work.[17] In 2024
OpenAI released the reasoning model OpenAI o1, which generates long chains of
thought before returning a final answer.
Competing language models have for the most part been attempting to equal the
GPT series, at least in terms of number of parameters.[18]
Since 2022, source-available models have been gaining popularity, especially at first
with BLOOM and LLaMA, though both have restrictions on the field of use. Mistral
AI's models Mistral 7B and Mixtral 8x7b have the more permissive Apache License.
In January 2025, DeepSeek released DeepSeek R1, a 671-billion-parameter open-
weight model that performs comparably to OpenAI o1 but at a much lower cost.[19]
Since 2023, many LLMs have been trained to be multimodal, having the ability to
also process or generate other types of data, such as images or audio. These LLMs
are also called large multimodal models (LMMs).[20]
As of 2024, the largest and most capable models are all based on the transformer
architecture. Some recent implementations are based on other architectures, such
as recurrent neural network variants and Mamba (a state space model).[21][22][23]