5 Seminal Papers to Kickstart Your Journey Into Large Language Models

by A1Sadmin

Author: Dylan, Research Scientist II
Five minute read

Large language models (LLMs) have introduced an exciting new paradigm to the machine learning world in the past few years.

Research groups like AIS’s Advanced Research Concepts (ARC) team have been quick in exploring the range of possibilities with this new technology. However, aspiring AI developers and scientists often have no idea where to start learning about the science of large language models. In this blog post, we’ll review five seminal papers from the field and provide a brief reading guide so that you know which details to look for when you read them yourself.

1

Attention Is All You Need

“Attention Is All You Need” (Vaswani et al., 2017) presented the transformer architecture for neural networks which underlies all modern LLM designs. Transformers differed from previous neural networks in their use of self-attention, a mechanism that computes the ways elements in the input sequence affect each other’s semantic value. While the architecture was initially presented for use in language translation, it was adapted and scaled up by OpenAI into their first GPT model as well as Google’s T5 and BERT.

What Will This Paper Teach You?

The self-attention mechanism.
How the transformer architecture is organized into layers that update a residual stream.
Autoregressive language generation and how it fits into the sequence-to-sequence paradigm.

2

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Another paper to come out of the early transformer literature, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” (Devlin et al., 2019), set the paradigm for transformer embedding models, which seek to embed sequences of text into vectors. The BERT model uses the encoder part of the transformer architecture and is trained on unsupervised masked language modeling on a large set of unstructured text data. AIS used MiniLM, a successor to BERT, as part of an explanatory reinforcement learning technique in our paper “CODEX: A Cluster-Based Method for Explainable Reinforcement Learning”.

What Will This Paper Teach You?

How the encoder part of the transformer architecture can be isolated for embedding models.
The paradigm of pre-training and fine-tuning with language models.

3

Language Models are Few-Shot Learners

While all of OpenAI’s GPT papers have been influential, the paper covering GPT-3, “Language Models are Few-Shot Learners” (Brown et al., 2020) marked the entry into the modern era of LLMs. The paper reviews the training process and architecture of GPT-3, but mostly focuses on the efficacy of few-shot learning, a technique where a pre-trained LLM is fed examples into its context window instead of being fine-tuned for a task.

What Will This Paper Teach You?

How the decoder part of the transformer architecture can be isolated for pure autoregressive generation.
The advantages of few-shot learning over fine-tuning.
The test sets used to evaluate capabilities of large language models, like HellaSwag, Winograd schemas and TriviaQA.

4

ReAct: Synergizing Reasoning and Acting in Language Models

One area of focus within LLM research is using prompt scaffolding and reasoning schemes to elicit more powerful capabilities from LLMs. A core paper in this subfield is “ReAct: Synergizing Reasoning and Acting in Language Models” (Yao et al., 2023), which builds on top of prior chain-of-thought prompting to get the model to think clearly and also use resources like a Wikipedia API to inform its actions. The ReAct framework is deployed in direct QA tasks as well as text-based adventure environments and shows improvements over base model performance in both.

What Will This Paper Teach You?

Prompt scaffolding frameworks and the degree to which they augment LLM abilities.
The evaluation environments for LLM agents like ALFWorld and WebShop.

5

The Llama 3 Herd of Models

To get a better idea of how modern LLMs are trained, we recommend “The Llama 3 Herd of Models” (Dubey et al., 2024). The Llama models are among the best open-source models today alongside the Qwen series and Mistral series, and the AIS team uses Llama models for several of our internal projects. This paper goes into depth on the more technical engineering details that went into the training process, especially post-training where they apply several techniques like SFT and DPO to coerce the base model into being more coherent and user-interactive.

What Will This Paper Teach You?

The post-training process for modern LLMs, including alignment, multilinguality and coding specialization.
How scaling laws are used during training to optimize energy expenditure (specifically, the data size/model size tradeoff in the compute budget).
The hardware and software used to support the platforms for training LLMs.

While these papers should provide a solid foundation to any novice LLM enthusiast, reading will only get you so far.

It’s important to engage with these papers and others in a more hands-on way: re-implementing architectures, reproducing results and performing experiments on top of the prior work are all great ways to learn more about LLMs, and in machine learning more generally. In particular, papers that use frontier model APIs like the ReAct paper above don’t require massive GPU servers and can run on a consumer-grade laptop. We hope that the resources we’ve provided will give a nice jumpstart!