This is a summary of the first GPT paper, "Improving Language Understanding by Generative Pre-Training."
- Introduction
- Improving Language Understanding by Generative Pre-Training
- Conclusion / Personal Thoughts
- References
Introduction
The paper I'm summarizing today is:
Improving Language Understanding by Generative Pre-Training
This is the first GPT paper from OpenAI.
- Published: May 2018
- Author: OpenAI
- Code: GitHub - openai/finetune-transformer-lm: Code and model for the paper "Improving Language Understanding by Generative Pre-Training"
All figures shown in this article are cited from the paper above.
Note: This article was translated from my original post.
Improving Language Understanding by Generative Pre-Training
Overview
- Background
- While unlabeled text data is abundant, labeled training data is scarce
- Challenge
- Current approaches rely heavily on labeled training data, which is difficult to obtain
- Effective methods for leveraging unlabeled text data had not yet been established
- What they did
- Built a two-stage model: general-purpose pre-training (Generative Pre-Training) on unlabeled text data, followed by task-specific fine-tuning on labeled data
- Used a Transformer-based architecture: decoder-only
- Results
- Achieved state-of-the-art performance across a wide range of language tasks at the time
Method

- Architecture based on the Transformer decoder
- Pre-training on unlabeled text data
- For supervised fine-tuning, a linear layer + softmax is added on top of the Transformer
- The objective function incorporates an auxiliary objective that combines the pre-training objective using weighted addition
- Input format is converted into a continuous sequence of tokens that the Transformer can process, with formatting dependent on the task
- Multiple Transformers are used in parallel depending on the task
Results
Natural Language Inference Tasks

- Task: recognize logical relationships in text
- Determine whether two sentences show entailment, contradiction, or are neutral
- Outperformed existing models on all datasets except RTE
Question Answering / Commonsense Reasoning Tasks

- RACE: QA task based on middle and high school English exam passages
- Story Cloze: Task to select the correct ending for a story
- Outperformed existing models on both tasks
- Demonstrates the model's ability to handle long-context documents
Semantic Similarity / Classification Tasks

- Semantic Similarity: Determine whether two sentences are semantically equivalent
- Outperformed existing methods on two out of three datasets
- Classification/Text classification
- Significantly outperformed the existing state-of-the-art on CoLA and achieved near-SOTA results on SST-2
Analysis
Was pre-training actually meaningful?

- The more pre-trained layers used in transfer learning, the higher the task accuracy
- Knowledge acquired during pre-training is beneficial for downstream task performance

- Without pre-training, performance consistently declined across tasks
Conclusion / Personal Thoughts
That wraps up my summary of the paper "Improving Language Understanding by Generative Pre-Training."
Below are my personal notes:
- What are the authors trying to achieve?
- Presenting an effective method for utilizing unlabeled text data
- What are the key elements of their approach?
- A two-stage structure: general-purpose pre-training on unlabeled text data, followed by task-specific fine-tuning on labeled training data
- Adoption of the Transformer decoder architecture
- What cited papers should I read next?
- Transformer decoder model [1801.10198] Generating Wikipedia by Summarizing Long Sequences
- Other thoughts
- I wonder if this discovery—demonstrating that pre-training on unlabeled text data is effective—combined with the Scaling Laws published later, established the current paradigm of large-scale training on text data.
- I'd been treating GPT/Generative Pre-Training as just a proper noun, but reading this paper finally helped me understand what it actually means.
[Related Articles]