Understanding the First GPT | Paper Notes: Improving Language Understanding by Generative Pre-Training

This is a summary of the first GPT paper, "Improving Language Understanding by Generative Pre-Training."

Introduction

The paper I'm summarizing today is:

Improving Language Understanding by Generative Pre-Training

This is the first GPT paper from OpenAI.

All figures shown in this article are cited from the paper above.

Note: This article was translated from my original post.

Improving Language Understanding by Generative Pre-Training

Overview

  • Background
    • While unlabeled text data is abundant, labeled training data is scarce
  • Challenge
    • Current approaches rely heavily on labeled training data, which is difficult to obtain
    • Effective methods for leveraging unlabeled text data had not yet been established
  • What they did
    • Built a two-stage model: general-purpose pre-training (Generative Pre-Training) on unlabeled text data, followed by task-specific fine-tuning on labeled data
    • Used a Transformer-based architecture: decoder-only
  • Results
    • Achieved state-of-the-art performance across a wide range of language tasks at the time

Method

Model architecture and fine-tuning configuration

  • Architecture based on the Transformer decoder
  • Pre-training on unlabeled text data
  • For supervised fine-tuning, a linear layer + softmax is added on top of the Transformer
    • The objective function incorporates an auxiliary objective that combines the pre-training objective using weighted addition
    • Input format is converted into a continuous sequence of tokens that the Transformer can process, with formatting dependent on the task
    • Multiple Transformers are used in parallel depending on the task

Results

Natural Language Inference Tasks

Natural Language Inference task results

  • Task: recognize logical relationships in text
    • Determine whether two sentences show entailment, contradiction, or are neutral
  • Outperformed existing models on all datasets except RTE

Question Answering / Commonsense Reasoning Tasks

  • RACE: QA task based on middle and high school English exam passages
  • Story Cloze: Task to select the correct ending for a story
  • Outperformed existing models on both tasks
    • Demonstrates the model's ability to handle long-context documents

Semantic Similarity / Classification Tasks

Semantic similarity and classification task results

  • Semantic Similarity: Determine whether two sentences are semantically equivalent
    • Outperformed existing methods on two out of three datasets
  • Classification/Text classification
    • Significantly outperformed the existing state-of-the-art on CoLA and achieved near-SOTA results on SST-2

Analysis

Was pre-training actually meaningful?

Number of pre-trained layers transferred and task accuracy

  • The more pre-trained layers used in transfer learning, the higher the task accuracy
    • Knowledge acquired during pre-training is beneficial for downstream task performance

Model ablation study results

  • Without pre-training, performance consistently declined across tasks

Conclusion / Personal Thoughts

That wraps up my summary of the paper "Improving Language Understanding by Generative Pre-Training."

Below are my personal notes:

  • What are the authors trying to achieve?
    • Presenting an effective method for utilizing unlabeled text data
  • What are the key elements of their approach?
    • A two-stage structure: general-purpose pre-training on unlabeled text data, followed by task-specific fine-tuning on labeled training data
    • Adoption of the Transformer decoder architecture
  • What cited papers should I read next?
  • Other thoughts
    • I wonder if this discovery—demonstrating that pre-training on unlabeled text data is effective—combined with the Scaling Laws published later, established the current paradigm of large-scale training on text data.
    • I'd been treating GPT/Generative Pre-Training as just a proper noun, but reading this paper finally helped me understand what it actually means.

[Related Articles]

en.bioerrorlog.work

References