How do LLMs Handle Out-of-vocabulary (OOV) Words?

LLMs handle out-of-vocabulary (OOV) words or tokens by leveraging their tokenization process, which ensures that even unfamiliar or rare inputs are represented in a way the model can understand. Here’s how it works:

1. Tokenization Techniques:

  • Subword Tokenization (e.g., Byte Pair Encoding (BPE), WordPiece, Unigram):
    • GPT or BERT use subword tokenization methods that break down words into smaller units (subwords). For example:
    • The word “unfamiliar” might be split into ["un", "familiar"].
    • A rare word like “autodidactism” might be split into ["auto", "did", "act", "ism"].
    • This ensures that even if a word isn’t in the model’s vocabulary, its components are, allowing the model to process it effectively.
  • Byte-Level Tokenization:
    • Considers each byte (0-255) as a potential token.
    • Models tokenize inputs at the byte level, allowing them to handle any text input, including misspellings, rare words, or text in various languages.
    • For instance, “🤖AI🌍” would be tokenized into individual byte-level tokens representing emojis and characters.

2. Embedding and Context Understanding:

  • Each token (subword or byte) is mapped to an embedding in a high-dimensional space.
  • Even if the model encounters a novel word or token combination, the embeddings of its components allow it to infer meaning based on contextual patterns learned during training.

3. Handling Entirely Novel Inputs:

  • Misspelled or Noisy Inputs:
    • These are broken into tokens the model knows, which might still capture some semantic information. For example, “autdoidactism” could be tokenized similarly to “autodidactism.”
  • Code or Specialized Notation:
    • LLMs trained on diverse datasets (e.g., code snippets, technical writing) can handle uncommon tokens by identifying familiar patterns or structures.

4. Limitations:

  • Loss of Semantic Specificity:
    • If a rare or completely novel word is tokenized into many smaller units, its specific meaning might not be fully captured.
  • Tokenization Overhead:
    • Rare or OOV words may require multiple tokens, increasing computational cost.
  • Dependence on Training Data:
    • If the model hasn’t encountered similar patterns in training, its ability to infer meaning may be limited.

Example:

  • Input: “ChatGPT is great for autodidactism!”
  • Tokenization (BPE): ["Chat", "GPT", "is", "great", "for", "auto", "did", "act", "ism"]
  • Model uses embeddings for each token and context to process the sentence.

By tokenizing inputs into manageable units and leveraging context-aware embeddings, LLMs can handle OOV words effectively, even if not perfectly.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top