A gentle intro to LLM – DataExplorer

The information provided below is based on the Sebastian Raschka book “Build LLM from Scratch”.

Intro

LLM stands for Large Language Model. This model is a (large) deep neural network trained on a large amount of data. At its core, LLM uses “transformer”, an architecture that allows it to pay attention to different parts of input selectively.

What can LLM do for us? * Text generation: LLM can generate text based on a given prompt, which can be used for creative writing, content generation, and more. * Text generation with control: LLM can generate text with specific attributes, such as style, tone, and sentiment, which can be used for personalized content generation, marketing, and more. * Text completion: LLM can complete a given text based on the context, which can be used for auto-completion, code generation, and more. * Text classification: LLM can classify text into different categories, which can be used for sentiment analysis, topic classification, and more. * Text summarization: LLM can summarize a given text, which can be used for news summarization, document summarization, and more. * Text translation: LLM can translate text from one language to another, which can be used for machine translation, localization, and more. * Text question answering: LLM can answer questions based on a given text, which can be used for information retrieval, chatbots, and more. * Text sentiment analysis: LLM can analyze the sentiment of a given text, which can be used for social media analysis, customer feedback analysis, and more.

So, LLM can be useful when we deal with parsing and generating text.

Stages of building LLM

There are several ready-to-use LLMs available, such as OpenAI’s GPT-3 or Google’s BERT. However, you may want to build your own custom-built LLM for specific tasks. One example is BloombergGPT, which is a custom-built LLM based on financial data. Why should you want to build your own LLM?

Customization: You can tailor the model to your specific needs and requirements.
Control: You have full control over the model architecture, training data, and hyperparameters.
Privacy: You can keep your data private and secure, without relying on third-party providers.
Cost: You can save costs by using your own hardware and resources, rather than paying for cloud-based services.
Performance: You can optimize the model for your specific use case, which can lead to better performance and accuracy.

The general process of building LLM consists of two stages:
1. Pre-training: This stage involves training the model on a large corpus of text data to learn the underlying patterns and relationships in the language. The model learns to predict the next word in a sentence given the previous words, which helps it understand the structure and semantics of the language.
2. Fine-tuning: This stage involves training the pre-trained model (a.k.a foundation model) on a smaller, task-specific dataset to adapt it to a specific task or domain. Fine-tuning allows the model to learn the nuances and specifics of the task, improving its performance and accuracy.

flowchart LR
    A[Raw unlabeled text data] --> B[[pre-training]]
    B --> C[foundation model]
    C --> E[[fine-tuning]]
    D[labled data] --> E
    E --> F[fine-tuned LLM]

Raw unlabled text data: This is the data that is used to pre-train the model. It can be any text data, such as books, articles, or web pages. For instance, let’s look at this simple sentence: “The cat sat on the mat.” The model will learn to predict the next word in the sentence given the previous words, such as “The cat sat on the” -> “mat”. So, in the pre-training stage, the model learns the structure and semantics of the language by predicting the next word in a sentence given the previous words.
foundation model: This is the pre-trained model that has learned the underlying patterns and relationships in the language. It can do “text completion” and has “few-shot” capabilities, meaning it can be adapted to specific tasks with a small amount of labeled data.
labled data: This is the data that is used to fine-tune the model. It can be any text data that is labeled for a specific task, such as sentiment analysis, text classification, or question answering. For instance, if we want to fine-tune the model for sentiment analysis, we can use a dataset of movie reviews labeled as positive or negative.
fine-tuned LLM: This is the final model that has been fine-tuned on the labeled data for a specific task. It can be used for various applications, such as sentiment analysis, text classification, or question answering.

Note on fine-tuning:
There are two popular approaches for fine-tuning LLMS: (i) instruction fine-tuning, and (ii) classification fine-tuning. In the “instruction fine-tuning” approach, data consists of pairs of (instruction, answer). In the “classification fine-tuning” approach, however, the data contains pairs of (text, label).

transformer

In a very high-level view, one can see transformer as follows:

flowchart LR
    A[Input text] --> B[Encoder]
    B --> C[encoded vectors]
    C --> D[Decoder]
    D --> E[Output text]

A key component is self-attention mechanism (not shown in the diagram above). This allows the model to consider different weights for different words (or tokens) in the input text, depending on their importance in the context of the sentence. Both GPT and BERT use transformer architecture. While GPT is good at text generation (text completion), BERT is good at masked word prediction. So BERT may show better performance in tasks such as text classification, while GPT may show better performance in tasks such as text generation.

Further reading:
* See the Appendix B of the book “Build LLM from Scratch” by Sebastian Raschka. * Read the original paper “Attention is all you need” by Vaswani et al. (2017)

large data set for pre-training

Many LLMs are pre-trained on very large datasets, and the associated cost of such pre-training can be very high. For instance, OpenAI’s GPT-3 was trained on 570GB of text data, which cost around $4.6 million to pre-train. However, you can use smaller datasets for pre-training, such as the Common Crawl dataset, which is a publicly available dataset of web pages. Also, there are many open-source pre-trained models (foundation models) available. For the sake of education, we will try to pre-train small LLM on a small dataset. Once done, we will then switch to use an open-souce pre-trained model (foundation model) and fine-tune it on a small dataset.

GPT architecture

TBC