Shaw Talebi, Author at Towards Data Science

How to Train LLMs to “Think” (o1 & DeepSeek-R1)

Shaw Talebi — Tue, 04 Mar 2025 01:31:34 +0000

In September 2024, OpenAI released its o1 model, trained on large-scale reinforcement learning, giving it “advanced reasoning” capabilities. Unfortunately, the details of how they pulled this off were never shared publicly. Today, however, DeepSeek (an AI research lab) has replicated this reasoning behavior and published the full technical details of their approach. In this article, I will discuss the key ideas behind this innovation and describe how they work under the hood.

OpenAI’s o1 model marked a new paradigm for training large language models (LLMs). It introduced so-called “thinking” tokens, which enable a sort of scratch pad that the model can use to think through problems and user queries.

The major insight from o1 was performance improved with increased test-time compute. This is just a fancy way of saying that the more tokens a model generates, the better its response. The figure below, reproduced from OpenAI’s blog, captures this point nicely.

AIME accuracy scaling with train-time and test-time compute, respectively. Plots reillustrated from [1].

In the plots above, the y-axes are model performance on AIME (math problems), while the x-axes are various compute times. The left plot depicts the well-known neural scaling laws that kicked off the LLM rush of 2023. In other words, the longer a model is trained (i.e. train-time compute), the better its performance.

On the right, however, we see a new type of scaling law. Here, the more tokens a model generates (i.e. test-time compute), the better its performance.

“Thinking” tokens

A key feature of o1 is its so-called “thinking” tokens. These are special tokens introduced during post-training, which delimit the model’s chain of thought (CoT) reasoning (i.e., thinking through the problem). These special tokens are important for two reasons.

One, they clearly demarcate where the model’s “thinking” starts and stops so it can be easily parsed when spinning up a UI. And two, it produces a human-interpretable readout of how the model “thinks” through the problem.

Although OpenAI disclosed that they used reinforcement learning to produce this ability, the exact details of how they did it were not shared. Today, however, we have a pretty good idea thanks to a recent publication from DeepSeek.

DeepSeek’s paper

In January 2025, DeepSeek published “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning” [2]. While this paper caused its fair share of pandemonium, its central contribution was unveiling the secrets behind o1.

It introduces two models: DeepSeek-R1-Zero and DeepSeek-R1. The former was trained exclusively on reinforcement learning (RL), and the latter was a mixture of Supervised Fine-tuning (SFT) and RL.

Although the headlines (and title of the paper) were about DeepSeek-R1, the former model is important because, one, it generated training data for R1, and two, it demonstrates striking emergent reasoning abilities that were not taught to the model.

In other words, R1-Zero discovers CoT and test-time compute scaling through RL alone! Let’s discuss how it works.

DeepSeek-R1-Zero (RL only)

Reinforcement learning (RL) is a Machine Learning approach in which, rather than training models on explicit examples, models learn through trial and error [3]. It works by passing a reward signal to a model that has no explicit functional relationship with the model’s parameters.

This is similar to how we often learn in the real world. For example, if I apply for a job and don’t get a response, I have to figure out what I did wrong and how to improve. This is in contrast to supervised learning, which, in this analogy, would be like the recruiter giving me specific feedback on what I did wrong and how to improve.

While using RL to train R1-Zero consists of many technical details, I want to highlight 3 key ones: the prompt template, reward signal, and GRPO (Group Relative Policy Optimization).

1) Prompt template

The template used for training is given below, where {prompt} is replaced with a question from a dataset of (presumably) complex math, coding, and logic problems. Notice the inclusion of and tags via simple prompting.

A conversation between User and Assistant. The user asks a question, and the 
Assistant solves it.The assistant first thinks about the reasoning process in 
the mind and then provides the user with the answer. The reasoning process and 
answer are enclosed within   and   tags, 
respectively, i.e.,  reasoning process here 
 answer here . User: {prompt}. Assistant:

Something that stands out here is the minimal and relaxed prompting strategy. This was an intentional choice by DeepSeek to avoid biasing model responses and to observe its natural evolution during RL.

2) Reward signal

The RL reward has two components: accuracy and format rewards. Since the training dataset consists of questions with clear right answers, a simple rule-based strategy is used to evaluate response accuracy. Similarly, a rule-based formatting reward is used to ensure reasoning tokens are generated in between the thinking tags.

It’s noted by the authors that a neural reward model isn’t used (i.e. rewards are not computed by a neural net), because these may be prone to reward hacking. In other words, the LLM learns how to trick the reward model into maximizing rewards while decreasing downstream performance.

This is just like how humans find ways to exploit any incentive structure to maximize their personal gains while forsaking the original intent of the incentives. This highlights the difficulty of producing good rewards (whether for humans or computers).

3) GRPO (Group Relative Policy Optimization)

The final detail is how rewards are translated into model parameter updates. This section is quite technical, so the enlightened reader can feel free to skip ahead.

GRPO is an RL approach that combines a collection of responses to update model parameters. To encourage stable training, the authors also incorporate clipping and KL-divergence regularization terms into the loss function. Clipping ensures optimization steps are not too big, and regularization ensures the model predictions do not change too abruptly.

Here is the complete loss function with some (hopefully) helpful annotations.

Annotated GRPO loss function. Image by author.

Results (emergent abilities)

The most striking result of R1-Zero is that, despite its minimal guidance, it develops effective reasoning strategies that we might recognize.

For example, it learns implicitly through the RL to improve responses through test-time compute (recall the earlier insight from o1). This is depicted via the plot below from the R1 paper [2].

Average response length increases with additional training (and improved performance). Screenshot from DeepSeek R1 paper [2].

Another notable ability is reflecting on past responses with an uncanny human-like internal monologue. Here is an example from R1-Zero. Notice the long CoT and additional verification step before generating the final answer (I omitted some parts because the response was very long).

User prompt: If it takes me 1.5 hours to write a codebase and my intern 8 
hours. How long will it take us both to write 3 codebases?

 To solve for how long it will take both you and your intern to write 
three codebases together, we need to first find out how many codebases each of 
you can write per hour individually.

(long CoT) ...

So, it will take approximately 3.7895 hours for you and your intern to write 
three codebases together.

To make sure everything makes sense, let us round the answer to a reasonable 
number of decimal places (e.g., round to two decimal places):

t ≈ 3.79 hours

However, if you need a more "human-friendly" answer, you could convert the 
decimal part into minutes. Since 0.7895 of an hour is:

0.7895 × 60 minutes ≈ 47.37 minutes

So, 3.7895 hours is approximately 3 hours and 47.37 minutes (or roughly 3 
hours and 47 minutes if rounding to the nearest minute).

## Verification (optional step for verification)
To verify, let us check how many codebases you can individually write in t = 
72/19 hours:

Your work in 72/19 hours
= (2/3 codebases per hour) × (72/19 hours)
= (2 × 72) / (3 × 19)
= 144 / 57
= 48/19 codebases.

(verification process) ...

However, a more "exact" answer should be kept in fractional form unless 
specified otherwise. Thus, the answer is: 72/19hours which is approximately
$3.7895$ hours or about $3.79$ hours when rounded to two decimal places).


72/19 hours (or approximately 3.79 hours).

Problems with R1-Zero

Although the thinking tokens from R1-Zero give a human-readable window into the model’s “thought process,” the authors report some issues. Namely, the learned CoT sometimes suffers from readability issues and language mixing. Suggesting (perhaps) that its reasoning begins to veer away from something easily interpretable by humans.

DeepSeek-R1 (SFT + RL)

To mitigate R1-Zero’s interpretability issues, the authors explore a multi-step training strategy that utilizes both supervised fine-tuning (SFT) and RL. This strategy results in DeepSeek-R1, a better-performing model that is getting more attention today. The entire training process can be broken down into 4 steps.

Step 1: SFT with reasoning data

To help get the model on the right track when it comes to learning how to reason, the authors start with SFT. This leverages 1000s of long CoT examples from various sources, including few-shot prompting (i.e., showing examples of how to think through problems), directly prompting the model to use reflection and verification, and refining synthetic data from R1-Zero [2].

The two key advantages of this are, one, the desired response format can be explicitly shown to the model, and two, seeing curated reasoning examples unlocks better performance for the final model.

Step 2: R1-Zero style RL (+ language consistency reward)

Next, an RL training step is applied to the model after SFT. This is done in an identical way as R1-Zero with an added component to the reward signal that incentivizes language consistently. This was added to the reward because R1-Zero tended to mix languages, making it difficult to read its generations.

Step 3: SFT with mixed data

At this point, the model likely has on par (or better) performance than R1-Zero on reasoning tasks. However, this intermediate model wouldn’t be very practical because it wants to reason about any input it receives (e.g., “hi there”), which is unnecessary for factual Q&A, translation, and creative writing. That’s why another SFT round is performed with both reasoning (600k examples) and non-reasoning (200k examples) data.

The reasoning data here is generated from the resulting model from Step 2. Additionally, examples are included which use an LLM judge to compare model predictions to ground truth answers.

The non-reasoning data comes from two places. First, the SFT dataset used to train DeepSeek-V3 (the base model). Second, synthetic data generated by DeepSeek-V3. Note that examples are included that do not use CoT so that the model doesn’t use thinking tokens for every response.

Step 4: RL + RLHF

Finally, another RL round is done, which includes (again) R1-Zero style reasoning training and RL on human feedback. This latter component helps improve the model’s helpfulness and harmlessness.

The result of this entire pipeline is DeepSeek-R1, which excels at reasoning tasks and is an AI assistant you can chat with normally.

Accessing R1-Zero and R1

Another key contribution from DeepSeek is that the weights of the two models described above (and many other distilled versions of R1) were made publicly available. This means there are many ways to access these models, whether using an inference provider or running them locally.

Here are a few places that I’ve seen these models.

DeepSeek (DeepSeek-V3 and DeepSeek-R1)
Together (DeepSeek-V3, DeepSeek-R1, and distillations)
Hyperbolic (DeepSeek-V3, DeepSeek-R1-Zero, and DeepSeek-R1)
Ollama (local) (DeepSeek-V3, DeepSeek-R1, and distillations)
Hugging Face (local) (all of the above)

Conclusions

The release of o1 introduced a new dimension by which LLMs can be improved: test-time compute. Although OpenAI did not release its secret sauce for doing this, 5 months later, DeepSeek was able to replicate this reasoning behavior and publish the technical details of its approach.

While current reasoning models have limitations, this is a promising research direction because it has demonstrated that reinforcement learning (without humans) can produce models that learn independently. This (potentially) breaks the implicit limitations of current models, which can only recall and remix information previously seen on the internet (i.e., existing human knowledge).

The promise of this new RL approach is that models can surpass human understanding (on their own), leading to new scientific and technological breakthroughs that might take us decades to discover (on our own).

Get exclusive access to AI resources and project ideas: https://the-data-entrepreneurs.kit.com/shaw

Learn AI in 6 weeks by building it: https://maven.com/shaw-talebi/ai-builders-bootcamp

References

[1] Learning to reason with LLMs

[2] arXiv:2501.12948 [cs.CL]

[3] Deep Dive into LLMs Like ChatGPT

The post How to Train LLMs to “Think” (o1 & DeepSeek-R1) appeared first on Towards Data Science.

Fine-tuning Multimodal Embedding Models

Shaw Talebi — Fri, 31 Jan 2025 22:31:23 +0000

This is the 4th article in a larger series on multimodal AI. In the previous post, we discussed multimodal RAG systems, which can retrieve and synthesize information from different data modalities (e.g. text, images, audio). There, we saw how we could implement such a system using CLIP. One issue with this approach, however, is that vector search results from a general-purpose embedding model (like CLIP) may perform poorly in domain-specific use cases. In this article, I’ll discuss how we can mitigate these issues via fine-tuning multimodal embedding models.

Photo by Markus Winkler on Unsplash

Multimodal embeddings represent multiple data modalities in the same vector space such that similar concepts are co-located. A visual example of this is shown below, where semantically similar items (e.g. a picture of a dog and its corresponding caption) are close, while dissimilar items (e.g. a picture of a cat and a caption describing a dog) are far apart.

Stock photos from Canva. Image by author.

A popular multimodal embedding model is CLIP, which was trained on a massive corpus of image-caption pairs using contrastive learning. The key insight from CLIP was that such a model unlocks 0-shot abilities such as image classification, search, and captioning [1].

One limitation here is that CLIP’s 0-shot abilities may not transfer well to domains involving specialized information e.g. architectural drawings, medical imaging, and technical jargon. In such cases, we can improve CLIP’s performance through fine-tuning.

Fine-tuning CLIP

Fine-tuning involves adapting a model to a particular use case through additional training. This is powerful because it enables us to build on top of existing state-of-the-art models to develop powerful specialized models with relatively small data.

We can do this with CLIP through the following key steps.

Collect text-image training pairs
Pre-process training data
Define Evals
Fine-tune the model
Evaluate the model

I will discuss each of these steps in the context of a concrete example. If you are curious about what this looks like for text embedding (i.e. text-text pairs), I did that in a previous blog post.

Fine-Tuning Text Embeddings For Domain-Specific Search

Example: Fine-tuning CLIP on YouTube Titles and Thumbnails

Here, I will fine-tune CLIP on titles and thumbnails from my YouTube channel. At the end of this, we will have a model that can take title-thumbnail pairs and return a similarity score. This can be used for practical applications such as matching title ideas to an existing thumbnail or performing search over a thumbnail library.

The example code, dataset, and fine-tuned model are freely available on GitHub and the Hugging Face Hub, respectively. You can use this code and data to train your own models. If you end up publishing any work using this dataset, please cite the original source

GitHub Repo | Dataset | Fine-tuned Model

Step 1: Collect Text-Image Training Pairs

The first (and most important) step of any fine-tuning process is data collection. Here, I extracted title-thumbnail pairs from my channel in a 2-step process.

First, I used YouTube’s search API to extract the video IDs for all the videos on my channel. Second, I used YouTube’s video API to extract the title and thumbnail URL of each of my long-form videos (i.e. longer than 3 min).

# imports
from top_secret import my_key
import requests
from isodate import parse_duration

import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from datasets import DatasetDict, Dataset

channel_id = 'UCa9gErQ9AE5jT2DZLjXBIdA' # my YouTube channel ID
page_token = None # initialize page token
url = 'https://www.googleapis.com/youtube/v3/search' # YouTube search API 

# extract video data across multiple search result pages
video_id_list = []

while page_token != 0:
    params = {
        "key": my_key, 
        'channelId': channel_id, 
        'part': ["snippet","id"], 
        'order': "date", 
        'maxResults':50, 
        'pageToken': page_token
    }
    response = requests.get(url, params=params)

    for raw_item in dict(response.json())['items']:

        # only execute for youtube videos
        if raw_item['id']['kind'] != "youtube#video":
            continue

        # grab video ids
        video_id_list.append(raw_item['id']['videoId'])

    try:
        # grab next page token
        page_token = dict(response.json())['nextPageToken']
    except:
        # if no next page token kill while loop
        page_token = 0

Note that you will need a YouTube API key to run the above Python code, which you can create using the Google Cloud Console. To adapt this to your channel, you just need to change the _channelid variable.

# extract video titles and thumbnails
url = "https://www.googleapis.com/youtube/v3/videos"
video_data_list = []

for video_id in video_id_list:

    params = {
        "part": ["snippet","contentDetails"],
        "id": video_id,  
        "key": my_key,  
    }
    response = requests.get(url, params=params)

    raw_dict = dict(response.json())['items'][0]

    # only process videos longer than 3 minutes
    iso_duration = raw_dict['contentDetails']["duration"]
    if parse_duration(iso_duration).total_seconds() < 180:
        continue

    # extract video data
    video_data = {}
    video_data['video_id'] = video_id
    video_data['title'] = raw_dict['snippet']['title']
    video_data['thumbnail_url'] = raw_dict['snippet']['thumbnails']['high']['url']

    # append data to list
    video_data_list.append(video_data)

As an additional step, I created negative thumbnail-title pairs. We can use these during the training process to not only guide the model with examples of which embedding should be close together (i.e. positive pair), but also which embedding should be far apart (i.e. negative pairs).

To do this, I computed the similarity between all possible title pairs using the sentence transformer library. Then for each positive pair, I matched the least similar title as a negative example (ensuring there were no duplicates).

# store data in dataframe
df = pd.DataFrame(video_data_list)

# Load the model
model = SentenceTransformer("all-mpnet-base-v2")

# Encode all titles
embeddings = model.encode(df['title'].to_list())

# compute similarities
similarities = model.similarity(embeddings, embeddings)

# match least JDs least similar to positive match as the negative match
similarities_argsorted = np.argsort(similarities.numpy(), axis=1)
negative_pair_index_list = []

for i in range(len(similarities)):

    # Start with the smallest similarity index for the current row
    j = 0
    index = int(similarities_argsorted[i][j])

    # Ensure the index is unique
    while index in negative_pair_index_list:
        j += 1  # Move to the next smallest index
        index = int(similarities_argsorted[i][j])  # Fetch next smallest index

    negative_pair_index_list.append(index)

# add negative pairs to df
df['title_neg'] = df['title'].iloc[negative_pair_index_list].values

Finally, I created a train-valid-test split and pushed the dataset to the Hugging Face Hub.

# Shuffle the dataset
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

# Split into train, validation, and test sets
train_frac = 0.7
valid_frac = 0.15
test_frac = 0.15

# define train and validation size
train_size = int(train_frac * len(df))
valid_size = int(valid_frac * len(df))

# create train, validation, and test datasets
df_train = df[:train_size]
df_valid = df[train_size:train_size + valid_size]
df_test = df[train_size + valid_size:]

# Convert the pandas DataFrames back to Hugging Face Datasets
train_ds = Dataset.from_pandas(df_train)
valid_ds = Dataset.from_pandas(df_valid)
test_ds = Dataset.from_pandas(df_test)

# Combine into a DatasetDict
dataset_dict = DatasetDict({
    'train': train_ds,
    'valid': valid_ds,
    'test': test_ds
})

# push data to hub
dataset_dict.push_to_hub("shawhin/yt-title-thumbnail-pairs")

Step 2: Pre-process Training Pairs

Although we have all the data we need for fine-tuning, it is still not a suitable format for training. More specifically, we need to convert our image URLs to PIL image objects and organize our data into (anchor, positive, negative) triplets, **** i.e., a thumbnail, its corresponding title, and negative title, respectively.

We can process all three data splits (i.e. train, valid, and test) in the following way using the Hugging Face Datasets library.

from PIL import Image

# load dataset
dataset = load_dataset("shawhin/yt-title-thumbnail-pairs")

# define preprocessing function
def preprocess(batch):
    """
        Preprocessing data without augmentations for test set
    """
    # get images from urls
    image_list = [Image.open(requests.get(url, stream=True).raw) 
                      for url in batch["thumbnail_url"]]

    # return columns with standard names
    return {
        "anchor": image_list,       
        "positive": batch["title"],  
        "negative": batch["title_neg"]
    }

# remove columns not relevant to training
columns_to_remove = [col for col in dataset['train'].column_names 
                        if col not in ['anchor', 'positive', 'negative']]
# apply transformations
dataset = dataset.map(preprocess, batched=True, 
                         remove_columns=columns_to_remove)

It’s important that we order our columns as (anchor, positive, negative) triplets because this is the format expected by the loss function we will use during training (which I learned the hard way).

Step 3: Define Evals

Training involves optimizing a model’s parameters to minimize a loss function. However, this value (i.e. a contrastive loss) is rarely helpful in assessing the model’s performance on a downstream task (e.g. matching titles to thumbnails).

A quantity that is more insightful, in this case, is the model’s ability to correctly match a given thumbnail to the correct title among several candidates. This is denoted Recall@1.

We can implement an evaluator compatible with the Sentence Transformers library to compute this metric. Since the code is quite long, I won’t paste it here, but the curious reader can find it in Cell 12 of this notebook.

# function to create new evaluator given data split
def create_recall_evaluator(set_name, k=1):
    """
        Create triplet evaluator for "train", "valid", or "test" split
    """

    return ImageTextRetrievalEvaluator(
        images=dataset[f"{set_name}"]["anchor"],
        texts=dataset[f"{set_name}"]["positive"],
        name=f"yt-title-thumbnail-{set_name}",
        k=k
    )

# Create new evaluator with Recall@k
evaluator_recall_train = create_recall_evaluator("train", k=1)
evaluator_recall_valid = create_recall_evaluator("valid", k=1)

print("Train:", evaluator_recall_train(model))
print("Valid:", evaluator_recall_valid(model))

# >> Train: {'yt-title-thumbnail-train_Recall@1': 0.660377358490566}
# >> Valid: {'yt-title-thumbnail-valid_Recall@1': 0.6363636363636364}

We can see the model already has decent performance out-of-the-box, with correct titles being matched 66% of the time.

Step 4: Fine-tune the Model

There are 3 key things we must do before training the model. Namely, choose which parameters to train, pick a loss function, and set hyperparameters.

Trainable Parameters

The key limitation of this project is that I’ve only posted 76 YouTube videos (as of writing this). With the validation and test splits, this leaves only 53 examples for training.

Since we have so few training examples, limiting the number of parameters we train is a good idea. In this case, I only train the final projection layer of the model, which maps the text and image embeddings into a shared vector space. This is about 1M parameters total.

# import model
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/clip-ViT-L-14")

# pick specific layers to train (note: you can add more layers to this list)
trainable_layers_list = ['projection']

# Apply freezing configuration
for name, param in model.named_parameters():

    # freeze all params
    param.requires_grad = False

    # unfreeze layers in trainable_layers_list
    if any(layer in name for layer in trainable_layers_list):
        param.requires_grad = True

# Count total and trainable parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
print(f"% of trainable parameters: {100*trainable_params/total_params:.2f}%")

# >> Total parameters: 427,616,513
# >> Trainable parameters: 1,376,256
# >> % of trainable parameters: 0.32%

Loss function

Here, I use the Multiple Negatives Ranking Loss from the Sentence Transformers library (which works with single negatives like in this case). It works by maximizing the similarity between positive pairs while minimizing the similarity between negative pairs. Here’s what the loss function looks like for the single negative case [2].

Mulitple negatives loss function (with only 1 negative). Image by author.

from sentence_transformers.losses import MultipleNegativesRankingLoss

# define loss
loss = MultipleNegativesRankingLoss(model)

Hyperparameters

For hyperparameters, I experimented with a handful of choices manually and picked the choice with the best validation loss and Recall@1 performance. Here are the final choices.

from sentence_transformers import SentenceTransformerTrainingArguments

# hyperparameters
num_epochs = 2
batch_size = 16
lr = 1e-4
finetuned_model_name = "clip-title-thumbnail-embeddings"

train_args = SentenceTransformerTrainingArguments(
    output_dir=f"models/{finetuned_model_name}",
    num_train_epochs=num_epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    learning_rate=lr,
    # Evaluation settings
    eval_strategy="epoch",
    eval_steps=1,
    logging_steps=1,
)

With our loss and hyperparameters defined, we can train the model using the SentenceTransformersTrainer().

from sentence_transformers import SentenceTransformerTrainer

trainer = SentenceTransformerTrainer(
    model=model,
    args=train_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["valid"],
    loss=loss,
    evaluator=[evaluator_recall_train, evaluator_recall_valid],
)
trainer.train()

Model training is an iterative process where you may explore dozens of models for different choices of trainable parameters, loss functions, and hyperparameters.

However, I highly recommend keeping these experiments as simple as possible. If you find yourself spending too much time tweaking training args to get your model to converge, there’s probably something fundamentally wrong with your data (speaking from experience ).

Step 5: Evaluate the Model

As a final step, we can evaluate the model’s Recall@1 score on the testing set. These data were not used for training or hyperparameter tuning, so it gives us an unbiased assessment of the model.

evaluator_recall_test = create_recall_evaluator("test")

print("Train:", evaluator_recall_train(model))
print("Valid:", evaluator_recall_valid(model))
print("Test:", evaluator_recall_test(model))

# >> Train: {'yt-title-thumbnail-train_Recall@1': 0.8490566037735849}
# >> Valid: {'yt-title-thumbnail-valid_Recall@1': 0.9090909090909091}
# >> Test: {'yt-title-thumbnail-test_Recall@1': 0.75}

We see that the model performs well across all three datasets with 75% Recall@1 on the test set. In other words, 75% of the time, the model correctly matches a given thumbnail to its original title. Additionally, the recall for the validation dataset increases by 27%!

What’s Next?

Multimodal embedding models, like CLIP, unlock countless 0-shot use cases such as image classification and retrieval. Here, we saw how we can fine-tune such a model to adapt it to a specialized domain (i.e. my YouTube titles and thumbnails).

Although CLIP is a small model by today’s standards (~500M parameters) and our training dataset was tiny, the final model still demonstrated strong performance on this task. This highlights the power of fine-tuning.

If you have any questions or suggestions for future content, let me know in the comments

More on Multimodal AI

Multimodal AI

Get exclusive access to AI resources and project ideas: https://the-data-entrepreneurs.kit.com/shaw

Learn AI in 6 weeks by building it: https://maven.com/shaw-talebi/ai-builders-bootcamp?promoCode=AI25

References

[1] arXiv:2103.00020 [cs.CV]

[2] arXiv:1705.00652 [cs.CL]

The post Fine-tuning Multimodal Embedding Models appeared first on Towards Data Science.

How to Build a Resume Optimizer with AI

Shaw Talebi — Mon, 30 Dec 2024 13:02:50 +0000

In a previous blog post, I shared 5 AI Projects You Can Build This Weekend, where the first project idea was a resume optimization tool. Since then, many people have asked for more guidance on implementing this project. In this article, I’ll walk through an example implementation using Python and OpenAI’s API.

Image from Canva.

Adapting your resume to different job descriptions is an effective yet tedious part of finding a job. Even if you’ve tailored your resume for a specific role, companies may have varying expectations for similar job titles.

For instance, below are two comparable data scientist roles at Netflix and Google with slightly different job descriptions. Lucky for us, with today’s AI tools, we can build an app to streamline this process.

Comparison of similar data scientist roles. Image by author.

Can’t I Use ChatGPT?

Hearing this idea, you might think, can’t I just use a no-code AI tool like ChatGPT or Claude for this?

The answer is yes! In fact, I’d recommend starting with no-code solutions anytime you want to build a project like this. However, there are two limitations to consider.

First, working with long prompts in a chat UI can get cumbersome, especially when you have several jobs you want to apply to. Second, if you want to scale this process up from 5 resumes to 50, using ChatGPT (or the like) may become impractical.

The Project Workflow

At a high level, this project aims to take a resume and job description (JD) and return an optimized version of the resume based on the JD. While one could implement such a tool in many ways, here I’ll use the following 4-step workflow.

Import the Resume and JD: Load your resume and define the job description you want to target.
Construct a Prompt: Create a prompt to guide the AI in rewriting your resume.
Make API Call: Use OpenAI’s API to prompt GPT-4o-mini to rewrite your resume.
Display and Save the Results: Save the new resume as a PDF.

4-step workflow. Image by author.

Example Code: Automated Resume Optimizer

With a basic understanding of what we are trying to build and why, let’s see how to implement this project in Python. The example code is freely available on GitHub.

AI-Builders-Bootcamp-2/lightning-lesson at main · ShawhinT/AI-Builders-Bootcamp-2

Imports

We start by importing a few hand Python libraries. The key ones are openai for accessing GPT-4o-mini, as well as markdown and weasyprint to create a PDF version of the final Resume. Note: An OpenAI API key is required for this project, which I imported from a separate Python script.

from IPython.display import display, Markdown
from openai import OpenAI
from top_secret import my_sk

from markdown import markdown
from weasyprint import HTML

Step 1: Input Resume & JD

Next, we will load our input resume into Python as a string and use Python’s input() function to allow us to copy-paste it into any job description when we run the script.

# open and read the markdown file
with open("resumes/resume.md", "r", encoding="utf-8") as file:
    resume_string = file.read()

# input job description
jd_string = input()

A detail here is that the resume is saved in a markdown format. This is important because it will encourage GPT-4o-mini to generate a new resume in markdown, which we can easily style into a PDF. Note: ChatGPT (or the like) can convert your PDF resume to markdown.

Step 2: Construct Prompt

With our resume and JD imported, we can now craft a prompt to instruct the model to optimize the resume. A pro tip here is to use ChatGPT to write an initial version of this prompt because 1) it’s pretty long, and 2) LLMs tend to write instructions more aligned with the expectations of other LLMs.

After some experimentation, I ended up with the following prompt template, which rewrites the resume and makes additional suggestions for improvement if skill gaps exist.

prompt_template = lambda resume_string, jd_string : f"""
You are a professional resume optimization expert specializing in tailoring 
resumes to specific job descriptions. Your goal is to optimize my resume and 
provide actionable suggestions for improvement to align with the target role.

### Guidelines:
1. **Relevance**:  
   - Prioritize experiences, skills, and achievements **most relevant to the 
job description**.  
   - Remove or de-emphasize irrelevant details to ensure a **concise** and 
**targeted** resume.
   - Limit work experience section to 2-3 most relevant roles
   - Limit bullet points under each role to 2-3 most relevant impacts

2. **Action-Driven Results**:  
   - Use **strong action verbs** and **quantifiable results** (e.g., 
percentages, revenue, efficiency improvements) to highlight impact.  

3. **Keyword Optimization**:  
   - Integrate **keywords** and phrases from the job description naturally to 
optimize for ATS (Applicant Tracking Systems).  

4. **Additional Suggestions** *(If Gaps Exist)*:  
   - If the resume does not fully align with the job description, suggest:  
     1. **Additional technical or soft skills** that I could add to make my 
profile stronger.  
     2. **Certifications or courses** I could pursue to bridge the gap.  
     3. **Project ideas or experiences** that would better align with the role.  

5. **Formatting**:  
   - Output the tailored resume in **clean Markdown format**.  
   - Include an **"Additional Suggestions"** section at the end with 
actionable improvement recommendations.  

---

### Input:
- **My resume**:  
{resume_string}

- **The job description**:  
{jd_string}

---

### Output:  
1. **Tailored Resume**:  
   - A resume in **Markdown format** that emphasizes relevant experience, 
skills, and achievements.  
   - Incorporates job description **keywords** to optimize for ATS.  
   - Uses strong language and is no longer than **one page**.

2. **Additional Suggestions** *(if applicable)*:  
   - List **skills** that could strengthen alignment with the role.  
   - Recommend **certifications or courses** to pursue.  
   - Suggest **specific projects or experiences** to develop.
"""

Step 3: Make API Call

Using the above prompt template, we can dynamically construct a prompt using the input resume and JD and then send it to OpenAI via their API.

# create prompt
prompt = prompt_template(resume_string, jd_string)

# setup api client
client = OpenAI(api_key=my_sk)

# make api call
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "Expert resume writer"},
        {"role": "user", "content": prompt}
    ], 
    temperature = 0.7
)

# extract response
response_string = response.choices[0].message.content

Step 4: Save New Resume

Finally, we can extract the optimized resume and suggestions for improvement.

# separate new resume from improvement suggestions
response_list = response_string.split("## Additional Suggestions")

For the resume, we can convert the markdown output to HTML using the markdown library. Then, convert the HTML to a PDF using weasyprint.

# save as PDF
output_pdf_file = "resumes/resume_new.pdf"

# Convert Markdown to HTML
html_content = markdown(response_list[0])

# Convert HTML to PDF and save
HTML(string=html_content).write_pdf(output_pdf_file, 
                                    stylesheets=['resumes/style.css'])

Here’s what the final result looks like.

Final PDF version of resume. Image by author.

For the improvement suggestions, we can print those directly.

display(Markdown(response_list[1]))

Improvement suggestions. Image by author.

Bonus: Build a GUI

While the code above streamlines this process to some extent, we can do better. To improve the usability of this tool, we can create a simple web interface using Gradio.

The final product is shown below. A user can upload a markdown file of their resume and paste it into any job description more straightforwardly. I also added an area where users can edit the new resume before exporting it as a PDF.

Demo of final GUI. GIF by author.

The example code is available at the GitHub repository here. Check out the YouTube video to see me talk through the code.

What’s Next?

While tailoring one’s resume to specific job descriptions is an effective way to make an application stand out, it can be quite tedious. Here, we walk through the implementation of an AI-powered resume optimization tool using Python and OpenAI’s API.

If you have any questions or want to dive deeper into any of the topics covered, let me know in the comments

y2b.io helped me write this article.

Get exclusive access to AI resources and project ideas: Learn more

The post How to Build a Resume Optimizer with AI appeared first on Towards Data Science.

How I’d Learn AI in 2025 (If I Could Start Over)

Shaw Talebi — Fri, 13 Dec 2024 14:02:21 +0000

Today, more people than ever are trying to learn AI. Although there are countless free learning resources online, navigating this rapidly evolving landscape can be overwhelming (especially as a beginner). In this article, I discuss how I’d approach learning AI, given what I know now and the tools available today.

Image from Canva.

Who is this for?

Given the wide range of backgrounds interested in AI these days, I’ve tried to make this guide widely accessible. However, no guide can help everyone. Here are a few specific groups I have in mind.

Technical professionals trying to up-skill for career advancement
Business leaders who haven’t coded in years, trying to keep up with the changing tech landscape
Entrepreneurs building AI-native products
Students trying to develop their technical AI skills

About me – I’ve worked in AI for the past 6 years. I started as an AI researcher while getting my PhD, then eventually worked as a data scientist at Toyota. Although I still have a lot to learn, the approach below covers (what I think are) the essentials based on my personal experience.

My 5-Step Approach

The guiding principle of this framework is to learn by doing. Each step outlines a clear and specific objective through which completion will naturally develop key skills. In other words, rather than reviewing a list of concepts and courses, each step is a task designed to force me to learn essential skills by completing it.

Here’s an overview of the 5-step approach. Each step builds upon the ones before it.

Use ChatGPT (or the like)
Install Python
Build an Automation
Build an ML Project
Build a Real-world Project

Step 1: Use ChatGPT (or the like)

If starting from zero, the first thing I would do is familiarize myself with modern AI tools i.e. ChatGPT, Claude, and the like. This is important because frequently using these models will give me a practical understanding of what they can and can’t do and ** develop my ability to use them effectively through promptin**g.

On a more meta level, these chat interfaces are incredible tools for learning AI (or anything else, really). I’d use it to explain confusing buzzwords and technical concepts (e.g. LLM, tokens, API, RAG) and be sure to ask follow-up questions until I have a solid understanding of each idea. For those that don’t click, I’d seek alternative resources using Google search and YouTube.

Task: Pick an AI chat tool and use it in daily work
Resources: ChatGPT, Claude, Together AI, Gemini, Perplexity

Step 2: Install Python

Although I could go far with today’s no-code AI tools, they are fundamentally limited. Namely, these tools can’t be easily used to build custom solutions or process information in bulk. That’s why the next thing I would do is install Python on my computer.

Python is the industry standard programming language for AI development. To get it installed, I’d ask ChatGPT for step-by-step instructions. If I get stuck, I’d come back to ChatGPT, explain the issue, and ask for additional guidance.

While using ChatGPT (or any other AI assistant) in this way can streamline the process significantly, I would still take the time to understand each step of the process and ask follow-up questions as needed. This is an important habit to develop because it will avoid accumulating technical debt, which I’ll have to pay later when something goes wrong.

Task: Install Python on your machine
Resources: Python QuickStart

Python QuickStart for People Learning AI

Step 3: Build an Automation (Beginner)

Once I’ve become comfortable using ChatGPT and installed Python on my machine, my next step would be to build a simple automation using Python. My approach to generating project ideas would be to think of things I consistently use ChatGPT for (e.g. summarizing research articles), then try and automate it with Python.

This would require me to become familiar with OpenAI’s Python API. So, I’d start by reading their documentation and reviewing the example code there. Once I felt comfortable with the API, I’d start writing Python code.

My first step would be to think through the steps of my automation. For example, if summarizing research papers, the steps might be:

Read paper contents into Python
Construct prompt for GPT-4o
Make an OpenAI API call

If I got stuck, I’d turn to ChatGPT for assistance. For instance, if I didn’t know how to read PDFs into Python, I could ask ChatGPT for help. If it spits out code I don’t understand, I’d ask follow-up questions until I understand each line.

It (again) is important that I take this approach to coding with ChatGPT because blindly copy-pasting code from it wouldn’t teach me much. It would also accrue unforgiving technical debt. In other words, I’d get short-term gains but would have to pay for them later via technical difficulties and headaches.

Task: Use OpenAI API (or the like) to build a simple automation

Resources: OpenAI API Intro | Paper Summarizer Example

Step 4: Build an ML Project (Intermediate)

After Step 3 gets easy for me, I’d seek out more sophisticated projects. Rather than simply making ChatGPT-like API calls, I’d build a project that required me to use embedding models or to train a model myself.

Potential project ideas would be things like:

Semantic search tool
Basic RAG system (i.e. semantic search + LLM)
Clustering documents based on similarity
Training a text classifier
Fine-tuning an LLM

For example, if I went with the RAG project, I’d first educate myself on RAG by watching YouTube videos and reading blog posts. Then, I’d break down the system’s basic components and the steps to implement it. Finally, I’d start coding the project, using ChatGPT as a co-pilot like Step 3.

Task: Build an ML project that goes beyond ChatGPT-like API call
Resources: More Project Ideas

5 AI Projects You Can Build This Weekend (with Python)

Step 5: Build a Real-world Project (Advanced)

Although I would have learned a lot about the technical side of AI from Steps 3 and 4, this is not sufficient for generating value with it. For that, I’d need to use what I learned to solve real-world problems.

There are two ways to do this. I could, one, solve my own problem, or two, solve someone else’s problem. Since I (hopefully) already did the former way in Steps 3 and 4, here are a few different ways I’d approach the latter.

Reach out to business owners and professionals in my network
Join a research group at my University (if I was a student)
Find an internship (if I was a student)
Find a freelance gig on Upwork

Let’s say I had graduated from college and wasn’t quite confident enough to freelance yet so that leaves Option 1. I’d start by making a list of people to reach out to. Ideal contacts would be small business owners or professionals working at a small to medium-sized business.

Then, I would craft a message like the one below and send it to everyone on my list via LinkedIn DM or email. If I struggle to find the right wording, I’d use ChatGPT (yet again) to help out.

Subject: Offering Free Help with AI Projects

Hi [Name],

Your work at [Company Name] caught my attention—[insert a detail from their 
LinkedIn or company page you find interesting].

Over the past few months, I've been building practical AI projects to develop 
my skills. You can see some examples [here](link to portfolio).

Now, I'm looking to apply my learnings to solve real-world problems 
by helping businesses like yours—**completely free of charge**. If there's a 
task you've been looking to automate or improve with AI, I'd be happy to 
explore how I can contribute.

Would you be opposed to a short conversation to discuss this?

Best regards,
Shaw

Task: Find a real-world problem to apply AI skills to
Resources: Project Discovery Questions | Project Management

Final Thoughts

Although AI entails an interdisciplinary collection of technical skills and knowledge, with today’s tools and resources, it’s never been more accessible. Here, I shared the 5-step approach I’d take to learning it today.

That said, it’s important to remember that learning (itself) is hard. You will get confused, you will get frustrated, and you will question why you’re putting yourself through this. However, if you are willing to see it through, you will be rewarded with clarity and knowledge, which is an amazing gift.

If you have questions or want feedback on project ideas, feel free to share them in the comments

I’m hosting a 6-week (live) AI Bootcamp starting Jan 10 (Learn more)

Get a 40% discount using promo code SAVE40

The post How I’d Learn AI in 2025 (If I Could Start Over) appeared first on Towards Data Science.

Multimodal RAG: Process Any File Type with AI

Shaw Talebi — Thu, 05 Dec 2024 20:19:55 +0000

This is the third article in a larger series on multimodal AI. In the previous posts, we discussed multimodal LLMs and embedding models, respectively. In this article, we will combine these ideas to enable the development of multimodal RAG systems. I’ll start by reviewing key concepts and then share example code for implementing such a system.

Image from Canva.

Language models like GPT, LLaMA, and Claude learn a tremendous amount of world knowledge via their pre-training. This makes them powerful tools for solving custom problems and answering complex questions.

However, there is knowledge that even the most advanced language models are ignorant of. This includes proprietary information within organizations, events that occurred after a model’s pre-training data collection, and specialized knowledge that is not prevalent on the internet.

Although this ignorance limits a model’s out-of-the-box capabilities, there is a popular technique to overcome these limitations: retrieval augmented generation (or RAG for short).

What is RAG?

RAG is an approach for improving a model’s response quality by dynamically providing the relevant context for a given prompt. Here’s an example of when this might be helpful.

Say, I forgot the name of a Python library a colleague mentioned in yesterday’s meeting. This isn’t something ChatGPT can help me with because it does not know the meeting’s contents.

However, RAG could help with this by taking my question (e.g. "What was the name of that Python library that Rachel mentioned in yesterday’s meeting?"), automatically pulling the meeting transcript, then providing my original query and the transcript to an LLM.

Basic design of a RAG system. Image by author.

Multimodal RAG

Although improving LLMs with RAG unlocks several practical use cases, there are some situations where relevant information exists in non-text formats, e.g., images, videos, charts, and tables. In such cases, we can go one step further and build multimodal RAG systems, AI systems capable of processing text and non-text data.

Multimodal RAG enables more sophisticated inferences beyond what is conveyed by text alone. For example, it could analyze someone’s facial expressions and speech tonality to give a richer context to a meeting’s transcription.

Basic design of a Multimodal RAG system. Image by author.

3 Levels of MRAG

While there are several ways to implement a multimodal RAG (MRAG) system, here I will focus on three basic strategies at increasing levels of sophistication.

Translate modalities to text.
Text-only retrieval + MLLM
Multimodal retrieval + MLLM

The following discussion assumes you already have a basic understanding of RAG and multimodal models. The following articles discussed these topics: RAG, Multimodal LLMs, and Multimodal Embeddings.

Level 1: Translate modalities to text

A simple way to make a RAG system multimodal is by translating new modalities to text before storing them in the knowledge base. This could be as simple as converting meeting recordings into text transcripts, using an existing multimodal LLM (MLLM) to generate image captions, or converting tables to a readable text format (e.g., .csv or .json).

Visual overview of Level 1 of MRAG. Image by author.

The key upside of this approach is that it requires minimal changes to an existing RAG system. Additionally, by explicitly generating text representations of non-text modalities, one has better control over the features of the data to extract. For instance, captions of analytical figures may include both a description and key insights.

Of course, the downside of this strategy is that the model’s responses cannot directly use non-textual data, which means that the translation from, say, image to text can create a critical information bottleneck.

Level 2: Text-only retrieval + MLLM

Another approach is to generate text representations of all items in the knowledge base, e.g., descriptions and meta-tags, for retrieval, but to pass the original modality to a multimodal LLM (MLLM). For example, image metadata is used for the retrieval step, and the associated image is passed to a model for inference.

Visual overview of Level 2 of MRAG. Image by author.

This maintains many of the benefits of Level 1 while mitigating its limitations. Namely, text features of items in the knowledge base can be optimized for search, but the downstream model can use the full richness of each item’s original modality.

The key difference with this approach is that it requires an MLLM, which is an LLM capable of processing non-text data. This unlocks more advanced reasoning capabilities, as demonstrated by models like GPT-4o or LLaMA 3.2 Vision.

Level 3: Multimodal retrieval + MLLM

Although we could use keyword-based search in the retrieval processes for Level 1 and Level 2, it is a common practice to use so-called vector search. This consists of generating vector representations (i.e., embeddings) of items in the knowledge base and then performing a search by computing similarity scores between an input query and each item in the knowledge base.

Traditionally, this requires that the query and knowledge base items are text-based. However, as we saw in the previous article of this series, there exist multimodal embedding models that generate aligned vector representations of both text and non-text data.

Therefore, we can use multimodal embeddings to perform multimodal retrieval. This works the same way as text-based vector search, but now the embedding space co-locates similar concepts independent of its original modality. The results of such a retrieval strategy can then be passed directly to a MLLM.

Visual overview of Level 3 of MRAG. Image by author.

Example Code: Multimodal Blog Question-Answering Assistant

With a basic understanding of how Multimodal RAG works, let’s see how we can build such a system. Here, I will create a question-answering assistant that can access the text and figures from the previous two blogs in this series.

The Python code for this example is freely available at the GitHub repo.

Imports & Data Loading

We start by importing a few handy libraries and modules.

import json
from transformers import CLIPProcessor, CLIPTextModelWithProjection
from torch import load, matmul, argsort
from torch.nn.functional import softmax

Next, we’ll import text and image chunks from the Multimodal LLMs and Multimodal Embeddings blog posts. These are saved in .json files, which can be loaded into Python as a list of dictionaries.

# load text chunks
with open('data/text_content.json', 'r', encoding='utf-8') as f:
        text_content_list = json.load(f)

# load images
with open('data/image_content.json', 'r', encoding='utf-8') as f:
        image_content_list = json.load(f)

While I won’t review the data preparation process here, the code I used is on the GitHub repo.

We will also load the multimodal embeddings (from CLIP) for each item in _text_contentlist and _image_contentlist. These are saved as pytorch tensors.

# load embeddings
text_embeddings = load('data/text_embeddings.pt', weights_only=True)
image_embeddings = load('data/image_embeddings.pt', weights_only=True)

print(text_embeddings.shape)
print(image_embeddings.shape)

# >> torch.Size([86, 512])
# >> torch.Size([17, 512])

Printing the shape of these tensors, we see they are represented via 512-dimensional embeddings. And we have 86 text chunks and 17 images.

Multimodal Search

With our knowledge base loaded, we can now define a query for vector search. This will consist of translating an input query into an embedding using CLIP. We do this similarly to the examples from the previous post.

# query
query = "What is CLIP's contrastive loss function?"

# embed query (4 steps)
# 1) load model
model = CLIPTextModelWithProjection.from_pretrained("openai/clip-vit-base-patch16")
# 2) load data processor
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch16")
# 3) pre-process text
inputs = processor(text=[text], return_tensors="pt", padding=True)
# 4) compute embeddings with CLIP
outputs = model(**inputs)

# extract embedding
query_embed = outputs.text_embeds
print(query_embed.shape)

# >> torch.Size([1, 512])

Printing the shape, we see we have a single vector representing the query.

To perform a vector search over the knowledge base, we need to do the following.

Compute similarities between the query embedding and all the text and image embeddings.
Rescale the similarities to range from 0 to 1 via the softmax function.
Sort the scaled similarities and return the top k results.
Finally, filter the results to only keep items above a pre-defined similarity threshold.

Here’s what that looks like in code for the text chunks.

# define k and simiarlity threshold
k = 5
threshold = 0.05

# multimodal search over articles
text_similarities = matmul(query_embed, text_embeddings.T)

# rescale similarities via softmax
temp=0.25
text_scores = softmax(text_similarities/temp, dim=1)

# return top k filtered text results
isorted_scores = argsort(text_scores, descending=True)[0]
sorted_scores = text_scores[0][isorted_scores]

itop_k_filtered = [idx.item() 
                    for idx, score in zip(isorted_scores, sorted_scores) 
                    if score.item() >= threshold][:k]
top_k = [text_content_list[i] for i in itop_k_filtered]

print(top_k)

# top k results

[{'article_title': 'Multimodal Embeddings: An Introduction',
  'section': 'Contrastive Learning',
  'text': 'Two key aspects of CL contribute to its effectiveness'}]

Above, we see the top text results. Notice we only have one item, even though k=5. This is because the 2nd-5th items were below the 0.1 threshold.

Interestingly, this item doesn’t seem helpful to our initial query of "What is CLIP’s contrastive loss function?" This highlights one of the key challenges of vector search: items similar to a given query may not necessarily help answer it.

One way we can mitigate this issue is having less stringent restrictions on our search results by increasing k and lowering the similarity threshold, then hoping the LLM can work out what’s helpful vs. not.

To do this, I’ll first package the vector search steps into a Python function.

def similarity_search(query_embed, target_embeddings, content_list, 
                      k=5, threshold=0.05, temperature=0.5):
    """
       Perform similarity search over embeddings and return top k results.
    """
    # Calculate similarities
    similarities = torch.matmul(query_embed, target_embeddings.T)

    # Rescale similarities via softmax
    scores = torch.nn.functional.softmax(similarities/temperature, dim=1)

    # Get sorted indices and scores
    sorted_indices = scores.argsort(descending=True)[0]
    sorted_scores = scores[0][sorted_indices]

    # Filter by threshold and get top k
    filtered_indices = [
        idx.item() for idx, score in zip(sorted_indices, sorted_scores) 
        if score.item() >= threshold
    ][:k]

    # Get corresponding content items and scores
    top_results = [content_list[i] for i in filtered_indices]
    result_scores = [scores[0][i].item() for i in filtered_indices]

    return top_results, result_scores

Then, set more inclusive search parameters.

# search over text chunks
text_results, text_scores = similarity_search(query_embed, text_embeddings, 
                    text_content_list, k=15, threshold=0.01, temperature=0.25)

# search over images
image_results, image_scores = similarity_search(query_embed, image_embeddings, 
                    image_content_list, k=5, threshold=0.25, temperature=0.5)

This results in 15 text results and 1 image result.

1 - Two key aspects of CL contribute to its effectiveness
2 - To make a class prediction, we must extract the image logits and evaluate 
which class corresponds to the maximum.
3 - Next, we can import a version of the clip model and its associated data 
processor. Note: the processor handles tokenizing input text and image 
preparation.
4 - The basic idea behind using CLIP for 0-shot image classification is to 
pass an image into the model along with a set of possible class labels. Then, 
a classification can be made by evaluating which text input is most similar to 
the input image.
5 - We can then match the best image to the input text by extracting the text 
logits and evaluating the image corresponding to the maximum.
6 - The code for these examples is freely available on the GitHub repository.
7 - We see that (again) the model nailed this simple example. But let's try 
some trickier examples.
8 - Next, we'll preprocess the image/text inputs and pass them into the model.
9 - Another practical application of models like CLIP is multimodal RAG, which 
consists of the automated retrieval of multimodal context to an LLM. In the 
next article of this series, we will see how this works under the hood and 
review a concrete example.
10 - Another application of CLIP is essentially the inverse of Use Case 1. 
Rather than identifying which text label matches an input image, we can 
evaluate which image (in a set) best matches a text input (i.e. query)—in 
other words, performing a search over images.
11 - This has sparked efforts toward expanding LLM functionality to include 
multiple modalities.
12 - GPT-4o - Input: text, images, and audio. Output: text.FLUX - Input: text. 
Output: images.Suno - Input: text. Output: audio.
13 - The standard approach to aligning disparate embedding spaces is 
contrastive learning (CL). A key intuition of CL is to represent different 
views of the same information similarly [5].
14 - While the model is less confident about this prediction with a 54.64% 
probability, it correctly implies that the image is not a meme.
15 - [8] Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex 
Capabilities

Image search result.

Prompting MLLM

Although most of these text item results do not seem helpful to our query, the image result is exactly what we’re looking for. Nevertheless, given these search results, let’s see how LLaMA 3.2 Vision responds to this query.

We first will structure the search results as well-formatted strings.

text_context = ""
for text in text_results:
    if text_results:
        text_context = text_context + "**Article title:** " 
                                              + text['article_title'] + "n"
        text_context = text_context + "**Section:**  " 
                                              + text['section'] + "n"
        text_context = text_context + "**Snippet:** " 
                                              + text['text'] + "nn"

image_context = ""
for image in image_results:
    if image_results:
        image_context = image_context + "**Article title:** " 
                                          + image['article_title'] + "n"
        image_context = image_context + "**Section:**  " 
                                          + image['section'] + "n"
        image_context = image_context + "**Image Path:**  " 
                                          + image['image_path'] + "n"
        image_context = image_context + "**Image Caption:** " 
                                          + image['caption'] + "nn"

Note the metadata that accompanies each text and image item. This will help the LLaMA better understand the context of the content.

Next, we interleave the text and image results in a prompt.

# construct prompt template
prompt = f"""Given the query "{query}" and the following relevant snippets:

{text_context}
{image_context}

Please provide a concise and accurate answer to the query, incorporating 
relevant information from the provided snippets where possible.

"""

The final prompt is quite long, so I won’t print it here. However, it is fully displayed in the example notebook on GitHub.

Finally, we can use ollama to pass this prompt to LLaMA 3.2 Vision.

ollama.pull('llama3.2-vision')

response = ollama.chat(
    model='llama3.2-vision',
    messages=[{
        'role': 'user',
        'content': prompt,
        'images': [image["image_path"] for image in image_results]
    }]
)

print(response['message']['content'])

The image depicts a contrastive loss function for aligning text and image 
representations in multimodal models. The function is designed to minimize the 
difference between the similarity of positive pairs (text-image) and negative 
pairs (text-text or image-image). This loss function is commonly used in CLIP, 
which stands for Contrastive Language-Image Pre-training.

**Key Components:**

*   **Positive Pairs:** Text-image pairs where the text describes an image.
*   **Negative Pairs:** Text-text or image-image pairs that do not belong to 
the same class.
*   **Contrastive Loss Function:** Calculates the difference between positive 
and negative pairs' similarities.

**How it Works:**

1.  **Text-Image Embeddings:** Generate embeddings for both text and images 
using a multimodal encoder (e.g., CLIP).
2.  **Positive Pair Similarity:** Calculate the similarity score between each 
text-image pair.
3.  **Negative Pair Similarity:** Calculate the similarity scores between all 
negative pairs.
4.  **Contrastive Loss Calculation:** Compute the contrastive loss by 
minimizing the difference between positive and negative pairs' similarities.

**Benefits:**

*   **Multimodal Alignment:** Aligns text and image representations for better 
understanding of visual content from text descriptions.
*   **Improved Performance:** Enhances performance in downstream tasks like 
image classification, retrieval, and generation.

The model correctly picks up that the image contains the information it needs and explains the general intuition of how it works. However, it misunderstands the meaning of positive and negative pairs, thinking that a negative pair corresponds to a pair of the same modality.

While we went through the implementation details step-by-step, I packaged everything into a nice UI using Gradio in this notebook on the GitHub repo.

YouTube-Blog/multimodal-ai/3-multimodal-rag at main · ShawhinT/YouTube-Blog

Conclusion

Multimodal RAG systems can synthesize knowledge stored in a variety of formats, expanding what’s possible with AI. Here, we reviewed 3 simple strategies for developing such a system and then saw an example implementation of a multimodal blog QA assistant.

Although the example worked well enough for this demonstration, there are clear limitations to the search process. A few techniques that may improve this include using a reranker to refine similarity search results and to improve search quality via fine-tuned multimodal embeddings.

If you want to see future posts on these topics, let me know in the comments

More on Multimodal models

Multimodal AI

My website: https://www.shawhintalebi.com/

[1] RAG

[2] Multimodal LLMs

[3] Multimodal Embeddings

The post Multimodal RAG: Process Any File Type with AI appeared first on Towards Data Science.

Multimodal Embeddings: An Introduction

Shaw Talebi — Fri, 29 Nov 2024 15:02:12 +0000

This is the 2nd article in a larger series on multimodal AI. In the previous post, we saw how to augment large language models (LLMs) to understand new data modalities (e.g., images, audio, video). One such approach relied on encoders that generate vector representations (i.e. embeddings) of non-text data. In this article, I will discuss multimodal embeddings and share what they can do via two practical use cases.

Image from Canva.

AI research is traditionally split into distinct fields: NLP, computer vision (CV), robotics, human-computer interface (HCI), etc. However, countless practical tasks require the integration of these different research areas e.g. autonomous vehicles (CV + robotics), AI agents (NLP + CV + HCI), personalized learning (NLP + HCI), etc.

Although these fields aim to solve different problems and work with different data types, they all share a fundamental process. Namely, generating useful numerical representations of real-world phenomena.

Historically, this was done by hand. This means that researchers and practitioners would use their (or other people’s) expertise to explicitly transform data into a more helpful form. Today, however, these can be derived another way.

Embeddings

Embeddings are (useful) numerical representations of data learned implicitly through model training. For example, through learning how to predict text, BERT learned representations of text, which are helpful for many NLP tasks [1]. Another example is the Vision Transformer (ViT), trained for image classification on Image Net, which can be repurposed for other applications [2].

A key point here is that these learned embedding spaces will have some underlying structure so that similar concepts are located close together. As shown in the toy examples below.

Toy represetation of text and image embeddings, respectively. Image by author.

One key limitation of the previously mentioned models is they are restricted to a single data modality, e.g., text or images. Preventing cross-modal applications like image captioning, content moderation, image search, and more. But what if we could merge these two representations?

Multimodal Embeddings

Although text and images may look very different to us, in a neural network, these are represented via the same mathematical object, i.e., a vector. Therefore, in principle, text, images, or any other data modality can processed by a single model.

This fact underlies multimodal embeddings, which represent multiple data modalities in the same vector space such that similar concepts are co-located (independent of their original representations).

Toy representation of multimodal embedding space. Image by author.

For example, CLIP encodes text and images into a shared embedding space [3]. A key insight from CLIP is that by aligning text and image representations, the model is capable of 0-shot image classification on an arbitrary set of target classes since any input text can be treated as a class label (we will see a concrete example of this later).

However, this idea is not limited to text and images. Virtually any data modalities can be aligned in this way e.g., text-audio, audio-image, text-EEG, image-tabular, and text-video. Unlocking use cases such as video captioning, advanced OCR, audio transcription, video search, and EEG-to-text [4].

Contrastive Learning

The standard approach to aligning disparate embedding spaces is contrastive learning (CL). A key intuition of CL is to represent different views of the same information similarly [5].

This consists of learning representations that maximize the similarity between positive pairs and minimize the similarity of negative pairs. In the case of an image-text model, a positive pair might be an image with an appropriate caption, while a negative pair would be an image with an irrelevant caption (as shown below).

Example positive and negative pairs used in contrastive training. Image by author.

Two key aspects of CL contribute to its effectiveness

Since positive and negative pairs can be curated from the data’s inherent structure (e.g., metadata from web images), CL training data do not require manual labeling, which unlocks larger-scale training and more powerful representations [3].
It simultaneously maximizes positive and minimizes negative pair similarity via a special loss function, as demonstrated by CLIP [3].

CLIP’s contrastive loss for text-image representation alignment [3]. Image by author.

Example Code: Using CLIP for 0-shot classification and image search

With a high-level understanding of how multimodal embeddings work, let’s see two concrete examples of what they can do. Here, I will use the open-source CLIP model to perform two tasks: 0-shot image classification and image search.

The code for these examples is freely available on the GitHub repository.

Use case 1: 0-shot Image Classification

The basic idea behind using CLIP for 0-shot image classification is to pass an image into the model along with a set of possible class labels. Then, a classification can be made by evaluating which text input is most similar to the input image.

We’ll start by importing the Hugging Face Transformers library so that the CLIP model can be downloaded locally. Additionally, the PIL library is used to load images in Python.

from transformers import CLIPProcessor, CLIPModel
from PIL import Image

Next, we can import a version of the clip model and its associated data processor. Note: the processor handles tokenizing input text and image preparation.

# import model
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch16")

# import processor (handles text tokenization and image preprocessing)
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch16")

We load in the below image of a cat and create a list of two possible class labels: "a photo of a cat" or "a photo of a dog".

# load image
image = Image.open("images/cat_cute.png")

# define text classes
text_classes = ["a photo of a cat", "a photo of a dog"]

Input cat photo. Image from Canva.

Next, we’ll preprocess the image/text inputs and pass them into the model.

# pass image and text classes to processor
inputs = processor(text=text_classes, images=image, return_tensors="pt", 
                                                    padding=True)

# pass inputs to CLIP
outputs = model(**inputs) # note: "**" unpacks dictionary items

To make a class prediction, we must extract the image logits and evaluate which class corresponds to the maximum.

# image-text similarity score
logits_per_image = outputs.logits_per_image 
# convert scores to probs via softmax
probs = logits_per_image.softmax(dim=1) 

# print prediction
predicted_class = text_classes[probs.argmax()]
print(predicted_class, "| Probability = ", 
                       round(float(probs[0][probs.argmax()]),4))

>> a photo of a cat | Probability =  0.9979

The model nailed it with a 99.79% probability that it’s a cat photo. However, this was a super easy one. Let’s see what happens when we change the class labels to: "ugly cat" and "cute cat" for the same image.

>> cute cat | Probability =  0.9703

The model easily identified that the image was indeed a cute cat. Let’s do something more challenging like the labels: "cat meme" or "not cat meme".

>> not cat meme | Probability =  0.5464

While the model is less confident about this prediction with a 54.64% probability, it correctly implies that the image is not a meme.

Use case 2: Image Search

Another application of CLIP is essentially the inverse of Use Case 1. Rather than identifying which text label matches an input image, we can evaluate which image (in a set) best matches a text input (i.e. query)—in other words, performing a search over images.

We start by storing a set of images in a list. Here, I have three images of a cat, dog, and goat, respectively.

# create list of images to search over
image_name_list = ["images/cat_cute.png", "images/dog.png", "images/goat.png"]

image_list = []
for image_name in image_name_list:
    image_list.append(Image.open(image_name))

Next, we can define a query like "a cute dog" and pass it and the images into CLIP.

# define a query
query = "a cute dog"

# pass images and query to CLIP
inputs = processor(text=query, images=image_list, return_tensors="pt", 
                                                  padding=True)

We can then match the best image to the input text by extracting the text logits and evaluating the image corresponding to the maximum.

# compute logits and probabilities
outputs = model(**inputs)
logits_per_text = outputs.logits_per_text
probs = logits_per_text.softmax(dim=1)

# print best match
best_match = image_list[probs.argmax()]
prob_match = round(float(probs[0][probs.argmax()]),4)

print("Match probability: ",prob_match)
display(best_match)

>> Match probability:  0.9817

Best match for query "a cute dog". Image from Canva.

We see that (again) the model nailed this simple example. But let’s try some trickier examples.

query = "something cute but metal "

>> Match probability:  0.7715

Best match for query "something cute but metal ". Image from Canva.

query = "a good boy"

>> Match probability:  0.8248

Best match for query "a good boy". Image from Canva.

query = "the best pet in the world"

>> Match probability:  0.5664

Best match for query "the best pet in the world". Image from Canva.

Although this last prediction is quite controversial, all the other matches were spot on! This is likely since images like these are ubiquitous on the internet and thus were seen many times in CLIP’s pre-training.

YouTube-Blog/multimodal-ai/2-mm-embeddings at main · ShawhinT/YouTube-Blog

What’s Next?

Multimodal embeddings unlock countless AI use cases that involve multiple data modalities. Here, we saw two such use cases, i.e., 0-shot image classification and image search using CLIP.

Another practical application of models like CLIP is multimodal RAG, which consists of the automated retrieval of multimodal context to an LLM. In the next article of this series, we will see how this works under the hood and review a concrete example.

More on Multimodal models

Multimodal AI

My website: https://www.shawhintalebi.com/

The post Multimodal Embeddings: An Introduction appeared first on Towards Data Science.

Multimodal Models – LLMs that can see and hear

Shaw Talebi — Tue, 19 Nov 2024 11:01:59 +0000

Multimodal Models – LLMs That Can See and Hear

This is the first post in a larger series on Multimodal AI. A Multimodal Model (MM) is an AI system capable of processing or generating multiple data modalities (e.g., text, image, audio, video). In this article, I will discuss a particular type of MM that builds on top of a large language model (LLM). I’ll start with a high-level overview of such models and then share example code for using LLaMA 3.2 Vision to perform various image-to-text tasks.

Photo by Sincerely Media on Unsplash

Large language models (LLMs) have marked a fundamental shift in AI research and development. However, despite their broader impacts, they are still fundamentally limited.

Namely, LLMs can only process and generate text, making them blind to other modalities such as images, video, audio, and more. This is a major limitation since some tasks rely on non-text data, e.g., analyzing engineering blueprints, reading body language or speech tonality, and interpreting plots and infographics.

This has sparked efforts toward expanding LLM functionality to include multiple modalities.

What is a Multimodal Model?

A Multimodal Model (MM) is an AI system that can process multiple data modalities as input or output (or both) [1]. Below are a few examples.

GPT-4o – Input: text, images, and audio. Output: text.
FLUX – Input: text. Output: images.
Suno – Input: text. Output: audio.

Example mutlimodal models. Image by author.

While there are several ways to create models that can process multiple data modalities, a recent line of research seeks to use LLMs as the core reasoning engine of a multimodal system [2]. Such models are called multimodal large language models (or large multimodal models) [2][3].

One benefit of using existing LLM as a starting point for MMs is that they’ve demonstrated a strong ability to acquire world knowledge through large-scale pre-training, which can be leveraged to process concepts appearing in non-textual representations.

3 Paths to Multimodality

Here, I will focus on multimodal models developed from an LLM. Three popular approaches are described below.

LLM + Tools: Augment LLMs with pre-built components
LLM + Adapters: Augment LLMs with multi-modal encoders or decoders, which are aligned via adapter fine-tuning
Unified Models: Expand LLM architecture to fuse modalities at pre-training

Path 1: LLM + Tools

The simplest way to make an LLM multimodal is by adding external modules that can readily translate between text and an arbitrary modality. For example, a transcription model (e.g. Whisper) can be connected to an LLM to translate input speech into text, or a text-to-image model can generate images based on LLM outputs.

The key benefit of such an approach is simplicity. Tools can quickly be assembled without any additional model training.

The downside, however, is that the quality of such a system may be limited. Just like when playing a game of telephone, messages mutate when passed from person to person. Information may degrade going from one module to another via text descriptions only.

An example of information degradation during message passing. Image by author.

Path 2: LLM + Adapters

One way to mitigate the "telephone problem" is by optimizing the representations of new modalities to align with the LLM’s internal concept space. For example, ensuring an image of a dog and the description of one look similar to the LLM.

This is possible through the use of adapters, a relatively small set of parameters that appropriately translate a dense vector representation for a downstream model [2][4][5].

Adapters can be trained using, for example, image-caption pairs, where the adapter learns to translate an image encoding into a representation compatible with the LLM [2][4][6]. One way to achieve this is via contrastive learning [2], which I will discuss more in the next article of this series.

A simple strategy for integrating images into an LLM via an image encoding adapter. Image by author.

The benefits of using adapters to augment LLMs include better alignment between novel modality representations in a data-efficient way. Since many pre-trained embedding, language, and diffusion models are available in today’s AI landscape, one can readily fuse models based on their needs. Notable examples from the open-source community are LLaVA, LLaMA 3.2 Vision, Flamingo, MiniGPT4, Janus, Mini-Omni2, and IDEFICS [3][5][7][8].

However, this data efficiency comes at a price. Just like how adapter-based fine-tuning approaches (e.g. LoRA) can only nudge an LLM so far, the same holds in this context. Additionally, pasting various encoders and decoders to an LLM may result in overly complicated model architectures.

Path 3: Unified Models

The final way to make an LLM multimodal is by incorporating multiple modalities at the pre-training stage. This works by adding modality-specific tokenizers (rather than pre-trained encoder/decoder models) to the model architecture and expanding the embedding layer to accommodate new modalities [9].

While this approach comes with significantly greater technical challenges and computational requirements, it enables the seamless integration of multiple modalities into a shared concept space, unlocking better reasoning capabilities and efficiencies [10].

The preeminent example of this unified approach is (presumably) GPT-4o, which processes text, image, and audio inputs to enable expanded reasoning capabilities at faster inference times than previous versions of GPT-4. Other models that follow this approach include Gemini, Emu3, BLIP, and Chameleon [9][10].

Training these models typically entails multi-step pre-training on a set of (multimodal) tasks, such as language modeling, text-image contrastive learning, text-to-video generation, and others [7][9][10].

Example: Using LLaMA 3.2 Vision for Image-based Tasks

With a basic understanding of how LLM-based multimodal models work under the hood, let’s see what we can do with them. Here, I will use LLaMA 3.2 Vision to perform various image-to-text tasks.

To run this example, download Ollama and its Python library. This enables the model to run locally i.e. no need for external API calls.

The example code is freely available on GitHub.

Importing model

We start by importing ollama.

import ollama

Next, we’ll download the model locally. Here, we use LLaMA 3.2 Vision 11B.

ollama.pull('llama3.2-vision')

Visual QA

Now, we’re ready to use the model! Here’s how we can do basic visual question answering.

# pass image and question to model
response = ollama.chat(
    model='llama3.2-vision',
    messages=[{
        'role': 'user',
        'content': 'What is in this image?',
        'images': ['images/shaw-sitting.jpeg']
    }]
)

# print response
print(response['message']['content'])

The image is of me from a networking event (as shown below).

Image of me from networking event at Richardson IQ. Image by author.

The model’s response is shown below. While it has trouble reading what’s on my hat, it does a decent job inferring the context of the photo.

This image shows a man sitting on a yellow ottoman with his hands clasped 
together. He is wearing a black polo shirt with a name tag that says "Shaw" 
and a black baseball cap with white text that reads, "THE DATA ENREPRENEUR." 
The background of the image appears to be an office or lounge area, with a 
large screen on the wall behind him displaying a presentation slide. There are 
also several chairs and tables in the background, suggesting that this may be 
a meeting room or common area for employees to gather and work.

If you run this on your machine, you may run into a long wait time until the model generates a response. One thing we can do to make this less painful is to enable streaming.

# create stream
stream = ollama.chat(
    model='llama3.2-vision',
    messages=[{
        'role': 'user',
        'content': 'Can you write a caption for this image?',
        'images': ['images/shaw-sitting.jpeg']
    }],
    stream=True,
)

# print chunks in stream as they become available
for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

Interestingly, we get a qualitatively different response when prompting the model in a slightly different way for the same image.

This image features a man sitting on a yellow chair. He is wearing a black 
polo shirt with a blue name tag that says "Shaw", khaki pants, and a black 
baseball cap with white text that reads "THE DATA ENTHUSIAST". The man has his 
hands clasped together in front of him and appears to be smiling.

The background of the image consists of a room with various pieces of 
furniture. There is a green ottoman to the left of the yellow chair, and two 
blue chairs on the right side of the image. A brown table or desk sits behind 
the man, along with a fireplace. The walls are painted teal blue and have a 
wooden accent wall featuring holes for hanging items.

The overall atmosphere suggests that this may be a modern office space or 
co-working area where people can come to work, relax, or socialize.

Explaining Memes

Objectively describing a scene is simpler than understanding and explaining humor. Let’s see how the model explains the meme below.

Building with AI meme. Image by author.

# ask model to explain meme
stream = ollama.chat(
    model='llama3.2-vision',
    messages=[{
        'role': 'user',
        'content': 'Can you explain this meme to me?',
        'images': ['images/ai-meme.jpeg']
    }],
    stream=True,
)

# print stream
for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

The meme depicts Patrick Star from SpongeBob SquarePants, surrounded by 
various AI tools and symbols. The caption reads "Trying to build with AI 
today..." The image humorously illustrates the challenges of using AI in 
building projects, implying that it can be overwhelming and frustrating.

The model does a good job here. It understands that the image is funny while also conveying the pain that people face.

OCR

The last use case is optical character recognition (OCR). This involves extracting text from images, which is valuable in a wide range of contexts. Here, I’ll see if the model can translate a screenshot from my notes app to a markdown file.

Screenshot of 5 AI project ideas. Image by author.

# ask model to read screenshot and convert to markdown
stream = ollama.chat(
    model='llama3.2-vision',
    messages=[{
        'role': 'user',
        'content': 'Can you transcribe the text from this screenshot in a 
                    markdown format?',
        'images': ['images/5-ai-projects.jpeg']
    }],
    stream=True,
)

# read stream
for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

Here is the transcription of the text in markdown format:

5 AI Projects You Can Build This Weekend (with Python)

1. **Resume Optimization (Beginner)**
 * Idea: build a tool that adapts your resume for a specific job description
2. **YouTube Lecture Summarizer (Beginner)**
 * Idea: build a tool that takes YouTube video links and summarizes them
3. **Automatically Organizing PDFs (Intermediate)**
 * Idea: build a tool to analyze the contents of each PDF and organize them 
into folders based on topics
4. **Multimodal Search (Intermediate)**
 * Idea: use multimodal embeddings to represent user queries, text knowledge, 
and images in a single space
5. **Desktop QA (Advanced)**
 * Idea: connect a multimodal knowledge base to a multimodal model like 
Llama-3.2-11B-Vision

Note that I've added some minor formatting changes to make the text more 
readable in markdown format. Let me know if you have any further requests.

Again, the model does a decent job out of the box. While it missed the header, it accurately captured the content and formatting of the project ideas.

YouTube-Blog/multimodal-ai/1-mm-llms at main · ShawhinT/YouTube-Blog

What’s next?

Multimodal models are AI systems that can process multiple data modalities as inputs or outputs (or both). A recent trend for developing these systems involves adding modalities to large language models (LLMs).

However, there are other types of multimodal models. In the next article of this series, I will discuss multimodal embedding models, which encode multiple data modalities (e.g. text and images) into a shared representation space.

More on Multimodal models

Multimodal AI

Get FREE access to every new story I write (Learn More)

[1] Multimodal Machine Learning: A Survey and Taxonomy

[2] A Survey on Multimodal Large Language Models

[3] Visual Instruction Tuning

[4] GPT-4o System Card

[5] Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

[6] Learning Transferable Visual Models From Natural Language Supervision

[7] Flamingo: a Visual Language Model for Few-Shot Learning

[8] Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities

[9] Emu3: Next-Token Prediction is All You Need

[10] Chameleon: Mixed-Modal Early-Fusion Foundation Models

The post Multimodal Models – LLMs that can see and hear appeared first on Towards Data Science.

Fine-Tuning BERT for Text Classification

Shaw Talebi — Thu, 17 Oct 2024 05:31:47 +0000

Although today’s 100B+ parameter transformer models are state-of-the-art in AI, there’s still much we can accomplish with smaller (< 1B parameter) models. In this article, I will walk through one such example, fine-tuning BERT (110M parameters) to classify phishing URLs. I’ll start by covering key concepts and then share example Python code.

Image from Canva.

Fine-tuning

Fine-tuning involves adapting a pre-trained model to a particular use case through additional training.

Pre-trained models are developed via unsupervised learning, which precludes the need for large-scale labeled datasets. Fine-tuned models can then exploit pre-trained model representations to significantly reduce training costs and improve model performance compared to training from scratch [1].

Fine-Tuning Large Language Models (LLMs)

Splitting the training process into multiple phases has led to today’s state-of-the-art transformer models, such as GPT-4o, Claude, and Llama 3.2. It also enables the democratization of AI since the expensive undertaking of model pre-training can be done by specialized research labs, who can then make these models publicly available for fine-tuning.

BERT

While model fine-tuning gained tremendous popularity post-ChatGPT, it’s been around since (at least) 2015 [2]. One of the early language models developed specifically for fine-tuning was Google’s Bert model, which was pre-trained on two unsupervised tasks: 1) masked language modeling (MLM) and 2) next sentence prediction [1].

The MLM pre-training task consists of predicting arbitrarily masked words in a sequence. This is in contrast to causal language modeling, which is restricted to predicting the word at the end of a sequence. Therefore, MLM enables models to leverage more context (i.e. text before AND after the masked word) to make predictions [1].

Next sentence prediction is important for downstream tasks that require understanding the relationship between two sentences (e.g., Question Answering and Semantic Similarity). This is implemented using special input tokens to distinguish the sentence prediction task from the MLM [1].

These pre-training tasks enable BERT to be fine-tuned on a wide range of tasks such as sentiment analysis, sentence similarity, question answering, named entity recognition, common sense reasoning, and many others [1].

Text Classification

Many of the tasks mentioned above (e.g. sentiment analysis, sentence similarity, named entity recognition) fall under the category of text classification, i.e., assigning a label to input text sequences.

There are countless practical applications of text classification, such as detecting spam in emails, categorizing IT support tickets, detecting toxic or harmful speech, and analyzing the sentiment of customer reviews. While each of these tasks is practically very different, their implementations are almost identical from a technical standpoint.

Example Code: Fine-tuning BERT for Phishing URL Identification

Here, we will walk through an example of BERT fine-tuning to classify phishing URLs. We will use the bert-base-uncased model freely available on the Hugging Face (HF) hub.

The model consists of 110M parameters, of which we will only train a small percentage. Therefore, this example should easily run on most consumer hardware (no GPU required).

The fine-tuned model is also available on the HF hub, and an example notebook is available on GitHub.

We’ll start by importing a few handy libraries.

from datasets import DatasetDict, Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, 
                         TrainingArguments, Trainer
import evaluate
import numpy as np
from transformers import DataCollatorWithPadding

Next, we’ll load the training dataset. It consists of 3,000 text-label pairs with a 70–15–15 train-test-validation split. The data are originally from here (open database license).

dataset_dict = load_dataset("shawhin/phishing-site-classification")

The Transformer library makes it super easy to load and adapt pre-trained models. Here’s what that looks like for the BERT model.

# define pre-trained model path
model_path = "google-bert/bert-base-uncased"

# load model tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)

# load model with binary classification head
id2label = {0: "Safe", 1: "Not Safe"}
label2id = {"Safe": 0, "Not Safe": 1}
model = AutoModelForSequenceClassification.from_pretrained(model_path, 
                                                           num_labels=2, 
                                                           id2label=id2label, 
                                                           label2id=label2id,)

When we load a model like this, all the parameters will be set as trainable by default. However, training all 110M parameters will be computationally costly and potentially unnecessary.

Instead, we can freeze most of the model parameters and only train the model’s final layer and classification head.

# freeze all base model parameters
for name, param in model.base_model.named_parameters():
    param.requires_grad = False

# unfreeze base model pooling layers
for name, param in model.base_model.named_parameters():
    if "pooler" in name:
        param.requires_grad = True

Next, we will need to preprocess our data. This will consist of two key operations: tokenizing the URLs (i.e., converting them into integers) and truncating them.

# define text preprocessing
def preprocess_function(examples):
    # return tokenized text with truncation
    return tokenizer(examples["text"], truncation=True)

# preprocess all datasets
tokenized_data = dataset_dict.map(preprocess_function, batched=True)

Another important step is creating a data collator that will dynamically pad token sequences in a batch during training so they have the same length. We can do this in one line of code.

# create data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

As a final step before training, we can define a function to compute a set of metrics to help us monitor training progress. Here, we will consider model accuracy and AUC.

# load metrics
accuracy = evaluate.load("accuracy")
auc_score = evaluate.load("roc_auc")

def compute_metrics(eval_pred):
    # get predictions
    predictions, labels = eval_pred

    # apply softmax to get probabilities
    probabilities = np.exp(predictions) / np.exp(predictions).sum(-1, 
                                                                 keepdims=True)
    # use probabilities of the positive class for ROC AUC
    positive_class_probs = probabilities[:, 1]
    # compute auc
    auc = np.round(auc_score.compute(prediction_scores=positive_class_probs, 
                                     references=labels)['roc_auc'],3)

    # predict most probable class
    predicted_classes = np.argmax(predictions, axis=1)
    # compute accuracy
    acc = np.round(accuracy.compute(predictions=predicted_classes, 
                                     references=labels)['accuracy'],3)

    return {"Accuracy": acc, "AUC": auc}

Now, we are ready to fine-tune our model. We start by defining hyperparameters and other training arguments.

# hyperparameters
lr = 2e-4
batch_size = 8
num_epochs = 10

training_args = TrainingArguments(
    output_dir="bert-phishing-classifier_teacher",
    learning_rate=lr,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    logging_strategy="epoch",
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

Then, we pass our training arguments into a trainer class and train the model.

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

The training results are shown below. We can see that the training and validation loss are monotonically decreasing while the accuracy and AUC increase with each epoch.

Training results. Image by author.

As a final test, we can evaluate the performance of the model on the independent validation data, i.e., data not used for training or setting hyperparameters.

# apply model to validation dataset
predictions = trainer.predict(tokenized_data["validation"])

# Extract the logits and labels from the predictions object
logits = predictions.predictions
labels = predictions.label_ids

# Use your compute_metrics function
metrics = compute_metrics((logits, labels))
print(metrics)

# >> {'Accuracy': 0.889, 'AUC': 0.946}

Bonus: Although a 110M parameter model is tiny compared to modern language models, we can reduce its computational requirements using model compression techniques. I cover how to reduce the memory footprint model by 7X in the article below.

Compressing Large Language Models (LLMs)

Conclusion

Fine-tuning pre-trained models is a powerful paradigm for developing better models at a lower cost than training them from scratch. Here, we saw how to do this with BERT using the Hugging Face Transformers library.

While the example code was for URL classification, it can be readily adapted to other text classification tasks.

5 AI Projects You Can Build This Weekend (with Python)

Shaw Talebi — Wed, 09 Oct 2024 05:09:44 +0000

The best way to develop your AI skills is by building projects. However, figuring out what to build can be difficult if you’re just getting started. Here, I share 5 AI projects you can build fast at three levels of sophistication. I’ll break down the steps and Python libraries needed to implement each idea.

The number one mistake beginners make when thinking of project ideas is starting with the question, "How can I use this new tech?" While this can be a fine way to learn a new tool, there is a better way.

Good project ideas start with the question, "What problem can I solve?" This not only makes for a nice story when sharing with potential employers but solving problems is how you translate technical skills into value.

The following projects all take this problem-first approach. You can take these ideas and implement them directly or (even better) use them as inspiration for solving a problem that you are personally facing.

1) Resume Optimization (Beginner)

An effective yet time-consuming part of applying for jobs is adapting your resume to different job descriptions. While automating this task would have been an advanced project a few years ago, with today’s large language models, it is as simple as an API call.

Here’s a step-by-step breakdown of how to implement such an automation.

Create a markdown version of your resume (Note: ChatGPT can do this for you).
Experiment with different prompt templates that take your markdown resume and a job description and output a new resume in markdown.
Use OpenAI’s Python API to prompt GPT-4o-mini to rewrite your resume dynamically.
Convert the markdown file to HTML and then to PDF with the markdown and pdfkit libraries, respectively.

Libraries: openai, markdown, pdfkit

While we could readily use ChatGPT for this, the upside of implementing this with Python is that we can easily scale up the process. Here’s some starter code for Step 3.

import openai
openai.api_key = "your_sk"

# prompt (assuming md_resume and job_desciption have been defined)
prompt = f"""
I have a resume formatted in Markdown and a job description. 
Please adapt my resume to better align with the job requirements while 
maintaining a professional tone. Tailor my skills, experiences, and 
achievements to highlight the most relevant points for the position. 
Ensure that my resume still reflects my unique qualifications and strengths 
but emphasizes the skills and experiences that match the job description.

### Here is my resume in Markdown:
{md_resume}

### Here is the job description:
{job_desciption}

Please modify the resume to:
- Use keywords and phrases from the job description.
- Adjust the bullet points under each role to emphasize relevant skills and achievements.
- Make sure my experiences are presented in a way that matches the required qualifications.
- Maintain clarity, conciseness, and professionalism throughout.

Return the updated resume in Markdown format.

"""

# make api call
response = openai.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ], 
    temperature = 0.25
)

# extract response
resume = response.choices[0].message.content

Note: ChatGPT is super helpful for writing short code snippets (and prompts) like this. If you get stuck, try it for Step 4.

2) YouTube Lecture Summarizer (Beginner)

Although I love adding technical talks to my YouTube "watch later" playlist, it might be a while before I watch them (if I ever get around to it ). A project that can help with this is a tool that watches the videos for me and generates concise summaries with key points.

Here’s one way to do that:

Extract YouTube video ID from video link using regex
Use video ID to extract transcript using youtube-transcript-api
Experiment with different ChatGPT prompts that effectively summarize the transcript
Use OpenAI’s Python API to automate the process

Libraries: openai, youtube-transcript-api

From a technical perspective, this is very similar to the first project. A key difference, however, is that we will need to automatically extract video transcripts, which we can feed into the Llm.

Here’s some starter code for that.

import re
from youtube_transcript_api import YouTubeTranscriptApi

youtube_url = "video link here"

# extract video ID with regex
video_id_regex = r'(?:v=|/)([0-9A-Za-z_-]{11}).*'
match = re.search(video_id_regex, youtube_url)

if match:
    return match.group(1)
else:
    return None

# extract transcript
text_list = [transcript[i]['text'] for i in range(len(transcript))]
transcript_text = 'n'.join(text_list)

3) Automatically Organizing PDFs (Intermediate)

My watch later playlist is not the only place I hoard technical information. Another cache is my desktop, which is riddled with (118) research papers. Since manually reviewing these papers would be (very) time-consuming, let’s see how AI can help.

One could build a tool that analyzes the contents of each PDF on my desktop and organize them into folders based on topics. Text embeddings can translate each paper into a dense vector representation, from which similar articles could be clustered using a traditional Machine Learning algorithm like K-Means.

Here’s a more detailed breakdown:

Read the abstract of each research article using PyMuPDF
Use the sentence-transformers library to translate abstracts into text embeddings and store them in a Pandas DataFrame
Use your favorite clustering algorithm from sklearn to group the embeddings based on similarity
Create folders for each cluster and move the files into the appropriate folder.

Libraries: PyMuPDF, _sentence_transformers, pandas, sklearn_

The key step for this project is generating the text embeddings. Here’s a code snippet for doing that with sentence_transformers.

from sentence_transformers import SentenceTransformer

# load embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")

# store abstracts in a list
abstract_list = ["abstract 1", "abstract 2"]

# calculate embeddings
embeddings = model.encode(abstract_list)

4) Multimodal Search (Intermediate)

A couple of months ago, I helped a company create a basic RAG system for a set of technical reports. One of the challenges with searching such reports is that key information is often presented in plots and figures rather than text.

One way to incorporate this visual information into the search process is to use a multimodal embedding model to represent text and images in a shared space.

Here’s a basic breakdown:

Given a PDF, chunk it into sections and extract the images using PyMuPDF
Use a multimodal embedding model (e.g. nomic-ai/nomic-embed-text-v1.5) to represent the chunks and images as dense vectors and store them in a dataframe
Repeat for all PDFs in the knowledge base
Given a user query, pass it through the same embedding model used for the knowledge base
Compute the cosine similarity score between the query embedding and every item in the knowledge base
Return top k results

Libraries: PyMuPDF, transformers, pandas, sklearn

The most important part of this project is how the PDFs are chunked. The simplest way would be to use a fixed character count with some overlap between chunks. It is also helpful to capture metadata such as filename and page number for each chunk.

Here’s some basic boilerplate code to do that (courtesy of ChatGPT). If you get stuck, try asking it to extract the images.

Python">import fitz  # PyMuPDF

def extract_text_chunks(pdf_path, chunk_size, overlap_size):
    # Open the PDF file
    pdf_document = fitz.open(pdf_path)
    chunks = []

    # Iterate through each page in the PDF
    for page_num in range(len(pdf_document)):
        page = pdf_document[page_num]
        page_text = page.get_text()

        # Split the text from the current page into chunks with overlap
        start = 0
        while start < len(page_text):
            end = start + chunk_size
            chunk = page_text[start:end]

            # Store the page number with the chunk
            chunks.append((page_num + 1, chunk))
            # Move to the next chunk with the overlap
            start += chunk_size - overlap_size

    return chunks

# Parameters for extraction
pdf_path = "your_file.pdf"
chunk_size = 1000  # Size of each text chunk in characters
overlap_size = 200  # Overlap size in characters

text_chunks = extract_text_chunks_with_page_numbers(pdf_path, chunk_size, overlap_size)

# Display the chunks with page numbers
for i, (page_number, chunk) in enumerate(text_chunks):
    print(f"Chunk {i + 1} (Page {page_number}):n{chunk}n{'-' * 50}")

5) Knowledge Base QA (Advanced)

Over the past year, I’ve helped almost 100 businesses and individuals build AI projects. By far, the most common project people ask about is a document question-answering system. Building on the previous project, we can implement this in a straightforward way.

If we’ve already chunked and stored our documents in a DataFrame, we can convert the multimodal search tool into a multimodal RAG system.

Here are the steps:

Perform a search over the knowledge base (like the one created in Project 4)
Combine user query with top k search results and pass them to a multimodal model.
Create a simple Gradio user interface for the QA system.

Libraries: PyMuPDF, transformers, pandas, sklearn, together/openai, Gradio

Note: Llama 3.2 Vision is free until 2025 via Together AI’s API

This project essentially combines projects 2 and 4. However, it includes the essential component of a user interface. For that, we can use a dashboarding tool like Gradio, which allows us to create a chat UI with a few lines of code.

Here’s an example snippet adapted from Gradio’s doc for doing this.

import gradio as gr
import time

def generate_response(message, history):
    """
        Your code for generating a response
    """
    return response

demo = gr.ChatInterface(
    fn=generate_response, 
    examples=[{"text": "Hello", "files": []}], 
    title="Echo Bot", 
    multimodal=True)

demo.launch()

What’s next?

Thanks to tools like ChatGPT and Cursor, it’s never been easier to build AI projects fast. Things that used to block me for hours (if not days) a few years ago can now be resolved in minutes with advanced coding assistants.

My parting advice is to use these tools to learn faster and be bold in your project choices. For projects, find problems and time-box the implementation into a weekend.

Drop your questions in the comments

I’m hosting a (free) live workshop on how to build Project 1 – Resume Optimizer. Learn more

The post 5 AI Projects You Can Build This Weekend (with Python) appeared first on Towards Data Science.