Exploring Embeddings and Clustering Techniques for Computer Vision

Models & Research

The Engineer

31 Oct 2023 · 3 min read

This article explores how image embeddings can revolutionize computer vision tasks like clustering and quality assessment, starting with a basic example using pixel brightness in the MNIST dataset.

Embeddings have been a game-changer in natural language processing (NLP), and they're now making waves in computer vision. This article delves into how embeddings can be used to cluster images, assess dataset quality, and identify duplicates. We’ll also walk through a practical example using the MNIST dataset to illustrate these concepts.

Clustering MNIST Images Using Pixel Brightness

Before diving into more complex examples with embeddings, let's start with a simpler case: clustering MNIST images based on pixel brightness. The MNIST dataset consists of 60,000 grayscale images of handwritten digits, each with a size of 28x28 pixels. Each image can be represented by 784 values (one for each pixel). Our goal is to reduce these 784 dimensions to three using dimensionality reduction techniques like t-SNE and UMAP, allowing us to visualize the clusters in 3D space.

Steps to Cluster MNIST Images

Load the Data:
- Load the images of each class.
- Reshape the data into a 2D NumPy array with 784 features per image.
Apply Dimensionality Reduction:
- Use t-SNE or UMAP to reduce the dimensions from 784 to 3.
- t-SNE (t-Distributed Stochastic Neighbor Embedding) is great for visualizing high-dimensional data but can be computationally expensive.
- UMAP (Uniform Manifold Approximation and Projection) is faster and often produces better results for larger datasets.
Visualize the Clusters:
- Plot the reduced dimensions in a 3D scatter plot to see how similar images cluster together.

Visualizing High-Dimensional Data

Working with high-dimensional data can be challenging because it's hard to visualize and understand the underlying structure. Dimensionality reduction techniques like t-SNE and UMAP help simplify these complex datasets, making them more manageable and interpretable.

t-SNE: Focuses on preserving local structures, meaning points that are close in high-dimensional space will remain close in the reduced space.
UMAP: Balances between preserving local and global structures, often providing a clearer visualization for larger datasets.

Clustering with CLIP Embeddings

Now, let's move to a more advanced example using OpenAI’s CLIP (Contrastive Language–Image Pretraining) embeddings. CLIP is a model trained on a large dataset of image-text pairs, making it capable of generating rich, semantic embeddings for images and text.

Steps to Cluster Images Using CLIP Embeddings

Generate Embeddings:
- Use the CLIP model to generate embeddings for your images.
- These embeddings capture more meaningful features compared to raw pixel values.
Apply Dimensionality Reduction:
- Reduce the dimensionality of the CLIP embeddings using t-SNE or UMAP.
- The reduced dimensions can be visualized in 2D or 3D plots.
Visualize and Analyze Clusters:
- Plot the reduced dimensions to see how images with similar content cluster together.
- Use these clusters to assess dataset quality, identify duplicates, and discover patterns.

Practical Example: MNIST with CLIP Embeddings

To illustrate this, we can use the same MNIST dataset but generate embeddings using CLIP instead of raw pixel values. Here’s a simplified version of how you might do this in a Google Colab notebook:

import torch
from PIL import Image
from torchvision.transforms import ToTensor
from sklearn.manifold import TSNE
import umap
import matplotlib.pyplot as plt

# Load the CLIP model and preprocessor
import clip
model, preprocess = clip.load("ViT-B/32")

# Load MNIST images
mnist_images = ...  # Your code to load MNIST images

# Generate embeddings using CLIP
def generate_clip_embeddings(images):
    embeddings = []
    for image in images:
        image_tensor = preprocess(image).unsqueeze(0)
        with torch.no_grad():
            embedding = model.encode_image(image_tensor)
        embeddings.append(embedding.squeeze().numpy())
    return np.array(embeddings)

clip_embeddings = generate_clip_embeddings(mnist_images)

# Apply UMAP to reduce dimensions