
Share
This article explores how image embeddings can revolutionize computer vision tasks like clustering and quality assessment, starting with a basic example using pixel brightness in the MNIST dataset.
Embeddings have been a game-changer in natural language processing (NLP), and they're now making waves in computer vision. This article delves into how embeddings can be used to cluster images, assess dataset quality, and identify duplicates. We’ll also walk through a practical example using the MNIST dataset to illustrate these concepts.
Before diving into more complex examples with embeddings, let's start with a simpler case: clustering MNIST images based on pixel brightness. The MNIST dataset consists of 60,000 grayscale images of handwritten digits, each with a size of 28x28 pixels. Each image can be represented by 784 values (one for each pixel). Our goal is to reduce these 784 dimensions to three using dimensionality reduction techniques like t-SNE and UMAP, allowing us to visualize the clusters in 3D space.
Load the Data:
Apply Dimensionality Reduction:
Visualize the Clusters:
Working with high-dimensional data can be challenging because it's hard to visualize and understand the underlying structure. Dimensionality reduction techniques like t-SNE and UMAP help simplify these complex datasets, making them more manageable and interpretable.
Now, let's move to a more advanced example using OpenAI’s CLIP (Contrastive Language–Image Pretraining) embeddings. CLIP is a model trained on a large dataset of image-text pairs, making it capable of generating rich, semantic embeddings for images and text.

Generate Embeddings:
Apply Dimensionality Reduction:
Visualize and Analyze Clusters:
To illustrate this, we can use the same MNIST dataset but generate embeddings using CLIP instead of raw pixel values. Here’s a simplified version of how you might do this in a Google Colab notebook:
import torch
from PIL import Image
from torchvision.transforms import ToTensor
from sklearn.manifold import TSNE
import umap
import matplotlib.pyplot as plt
# Load the CLIP model and preprocessor
import clip
model, preprocess = clip.load("ViT-B/32")
# Load MNIST images
mnist_images = ... # Your code to load MNIST images
# Generate embeddings using CLIP
def generate_clip_embeddings(images):
embeddings = []
for image in images:
image_tensor = preprocess(image).unsqueeze(0)
with torch.no_grad():
embedding = model.encode_image(image_tensor)
embeddings.append(embedding.squeeze().numpy())
return np.array(embeddings)
clip_embeddings = generate_clip_embeddings(mnist_images)
# Apply UMAP to reduce dimensions
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
31 October 2023
133 articles
Related Articles
Related Articles
More Stories