
Share
Today, I tackle building a SQLite C extension for vector similarity calculations, leveraging ChatGPT Code Interpreter to automate testing and deployment of the custom SQL function.
23rd March 2024
Yesterday, I explored using Claude and ChatGPT Code Interpreter for simple tasks like converting shapefiles to GeoJSON. Today, I’m diving into a more ambitious project: building a SQLite C extension to enhance vector similarity calculations.
Vector similarity is a common requirement in many applications, especially when dealing with large datasets and machine learning models. Storing encoded floating-point vectors in BLOB columns in a SQLite database table and using custom SQL functions to decode them and calculate cosine similarity is an efficient pattern. I’ve implemented this in Python before (check out this example from my LLM tool), but the overhead of making function calls from SQLite to Python can be significant.
My goal was to create a C extension for SQLite that performs these operations faster, leveraging the low-level efficiency of C. The catch? I wanted to build and test this extension entirely within ChatGPT Code Interpreter’s environment.
One of the quirks of working with ChatGPT Code Interpreter is that it sometimes denies capabilities it actually has. To get it into a cooperative state, I followed these steps:
Check Python and SQLite Versions:
import sqlite3
import sys
python_version = sys.version
sqlite_version = sqlite3.sqlite_version
python_version, sqlite_version
('3.11.8 (main, Mar 12 2024, 11:41:52) [GCC 12.2.0]', '3.40.1')
Verify GCC Availability:
gcc --version and tell me that version."import subprocess

result = subprocess.run(["gcc", "--version"], capture_output=True, text=True)
gcc_version = result.stdout.split('\n')[0]
gcc_version
```
'gcc (Debian 12.2.0-14) 12.2.0'
These steps are crucial because they ensure that ChatGPT Code Interpreter is in a state where it’s willing to compile C code.
With the environment set up, I could proceed with writing and compiling the C extension. Here’s a simplified version of what I did:
#include <sqlite3ext.h>
SQLITE_EXTENSION_INIT1
double cosine_similarity(const float *a, const float *b, int len) {
double dot_product = 0.0;
double norm_a = 0.0;
double norm_b = 0.0;
for (int i = 0; i < len; ++i) {
dot_product += a[i] * b[i];
norm_a += a[i] * a[i];
norm_b += b[i] * b[i];
}
return dot_product / (sqrt(norm_a) * sqrt(norm_b));
}
static void cosine_similarity_func(sqlite3_context *[context](/articles/mistrallite-a-fine-tuned-7b-model-for-enhanced-long-context-handling), int argc, sqlite3_value **argv) {
if (argc != 2) {
sqlite3_result_error(context, "Usage: COSINE_SIMILARITY(blob1, blob2)", -1);
return;
}
const unsigned char *blob1 = sqlite3_value_blob(argv[0]);
const unsigned char *blob2 = sqlite3_value_blob(argv[1]);
int len =
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
25 March 2024
88 articles
Related Articles
Related Articles
More Stories