Experimenting with Vector Search in MySQL

Vector search (or semantic search) can yield better results than traditional keyword matching, but not always. To get a better understanding of the concept and howto implement it in a useful way, I decided to get my hands dirty and try to implement it as simple as possible using mysql instead of a specialised vector database like Milvus, Pinecone or Qdrant (there are many more). Yes it might be slow and inefficient but it will be interesting.

TLDR; I made a module for Search API vector search in MySQL leveraging the Ai module in Drupal.

If you are familiar with Drupal and the Search API, then you probably know about the mysql database solution which is included if you don't want (or need) specialised search databases like Solr or ElasticSearch. In a similar manner I wanted to use a "local" vector database in MySQL.

Semantic and Vector Search
A quick recap on semantic search: first, the user's search input is transformed into a vector representation using a Large Language Model (LLM). This representation carries semantic meaning, but essentially, it's a multidimensional vector with hundreds or thousands of dimensions. Don't worry about visualizing such a high-dimensional vector; the key point is that more dimensions generally lead to a more accurate representation.

The vector search is the second step, where relevant data is identified in your database by finding other data with similar vector representations. Common metrics for determining similarity include cosine similarity (which measures similarity in direction) and Euclidean distance (which measures the distance between points in the vector space).

To facilitate this process, all searchable content in your database must also be stored as vectors, using the same embedding model as the one used for the input. This ensures consistency and allows each vector to be mapped to a specific piece of content, which can then be retrieved and displayed in the search results.

Vector Search using embeddings — Visual representation of how a vector search works

Drupal "Ai" Search
Drupal supports using Ai in Search API using plugins supporting various backend vector database servers. This was previously done using Search API Ai but that module has been included in the Ai module. Since the ai_search module already does a lot of the heavy lifting here (for indexing content using embeddings), I decided to create a new module which implements a AiVdbProvider plugin for MySQL.

Existing Ai search VDB provider modules are:

Following this naming pattern I decided to name mine:

ai_vdb_provider_mysql

Using it currently requires a patch for the ai module as noted on the project page, because I wanted to be able to have separate tables per index, in order to keep performance as best as possible. Currently the design of the plugin system does not allow for this kind of flexibility which might change in a future update of the Ai module.

The embedding part is easy and handled by any provider that is supported by the Ai module. The hard part is getting decent performance in MySQL calculating the similarity.

Luckily, after doing some research I stumbled upon this post on Reddit by someone who created a PHP implementation for fast vector search in MySQL. The library is on github and I currently am using a fork with some modifications in my module.

Quantization and Hamming Distance
The clever part about this library is how first finds a small subset of most relevant vector data and then only does the expensice cosine similarity calculation on that data. How exactly is this done?

Consider this vector:

[0.037807886, 0.054347273, -0.05825067, 0.028324638, 0.04211163, -0.03260336, -0.016389256, 0.01886641]

First, quantization. Quantization in the context of vector databases refers to a technique used to reduce the memory footprint and computational cost of storing and processing high-dimensional vectors. In this case, each vector component is represented by a bit (0 or 1).

So it becomes:

11011111001

The Hamming distance is a measure used to quantify the difference between two strings of equal length. It is defined as the number of positions at which the corresponding symbols (characters, bits, etc.) are different.

For example, if you have two binary strings:

String 1: 11011111001
String 2: 11110111001

The Hamming distance between these two strings is 2, because there are two positions where the corresponding bits are different (positions 3 and 5, if we start counting from 1).

This distance is calculated using a fast bitwise operation in a mysql query and results in a number of results that are similar.

The results are then passed to a stored mysql cosine similarity function which loops through the vectors in the JSON field (mysql 8 required).

Below is a video of a simple setup (using composite strategy to limit amount of vectors) and showing some basic searches using the Ai Vector DB Explorer (submodule of Ai).

Example using the Ai Vector DB Explorer to test the search

Mysql 9 Vector field?
But wait, doesn't MySQL 9 support vector type fields? Yes it does, and that might improve performance even more. Altough you shouldn't expect it to come close to any of the specialised vector database solutions but I haven't played around with mysql 9 yet.

Conclusion
This was a fun excercise to better understand vector search and different ways how it can be optimized using techniques like quantization and hamming distance using just PHP and MySQL. It might not be suitable for large datasets, but certainly viable for small projects and for learning purposes.
Scott Euser (ai module maintainer) also mentioned Typesense which seems like a promising open source vector database solution (and very fast), might be worth checking out next.

Experimenting with Vector Search in MySQL

About the author

How I switched from Gatsby to Astro (While Keeping Drupal in the Mix)

Using MCP to Manage Content with Ai Agents