This week is National Coding Week. The theme for 2023 (at least according to codingweek.org) is Artificial Intelligence. I heard about this on a LinkedIn post from a former colleague who was taking this as an opportunity to learn Go.
As it happens, Go is used quite a lot at my new employer, PingCAP, in their development of TiDB (a modern distributed SQL database) so I thought that making a few “experimental” changes to TiDB would be a good way to learn a new language and scratch the AI itch at the same time.
Deciding on a Project
For my project I wanted to accomplish the following:
- Learn some Go programming
- Learn a bit more about TiDB
- Do something AI related
After seeing a recent blog post by Daniël van Eeden about extending TiDB with custom functions, I thought it would be interesting to follow that example and add some custom functions to TiDB. I have been looking at some of the interesting capabilities that are made available in vector databases, and thought it may be interesting to add some experimental functions that would allow users to calculate the distance between vectors.
These kinds of functions are useful for users who are storing vector embeddings and want to be able to calculate the distance between them. Measuring the distance between vectors can be used to calculate similarity between two vector embeddings persisted in a database, or it could also compare persisted vector embeddings with a vector embedding generated by a search query.
While there are definitely some more state of the art approaches to optimize performance and scaling for these kinds of vector similarity searches, for this experimental effort I have stuck with the more straightforward cosine similarity and dot product functions.
Getting Started
To get started with the project I relied on the detailed write up from Daniël van Eeden, as well as TiDB’s excellent developer guide.
I followed the getting started guide to set up my IDE (Visual Studio Code) on my Mac and get started.
It was remarkably easy to get the TiDB database up and running following the guide so I won’t replicate that documentation here.
Learning by Doing
My previous programming experience was primarily in Java so adjusting to some of the syntactic differences between the two was interesting. Taking a look at some of the existing examples in the TiDB code base was very helpful in figuring it out, as well as being able to dive into some of the Go documentation as needed.
The first thing I did was follow the example from the TiDB developer documentation as these custom functions will be compiled into TiDB. The initial changes I made were to update functions.go so that TIDB would recognize the function names in the SQL statements. I also updated builtin.go at this point to point to the function implementations I was going to write (as new builtin functions, compiled into TiDB).
Next, I set about defining the functions themselves. This helped me to learn a bit about how functions and methods are defined in Go. I particularly liked the ability to define multiple return values (somewhat reminiscent of returning multiple values as tuples in Haskell) to help with handling error cases. This was an interesting change of pace from Java and its use of exceptions to handle some of these cases.
I decided to extract the calculation of the Cosine Similarity to operate directly on float64 arrays. I originally did this as I was thinking of importing the functions using some library (that would have likely operated on float64’s versus the internal TiDB types), but after an initial investigation it seemed that these functions were easy enough to implement directly (for the purposes of this experimental project) so just went ahead and did that:
func Cosine(a []float64, b []float64) (cosine float64, err error) {
if len(a) != len(b) {
return 0.0, errors.New("Invalid vectors: two arrays of the same length were expected")
}
if len(a) == 0 {
return 0.0, errors.New("Invalid vectors: two non-zero length arrays were expected")
}
sum := 0.0
s1 := 0.0
s2 := 0.0
for i := range a {
sum += a[i] * b[i]
s1 += a[i] * a[i]
s2 += b[i] * b[i]
}
return sum / (math.Sqrt(s1) * math.Sqrt(s2)), nil
}
I also needed to decide how to store the vector embeddings in the database. To make life easy on myself I decided that I would forgo adding a custom type and instead decided to use the JSON data type that is available in TiDB. The functions would operate on JSON arrays of numbers. To do this I used some of the useful capabilities exposed in TiDB types to convert from the JSON type to an array of float64 in Go:
func AsFloat64Array(binJson types.BinaryJSON) (values []float64, err error) {
if binJson.TypeCode != types.JSONTypeCodeArray {
err = errors.New("Invalid JSON Array: an array of numbers were expected")
return nil, err
}
var arrCount int = binJson.GetElemCount()
values = make([]float64, arrCount)
for i := 0; i < arrCount && err == nil; i++ {
var elem = binJson.ArrayGetElem(i)
values[i], err = types.ConvertJSONToFloat(fakeSctx, elem)
}
return values, err
}
All the changes I made (including some basic test cases) are available in Github (see initial commit).
Phase 1 Complete
All of the changes I have made now enable me to easily calculate the cosine similarity and dot product of two JSON arrays in TiDB using SQL. Using the MySQL command line client (TiDB is wire compatible with MySQL) I can run SQL like the following:
mysql> SELECT 1 AS r, x_cosine_sim('[1.0, 2.0, 3.0, 4.0 ,5.0]','[1.0, 2.0, 3.0, 4.0, 5.0]') AS result
-> UNION
-> SELECT 2 AS r, x_cosine_sim('[1.0, 2.0, 3.0, 4.0 ,5.0]','[-1.0, -2.0, -3.0, -4.0, -5.0]') AS result
-> UNION
-> SELECT 3 as r, x_cosine_sim('[1.0, 2.0, 3.0, 4.0 ,5.0]','[-1.0, 2.0, 3.0, 4.0, -5.0]') AS result
-> ORDER BY r ASC;
+------+---------------------+
| r | result |
+------+---------------------+
| 1 | 1 |
| 2 | -1 |
| 3 | 0.05454545454545454 |
+------+---------------------+
3 rows in set (0.00 sec)
Reflections…
For some people (like myself), having a small project to work on is a useful tool for learning a new programming language. While the code I have shared is not professional quality (and could definitely be improved and be more idiomatic) it was helpful in its goal of helping me get more familiar with the Go language and its available tooling. Thanks to Chris and Daniël for the indirect inspiration for this project!