Wednesday 23 August 2023

Show HN: Fast vector similarity using Rust and Python https://bit.ly/3Pbhsla

Show HN: Fast vector similarity using Rust and Python I recently found myself computing the similarity between lots of very high dimensional vectors (i.e., sentence embedding vectors from LLMs), and I wanted to try some more powerful measures of similarity/dependency than just Cosine similarity, which seems to be the default for everything nowadays because of its computational efficiency. There are many other more involved measures that can detect more subtle relationships, but the problem is that some of them are quite slow to compute, especially if you're trying to do it in Python. For my favorite measure of statistical dependency, Hoeffding's D, that's true even if you use Numpy. Since I recently learned Rust and wanted to learn how to make Python packages using Rust, I put together this new library that I call Fast Vector Similarity. I was blown away by the performance of Rust and the quality of the tooling while making this. And even though it required a lot of fussing with Github Actions, I was also really impressed with just how easy it was to make a Python library using Rust that could be automatically compiled into wheels for every combination of platform (Linux, Windows, Mac) and Python Version (3.8 through 3.11) and uploaded to PyPi, all triggered by a commit to the repo and handled by Github's servers-- and all for free if you're working on a public repo! Anyway, this library can easily be installed to try out using `pip install fast_vector_similarity`, and you can see some simple demo Python code in the readme to show how to use it. Aside from exposing some very high performance implementations of some very nice similarity measures, I also included the ability to get robust estimates of these measures using the Bootstrap method. Basically, if you have two very high dimensional vectors, instead of using the entire vector to measure similarity, you can take the same random subset of indices from both vectors and compute the similarity of just those elements. Then you repeat the process hundreds or thousands of times and look at the robust average (i.e., throw away the results outside the 25th percentile to 75th percentile and average the remaining ones, to reduce the impact of outliers) and standard deviation of the results. Obviously this is very demanding of performance, but it's still reasonable if you're not trying to compute it for too many vectors. Everything is designed to fully saturate the performance of multi-core machines by extensive use of broadcasting/vectorization and the use of paralell processing via the Rayon library. I was really impressed with how easy and low-overhead it is to make highly parallelized code in Rust, especially compared to coming from Python, where you have to jump through a lot of hoops to use multiprocessing and there is a ton of overhead. Anyway, please let me know what you think. I'm looking to add more measures of similarity if I can find ones that can be efficiently computed (I already gave up on including HSIC because I couldn't get it to go fast enough, even using BLAS/LAPACK). https://bit.ly/3QY7Lrc August 23, 2023 at 01:34PM

No comments:

Post a Comment