No project description provided
Project description
vecy
Vector search in rust with filtering
TODO
- get it to work with brute force
- simd, brute force
- add clustring
- concurrent add clustring
- add benchmark
- add test
- add filtering
Linear algebra
The goal is to split the data in to a binary-tree structure where hyperplanes separate the data based upon two random points. Before we
The equation of a hyper plane according(normal form):
n * (x - x_0) = 0
x and x_0 are vectors in an N-dimensional space. The vector x represents any point on the hyperplane, and x_0 represents a specific known point on the hyperplane. n is a normal vector, which is perpendicular to the hyperplane. The direction of n gives the orientation of the hyperplane in space. The expression (x - x_0) is the vector from x_0 to x.
Compute the midpoint c of a and b:
c = (a + b) / 2
Compute the vector d pointing from a to b:
d = b - a
Randomly selecting two points in an N-dimensional space, a and b described as vectors, then the hyperplane that lies perfectly between these two points will pass through the midpoint of the line segment joining a and b, and will be normal to the vector that points from a to b. In order to find this hyperplane we will take the following steps:
The equation of the hyperplane that is perfectly between a and b is thus given by the following formula:
d * (x - c) = 0
where x represent any point on the hyper plane and c is he midpoint.
This equation describes a hyperplane that passes through the midpoint of a and b and is perpendicular to the direction from a to b, thus it lies perfectly in between these two points.
The vector d serves as the normal vector for the hyperplane, and the vector c represents a specific point on the hyperplane (the midpoint between a and b). Remember that "x" represents any point on the hyperplane.
The "*" represents the dot product operation in the equation above. The equation d * (x - c) = 0 indicates that the dot product of the normal vector d and the vector from c to any point x on the hyperplane is zero, which signifies that these two vectors are orthogonal or perpendicular to each other. This is the property that defines a hyperplane.
To find if a point is "above"(in the direction of the normal vector) or "below". We can check using:
d * (x - c) > 0
where x is the point of interest.
Fearless concurrency
Modern computers are made faster due to more cores and thus to make things faster the implementation is threaded. Currently there is one thread per tree, a better solution to not create to many threads and lose performance due to CPU scheduling would potentially be to have a worker pool. This is left for a later stage. The trees are Box:ed due to that we dont know the length at compile time and thus is needs to end up on the heap to allow it to freely grow. Since they are boxed, which result in that we only have a fixed size pointer on the stack which are note allowed to be copied we utilize to use a reference counter to keep track in order to use the trees in a threading set up. THis is a good tutorial on it.
Filtering support
By adding metadata to each node, we can allow for more efficient filtering, metadata would also allow to indicate when we need to read from multiple nodes since the top_k
is larger than the values in the sub nodes.
Also suggest having very large leafs, and then do linear search over each leaf. This could prove fast with super scalar cpus since. I think this is they way we will go. This will also result in that we have large enough clusters so that we can filter efficiently.
Since we use concurrency this should be goooood.
Python virtual env
source .env/bin/activate
Performance
Performance is primarly evaluated through the python code using the py-spy.
FFI for python
In order to make the package available from python the py03 package is used. The rust code will be exposed through the Trees
struct:
#[pyclass]
pub struct Trees{
trees: Vec<Tree>,
all_data: Vec<Vec<f32>>
}
in order to expose it to python through py03 #[pyclass]
is added, the new function is also exposed.
#[pymethods]
impl Trees{
#[new]
pub fn new ...
}
finally the python class(rust struct) has to be added to the module
#[pymodule]
fn vecy(_py: Python<'_>, m: &PyModule) -> PyResult<()> {
m.add_class::<Trees>()?;
Ok(())
}
and that is all that is needed to expose the rust code to python.
It can now be imported as(after it has been built using maturin develop
):
import vecy
io_ring
Andy pavlo talked about this, it might be cooool to check it out further!
Arrow? Should it be backed by arrow instead???
Scoped threads
Blog post about this as well! Will be awesome to do! Will make a small one :)
https://itsallaboutthebit.com/arc-mutex/
Parallize the index job as well!
- Send all the work to a work pool
- Need intra index parallism to make it run fast enough.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for vecy-0.1.0-cp311-cp311-macosx_10_7_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5ffc5d56773564d25a89ecbaca482d299f36bf6b65be7aaac8b8d0865f3740c6 |
|
MD5 | b46492f3e7812f99bb23deac66655e21 |
|
BLAKE2b-256 | a96733ef3d91055725bd675dc6275fd99ccddd3f7472a590cbbca4a0f51f970e |
Hashes for vecy-0.1.0-cp39-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 116faae47ff567e719b663fbed5d68f05b180d2f0a320a1c7f52bcd187f18506 |
|
MD5 | 62ecbbdea87ecd859242d84fb5e2224b |
|
BLAKE2b-256 | eb36f6a1821f5a99c9f782ec5eb2517a27d338bb0be59a27ea1da8caa6d67c8e |
Hashes for vecy-0.1.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7f6e5a639fc064c4f32ed1c0ecb484c68a10ec0133645eb6de5ebb94d0e5bd94 |
|
MD5 | e6c243239e15e904ab7cf7a816b0746d |
|
BLAKE2b-256 | d295dbd5410382244a737872f468c5107f251f71fa6f301c796454347dcb7332 |
Hashes for vecy-0.1.0-cp39-cp39-macosx_10_7_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e85b171ce98a206d667cd2903338fdca23c9a22f8ee1808284bae983ceac23f4 |
|
MD5 | 3229cd386aac4833763253ea1f4767ec |
|
BLAKE2b-256 | ce6d0286532f99002b493c2c7ccf7e81a98c6a4462eef8d6b91231cb0eda87eb |