Skip to main content

No project description provided

Project description

vecy

Vector search in rust with filtering

TODO

  • get it to work with brute force
  • simd, brute force
  • add clustring
  • concurrent add clustring
  • add benchmark
  • add test
  • add filtering

Linear algebra

The goal is to split the data in to a binary-tree structure where hyperplanes separate the data based upon two random points. Before we

The equation of a hyper plane according(normal form):

n * (x - x_0) = 0 

x and x_0 are vectors in an N-dimensional space. The vector x represents any point on the hyperplane, and x_0 represents a specific known point on the hyperplane. n is a normal vector, which is perpendicular to the hyperplane. The direction of n gives the orientation of the hyperplane in space. The expression (x - x_0) is the vector from x_0 to x.

Compute the midpoint c of a and b:

c = (a + b) / 2

Compute the vector d pointing from a to b:

d = b - a

Randomly selecting two points in an N-dimensional space, a and b described as vectors, then the hyperplane that lies perfectly between these two points will pass through the midpoint of the line segment joining a and b, and will be normal to the vector that points from a to b. In order to find this hyperplane we will take the following steps:

The equation of the hyperplane that is perfectly between a and b is thus given by the following formula:

d * (x - c) = 0

where x represent any point on the hyper plane and c is he midpoint.

This equation describes a hyperplane that passes through the midpoint of a and b and is perpendicular to the direction from a to b, thus it lies perfectly in between these two points.

The vector d serves as the normal vector for the hyperplane, and the vector c represents a specific point on the hyperplane (the midpoint between a and b). Remember that "x" represents any point on the hyperplane.

The "*" represents the dot product operation in the equation above. The equation d * (x - c) = 0 indicates that the dot product of the normal vector d and the vector from c to any point x on the hyperplane is zero, which signifies that these two vectors are orthogonal or perpendicular to each other. This is the property that defines a hyperplane.

To find if a point is "above"(in the direction of the normal vector) or "below". We can check using:

d * (x - c) > 0

where x is the point of interest.

Fearless concurrency

Modern computers are made faster due to more cores and thus to make things faster the implementation is threaded. Currently there is one thread per tree, a better solution to not create to many threads and lose performance due to CPU scheduling would potentially be to have a worker pool. This is left for a later stage. The trees are Box:ed due to that we dont know the length at compile time and thus is needs to end up on the heap to allow it to freely grow. Since they are boxed, which result in that we only have a fixed size pointer on the stack which are note allowed to be copied we utilize to use a reference counter to keep track in order to use the trees in a threading set up. THis is a good tutorial on it.

Filtering support

By adding metadata to each node, we can allow for more efficient filtering, metadata would also allow to indicate when we need to read from multiple nodes since the top_k is larger than the values in the sub nodes.

Also suggest having very large leafs, and then do linear search over each leaf. This could prove fast with super scalar cpus since. I think this is they way we will go. This will also result in that we have large enough clusters so that we can filter efficiently.

Since we use concurrency this should be goooood.

Python virtual env

source .env/bin/activate

Performance

Performance is primarly evaluated through the python code using the py-spy.

FFI for python

In order to make the package available from python the py03 package is used. The rust code will be exposed through the Trees struct:

#[pyclass]
pub struct Trees{
    trees: Vec<Tree>,
    all_data: Vec<Vec<f32>>
}

in order to expose it to python through py03 #[pyclass] is added, the new function is also exposed.

#[pymethods]
impl Trees{

    #[new]
    pub fn new ...
}

finally the python class(rust struct) has to be added to the module

#[pymodule]
fn vecy(_py: Python<'_>, m: &PyModule) -> PyResult<()> {
    m.add_class::<Trees>()?;
    Ok(())
}

and that is all that is needed to expose the rust code to python.

It can now be imported as(after it has been built using maturin develop):

import vecy

io_ring

Andy pavlo talked about this, it might be cooool to check it out further!

https://tylerneely.com/

Arrow? Should it be backed by arrow instead???

Performance analyzer rust

Scoped threads

Blog post about this as well! Will be awesome to do! Will make a small one :)

https://itsallaboutthebit.com/arc-mutex/

Parallize the index job as well!

  • Send all the work to a work pool
  • Need intra index parallism to make it run fast enough.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

vecy-0.1.0-cp311-cp311-macosx_10_7_x86_64.whl (259.2 kB view details)

Uploaded CPython 3.11 macOS 10.7+ x86-64

vecy-0.1.0-cp39-none-win_amd64.whl (192.9 kB view details)

Uploaded CPython 3.9 Windows x86-64

vecy-0.1.0-cp39-cp39-macosx_11_0_arm64.whl (244.3 kB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

vecy-0.1.0-cp39-cp39-macosx_10_7_x86_64.whl (268.1 kB view details)

Uploaded CPython 3.9 macOS 10.7+ x86-64

File details

Details for the file vecy-0.1.0-cp311-cp311-macosx_10_7_x86_64.whl.

File metadata

File hashes

Hashes for vecy-0.1.0-cp311-cp311-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 5ffc5d56773564d25a89ecbaca482d299f36bf6b65be7aaac8b8d0865f3740c6
MD5 b46492f3e7812f99bb23deac66655e21
BLAKE2b-256 a96733ef3d91055725bd675dc6275fd99ccddd3f7472a590cbbca4a0f51f970e

See more details on using hashes here.

Provenance

File details

Details for the file vecy-0.1.0-cp39-none-win_amd64.whl.

File metadata

  • Download URL: vecy-0.1.0-cp39-none-win_amd64.whl
  • Upload date:
  • Size: 192.9 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.1.0

File hashes

Hashes for vecy-0.1.0-cp39-none-win_amd64.whl
Algorithm Hash digest
SHA256 116faae47ff567e719b663fbed5d68f05b180d2f0a320a1c7f52bcd187f18506
MD5 62ecbbdea87ecd859242d84fb5e2224b
BLAKE2b-256 eb36f6a1821f5a99c9f782ec5eb2517a27d338bb0be59a27ea1da8caa6d67c8e

See more details on using hashes here.

Provenance

File details

Details for the file vecy-0.1.0-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for vecy-0.1.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7f6e5a639fc064c4f32ed1c0ecb484c68a10ec0133645eb6de5ebb94d0e5bd94
MD5 e6c243239e15e904ab7cf7a816b0746d
BLAKE2b-256 d295dbd5410382244a737872f468c5107f251f71fa6f301c796454347dcb7332

See more details on using hashes here.

Provenance

File details

Details for the file vecy-0.1.0-cp39-cp39-macosx_10_7_x86_64.whl.

File metadata

File hashes

Hashes for vecy-0.1.0-cp39-cp39-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 e85b171ce98a206d667cd2903338fdca23c9a22f8ee1808284bae983ceac23f4
MD5 3229cd386aac4833763253ea1f4767ec
BLAKE2b-256 ce6d0286532f99002b493c2c7ccf7e81a98c6a4462eef8d6b91231cb0eda87eb

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page