Skip to main content

No project description provided

Project description

VectorFlow

Open source, high-throughput, fault-tolerant vector embedding pipeline

Simple API endpoint that ingests large volumes of raw data, processes, and stores or returns the vectors quickly and reliably

Join our Discord | Website | Get in touch | Docs | Free Hosted Version

IMAGE ALT TEXT HERE

Introduction

VectorFlow is an open source, high throughput, fault tolerant vector embedding pipeline. With a simple API request, you can send raw data that will be chunked, embedded and stored in any vector database or returned back to you. VectorFlow is multi-modal and can ingest both textual and image data.

This current version is an MVP. We recommend using it with Kubernetes in production. Right now the system only supports uploading single files at a time, up to 25 MB. For text-based files, it supports TXT, PDF and DOCX. For image files, it support JPG, JPEG, and PNG.

Run it Locally

With two simple commands you can set up VectorFlow locally:

pip install vectorflow-ai
vectorflow

To start embedding documents, add your embedding and database keys to the env_scrips/env_vars.sh script that was generated and set the filepath variable in clients/standard_upload_client.py to point to the file you want to embed. Then run:

source env_scrips/env_vars.sh
python clients/standard_upload_client.py

Below is a more detailed description of how to manually set up and configure the system. Please note that the pip command will not create a development environment on your machine, it only sets up and runs the docker-compose.

Docker-Compose

The best way to run VectorFlow is via docker compose. If you are running this on Mac, please grant Docker permissions to read from your Documents folder as instructed here. If this fails, remove the volume section from the docker-compose.yml.

1) Set Environment Variables

First create a folder in the root for all the environment variables:

mkdir env_scripts
cd env_scripts
touch env_vars.env

This creates a file called env_vars.env in the env_scripts folder to add all the environment variables mentioned below. You only need to set the LOCAL_VECTOR_DB variable if you are running qdrant, Milvus or Weaviate locally.

INTERNAL_API_KEY=your-choice
POSTGRES_USERNAME=postgres
POSTGRES_PASSWORD=your-choice
POSTGRES_DB=vectorflow
POSTGRES_HOST=postgres
RABBITMQ_USERNAME=guest
RABBITMQ_PASSWORD=guest
RABBITMQ_HOST=rabbitmq
EMBEDDING_QUEUE=embeddings
VDB_UPLOAD_QUEUE=vdb-upload
LOCAL_VECTOR_DB=qdrant | milvus | weaviate

You can choose a variable for INTERNAL_API_KEY, POSTGRES_PASSWORD, and POSTGRES_DB, but they must be set.

2) Run Docker-Compose

Make sure you pull Rabbit MQ and Postgres into your local docker repo. We also recommend running a vector DB in locally, so make sure to pull the image of the one you are using. Our docker-compose file will spin up qdrant by default and create two index/collections. If you plan to run Milvus or Weaviate, you will have to configure them on your own.

docker pull rabbitmq
docker pull postgres
docker pull qdrant/qdrant | docker pull milvusdb/milvus | docker pull semitechnologies/weaviate

Then run:

docker-compose build --no-cache
docker-compose up -d

Note that the db-init container is running a script that sets up the database schema and will stop after the script completes.

3) (optional) Configure Sentence Transformer open face models.

VectorFlow can run any Sentence Transformer model but the docker-compose file will not spin it up automatically. Either run app.py --model_name your-sentence-transformer-model, or build and run the docker image in src/hugging_face with:

docker build --file hugging_face/Dockerfile -t vectorflow_hf:latest .
docker run --network=vectorflow --name=vectorflow_hf -d --env-file=/path/to/.env vectorflow_hf:latest --model_name "your_model_name_here"

Note that the Sentence Transformer models can be large and take several minutes to download from Hugging Face. VectorFlow does not provision hardware, so you must ensure your hardware has enough RAM/VRAM for the model. By default, VectorFlow will run models on GPU with CUDA if available.

Using VectorFlow

To use VectorFlow in a live system, make an HTTP request to your API's URL at port 8000 - for example, localhost:8000 from your development machine, or vectorflow_api:8000 from within another docker container.

Request & Response Payload

All requests require an HTTP Header with Authorization key which is the same as your INTERNAL_API_KEY env var that you defined before (see above). You must pass your vector database api key with the HTTP Header X-VectorDB-Key if you are running a connecting to a cloud-based instance of a vector DB, and the embedding api key with X-EmbeddingAPI-Key if you are using OpenAI. HuggingFace Sentence Transformer embeddings do not require an api key, but you must follow the above steps to run the container with the model you need.

VectorFlow currently supports Pinecone, Qdrant, Weaviate, and Milvus vector databases.

To check the status of a job, make a GET request to this endpoint: /jobs/<int:job_id>/status. The response will be in the form:

{
    'JobStatus': job_status.value
}

To submit a job for embedding, make a POST request to the /embed endpoint with a file attached, the 'Content-Type: multipart/form-data' header and the following payload:

{
    'SourceData=path_to_txt_file'
    'LinesPerBatch=4096'
    'EmbeddingsMetadata={
        "embeddings_type": "OPEN_AI | HUGGING_FACE | IMAGE",
        "chunk_size": 512,
        "chunk_overlap": 128,
        "chunk_strategy": "EXACT | PARAGRAPH | SENTENCE",
        "hugging_face_model_name": "sentence-transformer-model-name-here"
    }'
    'VectorDBMetadata={
        "vector_db_type": "PINECONE | QDRANT | WEAVIATE | MILVUS | REDIS",
        "index_name": "index_name",
        "environment": "env_name"
    }'
    'DocumentID=your-optional-internal-tracking-id'
}

You will get the following payload back:

{
    message': f"Successfully added {batch_count} batches to the queue",
    'JobID': job_id
}

VectorFlow API Client

The easiest way to use VectorFlow is with the our clients, located in clients/ directory. There are several scripts, with different configurations for qickly uploading data. We recommmend starting with the clients/standard_upload_client.py - Running this script will submit a document to VectorFlow for embedding with Open AI ADA and upload to the local qdrant instance. You can change the values to match your configuration.

Note that the TESTING_ENV variable is the equivalent of the enviroment field in the VectorDBMetadata, which corresponds to an environment in Pincone, a class in Weaviate, a collection in qdrant, etc.

Vector Database Schema Standard

VectorFlow enforces a standardized schema for uploading data to a vector store:

id: string
source_data: string
source_document: string
embeddings: float array

The id can be used for deduplication and idempotency. Please note for Weaviate, the id is called vectorflow_id. We plan to support dynamically detected and/or configurable schemas down the road.

Raw Embeddings Webhook

If you wish to use VectorFlow only for chunking and generating embeddings, pass a WebhookURL parameter in the body of the /embed request and a X-Webhook-Key as a header. VectorFlow assumes a webhook key is required for writing back to any endpoint. The embeddings are sent back along with the source chunks in tuple or lis of (source_chunk, embedding). This is sent as json with the following form:

{
    'Embeddings': list[tuple(str, list[float])],
    'DocumentID': str,
    'JobID': int
}

Chunk Validation Webhook

If you wish to validate which chunks you wish to embed, pass a ChunkValidationURL parameter in the body of the /embed request. This will send the request with the following json payload, {"chunks": chunked_data}, where chunked_data is a list of strings. It will expected back json containing key valid_chunks with a list of valid chunks for embedding. This endpoint will timeout after 30 seconds by default but can be configured in the application code.

S3 Endpoint

VectorFlow is integrated with AWS s3. You can pass a pre-signed s3 URL in the body of the HTTP instead of a file. Use the form field PreSignedURL and hit the endpoint /s3. This endpoint has the same configuration and restrictions as the /embed endpoint.

Image Embedding & Similarity Search

VectorFlow can embed images using the Image2Vec library. It will create a 512 dimensional vector by default from any image. You can also perform a Top-K image similarity search.

How to Use Image Upload

Send a POST request to the /images endpoint to utilize this capability. You pass the same set of fields in VectorDBMetadata as you would for the /embed or /s3 endpoints but for EmbeddingsMetadata you only need to pass "embeddings_type": "IMAGE"

The docker-compose file will not spin this capability up automatically. Either run app.py --model_name your-sentence-transformer-model, or build and run the docker image in src/images with:

docker build --file images/Dockerfile -t vectorflow_image_embedding:latest .
docker run --network=vectorflow --name=vectorflow_image_embedding -d --env-file=/path/to/.env vectorflow_image_embedding:latest

Note that the Image2Vec uses the ResNet 18 by default. You can configure it to run larger models with higher dimensionality for more fine-grained embeddings. These take several minutes to download the weights from the source. VectorFlow does not provision hardware, so you must ensure your hardware has enough RAM/VRAM for the model. By default, VectorFlow will run models on GPU with CUDA if available.

How to Use Image Search

Performing an image similarity search requires running another service. You can build and run it using docker:

docker build --file images/Dockerfile.search -t vectorflow_image_search:latest .
docker run --network=vectorflow --name=vectorflow_image_search -d --env-file=/path/to/.env vectorflow_image_search:latest

Request

To perform a search, send a POST request to /images/search endpoint with an image file attached, the 'Content-Type: multipart/form-data' header and the following body:

{
    'ReturnVectors': boolean,
    'TopK': integer, less than 1000,
    'VectorDBMetadata={
        "vector_db_type": "PINECONE | QDRANT | WEAVIATE | MILVUS | REDIS",
        "index_name": "index_name",
        "environment": "env_name"
    }'
}

All requests require an HTTP Header with Authorization key which is the same as your INTERNAL_API_KEY env var that you defined before (see above). You must pass your vector database api key with the HTTP Header X-VectorDB-Key if you are running a connecting to a cloud-based instance of a vector DB.

Response

An image similarity search will return a response object containing the top K matches, plus the raw vectors if requested, with of the following form:

{
    "similar_images": list of match objects
    "vectors": list of list of floats
}

where `match`` objects are defined as:

{
    "id": str,
    "score": float,
    "metadata": {"source_document" : str}
}

Contributing

We love feedback from the community. If you have an idea of how to make this project better, we encourage you to open an issue or join our Discord. Please tag dgarnitz and danmeier2.

Our roadmap is outlined in the section below and we would love help in building it out. Our open issues are a great place to start and can be viewed here. If you want to work on something not listed there, we recommend you open an issue with a proposed approach in mind before submitting a PR.

Please tag dgarnitz on all PRs and update the README to reflect your changes.

Testing

When submitting a PR, please add units tests to cover the functionality you have added. Please re-run existing tests to ensure there are no regressive bugs. Run from the src directory. To run an individual test use:

python -m unittest module.tests.test_file.TestClass.test_method

To run all the tests in the file use:

python -m unittest module.tests.test_file

For end-to-end testing, it is recommend to build and run using the docker-compose, but take down the container you are altering and run it locally on your development machine. This will avoid the need to constantly rebuild the images and re-run the containers. Make sure to change the environment variables in your development machine terminal to the correct values (i.e. localhost instead of rabbitmq or postgres) so that the docker containers can communicate with your development machine. Once it works locally you can perform a final test with everything in docker-compose.

Verifying

Please verify that all changes work with docker-compose before opening a PR.

We also recommend you add verification evidence, such as screenshots, that show that your code works in an end to end flow.

Roadmap

  • Support for multi-file, directory data ingestion from sources such as Salesforce, Google Drive, etc
  • Retry mechanism
  • Langchain & Llama Index integrations
  • Support callbacks for writing object metadata to a separate store
  • Dynamically configurable vector DB schemas
  • Deduplication capabilities
  • Vector version control
  • "Smart" chunker
  • "Smart" metadata extractor

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vectorflow-ai-0.1.1.tar.gz (11.9 kB view details)

Uploaded Source

Built Distribution

vectorflow_ai-0.1.1-py3-none-any.whl (11.9 kB view details)

Uploaded Python 3

File details

Details for the file vectorflow-ai-0.1.1.tar.gz.

File metadata

  • Download URL: vectorflow-ai-0.1.1.tar.gz
  • Upload date:
  • Size: 11.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.9

File hashes

Hashes for vectorflow-ai-0.1.1.tar.gz
Algorithm Hash digest
SHA256 90f04f605971ccf366fe1378e9d0fd492c94fb106e51aafac2f5cac58f4a4e4f
MD5 771be8042a1fad6e51a6e83034198d6a
BLAKE2b-256 a0ffe1db56c4dc394d990bcdd7b5d04d79f06803d852880df30656bc370ab259

See more details on using hashes here.

File details

Details for the file vectorflow_ai-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for vectorflow_ai-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 056f234a890f1f6c4391b75f487014aa67b3b591a496984bea96cdbf4fdd2a81
MD5 0f088d87d882c695d04ee93e6c1063b4
BLAKE2b-256 e10daed242aa6ca0e104591cc6bfd4d333a11c682b0d50780e612ec4fed9630f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page