Skip to main content

Data Lake pipelines for Vector DB, RAG & AI. Ingest, process, embed, and semantic search.

Project description

LakeFlow Backend

FastAPI backend and data pipelines for LakeFlow: ingest, staging, processing, embedding, and semantic search.


Overview

  • API: FastAPI app (lakeflow.main:app) — auth, search, embed, pipeline trigger, Qdrant proxy, system.
  • Data Lake: Layered zones under LAKEFLOW_DATA_BASE_PATH: 000_inbox100_raw200_staging300_processed400_embeddings500_catalog.
  • Vector store: Qdrant (default collection lakeflow_chunks). Embeddings via sentence-transformers (e.g. all-MiniLM-L6-v2).

Requirements

  • Python ≥ 3.10
  • Qdrant (e.g. Docker: docker compose up -d qdrant)
  • See requirements.txt for Python dependencies

Install & run

With Docker (from the LakeFlow repo root where docker-compose.yml is):

docker compose up --build
# API: http://localhost:8011

Local dev (from repo root, go to lakeflow):

cd lakeflow
python3 -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -r requirements.txt
pip install -e .
# Create/copy .env (repo root or lakeflow) with LAKEFLOW_DATA_BASE_PATH, QDRANT_HOST, etc.
python -m uvicorn lakeflow.main:app --reload --port 8011
  • If you get bad interpreter (venv points to wrong Python): remove .venv, run python3 -m venv .venv again then pip install -r requirements.txt and pip install -e ..

  • If you get Address already in use (port 8011 in use): free the port then restart the server — lsof -ti :8011 | xargs kill -9

  • Swagger: http://localhost:8011/docs

  • ReDoc: http://localhost:8011/redoc

  • Embed API: docs/API_EMBED.mdPOST /search/embed


Pipeline steps (CLI)

Run from the lakeflow directory (with venv activated and LAKEFLOW_DATA_BASE_PATH set in .env or environment).

Step Command Output
0 – Inbox → Raw python -m lakeflow.scripts.step0_inbox Hash, dedup, catalog
1 – Staging python -m lakeflow.scripts.step1_raw pdf_profile.json, validation.json
2 – Processed python -m lakeflow.scripts.step2_staging clean_text.txt, chunks.json, tables.json
3 – Embeddings python -m lakeflow.scripts.step3_processed_files embeddings.npy, chunks_meta.json
4 – Qdrant python -m lakeflow.scripts.step3_processed_qdrant Points in Qdrant

Or use the Streamlit UI (Pipeline Runner) when LAKEFLOW_MODE=DEV.


Main APIs

  • POST /auth/login – Demo login (e.g. admin / admin123), returns JWT.
  • POST /search/embed – Body {"text": "..."}vector, embedding, dim.
  • POST /search/semantic – Body {"query": "...", "top_k": 5, "qdrant_url": "...", "collection_name": "..."}.
  • POST /search/qa – RAG-style Q&A (semantic search + LLM). Optional.
  • POST /pipeline/run – Run a pipeline step (auth required).
  • GET/POST /qdrant/ – Qdrant collections and points (proxy).

Design notes

  • Idempotent pipelines; deterministic UUIDs for Qdrant.
  • SQLite without WAL (NAS-friendly).
  • No full-file load for large files; streaming where applicable.

License

Same as the root repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lake_flow_pipeline-0.1.0.tar.gz (44.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lake_flow_pipeline-0.1.0-py3-none-any.whl (65.7 kB view details)

Uploaded Python 3

File details

Details for the file lake_flow_pipeline-0.1.0.tar.gz.

File metadata

  • Download URL: lake_flow_pipeline-0.1.0.tar.gz
  • Upload date:
  • Size: 44.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for lake_flow_pipeline-0.1.0.tar.gz
Algorithm Hash digest
SHA256 573efa3be54ff4cdce49845d3bda0ab16080c875d9f15aacbff2e427c9d7f91a
MD5 51f4a883af5ff453f86c80838229b50c
BLAKE2b-256 87d0b51b5b0cb2fd737a0db808213ce99340eb327f9b02be04587aa2a198abfc

See more details on using hashes here.

File details

Details for the file lake_flow_pipeline-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for lake_flow_pipeline-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 aeb9b81e2ca213444083c979c4655af14e9e5c275edd92b614c8e1fe5561b249
MD5 84f52fd2d6befcc5047c93c23db7fced
BLAKE2b-256 d10e8a7e32537949f83c64f00362be46c51a0a8fffb7be412d50d5605be14b20

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page