Skip to main content

a mineru tool for enhancing your rag workflow

Project description

MinerU Flow

English | 简体中文

MinerU Flow is a document processing tool built around MinerU's document understanding capabilities. It helps you:

  • Manage MinerU parsing configurations (SaaS or self-hosted deployments).
  • Ingest documents from local directories, HTTP, or S3-compatible object storage.
  • Run multi-phase jobs — parsing → chunking → knowledge base import — with retries and status monitoring.
  • Inspect job progress, system information, and artifacts in a visual dashboard.

Installation

pip install mineru-flow
mineru-flow

Using conda:

conda create -n mineru-flow python=3.11
conda activate mineru-flow
pip install mineru-flow
mineru-flow

Local Development

The backend REST APIs are exposed under /api/v1, metadata is stored in SQLite by default, and job artifacts live under the user data directory (for example ~/Library/Application Support/mineru-flow on macOS when no data directory is specified). The mineru-flow CLI launches both the HTTP service and the worker system in a single process.

  • Project structure
    • mineru_flow/: FastAPI app, business logic, storage adapters, worker management.
    • frontend/: Vite + React single-page app (TanStack Router, Radix UI, Tailwind CSS).
    • tests/: Backend Pytest suites.
    • mineru_flow/internal/processor/: Phase implementations for parsing, chunking, and knowledge base import.

Backend dependencies

  • Python ≥ 3.11
  • Poetry ≥ 1.8 (creates a virtual environment automatically)
  • GCC or Clang toolchain (needed for some native packages such as python-magic)
  • Optional: Docker for containerized deployment support

Frontend dependencies

  • Node.js 20+ (or Bun 1.1+)
  • Any package manager (npm, pnpm, bun; examples use npm)

Optional external services

  • S3-compatible object storage (MinIO, Amazon S3, etc.) for remote file ingestion.
  • An existing MinerU deployment (SaaS API key or self-hosted service URL).

Startup & Configuration

Backend (FastAPI + worker)

poetry install
poetry run mineru-flow --host 0.0.0.0 --port 8001 --open

This command will:

  1. Apply database migrations (SQLite file is created under the app data directory).
  2. Start the HTTP API server on the configured host and port.
  3. Launch the asynchronous worker manager that polls for jobs.
  4. Optionally open the default browser when --open is provided.

You can also start the application without the CLI by running:

poetry run python -m mineru_flow.main --host 127.0.0.1 --port 8001

Frontend (Vite React dashboard)

cd frontend
npm install
npm run dev -- --port 3000

The Vite dev server proxies API requests to the backend (default /api/v1). For production, build and serve the static assets:

npm run build
npm run serve

If you prefer Bun:

bun install
bun run dev

Environment configuration

Set environment variables before starting the backend (e.g. in a .env file or via the shell). Key variables include:

Variable Default Description
HOST 0.0.0.0 HTTP bind address.
PORT 8001 HTTP port.
DATABASE_URL sqlite:///<data_dir>/mineru_flow.sqlite Override to use PostgreSQL/MySQL if desired.
LOG_LEVEL INFO Log level for backend and workers.
LOG_JSON False Enable JSON-structured logs.
LOG_FILE None Path to an additional log file.
WORKER_CONCURRENCY 4 Number of concurrent worker coroutines.
WORKER_POLLING_INTERVAL_MS 5000 Polling interval for new jobs.
WORKER_MAX_RETRY_ATTEMPTS 3 Automatic retry limit per job phase.

Frontend-specific values use the VITE_ prefix (see frontend/src/env.ts). Create a .env or .env.local file under frontend/ if you need to override defaults, for example:

VITE_APP_TITLE="Mineru Flow"
VITE_API_BASE_URL="http://localhost:8001/api/v1"

Docker

Build and run the all-in-one container (serves both API and static UI):

docker build -t mineru-flow .
docker run --rm -p 8000:8000 \
  -e HOST=0.0.0.0 \
  -e PORT=8000 \
  -v $(pwd)/media:/app/media \
  mineru-flow

The image defaults BASE_DATA_DIR to /app/media, so mounting that path preserves the SQLite database, uploaded files, and job artifacts across restarts. Override it by supplying a different BASE_DATA_DIR (or MINERU_FLOW_DATA_DIR) if you prefer another mount point.

5. Additional Notes

  • Common commands

    • Backend tests: poetry run pytest
    • Backend static analysis: poetry run ruff check
    • Frontend tests: cd frontend && npm run test
    • Frontend formatting / linting: npm run format, npm run lint, npm run check
  • Database migrations

    • Migrations run automatically when the app starts. To trigger them manually, call mineru_flow.alembic.run_migrate.run_db_migrations().
  • Processing pipeline extensions

    • Each phase inherits from BasePhaseProcessor and is registered in mineru_flow/internal/processor/registry.py. Add new processors or replace existing ones as needed.
    • MinerU parsing strategies, chunking logic, and knowledge-base targets can be configured through /api/v1/configs or the frontend UI.
    • Artifacts are stored under <data_dir>/media/artifacts/<task_id>/<phase>/ for debugging.
  • Debugging tips

    1. Start the backend and worker with poetry run mineru-flow --open.
    2. Launch the frontend dev server in another terminal: npm run dev.
    3. Configure MinerU, S3, and knowledge base settings under System Settings before creating tasks.
    4. Track phase progress and logs in the task detail page; /api/v1/system/worker exposes worker status.
    5. Inspect logs (LOG_FILE if configured) and artifact directories for intermediate results when diagnosing failures.
  • Further development ideas

    • Swap out the database for an alternative that suits your deployment.
    • Create custom processors to add new workflow stages or override defaults.
    • Reuse or extend frontend components under frontend/src/components to build additional UI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mineru_flow-1.0.0a4.tar.gz (1.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mineru_flow-1.0.0a4-py3-none-any.whl (1.8 MB view details)

Uploaded Python 3

File details

Details for the file mineru_flow-1.0.0a4.tar.gz.

File metadata

  • Download URL: mineru_flow-1.0.0a4.tar.gz
  • Upload date:
  • Size: 1.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.0.1 CPython/3.11.13 Linux/6.11.0-1018-azure

File hashes

Hashes for mineru_flow-1.0.0a4.tar.gz
Algorithm Hash digest
SHA256 808acef4b6546336089a4945464b52fe1298932685d18be402851d5e3b096e31
MD5 5a0e31f74b6e6ee2bc1bbcb74a342de2
BLAKE2b-256 6f31f4f407d7a157418b5dc1b4e9864570c54d8a86e314ed9a524ef08ec4957e

See more details on using hashes here.

File details

Details for the file mineru_flow-1.0.0a4-py3-none-any.whl.

File metadata

  • Download URL: mineru_flow-1.0.0a4-py3-none-any.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.0.1 CPython/3.11.13 Linux/6.11.0-1018-azure

File hashes

Hashes for mineru_flow-1.0.0a4-py3-none-any.whl
Algorithm Hash digest
SHA256 9b3ba01bec2557e3d2c442a752237af1bbaf3f0e609887606ec556fa437575b5
MD5 fd877e8e3145b100a6d90ba2a46c6bcb
BLAKE2b-256 9f95dadb9f7724f7378f855ef4194db8aaf78151eea9c072a166e55fa884aea0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page