Fine-tune BAAI/bge-m3 per retrieval + reranking su dominio specifico

Project description

🎯 BGE Auto-Tune

Fine-tune BAAI/bge-m3 on your domain, from scratch.

BGE Auto-Tune automates the entire pipeline: generates a training dataset from your data, runs unified fine-tuning (dense + sparse + ColBERT), tests retrieval quality, re-indexes your collections, and publishes the model to HuggingFace.

bge-auto-tune generate   →   create dataset from Qdrant or from documents (Docling)
bge-auto-tune finetune   →   train the model
bge-auto-tune test       →   compare base vs fine-tuned
bge-auto-tune reindex    →   re-embed and update vectors in Qdrant
bge-auto-tune publish    →   upload model to HuggingFace Hub
bge-auto-tune run        →   generate + finetune + test in sequence

Two ways to generate the dataset

The generate command supports two data sources. Choose the one that fits your situation:

Source	Flag	When to use
Qdrant	(default)	You already have a Qdrant collection with chunked and vectorized documents. The tool reads chunks directly from your collection.
Docling	`--docling`	You're starting from scratch or don't have vectorized data yet. Point the tool at a folder of documents (PDF, DOCX, PPTX, HTML, images, etc.) and Docling handles parsing + semantic chunking.

Our recommendation: if you're building a retrieval system from zero and your documents live on disk, start with --docling. It gives you high-quality semantic chunks without having to set up an ingestion pipeline first. If you already have a well-populated Qdrant collection, use the default mode — it's faster since chunks are already there.

# From Qdrant (default)
bge-auto-tune generate --collection my_docs --min-pairs 3000

# From documents via Docling
bge-auto-tune generate --docling --docs-dir ./my_docs --min-pairs 3000

Both paths produce the same bge_m3_training.jsonl output. Everything downstream (finetune, test, reindex, publish) is identical regardless of how the dataset was generated.

Prerequisites

BGE Auto-Tune relies on external services depending on which mode you use.

Always required

Local LLM (OpenAI-compatible API)

A language model accessible via an OpenAI-compatible API. Used to filter low-quality chunks and generate realistic synthetic queries.

Any model served by vLLM, Ollama (with OpenAI endpoint), llama.cpp, TGI, etc. will work.

http://localhost:8001/v1  (default)

Tested with Qwen/Qwen3-4B-Instruct-2507 — 4-8B models work great for this task.

BGE-M3 embedding server

A server exposing the BGE-M3 model for generating embeddings. Used to search for hard negatives during dataset generation and to re-index collections after fine-tuning.

http://localhost:8004  (default)

The server must respond on POST /v1/embeddings with the standard format.

For Qdrant mode (default)

Qdrant

A Qdrant instance with a collection populated with text chunks. Chunks must have a text field in their payload (configurable with --text-field).

http://localhost:6333  (default)

For Docling mode (`--docling`)

Docling serve

A running instance of docling-serve. The tool uses the /v1/chunk/hybrid/file endpoint for semantic chunking — Docling handles document parsing, OCR, table extraction, and intelligent chunking all in one step.

http://localhost:5001  (default)

Start it with:

# Python package
pip install "docling-serve[ui]"
docling-serve run

# Or Docker
docker run -p 5001:5001 quay.io/docling-project/docling-serve

Services overview

Service	Default endpoint	Env var	Required for
LLM	`http://localhost:8001/v1`	`LLM_URL`	Both modes
Embedding	`http://localhost:8004`	`EMBED_URL`	Both modes
Qdrant	`http://localhost:6333`	`QDRANT_URL`	Qdrant mode
Docling	`http://localhost:5001`	`DOCLING_URL`	Docling mode

All endpoints can be configured via env vars or CLI flags.

Hardware

Fine-tuning requires a GPU with at least 16 GB VRAM (24 GB recommended). Dataset generation and testing can run on CPU but are much faster on GPU.

Installation

pip install bge-auto-tune

Recommended workflow (manual pipeline)

If this is your first time, use the individual commands. This lets you inspect and validate each step before moving on.

Step 1 — Generate the dataset

From Qdrant (you already have chunks in a collection):

bge-auto-tune generate \
  --collection my_docs \
  --min-pairs 3000 \
  --queries-per-chunk 3 \
  --hard-negatives 7

From documents via Docling (starting from files on disk):

bge-auto-tune generate \
  --docling \
  --docs-dir ./my_docs \
  --min-pairs 3000 \
  --queries-per-chunk 3 \
  --hard-negatives 7

Docling supports PDF, DOCX, PPTX, HTML, images (PNG, JPG, TIFF, BMP), Markdown, CSV, and XLSX. It scans the directory recursively, so subfolders are included.

Both commands produce a bge_m3_training.jsonl file. Open it and review it before proceeding:

# How many pairs?
wc -l bge_m3_training.jsonl

# Look at some examples
head -5 bge_m3_training.jsonl | python -m json.tool

Check that:

Queries are realistic and diverse
Positives are actually relevant to the query
Negatives are "hard" (similar but incorrect)

If something looks off, adjust parameters and regenerate. It's much better to spend time here than to fine-tune on dirty data.

How many pairs do you need?

Pairs	Expected quality
< 500	Nearly useless — too little signal
500 – 1,000	Works only with a very specific domain and clean data
1,000 – 3,000	Safe zone for most use cases
3,000 – 10,000	Ideal — robust results
> 10,000	Diminishing returns

The absolute number matters less than coverage: if your corpus has 5,000 chunks but the dataset only covers 200 of them, you'll have huge blind spots. Aim to cover at least 30-50% of your chunks.

Step 2 — Fine-tune

bge-auto-tune finetune \
  --dataset bge_m3_training.jsonl \
  --epochs 2 \
  --lr 1e-5 \
  --batch-size 4

The model is saved to ./bge-m3-finetuned/.

Note: fine-tuning always starts from the base model BAAI/bge-m3. If you rerun the command, the previous output is overwritten. To keep different versions, change --output:
bge-auto-tune finetune --output ./bge-m3-finetuned-v2

Step 3 — Test

bge-auto-tune test \
  --model ./bge-m3-finetuned \
  --test-queries 200 \
  --top-k 10

The test automatically holds out 20% of the dataset as unseen queries and compares the base model against the fine-tuned one:

Recall@K — is the positive in the top K results?
MRR — at what average position does the positive end up?
NDCG@10 — ranking quality considering position
Rerank accuracy — multi-mode (dense+sparse+colbert) scoring
Qualitative examples — queries where fine-tuning improved (or worsened) the ranking

If results aren't good, go back to Step 1: more data, better data, or different parameters.

Step 4 — Re-index Qdrant

After testing, you need to re-embed all documents with the new model. Point your BGE-M3 server to the fine-tuned model, then:

# Test first — creates a separate collection (my_docs_test_finetuned)
bge-auto-tune reindex --collections my_docs --test

# When satisfied, update the original collection in-place
bge-auto-tune reindex --collections my_docs

Multiple collections at once:

bge-auto-tune reindex --collections docs,faq,policies --test

The command iterates all points, reads the text field, calls the embedding service, and updates the vectors. It supports named vectors (dense + sparse), unnamed vectors, and ColBERT.

Step 5 — Publish (optional)

Share the fine-tuned model on HuggingFace Hub:

# Login once
huggingface-cli login

# Publish
bge-auto-tune publish --repo your-user/bge-m3-your-domain

A model card is generated automatically from your test results. Add --private for private repos.

Anyone can then use your model:

from FlagEmbedding import BGEM3FlagModel
model = BGEM3FlagModel("your-user/bge-m3-your-domain")

`run` command (automatic pipeline)

For users who have already calibrated their parameters and want to run generate → finetune → test in one shot:

# From Qdrant
bge-auto-tune run \
  --collection my_docs \
  --min-pairs 3000 \
  --epochs 2 \
  --lr 1e-5 \
  --verbose

# From documents via Docling
bge-auto-tune run \
  --docling \
  --docs-dir ./my_docs \
  --min-pairs 3000 \
  --epochs 2 \
  --lr 1e-5 \
  --verbose

If any step fails, it stops. Re-indexing and publishing are not included in run — they require service configuration and confirmation.

⚠️ Use this only when you know your services are working and your parameters are dialed in. For the first time, use the manual pipeline.

All parameters

`bge-auto-tune generate`

Common parameters (both modes)

Parameter	Default	Description
`--llm-url`	`http://localhost:8001/v1`	Local LLM endpoint
`--llm-model`	`Qwen/Qwen3-4B-Instruct-2507`	LLM model name
`--embed-url`	`http://localhost:8004`	BGE-M3 embedding server
`--output`	`bge_m3_training.jsonl`	Output dataset file
`--queries-per-chunk`	`3`	Queries generated per chunk
`--hard-negatives`	`7`	Hard negatives per query
`--min-pairs`	`2000`	Minimum target pairs
`--max-chunks`	(all)	Limit chunks processed (for testing)
`--min-chunk-length`	`100`	Minimum chunk length (chars)
`--min-words`	`20`	Minimum words per chunk
`--min-alpha-ratio`	`0.4`	Minimum alphabetic character ratio
`--skip-llm-filter`	`false`	Skip LLM quality filter
`--seed`	`42`	Random seed
`--resume`	`false`	Resume from partial output

Qdrant mode (default)

Parameter	Default	Description
`--qdrant-url`	`http://localhost:6333`	Qdrant instance URL
`--collection`	`docs`	Collection name
`--text-field`	`text`	Payload field containing text
`--scroll-batch`	`100`	Batch size for Qdrant scroll
`--vector-name`	(none)	Named vector in Qdrant (e.g. `dense`)

Docling mode (`--docling`)

Parameter	Default	Description
`--docling`	`false`	Enable Docling mode
`--docs-dir`	(required)	Directory with documents to parse
`--docling-url`	`http://localhost:5001`	Docling serve URL
`--docling-timeout`	`300`	Timeout per document (seconds)
`--embed-batch`	`32`	Batch size for embedding chunks
`--chunking-max-tokens`	`512`	Max tokens per chunk (semantic chunking)
`--chunking-tokenizer`	`BAAI/bge-m3`	Tokenizer used for chunking
`--docling-ocr` / `--no-docling-ocr`	`true`	Enable/disable OCR
`--docling-ocr-engine`	`easyocr`	OCR engine
`--docling-ocr-lang`	`["it", "en"]`	OCR languages (JSON array)
`--docling-pdf-backend`	`dlparse_v2`	PDF parsing backend
`--docling-table-mode`	`accurate`	Table extraction mode
`--docling-table-cell-matching`	`true`	Table cell matching

`bge-auto-tune finetune`

Parameter	Default	Description
`--dataset`	`bge_m3_training.jsonl`	Input dataset
`--model`	`BAAI/bge-m3`	Base model
`--output`	`./bge-m3-finetuned`	Output directory
`--epochs`	`2`	Training epochs
`--lr`	`1e-5`	Learning rate
`--batch-size`	`4`	Batch size per GPU
`--temperature`	`0.02`	InfoNCE temperature
`--warmup-ratio`	`0.05`	Warmup ratio
`--max-passage-len`	`1024`	Max passage length
`--max-query-len`	`256`	Max query length
`--train-group-size`	`8`	Passages per query in batch
`--save-steps`	`500`	Save checkpoint every N steps
`--gradient-checkpointing`	`true`	Save VRAM

`bge-auto-tune test`

Parameter	Default	Description
`--base-model`	`BAAI/bge-m3`	Base model for comparison
`--model`	`./bge-m3-finetuned`	Fine-tuned model
`--dataset`	`bge_m3_training.jsonl`	Dataset with query/positive pairs
`--test-queries`	`200`	Number of test queries
`--test-split`	`0.2`	Fraction held out for testing
`--top-k`	`10`	Recall/NDCG cutoff
`--dense-weight`	`0.30`	Dense weight for reranking
`--sparse-weight`	`0.65`	Sparse weight for reranking
`--colbert-weight`	`0.05`	ColBERT weight for reranking
`--batch-size`	`16`	Encoding batch size
`--output`	`test_results.json`	Detailed JSON report
`--verbose`	`false`	Show all examples
`--device`	`auto`	Device: auto, cuda, cpu, mps

`bge-auto-tune reindex`

Parameter	Default	Description
`--collections`	(required)	Collections to re-index (comma-separated)
`--qdrant-url`	`http://localhost:6333`	Qdrant instance URL
`--embed-url`	`http://localhost:8004`	BGE-M3 embedding server
`--text-field`	`text`	Payload field containing text
`--vectors`	`dense,sparse`	Vector types to generate (comma-separated)
`--vector-names`	(auto)	Mapping type:name, e.g. `dense:dense,sparse:sparse`
`--unnamed-vectors`	`false`	Use unnamed vectors (dense only)
`--test`	`false`	Create test collection instead of updating in-place
`--embed-batch`	`32`	Batch size for embedding calls
`--scroll-batch`	`100`	Batch size for Qdrant scroll
`--yes` / `-y`	`false`	Skip confirmation prompt

Re-index examples

# Default: named vectors dense + sparse (most common setup)
bge-auto-tune reindex --collections docs

# Test mode: creates docs_test_finetuned
bge-auto-tune reindex --collections docs --test

# Multiple collections
bge-auto-tune reindex --collections docs,faq,policies

# Only dense, unnamed vectors
bge-auto-tune reindex --collections docs --vectors dense --unnamed-vectors

# Dense + sparse + ColBERT with custom names
bge-auto-tune reindex --collections docs --vectors dense,sparse,colbert \
  --vector-names "dense:emb,sparse:lex,colbert:col"

# Different text field
bge-auto-tune reindex --collections docs --text-field content

# Skip confirmation
bge-auto-tune reindex --collections docs -y

`bge-auto-tune publish`

Parameter	Default	Description
`--model`	`./bge-m3-finetuned`	Path to fine-tuned model
`--repo`	(required)	HuggingFace repo (user/model-name)
`--private`	`false`	Create private repo
`--branch`	`main`	Target branch
`--message`	(auto)	Custom commit message
`--yes` / `-y`	`false`	Skip confirmation prompt

`bge-auto-tune run`

Accepts all parameters from generate, finetune, and test. Notable additions:

Parameter	Default	Description
`--model-output`	`./bge-m3-finetuned`	Output directory for fine-tuned model
`--verbose`	`false`	Detailed output for every step

Environment variables

All endpoints can be set via env vars so you don't have to repeat them:

export QDRANT_URL=http://10.0.0.1:6333
export QDRANT_COLLECTION=my_docs
export LLM_URL=http://10.0.0.1:8001/v1
export LLM_MODEL=Qwen/Qwen3-8B
export LLM_API_KEY=none
export EMBED_URL=http://10.0.0.1:8004
export DOCLING_URL=http://10.0.0.1:5001
export DOCS_DIR=/data/my_documents

# Qdrant mode:
bge-auto-tune generate
bge-auto-tune finetune
bge-auto-tune test
bge-auto-tune reindex --collections my_docs

# Docling mode:
bge-auto-tune generate --docling

Full examples

From Qdrant (existing vectorized data)

# 1. Generate 2000 pairs from banking docs
bge-auto-tune generate --collection docs --min-pairs 2000

# 2. Fine-tune for 4 epochs (batch 1 for 24GB GPU)
bge-auto-tune finetune --dataset bge_m3_training.jsonl --epochs 4 --batch-size 1

# 3. Test on held-out queries
bge-auto-tune test --model ./bge-m3-finetuned --test-queries 200

# 4. Point your BGE server to the fine-tuned model, then re-index
bge-auto-tune reindex --collections docs --test    # test first
bge-auto-tune reindex --collections docs            # production

# 5. Publish to HuggingFace
bge-auto-tune publish --repo your-user/bge-m3-bank-it

From documents via Docling (starting from scratch)

# 1. Generate 2000 pairs from a folder of PDFs, DOCX, etc.
bge-auto-tune generate \
  --docling \
  --docs-dir ./company_docs \
  --min-pairs 2000 \
  --chunking-max-tokens 512

# 2. Fine-tune
bge-auto-tune finetune --dataset bge_m3_training.jsonl --epochs 4 --batch-size 1

# 3. Test
bge-auto-tune test --model ./bge-m3-finetuned --test-queries 200

# 4. Publish
bge-auto-tune publish --repo your-user/bge-m3-company-it

One-shot pipeline with Docling

bge-auto-tune run \
  --docling \
  --docs-dir ./company_docs \
  --min-pairs 2000 \
  --epochs 4 \
  --batch-size 1 \
  --verbose

How Docling mode works

When you use --docling, the pipeline is:

Scan — recursively find all supported documents in --docs-dir
Chunk — send each document to Docling's /v1/chunk/hybrid/file endpoint, which handles parsing (OCR, table extraction, layout analysis) and semantic chunking in one step
Filter — apply deterministic quality filters (length, alpha ratio, junk patterns) and optionally LLM-based quality scoring
Embed — embed all chunks via the BGE-M3 server and build an in-memory index
Generate — for each chunk, generate synthetic queries via the local LLM and find hard negatives using the in-memory index (cosine similarity)
Save — write the standard JSONL dataset

The key difference from Qdrant mode: Docling mode doesn't need Qdrant at all for dataset generation. Hard negatives are computed in-memory using numpy instead of querying a vector database. This means you can fine-tune a model before you even have a Qdrant collection set up.

License

MIT

Project details

Release history Release notifications | RSS feed

0.1.5.7

Apr 8, 2026

0.1.5.6

Feb 9, 2026

This version

0.1.5.5

Feb 9, 2026

0.1.5.1

Feb 9, 2026

0.1.5

Feb 9, 2026

0.1.4

Feb 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bge_auto_tune-0.1.5.5-py3-none-any.whl (45.5 kB view details)

Uploaded Feb 9, 2026 Python 3

File details

Details for the file bge_auto_tune-0.1.5.5-py3-none-any.whl.

File metadata

Download URL: bge_auto_tune-0.1.5.5-py3-none-any.whl
Upload date: Feb 9, 2026
Size: 45.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for bge_auto_tune-0.1.5.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`63bbbed48e73e75fd8eeda066dce8c94031262e175763b7f401b7c5ab8933859`
MD5	`b35b4c54e4b8fbf2cba62ea18079a9fd`
BLAKE2b-256	`0d6e0816397b52c31004acd0979a81ca965b1f60e355b9448eab0fc5b7ea4aa1`

See more details on using hashes here.

bge-auto-tune 0.1.5.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

🎯 BGE Auto-Tune

Two ways to generate the dataset

Prerequisites

Always required

Local LLM (OpenAI-compatible API)

BGE-M3 embedding server

For Qdrant mode (default)

Qdrant

For Docling mode (--docling)

Docling serve

Services overview

Hardware

Installation

Recommended workflow (manual pipeline)

Step 1 — Generate the dataset

How many pairs do you need?

Step 2 — Fine-tune

Step 3 — Test

Step 4 — Re-index Qdrant

Step 5 — Publish (optional)

run command (automatic pipeline)

All parameters

bge-auto-tune generate

Common parameters (both modes)

Qdrant mode (default)

Docling mode (--docling)

bge-auto-tune finetune

bge-auto-tune test

bge-auto-tune reindex

Re-index examples

bge-auto-tune publish

bge-auto-tune run

Environment variables

Full examples

From Qdrant (existing vectorized data)

From documents via Docling (starting from scratch)

One-shot pipeline with Docling

How Docling mode works

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes

For Docling mode (`--docling`)

`run` command (automatic pipeline)

`bge-auto-tune generate`

Docling mode (`--docling`)

`bge-auto-tune finetune`

`bge-auto-tune test`

`bge-auto-tune reindex`

`bge-auto-tune publish`

`bge-auto-tune run`