Command-line tool to split documents into chunks and automatically generate question–answer datasets, designed for preparing data to fine-tune large language models (LLMs).
Project description
text2qna
text2qna is a Python toolkit and CLI for turning raw documents into instruction-style Q&A datasets for LLM fine-tuning.
- 📑 Chunk PDFs / TXT / HTML / MD into semantically split Markdown (sentence-aware or word-windowed).
- ❓ Generate Q&A pairs per section (supports positive and negative/trick pairs).
- 🎛 Steer style/coverage with
--num-pairs,--negative-ratio, and--extra-prompt.
Where many tools stop at chunking, text2qna goes further: it systematically distills as many Q&A pairs as you need from each section, helping you build robust instruction datasets quickly.
Table of contents
Quick start
1) Chunk a document to Markdown
text2qna chunk ./paper.pdf \
--backend local \ # Embedding backend: local | openai | ollama
--embed-model nomic-embed-text \ # Embedding model name
--api-key ollama # OpenAI-compatible API key
--embeddings-url http://localhost:11434/api/embeddings \ # Only for Ollama backend
--window 500 \ # Word window size
--step 400 \ # Word stride / overlap
--threshold 0.7 \ # Cosine similarity threshold for breaks
--sentence-split false \ # Use sentence splitting (requires nltk)
--output ./chunks.md # Output file
2) Generate a Q&A dataset from the sections
# Environment defaults (optional but convenient)
export TEXT2QNA_API_KEY=your-api-key # OpenAI-compatible API key
export TEXT2QNA_BASE_URL=https://api.openai.com/v1 # or http://localhost:11434/v1 (Ollama/other)
export TEXT2QNA_MODEL=llama3.2
# Command with all commonly used flags
text2qna qna ./chunks.md \
--model llama3.2 \ # Chat model (overrides TEXT2QNA_MODEL)
--num-pairs 5 \ # Q&A pairs per section
--negative-ratio 0.3 \ # 30% of pairs are negative/trick
--extra-prompt "Keep questions concise; answers ≤3 sentences." \ # Style/constraints
--output ./dataset.jsonl \ # Output JSONL path
--api-key "$TEXT2QNA_API_KEY" \ # Explicit API key (overrides env)
--base-url "$TEXT2QNA_BASE_URL" # Explicit endpoint (overrides env)
Output (dataset.jsonl)
{"prompt":"What is X...?", "response":"X is ..."}
{"prompt":"Is Mars the closest planet to the Sun?", "response":"No. Mars is fourth; Mercury is closest."}
Installation
PyPI (recommended)
pip install text2qna
From source
git clone https://github.com/nikosgiov/text2qna.git
cd text2qna
pip install -e .
Optional extras
pip install "text2qna[pdf]" # PDF text extraction (pdfplumber)
pip install "text2qna[local]" # Local sentence-transformers backend
pip install "text2qna[nltk]" # Sentence tokenization (punkt data)
Combine as needed, e.g.:
pip install "text2qna[pdf,local,nltk]"
Python: 3.9–3.12 supported.
Docker
You can run text2qna using Docker. We provide two versions of the Docker image:
- Full image (
nikosgiov/text2qna): Includes all dependencies (PDF support, local embeddings, NLTK) - Lite image (
nikosgiov/text2qna-lite): Lightweight version with PDF support but without local embeddings and NLTK. Perfect for API-based usage.
You can find the pre-built Docker images on Docker Hub:
- Full version: nikosgiov/text2qna
- Lite version: nikosgiov/text2qna-lite
Using pre-built images:
# Pull and use the full version
docker pull nikosgiov/text2qna:latest
# Or pull and use the lite version
docker pull nikosgiov/text2qna-lite:latest
Building locally:
# Build the full version
docker build -t text2qna:local .
# Build the lite version
docker build -f Dockerfile.lite -t text2qna:local-lite .
Run commands:
Create input/output directories in your current directory and place your input files in the input directory:
mkdir -p input output
Examples:
Running the chunking functionality:
docker run --rm \
-v "$(pwd)/input:/app/input" \
-v "$(pwd)/output:/app/output" \
nikosgiov/text2qna \
chunk /app/input/sample.txt \
--backend ollama \
--embed-model nomic-embed-text \
--embeddings-url http://host.docker.internal:11434/api/embeddings \
--output /app/output/chunks.md
Running the Q&A generation:
docker run --rm \
-v "$(pwd)/input:/app/input" \
-v "$(pwd)/output:/app/output" \
-e TEXT2QNA_API_KEY=your-api-key \
-e TEXT2QNA_BASE_URL=https://api.openai.com/v1 \
text2qna:local \
qna /app/input/chunks.md \
--model gpt-4 \
--output /app/output/dataset.jsonl
The Docker setup includes:
- All optional dependencies (pdf, local, nltk)
- Volume mounts for input/output files
- Environment variable passing for API keys and URLs
Using with Ollama
If you want to use Ollama's API instead of OpenAI, you'll need to:
- Start Ollama separately:
ollama serve
- Run text2qna with Ollama configuration: Just change the URLs to point at your Ollama server. For example:
- Embeddings:
--embeddings-url http://host.docker.internal:11434/api/embeddings - Q&A:
--base-url http://host.docker.internal:11434/v1
Note: There are two ways to connect to Ollama from the Docker container:
- Using
host.docker.internal(works on Docker Desktop for Windows/Mac by default)- Using
localhostwith the--network hostflagFor Q&A with Ollama, you must also set a dummy API key (e.g.
--api-key ollama) since the OpenAI client requires one.
text2qna features
- End-to-end: Go from raw documents → structured sections → instruction-style Q&A.
- Robustness: Support for negative pairs (misleading questions with corrective answers) helps fine-tuning resist falsehoods.
- Flexible embeddings: Choose local (sentence-transformers), OpenAI, or Ollama backends for chunking.
- Configurable: Tune chunk size, overlap, and break sensitivity; steer Q&A tone and constraints with
--extra-prompt. - CLI + API: Use in pipelines or as a library.
How it works
Chunking (semantic_split_markdown)
-
Input is normalized to Markdown.
-
Text is split into word windows (
window,step) or sentence groups (--sentence-split true, requires NLTK). -
Each adjacent pair of windows is embedded; cosine similarity determines section boundaries.
- Break when
similarity < threshold. - Higher
threshold⇒ more breaks (smaller sections).
- Break when
-
Very short sections are merged forward to avoid tiny fragments.
Defaults:
window=500,step=400,threshold=0.70. Implementation detail:min_section_words=60prevents tiny sections (configurable in code).
Q&A generation
-
Markdown → HTML → sections (by
h1/h2/h3). -
For each section:
- Generate N positive Q&A pairs (covering uncaptured aspects).
- Optionally generate M negative pairs (plausible but wrong questions; correct answers explain the error).
-
Duplicate question texts are filtered out; basic retry logic included.
-
Output is JSONL:
{"prompt": "...", "response": "..."}per line.
Section boundaries
When generating Q&A, text2qna treats each Markdown heading (#, ##, ###) as the start of a new section.
A section consists of the heading text plus all following content, stopping just before the next heading of equal or higher level.
Child subsections are not merged into their parent — each heading defines its own independent section.
CLI reference
All commands support --quiet / --verbose for logging.
chunk — semantic chunking
Convert raw input to Markdown and split into semantically coherent sections.
Usage
text2qna [--quiet|--verbose] chunk <input> [options]
Key options
--output <path>: Output Markdown (default:<input>.md)--backend <local|openai|ollama>: Embedding backend (default:local)--embed-model <name>: Embedding model (backend-specific)--embeddings-url <url>: Custom embeddings endpoint (Ollama; default/api/embeddings)--window <int>: Word window size (default:500)--step <int>: Word stride/overlap (default:400)--threshold <float>: Break when adjacent similarity< threshold(default:0.70)--sentence-split <true|false>: Group by sentences (requiresnltkpunkt)
Notes
- Units for
window/stepare words, not characters. - Higher
threshold→ more, smaller sections; lowerthreshold→ fewer, larger sections. - PDF parsing uses
pdfplumber; image-only PDFs may need OCR beforehand.
qna — dataset-generation
Create Q&A pairs per Markdown section using an OpenAI-compatible chat API.
Usage
text2qna [--quiet|--verbose] qna <markdown> [options]
Key options
--output <path>: JSONL output (default:dataset.jsonl)--model <name>: Chat model (e.g.,gpt-4o-mini,llama3.2)--base-url <url>: OpenAI-compatible base URL (e.g.,http://localhost:11434/v1)--api-key <key>: API key (or setTEXT2QNA_API_KEY)--num-pairs <int>: Pairs per section (default:3)--negative-ratio <float>: Fraction of negative/trick pairs (e.g.,0.3)--extra-prompt <text>: Style/constraints (tone, length caps, etc.)
Global flags
--quiet/--verbose: Adjust logging.-h,--help: Command help.
Version helper:
python -c "import text2qna; print(text2qna.__version__)"
Environment variables
Everything has CLI flags; env vars provide convenient defaults.
TEXT2QNA_API_KEY— API key for any OpenAI-compatible API (OpenAI, Claude, local models)TEXT2QNA_BASE_URL— Base URL for OpenAI-compatible API endpointTEXT2QNA_MODEL— Default chat model for Q&A (default:llama3.2)TEXT2QNA_EMBED_BACKEND—openai|ollama|local(default:local)TEXT2QNA_EMBED_MODEL— Embedding model for the chosen backendTEXT2QNA_EMBED_URL— Embeddings URL forollama(e.g.,http://localhost:11434/api/embeddings)TEXT2QNA_DEVICE— Device for local embeddings (cpu|cuda|mps)
Note: For backward compatibility,
OPENAI_API_KEYandOPENAI_BASE_URLare also supported butTEXT2QNA_API_KEYandTEXT2QNA_BASE_URLare preferred as they better reflect that any OpenAI-compatible API can be used (OpenAI, Claude, local models, etc).
Embedding backends
-
local (default) —
sentence-transformersmodels on your machine. Install:pip install "text2qna[local]" -
openai — Uses the official
openaiSDK’s embeddings endpoint. RequiresOPENAI_API_KEYand optionallyOPENAI_BASE_URL. -
ollama — Calls a local /api/embeddings endpoint (JSON body:
{"model": "...", "prompt": "..."}). Example:http://localhost:11434/api/embeddings
Programmatic usage
Chunking
from text2qna.chunker import load_file, to_markdown, semantic_split_markdown
from text2qna.embeddings import OllamaEmbeddings
raw = load_file("./paper.pdf")
md_text = to_markdown(raw)
embedder = OllamaEmbeddings(
model="mxbai-embed-large",
base_url="http://localhost:11434/api/embeddings",
)
chunks_md = semantic_split_markdown(
md_text,
embedder=embedder,
window=500,
step=400,
threshold=0.7,
sentence_split=False,
)
with open("chunks.md", "w", encoding="utf-8") as f:
f.write(chunks_md)
Q&A dataset generation
from text2qna.qna import create_dataset_from_markdown, save_dataset_jsonl
from openai import OpenAI
client = OpenAI(api_key="sk-...", base_url="http://localhost:11434/v1")
md = open("chunks.md", "r", encoding="utf-8").read()
dataset = create_dataset_from_markdown(
md_text=md,
client=client,
model="llama3.2",
num_pairs_per_section=5,
negative_ratio=0.3,
extra_prompt="Keep questions concise and answers under 3 sentences.",
)
save_dataset_jsonl(dataset, "dataset.jsonl")
Recipes
Rick & Morty tone
Respond exactly like Rick from Rick and Morty: be rude, impatient, sarcastic, and brutally honest.
Keep content factually correct; use informal, unfiltered tone.
Use via --extra-prompt or the extra_prompt parameter.
OpenAI embeddings
export TEXT2QNA_API_KEY=your-api-key
text2qna chunk notes.txt \
--backend openai \
--embed-model text-embedding-3-small
Ollama embeddings
ollama pull mxbai-embed-large
text2qna chunk page.html \
--backend ollama \
--embed-model mxbai-embed-large \
--embeddings-url http://localhost:11434/api/embeddings
Local sentence-transformers
pip install "text2qna[local]"
text2qna chunk notes.md \
--backend local \
--embed-model sentence-transformers/all-MiniLM-L6-v2 \
--device cpu
Performance & quality tips
- Start simple:
window=500,step=400,threshold=0.7. - Smaller sections (for dense/varied content): increase
threshold(e.g.,0.8–0.9) or reducewindow. - Larger sections (for long uniform prose): decrease
threshold(e.g.,0.5–0.6) or increasewindow. - Sentence alignment:
--sentence-split truecan improve boundaries; requiresnltk+punktdata. - PDFs: Image-only or messy PDFs benefit from OCR first;
pdfplumberreads text layers only. - Style control: Use
--extra-promptto enforce tone, length caps, format (bullets, JSON, etc.).
Troubleshooting
-
NLTK sentence splitting error
python -c "import nltk; nltk.download('punkt')"
-
Ollama embeddings not found: Pull the model locally (e.g.,
ollama pull mxbai-embed-large) and ensure you use the /api/embeddings route. -
Poor PDF text: Run OCR; bad text layers cause garbage input.
-
Missing deps: If you skipped extras, install what you need:
pip install "text2qna[pdf,local,nltk]". -
No Q&A output: Ensure your model name and
--base-urlare correct and accessible; check--verboselogs.
FAQ
How do window, step, and threshold interact?
window/step slide over words to make adjacent spans, which are embedded and compared. A break occurs when cosine_similarity(span_i, span_{i+1}) < threshold. Thus, higher threshold ⇒ more breaks (smaller chunks); lower ⇒ fewer breaks.
Does the tool require OpenAI? No. You can chunk with local or Ollama embeddings. Q&A generation does require an OpenAI-compatible chat API, which can be local (e.g., an Ollama-compatible server) if it matches the API.
What’s in the JSONL?
Lines like {"prompt": "...", "response": "..."}. Internal fields (e.g., is_negative) are stripped before writing.
How do I get the version?
python -c "import text2qna; print(text2qna.__version__)"
Contributing
Contributions are welcome!
- Fork & create a feature branch
pip install -e ".[pdf,local,nltk]"- Add tests (if applicable)
- Open a PR with a clear description and examples
Bug reports & feature requests: Issues
License
MIT © Nikolaos Giovanopoulos
Security & privacy
- Keep your API keys secret. Prefer environment variables over hardcoding.
- Review your documents for sensitive information before generating datasets.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file text2qna-0.1.2.tar.gz.
File metadata
- Download URL: text2qna-0.1.2.tar.gz
- Upload date:
- Size: 24.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5a182808b9630cbc273f99082955d8fcbd00a4c6d4f295aad3e17b4118301090
|
|
| MD5 |
e3fca64c431d4c42cf57355ffe340553
|
|
| BLAKE2b-256 |
8a58fde82035f2e9a9deb836ea8d173480dc4639981d3e088349a6f804ad27b2
|
File details
Details for the file text2qna-0.1.2-py3-none-any.whl.
File metadata
- Download URL: text2qna-0.1.2-py3-none-any.whl
- Upload date:
- Size: 19.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
acdbb0d803bcac1de0c38cc1f707a1d69561d02a1acb90f4a6d83709732caa71
|
|
| MD5 |
0fee60de8345a82a873040feed466fd8
|
|
| BLAKE2b-256 |
35df2537be16d0735082fdf156b051f0fe6e01c39cd408bfa3ad698a68d48c78
|