Production-grade agentic document-to-dataset pipeline with GraphRAG support.
Project description
neuraparse
Production-grade agentic document-to-dataset pipeline with GraphRAG support.
โ ๏ธ Alpha Release: This is an early alpha version (0.1.0a1). APIs may change. Feedback and contributions welcome!
๐ What is neuraparse?
neuraparse transforms documents into high-quality datasets for:
- GraphRAG systems (entity extraction, graph neighborhoods, hierarchical summaries)
- Retrieval evaluation (graded relevance, cross-document ranking, multi-context ranking)
- LLM fine-tuning (QA pairs, instruction datasets, summarization)
- Agentic workflows (memory, tool usage, knowledge graphs)
Key Features
โ
Multi-format ingestion: Web pages, PDFs, Office docs, Markdown, plain text
โ
Hierarchical parsing: Layout-aware DocumentTree (sections, paragraphs, metadata)
โ
GraphRAG-ready: DocumentGraph with structural + semantic nodes (entities, summaries)
โ
10+ dataset recipes: RAG chunks, QA pairs, entity knowledge, graded relevance, cross-doc ranking
โ
Profile system: Bundle recipes into workflows (graphrag, eval_ranking, eval_advanced)
โ
Real LLM integration: OpenAI, Anthropic (Claude), Ollama (local models)
โ
Production-ready: 39 tests, type hints, comprehensive error handling
๐ฆ Installation
# Basic installation
pip install neuraparse
# With LLM providers
pip install neuraparse[llm-openai] # OpenAI GPT-4/3.5
pip install neuraparse[llm-anthropic] # Anthropic Claude
pip install neuraparse[llm-ollama] # Local Ollama models
pip install neuraparse[llm-all] # All LLM providers
# With document parsing
pip install neuraparse[pdf] # PDF support
pip install neuraparse[office] # DOCX support
pip install neuraparse[recipes-yaml] # YAML recipe configs
# Full installation
pip install neuraparse[llm-all,pdf,office,recipes-yaml]
๐ฏ Quick Start
1. Ingest a document
# From a web page
neuraparse ingest https://example.com/article.html
# From a local file
neuraparse ingest path/to/document.pdf
# From markdown
neuraparse ingest path/to/notes.md
2. Build a document graph
neuraparse build-graph <document_id>
This creates a DocumentGraph with:
- Structural nodes: DOCUMENT โ SECTION โ PARAGRAPH hierarchy
- Semantic nodes: ENTITY (keywords), SUMMARY (section summaries)
- Edges: parent_of, next_sibling, mentions, summarizes
3. Generate datasets
Option A: Run a single recipe
# Generate RAG chunks
neuraparse run-recipe <document_id> --recipe examples/rag_chunks.json
# Generate QA pairs with OpenAI
neuraparse run-recipe <document_id> --recipe examples/recipe_with_openai.json
# Generate graded relevance dataset
neuraparse run-recipe <document_id> --recipe examples/graded_relevance.json
Option B: Run a profile (multiple recipes)
# GraphRAG profile (6 recipes: chunks, QA, summaries, entities, neighborhoods, relevance)
neuraparse run-profile <document_id> --profile graphrag
# Evaluation ranking profile (2 recipes: section_relevance, multi_context_ranking)
neuraparse run-profile <document_id> --profile eval_ranking
# Advanced evaluation profile (3 recipes: graded_relevance, cross_doc_ranking, entity_context_ranking)
neuraparse run-profile <document_id> --profile eval_advanced
๐ Available Recipes
| Recipe | Description | Output Format |
|---|---|---|
rag_chunks |
Paragraph chunks for RAG | {chunk_id, text, metadata} |
basic_qa |
QA pairs per paragraph | {question, answer, context} |
outline_summary |
Hierarchical section summaries | {section, summary, level} |
entity_knowledge |
Entity-centric knowledge aggregation | {entity, mentions, contexts} |
graph_neighborhood |
Paragraph + graph context | {paragraph, siblings, summary} |
section_relevance |
Binary relevance pairs | {query, context, label} |
multi_context_ranking |
Multi-context ranking | {query, contexts: [{text, label}]} |
graded_relevance |
Graded relevance (0-3) | {query, context, grade} |
cross_document_ranking |
Cross-doc ranking | {query, contexts: [{text, label, source_doc}]} |
entity_context_ranking |
Entity + summary ranking | {query, contexts: [{text, label, type}]} |
๐ง LLM Integration
OpenAI
{
"kind": "basic_qa",
"params": {
"llm": {
"provider": "openai",
"model": "gpt-4",
"api_key": "sk-...", // or set OPENAI_API_KEY env var
"temperature": 0.7,
"max_tokens": 512
}
}
}
Anthropic (Claude)
{
"kind": "outline_summary",
"params": {
"llm": {
"provider": "anthropic",
"model": "claude-3-5-sonnet-20241022",
"temperature": 0.5,
"max_tokens": 1024
}
}
}
Ollama (Local)
{
"kind": "basic_qa",
"params": {
"llm": {
"provider": "ollama",
"model": "llama3.2",
"base_url": "http://localhost:11434",
"temperature": 0.6
}
}
}
๐๏ธ Architecture
โโโโโโโโโโโโโโโโโโโ
โ Raw Documents โ (Web, PDF, DOCX, Markdown, Text)
โโโโโโโโโโฌโโโโโโโโโ
โ Ingestion
โผ
โโโโโโโโโโโโโโโโโโโ
โ DocumentTree โ (Hierarchical: sections, paragraphs, metadata)
โโโโโโโโโโฌโโโโโโโโโ
โ Graph Building
โผ
โโโโโโโโโโโโโโโโโโโ
โ DocumentGraph โ (Nodes: DOCUMENT, SECTION, PARAGRAPH, ENTITY, SUMMARY)
โโโโโโโโโโฌโโโโโโโโโ
โ Recipe Execution
โผ
โโโโโโโโโโโโโโโโโโโ
โ Datasets โ (RAG chunks, QA pairs, rankings, evaluations)
โโโโโโโโโโโโโโโโโโโ
๐ฌ Advanced Usage
Custom Profiles
Create my_profiles.json:
{
"profiles": {
"my_custom_profile": [
"rag_chunks",
"graded_relevance",
"entity_context_ranking"
]
}
}
Run it:
neuraparse run-profile <document_id> --profile my_custom_profile --profiles-config my_profiles.json
Python API
from neuraparse.core.ingestion import ingest_from_url
from neuraparse.core.graph_builder import build_document_graph
from neuraparse.recipes import execute_recipe, execute_profile
# Ingest
doc = ingest_from_url("https://example.com/article.html", base_dir="./data")
# Build graph
graph = build_document_graph(doc.id, base_dir="./data")
# Run recipe
output_path = execute_recipe(
config_path="examples/rag_chunks.json",
graph=graph,
base_dir="./data",
document_id=doc.id
)
# Or run profile
outputs = execute_profile(
profile_name="graphrag",
graph=graph,
base_dir="./data",
document_id=doc.id
)
๐งช Testing
# Run all tests
pytest
# Run with coverage
pytest --cov=neuraparse --cov-report=html
# Run specific test file
pytest tests/test_advanced_eval_recipes.py -v
Current status: โ 39 passed, 1 skipped
๐ Documentation
- Full Documentation (coming soon)
- Recipe Guide (coming soon)
- LLM Integration Guide (coming soon)
- Examples
๐ฃ๏ธ Roadmap
- Core ingestion + parsing + graph building
- 10+ dataset recipes
- Profile system
- Real LLM integration (OpenAI, Anthropic, Ollama)
- Advanced evaluation recipes (graded relevance, cross-doc ranking)
- Multi-document graph merging
- Streaming ingestion for large documents
- Web UI for graph visualization
- PyPI package release
๐ License
MIT License - see LICENSE for details.
๐ค Contributing
Contributions welcome! Please:
- Fork the repo
- Create a feature branch
- Add tests for new features
- Ensure all tests pass (
pytest) - Submit a pull request
๐ Acknowledgments
Built with modern 2025 GraphRAG and agentic data pipeline patterns, inspired by:
- Microsoft GraphRAG
- LlamaIndex
- LangChain
- Recent ACL/NAACL/ICLR papers on retrieval evaluation
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file neuraparse-0.1.0a1.tar.gz.
File metadata
- Download URL: neuraparse-0.1.0a1.tar.gz
- Upload date:
- Size: 43.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aa1d2d39aad10327b748007208ec9ed7e7581da73352024dc7b01152ebc54d72
|
|
| MD5 |
a875aaf0168356c3d4163841c7e2e16e
|
|
| BLAKE2b-256 |
fcb4917fc50eae420999bb52b0805705fc18ab0e3c26f014a7c0a40c732108b5
|
File details
Details for the file neuraparse-0.1.0a1-py3-none-any.whl.
File metadata
- Download URL: neuraparse-0.1.0a1-py3-none-any.whl
- Upload date:
- Size: 53.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8f7f53a89d9df7aa4d2d6b65590ae9e6c864cbfafeb89f34d96856578730c8ba
|
|
| MD5 |
e2cbc8452117a8d8d33c230bd154e402
|
|
| BLAKE2b-256 |
a7a6fd3eb89cf422502785172b605cac7b77c1cc9e84f759c6bd3ca63b1a2f00
|