Skip to main content

Production-grade agentic document-to-dataset pipeline with GraphRAG support.

Project description

neuraparse

Production-grade agentic document-to-dataset pipeline with GraphRAG support.

Python 3.10+ Tests License: MIT Version

โš ๏ธ Alpha Release: This is an early alpha version (0.1.0a1). APIs may change. Feedback and contributions welcome!

๐Ÿš€ What is neuraparse?

neuraparse transforms documents into high-quality datasets for:

  • GraphRAG systems (entity extraction, graph neighborhoods, hierarchical summaries)
  • Retrieval evaluation (graded relevance, cross-document ranking, multi-context ranking)
  • LLM fine-tuning (QA pairs, instruction datasets, summarization)
  • Agentic workflows (memory, tool usage, knowledge graphs)

Key Features

โœ… Multi-format ingestion: Web pages, PDFs, Office docs, Markdown, plain text โœ… Hierarchical parsing: Layout-aware DocumentTree (sections, paragraphs, metadata) โœ… GraphRAG-ready: DocumentGraph with structural + semantic nodes (entities, summaries) โœ… 10+ dataset recipes: RAG chunks, QA pairs, entity knowledge, graded relevance, cross-doc ranking โœ… Profile system: Bundle recipes into workflows (graphrag, eval_ranking, eval_advanced) โœ… Real LLM integration: OpenAI, Anthropic (Claude), Ollama (local models) โœ… Production-ready: 39 tests, type hints, comprehensive error handling


๐Ÿ“ฆ Installation

# Basic installation
pip install neuraparse

# With LLM providers
pip install neuraparse[llm-openai]      # OpenAI GPT-4/3.5
pip install neuraparse[llm-anthropic]   # Anthropic Claude
pip install neuraparse[llm-ollama]      # Local Ollama models
pip install neuraparse[llm-all]         # All LLM providers

# With document parsing
pip install neuraparse[pdf]             # PDF support
pip install neuraparse[office]          # DOCX support
pip install neuraparse[recipes-yaml]    # YAML recipe configs

# Full installation
pip install neuraparse[llm-all,pdf,office,recipes-yaml]

๐ŸŽฏ Quick Start

1. Ingest a document

# From a web page
neuraparse ingest https://example.com/article.html

# From a local file
neuraparse ingest path/to/document.pdf

# From markdown
neuraparse ingest path/to/notes.md

2. Build a document graph

neuraparse build-graph <document_id>

This creates a DocumentGraph with:

  • Structural nodes: DOCUMENT โ†’ SECTION โ†’ PARAGRAPH hierarchy
  • Semantic nodes: ENTITY (keywords), SUMMARY (section summaries)
  • Edges: parent_of, next_sibling, mentions, summarizes

3. Generate datasets

Option A: Run a single recipe

# Generate RAG chunks
neuraparse run-recipe <document_id> --recipe examples/rag_chunks.json

# Generate QA pairs with OpenAI
neuraparse run-recipe <document_id> --recipe examples/recipe_with_openai.json

# Generate graded relevance dataset
neuraparse run-recipe <document_id> --recipe examples/graded_relevance.json

Option B: Run a profile (multiple recipes)

# GraphRAG profile (6 recipes: chunks, QA, summaries, entities, neighborhoods, relevance)
neuraparse run-profile <document_id> --profile graphrag

# Evaluation ranking profile (2 recipes: section_relevance, multi_context_ranking)
neuraparse run-profile <document_id> --profile eval_ranking

# Advanced evaluation profile (3 recipes: graded_relevance, cross_doc_ranking, entity_context_ranking)
neuraparse run-profile <document_id> --profile eval_advanced

๐Ÿ“š Available Recipes

Recipe Description Output Format
rag_chunks Paragraph chunks for RAG {chunk_id, text, metadata}
basic_qa QA pairs per paragraph {question, answer, context}
outline_summary Hierarchical section summaries {section, summary, level}
entity_knowledge Entity-centric knowledge aggregation {entity, mentions, contexts}
graph_neighborhood Paragraph + graph context {paragraph, siblings, summary}
section_relevance Binary relevance pairs {query, context, label}
multi_context_ranking Multi-context ranking {query, contexts: [{text, label}]}
graded_relevance Graded relevance (0-3) {query, context, grade}
cross_document_ranking Cross-doc ranking {query, contexts: [{text, label, source_doc}]}
entity_context_ranking Entity + summary ranking {query, contexts: [{text, label, type}]}

๐Ÿง  LLM Integration

OpenAI

{
  "kind": "basic_qa",
  "params": {
    "llm": {
      "provider": "openai",
      "model": "gpt-4",
      "api_key": "sk-...",  // or set OPENAI_API_KEY env var
      "temperature": 0.7,
      "max_tokens": 512
    }
  }
}

Anthropic (Claude)

{
  "kind": "outline_summary",
  "params": {
    "llm": {
      "provider": "anthropic",
      "model": "claude-3-5-sonnet-20241022",
      "temperature": 0.5,
      "max_tokens": 1024
    }
  }
}

Ollama (Local)

{
  "kind": "basic_qa",
  "params": {
    "llm": {
      "provider": "ollama",
      "model": "llama3.2",
      "base_url": "http://localhost:11434",
      "temperature": 0.6
    }
  }
}

๐Ÿ—๏ธ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Raw Documents  โ”‚  (Web, PDF, DOCX, Markdown, Text)
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚ Ingestion
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ DocumentTree    โ”‚  (Hierarchical: sections, paragraphs, metadata)
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚ Graph Building
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ DocumentGraph   โ”‚  (Nodes: DOCUMENT, SECTION, PARAGRAPH, ENTITY, SUMMARY)
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚ Recipe Execution
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Datasets      โ”‚  (RAG chunks, QA pairs, rankings, evaluations)
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ”ฌ Advanced Usage

Custom Profiles

Create my_profiles.json:

{
  "profiles": {
    "my_custom_profile": [
      "rag_chunks",
      "graded_relevance",
      "entity_context_ranking"
    ]
  }
}

Run it:

neuraparse run-profile <document_id> --profile my_custom_profile --profiles-config my_profiles.json

Python API

from neuraparse.core.ingestion import ingest_from_url
from neuraparse.core.graph_builder import build_document_graph
from neuraparse.recipes import execute_recipe, execute_profile

# Ingest
doc = ingest_from_url("https://example.com/article.html", base_dir="./data")

# Build graph
graph = build_document_graph(doc.id, base_dir="./data")

# Run recipe
output_path = execute_recipe(
    config_path="examples/rag_chunks.json",
    graph=graph,
    base_dir="./data",
    document_id=doc.id
)

# Or run profile
outputs = execute_profile(
    profile_name="graphrag",
    graph=graph,
    base_dir="./data",
    document_id=doc.id
)

๐Ÿงช Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=neuraparse --cov-report=html

# Run specific test file
pytest tests/test_advanced_eval_recipes.py -v

Current status: โœ… 39 passed, 1 skipped


๐Ÿ“– Documentation


๐Ÿ›ฃ๏ธ Roadmap

  • Core ingestion + parsing + graph building
  • 10+ dataset recipes
  • Profile system
  • Real LLM integration (OpenAI, Anthropic, Ollama)
  • Advanced evaluation recipes (graded relevance, cross-doc ranking)
  • Multi-document graph merging
  • Streaming ingestion for large documents
  • Web UI for graph visualization
  • PyPI package release

๐Ÿ“„ License

MIT License - see LICENSE for details.


๐Ÿค Contributing

Contributions welcome! Please:

  1. Fork the repo
  2. Create a feature branch
  3. Add tests for new features
  4. Ensure all tests pass (pytest)
  5. Submit a pull request

๐Ÿ™ Acknowledgments

Built with modern 2025 GraphRAG and agentic data pipeline patterns, inspired by:

  • Microsoft GraphRAG
  • LlamaIndex
  • LangChain
  • Recent ACL/NAACL/ICLR papers on retrieval evaluation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neuraparse-0.1.0a1.tar.gz (43.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

neuraparse-0.1.0a1-py3-none-any.whl (53.1 kB view details)

Uploaded Python 3

File details

Details for the file neuraparse-0.1.0a1.tar.gz.

File metadata

  • Download URL: neuraparse-0.1.0a1.tar.gz
  • Upload date:
  • Size: 43.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for neuraparse-0.1.0a1.tar.gz
Algorithm Hash digest
SHA256 aa1d2d39aad10327b748007208ec9ed7e7581da73352024dc7b01152ebc54d72
MD5 a875aaf0168356c3d4163841c7e2e16e
BLAKE2b-256 fcb4917fc50eae420999bb52b0805705fc18ab0e3c26f014a7c0a40c732108b5

See more details on using hashes here.

File details

Details for the file neuraparse-0.1.0a1-py3-none-any.whl.

File metadata

  • Download URL: neuraparse-0.1.0a1-py3-none-any.whl
  • Upload date:
  • Size: 53.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for neuraparse-0.1.0a1-py3-none-any.whl
Algorithm Hash digest
SHA256 8f7f53a89d9df7aa4d2d6b65590ae9e6c864cbfafeb89f34d96856578730c8ba
MD5 e2cbc8452117a8d8d33c230bd154e402
BLAKE2b-256 a7a6fd3eb89cf422502785172b605cac7b77c1cc9e84f759c6bd3ca63b1a2f00

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page