A fast and light-weight library for ingesting and chunking files
Project description
easyparser
A fast and lightweight Python library for intelligent document parsing, chunking, and analysis.
📊 Library Responsibility
flowchart LR
Files([Documents]) -->|Input| Parse
Parse -->|Structured Data| Chunk
Chunk -->|Semantic Units| Index
Index -->|Indexed Data| Search
Search -->|Retrieved Chunks| Context[Represent Context]
Context -->|Context + Query| LLM([Language Model])
classDef primary fill:#4CAF50,stroke:#388E3C,color:black,stroke-width:2px,font-weight:bold
classDef secondary fill:#64B5F6,stroke:#1976D2,color:black,stroke-width:1px
classDef document fill:#E1BEE7,stroke:#9C27B0,color:black,stroke-width:1px,stroke-dasharray: 5 5
class Parse,Chunk,Context primary
class Index,Search secondary
class Files,LLM document
subgraph easyparserLibrary["easyparser Library Focus"]
Parse
Chunk
Context
end
The easyparser library focuses on the critical first steps in the document processing pipeline: Parsing documents into a unified format, Chunking content intelligently, and Representing retrieval context for LLMs.
🌟 Features
- Universal Document Processing: Parse PDFs, Word documents, PowerPoint, Excel, Markdown, HTML, images, audio, video, and more with a unified API
- Intelligent Structure Preservation: Keeps document hierarchy, tables, lists, and formatting intact
- Modular Architecture: Choose parsers and processing steps based on your needs
- Multimodal Support: Extract and process text, images, tables, and other elements
- Smart Chunking Strategies: Split content based on headers, semantic meaning, or custom rules
- LLM-Ready: Output chunks ready for embedding or use with language models
- Extensible: Easy to add custom parsers and processors
🚀 Quick Start
from easyparser import parse
# Parse a file or directory into chunks
chunks = parse("path/to/document.pdf")
# Print the text
print(chunks.render())
# Or get a hierarchical structure
for depth, chunk in chunks.walk():
print(" " * depth + f"{chunk.ctype}: {chunk.content[:30]}...")
Sample output:
# truncated output
Chunk(ctype=root, content=File assets/sample.docx)
Chunk(ctype=para, content=(abridged from Sam Gross’s doc: https://docs.googl... (1 more words))
Chunk(ctype=para, content=The goal of this project is to remove the global i... (31 more words))
Chunk(ctype=list, mimetype=text/plain)
Chunk(ctype=list, content=1. That it is feasible to remove the GIL from Python.)
Chunk(ctype=list, content=2. That the tradeoffs are manageable and the effor... (7 more words))
Chunk(ctype=list, content=3. That the main technical ideas of this project (... (16 more words))
Chunk(ctype=header, content=Why remove the GIL?)
Chunk(ctype=para, content=The GIL is a major obstacle to concurrency. For sc... (52 more words))
Chunk(ctype=header, content=Design)
Chunk(ctype=header, content=Reference counting)
Chunk(ctype=para, content=CPython’s reference counting system is not thread-... (48 more words))
...
📦 Installation
pip install easyparser
For specialized parsers, install with extras:
pip install easyparser[pdf,ocr,audio] # Install with PDF, OCR, and audio support
External Dependencies
Some parsers require external tools to be installed on your system:
- pandoc: Required for parsing markup languages (EPUB, HTML, RTF, RST, DOCX, etc.)
# Ubuntu/Debian sudo apt-get install pandoc # macOS brew install pandoc
For legacy Office documents support
-
libreoffice: Required for file conversion (DOC → DOCX, PPT → PPTX, PPTX → PDF, etc.)
# Ubuntu/Debian sudo apt-get install libreoffice # macOS brew install --cask libreoffice
🔍 Supported File Types
- Documents: PDF, DOCX, PPTX, XLSX, EPUB
- Markup: HTML, Markdown
- Data: CSV, JSON, YAML, TOML
- Media: Images (JPEG, PNG), Audio (MP3, WAV), Video (MP4)
- Code: Various programming languages
- Directories: Process entire folders of mixed documents
🧩 Components
Parsers
Parsers read different file formats and convert them into a unified chunk structure:
from easyparser.controller import get_controller
from easyparser.parser import FastPDF, Markdown, RapidOCRImageText
# Get a controller and parse a document
ctrl = get_controller()
chunk = ctrl.as_root_chunk("file.path")
# Parse a PDF with a faster parser
pdf_chunks = FastPDF.run(chunk)
# Parse Markdown into a structured tree
md_chunks = Markdown.run(chunk)
# Extract text from images using OCR
img_chunks = RapidOCRImageText.run(chunk)
Chunking Strategies
Split content into meaningful chunks with different strategies:
from easyparser.split import AgenticChunker, FlattenToMarkdown, LumberChunker
# Flatten hierarchical content while preserving structure
chunks = FlattenToMarkdown.run(doc, max_size=500)
# Use an LLM for semantic chunking (requires LLM setup)
chunks = LumberChunker.run(doc, chunk_size=800)
# Use an LLM for agentic chunking (requires LLM setup)
chunks = AgenticChunker.run(doc, chunk_size=800)
Processing Pipeline
Build custom processing pipelines:
from easyparser.controller import get_controller
from easyparser.parser.pdf import FastPDF
from easyparser.split import MarkdownSplitByHeading, Propositionizer
# Get a controller and parse a document
ctrl = get_controller()
chunk = ctrl.as_root_chunk("document.pdf")
FastPDF.run(chunk)
# Process the parsed document
sections = MarkdownSplitByHeading.run(chunk, min_chunk_size=200)
propositions = Propositionizer.run(sections)
🔧 Advanced Usage
Custom Parsers
Create your own parsers for specialized formats:
from easyparser.base import BaseOperation, Chunk, ChunkGroup, CType
class MyCustomParser(BaseOperation):
@classmethod
def run(cls, chunks: Chunk | ChunkGroup, **kwargs) -> ChunkGroup:
# Custom parsing logic here
return processed_chunks
LLM Integration
Add LLM support for semantic splitting and processing:
Add LLM support
By default, easyparser uses the llm (repo) with
alias easyparser-llm to interact with LLM. Please setup the desired LLM
provider according to their docs, and set the alias easyparser-llm to that
model. Example, using Gemini model (as of April 2025):
# Install the LLM gemini
$ llm install llm-gemini
# Set the Gemini API key
$ llm keys set gemini
# Alias LLM to 'easyparser-llm' (you can see other model ids by running `llm models`)
$ llm aliases set easyparser-llm gemini-2.5-flash-preview-04-17
# Check the LLM is working correctly
$ llm -m easyparser-llm "Explain quantum mechanics in 100 words"
Once LLM is set up, you can use LLM-based chunkers:
from easyparser.split import LumberChunker, AgenticChunker, Propositionizer
chunks = LumberChunker.run(doc) # Semantically split content
chunks = Propositionizer.run(doc) # Convert to atomic propositions
chunks = AgenticChunker.run(doc) # Group chunks by topic
📊 Example Applications
- Create knowledge bases from document collections
- Build RAG (Retrieval-Augmented Generation) systems
- Extract structured data from unstructured documents
- Generate document summaries with preserved structure
🤝 Contributing
Contributions are welcome! Ensure that you have git and git-lfs installed. git will be used for version control and git-lfs will be used for test data.
# Clone the repository
git clone git@github.com:easyparser/easyparser.git
cd easyparser
# Fetch the test data
git submodule update --init --recursive
# Install development dependnecy
pip install -e ".[dev]"
# Initialize pre-commit hooks
pre-commit install
📄 License
Apache 2.0 License. See the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file easyparser-0.0.3.tar.gz.
File metadata
- Download URL: easyparser-0.0.3.tar.gz
- Upload date:
- Size: 90.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
430a515edb7673dfa0f24ab148370723997ad64ea0577c514653944c81bf46cf
|
|
| MD5 |
5e9793ac20ff79b4ed8a1977825976ff
|
|
| BLAKE2b-256 |
819d408fe5a6c4a6db5cf9bea78b50e3a3d5bf8547eb0cd61a01873f344c420a
|
Provenance
The following attestation bundles were made for easyparser-0.0.3.tar.gz:
Publisher:
publish.yml on easyparser/easyparser
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
easyparser-0.0.3.tar.gz -
Subject digest:
430a515edb7673dfa0f24ab148370723997ad64ea0577c514653944c81bf46cf - Sigstore transparency entry: 243620986
- Sigstore integration time:
-
Permalink:
easyparser/easyparser@f1b3abf4f15e8ee2540044e4eccea2729fed1f89 -
Branch / Tag:
refs/tags/v0.0.3 - Owner: https://github.com/easyparser
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@f1b3abf4f15e8ee2540044e4eccea2729fed1f89 -
Trigger Event:
push
-
Statement type:
File details
Details for the file easyparser-0.0.3-py3-none-any.whl.
File metadata
- Download URL: easyparser-0.0.3-py3-none-any.whl
- Upload date:
- Size: 103.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c8fc0651b3b1e041a982fbebb0a7a0de0b89569bb0c31087548e4bb39efeaecd
|
|
| MD5 |
6fa570cc93bfcc2f2f7f76792c70ce98
|
|
| BLAKE2b-256 |
11d2e2ad14bbe0b841d0faf4bec958c0cadc81bbb2cd411a5f5409d3c768c030
|
Provenance
The following attestation bundles were made for easyparser-0.0.3-py3-none-any.whl:
Publisher:
publish.yml on easyparser/easyparser
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
easyparser-0.0.3-py3-none-any.whl -
Subject digest:
c8fc0651b3b1e041a982fbebb0a7a0de0b89569bb0c31087548e4bb39efeaecd - Sigstore transparency entry: 243620987
- Sigstore integration time:
-
Permalink:
easyparser/easyparser@f1b3abf4f15e8ee2540044e4eccea2729fed1f89 -
Branch / Tag:
refs/tags/v0.0.3 - Owner: https://github.com/easyparser
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@f1b3abf4f15e8ee2540044e4eccea2729fed1f89 -
Trigger Event:
push
-
Statement type: