Parsers and ingestors for different file types and formats
Project description
About
This repo provides the service code for llmsherpa API to connect. This repo contains custom RAG (retrieval augmented generation) friendly parsers for the following file formats:
The PDF parser is a rule based parser which uses text co-ordinates (boundary box), graphics and font data from nlmatics modified version of tika found here https://github.com/nlmatics/nlm-tika. The PDF parser works off text layer and also offers a OCR option (apply_ocr) to automatically use OCR if there are scanned pages in your PDFs. The OCR feature is based off a nlmatics modified version of tika which uses tesseract underneath. Check out the notebook pdf_visual_ingestor_step_by_step to experiment directly with the PDF parser.
The PDF Parser offers the following features: 1. Sections and subsections along with their levels. 2. Paragraphs - combines lines. 3. Links between sections and paragraphs. 5. Tables along with the section the tables are found in. 6. Lists and nested lists. 7. Join content spread across pages. 8. Removal of repeating headers and footers. 9. Watermark removal. 10. OCR with boundary boxes
HTML
A special HTML parser that creates layout aware blocks to make RAG performance better with higher quality chunks.
Text
A special text parser which tries to figure out lists, tables, headers etc. purely by looking at the text and no visual, font or bbox information.
DOCX, PPTX and any other format supported by Apache Tika
There are two ways to process these types of documents
- html output from tika for these file types is used and parsed by the html parser
Nlm Modified Tika
Nlm modified version of Tika can be found in the 2.4.1-nlm branch here https://github.com/nlmatics/nlm-tika/tree/2.4.1-nlm For convenience, a compiled jar file of the code is included in this repo in jars/ folder. In some cases, your PDFs may result in errors in the Java server and you will need to modify the code there to resolve the issue and recompile the jar file.
Installation steps:
Run each step directly
- Install latest version of java from https://www.oracle.com/java/technologies/downloads/
- Run the tika server:
java -jar <path_to_nlm_ingestor>/jars/tika-server-standard-nlm-modified-2.4.1_v6.jar
- Install the ingestor
!pip install nlm-ingestor
- Run the ingestor
python -m nlm_ingestor.ingestion_daemon
Run the docker file
A docker image is available via github container registry. Before running the following code, you may need to authenticate with docker first cat ~/TOKEN.txt | docker login https://ghcr.io -u USERNAME --password-stdin where TOKEN.txt is the token you create as described here: https://docs.github.com/en/enterprise-server@3.7/packages/working-with-a-github-packages-registry/working-with-the-docker-registry
docker pull ghcr.io/nlmatics/nlm-ingestor:latest
docker run nlm-ingestor-<version>
Test the ingestor server
Sample test code to test the server with llmsherpa parser is in this notebook.
Rule based parser vs model based parser
Over the course of 4 years, nlmatics team evaluated a variety of options including a yolo based vision parser developed by Tom Liu and Yi Zhang. Ultimately, we settled with the rule based parser due to the following reasons.
- It is substantially (100x) faster compared to any vision parser as bare miniumum you have to create images out of all pages of a PDF (even for the ones with text layer) to use a vision parser. It is our opinion that vision parser is a better option for OCRd PDF without a text layer, or for small PDF files consisting form like data, but for larger text layer PDFs, spanning hundreds of pages, a rule based parser like ours is more practical.
- No special hardware is needed to run this parser if you are not using the PDF OCR feature. You can run this with hardware from early 2000s!
- We found vision parser (or any parser for that matter including this) to be error prone and the solution to fix errors in a model were not pretty:
- Add more examples to your training set which may make the accuracy of the model from previous learning degrade and result in uncertainty in previously working code
- Using rule based ideas to fix model based parser issue gets us back to writing a lot of rules again.
Credits
The PDFparser visual_ingestor and new_indent_parser was written by Ambika Sukla with additional contributions from Reshav Abraham who wrote the initial code to modify tika, Tom Liu who wrote the original Indent Parser and Kiran Panicker who made several improvements to the parsing speed, table parsing accuracy, indent parsing accuracy and reordering accuracy.
The HTML Ingestor was written by Tom Liu.
The Markdown Parser was written by Yi Zhang.
The Text Ingestor was written by Reshav Abraham.
The XML Ingestor was written by Ambika Sukla primarily to process PubMed XMLs.
The line_parser which serves as a core sentence processing utility for all the other parsers was written by Ambika Sukla.
Also we are thankful to the Apache PDFBox and Tika developer community for their years of work in providing the base for the PDF Parser.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for nlm_ingestor-0.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 50bb21c121235d62a38895eaf2e259b5f49f25ce5c0b2eba84d132be0efe5445 |
|
MD5 | 6afa0a0beac0585dd312012de8cc692f |
|
BLAKE2b-256 | cda64b6d6ba73cc5ccd4eb5391d71564f40e3f9ba915c84f34d0624ece81fe78 |