Document parsing tool for LLM training and Rag
Project description
DocParser 📄
DocParser is a powerful tool for LLM traning and other application, for examples: RAG, which support to parse multi type file, includes:
Feature 🎉
File types supported for parsing:
- Pdf: Use OCR to parse PDF documents and output text in markdown format. The parsing results can be used for LLM pretrain, RAG, etc.
- Html: Use jina to parse multi html pages and output text in markdown.
Install
From pip:
pip install docparser_feb
From repository:
pip install git+https://github.com/feb-co/DocParser.git
Or install it directly through the installation package:
git clone https://github.com/feb-co/DocParser.git
cd DocParser
pip install -e .
API/Functional
From CLI
You can run the following script to get the pdf parsing results:
export LOG_LEVEL="ERROR"
export DOC_PARSER_MODEL_DIR="xxx"
export DOC_PARSER_OPENAI_URL="xxx"
export DOC_PARSER_OPENAI_KEY="xxx"
export DOC_PARSER_OPENAI_MODEL="gpt-4-0125-preview"
export DOC_PARSER_OPENAI_RETRY="3"
docparser-pdf \
--inputs path/to/file.pdf or path/to/directory \
--output_dir output_directory \
--page_range '0:1' --mode 'figure latex' \
--rendering --use_llm --overwrite_result
The following is a description of the relevant parameters:
usage: docparser-pdf [-h] --inputs INPUTS --output_dir OUTPUT_DIR [--page_range PAGE_RANGE] [--mode {plain,figure placehold,figure latex}] [--rendering] [--use_llm]
options:
-h, --help show this help message and exit
--inputs INPUTS Directory where to store PDFs, or a file path to a single PDF
--output_dir OUTPUT_DIR
Directory where to store the output results (md/json/images).
--page_range PAGE_RANGE
The page range to parse the PDF, the format is 'start_page:end_page', that is, [start, end). Default: full.
--mode {plain,figure placehold,figure latex}
The mode for parsing the PDF, to extract only the plain text or the text plus images.
--rendering Is it necessary to render the recognition results of the input PDF to output the recognition range? Default: False.
--use_llm Do you need to use LLM to format the parsing results? If so, please specify the corresponding parameters through the environment variables: DOC_PARSER_OPENAI_URL, DOC_PARSER_OPENAI_KEY, DOC_PARSER_OPENAI_MODEL. Default: False.
--overwrite_result If the parsed target file already exists, should it be rewritten? Default: False.
From Python
Html
From CLI
You can run the following script to get the html parsing results:
docparser-html https://github.com/mem0ai/mem0
The following is a description of the relevant parameters:
From Python
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docparser_feb-0.1.5.tar.gz.
File metadata
- Download URL: docparser_feb-0.1.5.tar.gz
- Upload date:
- Size: 424.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c4483878537bab79e9a77de2aa39536ad9bd48f16dd79f3073e4022211c15ed2
|
|
| MD5 |
53036b496cc2dec3bc6acec7f2507f15
|
|
| BLAKE2b-256 |
95e8279f26a16735d0e12e45d2249f290819da4c586f347036da52ddf83efbef
|
File details
Details for the file docparser_feb-0.1.5-py3-none-any.whl.
File metadata
- Download URL: docparser_feb-0.1.5-py3-none-any.whl
- Upload date:
- Size: 437.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
30b5073ae113a410e93c36da8ae103e13d892f7188cb7a43361896c46c29e76e
|
|
| MD5 |
b6284b97d015439aa2e83faff250fd9a
|
|
| BLAKE2b-256 |
4353deb7b830b2cb3f5ee73d1d4a473c615270a6beeea8f05646fbf2834819e0
|