Skip to main content

Document parsing tool for LLM training and Rag

Project description

DocParser 📄

DocParser is a powerful tool for LLM traning and other application, for examples: RAG, which support to parse multi type file, includes:

Feature 🎉

File types supported for parsing:

  • Pdf: Use OCR to parse PDF documents and output text in markdown format. The parsing results can be used for LLM pretrain, RAG, etc.
  • Html: Use jina to parse multi html pages and output text in markdown.

Install

From pip:

pip install docparser_feb

From repository:

pip install git+https://github.com/feb-co/DocParser.git

Or install it directly through the installation package:

git clone https://github.com/feb-co/DocParser.git
cd DocParser
pip install -e .

API/Functional

Pdf

From CLI

You can run the following script to get the pdf parsing results:

export LOG_LEVEL="ERROR"
export DOC_PARSER_MODEL_DIR="xxx"
export DOC_PARSER_OPENAI_URL="xxx"
export DOC_PARSER_OPENAI_KEY="xxx"
export DOC_PARSER_OPENAI_MODEL="gpt-4-0125-preview"
export DOC_PARSER_OPENAI_RETRY="3"
docparser-pdf \
    --inputs path/to/file.pdf or path/to/directory \
    --output_dir output_directory \
    --page_range '0:1' --mode 'figure latex' \
    --rendering --use_llm --overwrite_result

The following is a description of the relevant parameters:

usage: docparser-pdf [-h] --inputs INPUTS --output_dir OUTPUT_DIR [--page_range PAGE_RANGE] [--mode {plain,figure placehold,figure latex}] [--rendering] [--use_llm]

options:
  -h, --help            show this help message and exit
  --inputs INPUTS       Directory where to store PDFs, or a file path to a single PDF
  --output_dir OUTPUT_DIR
                        Directory where to store the output results (md/json/images).
  --page_range PAGE_RANGE
                        The page range to parse the PDF, the format is 'start_page:end_page', that is, [start, end). Default: full.
  --mode {plain,figure placehold,figure latex}
                        The mode for parsing the PDF, to extract only the plain text or the text plus images.
  --rendering           Is it necessary to render the recognition results of the input PDF to output the recognition range? Default: False.
  --use_llm             Do you need to use LLM to format the parsing results? If so, please specify the corresponding parameters through the environment variables: DOC_PARSER_OPENAI_URL, DOC_PARSER_OPENAI_KEY, DOC_PARSER_OPENAI_MODEL. Default: False.
  --overwrite_result    If the parsed target file already exists, should it be rewritten? Default: False.

From Python

Html

From CLI

You can run the following script to get the html parsing results:

docparser-html https://github.com/mem0ai/mem0

The following is a description of the relevant parameters:

From Python

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docparser_feb-0.1.5.tar.gz (424.3 kB view details)

Uploaded Source

Built Distribution

docparser_feb-0.1.5-py3-none-any.whl (437.0 kB view details)

Uploaded Python 3

File details

Details for the file docparser_feb-0.1.5.tar.gz.

File metadata

  • Download URL: docparser_feb-0.1.5.tar.gz
  • Upload date:
  • Size: 424.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for docparser_feb-0.1.5.tar.gz
Algorithm Hash digest
SHA256 c4483878537bab79e9a77de2aa39536ad9bd48f16dd79f3073e4022211c15ed2
MD5 53036b496cc2dec3bc6acec7f2507f15
BLAKE2b-256 95e8279f26a16735d0e12e45d2249f290819da4c586f347036da52ddf83efbef

See more details on using hashes here.

File details

Details for the file docparser_feb-0.1.5-py3-none-any.whl.

File metadata

File hashes

Hashes for docparser_feb-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 30b5073ae113a410e93c36da8ae103e13d892f7188cb7a43361896c46c29e76e
MD5 b6284b97d015439aa2e83faff250fd9a
BLAKE2b-256 4353deb7b830b2cb3f5ee73d1d4a473c615270a6beeea8f05646fbf2834819e0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page