DecisionFacts Extraction Library extracts content from PDF, PPTX, Docx, png, jpg., and convert as structured JSON data.
Project description
DF Extract Lib
Requirements
Python 3.10+ asyncio
Installation
# Using pip
$ python -m pip install df-extract
# Manual install
$ python -m pip install .
1. To extract content from PDF
from df_extract.pdf import ExtractPDF
path = "/home/test/ABC.pdf"
extract_pdf = ExtractPDF(file_path=path)
# By default, output as text
await extract_pdf.extract() # Output will be located `/home/test/ABC.pdf.txt`
# Output as json
await extract_pdf.extract(as_json=True) # Output will be located `/home/test/ABC.pdf.json`
You can change the output directory with simply pass
output_dirparam
from df_extract.pdf import ExtractPDF
path = "/home/test/ABC.pdf"
extract_pdf = ExtractPDF(file_path=path, output_dir="/home/test/output")
await extract_pdf.extract()
Extract content from PDF with image data
This requires
easyocr
from df_extract.base import ImageExtract
from df_extract.pdf import ExtractPDF
path = "/home/test/ABC.pdf"
image_extract = ImageExtract(model_download_enabled=True)
extract_pdf = ExtractPDF(file_path=path, image_extract=image_extract)
await extract_pdf.extract()
2. To extract content from PPT and PPTx
from df_extract.pptx import ExtractPPTx
path = "/home/test/DEF.pptx"
extract_pptx = ExtractPPTx(file_path=path)
# By default, output as text
await extract_pptx.extract() # Output will be located `/home/test/DEF.pptx.txt`
# Output as json
await extract_pptx.extract(as_json=True) # Output will be located `/home/test/DEF.pptx.json`
3. To extract content from Doc and Docx
from df_extract.docx import ExtractDocx
path = "/home/test/GHI.docx"
extract_docx = ExtractDocx(file_path=path)
# By default, output as text
await extract_docx.extract() # Output will be located `/home/test/GHI.docx.txt`
# Output as json
await extract_docx.extract(as_json=True) # Output will be located `/home/test/GHI.docx.json`
4. To extract content from PNG, JPEG and JPG
from df_extract.image import ExtractImage
path = "/home/test/JKL.png"
extract_png = ExtractImage(file_path=path)
# By default, output as text
await extract_png.extract() # Output will be located `/home/test/JKL.png.txt`
# Output as json
await extract_png.extract(as_json=True) # Output will be located `/home/test/JKL.png.json`
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
df_extract-0.0.2.1.tar.gz
(11.7 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file df_extract-0.0.2.1.tar.gz.
File metadata
- Download URL: df_extract-0.0.2.1.tar.gz
- Upload date:
- Size: 11.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c162dd85881866b8479a969135925ad7d0b881ade7e34ada88ff95b531bd632c
|
|
| MD5 |
3621f1f975ac6a1946868b950c8cf59b
|
|
| BLAKE2b-256 |
94b1203a533b073939f66c08b151b02f55a5be4298fd96fc9685175c343df86d
|
File details
Details for the file df_extract-0.0.2.1-py3-none-any.whl.
File metadata
- Download URL: df_extract-0.0.2.1-py3-none-any.whl
- Upload date:
- Size: 14.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d46d15d8e08dce13904b9ac88ec25d28a40ddac3c8ccb3c39fb5189eb0743004
|
|
| MD5 |
c84e49f3e542e12da9882eaab94b9072
|
|
| BLAKE2b-256 |
af6dd33a87d3de22eccba3495ffa53057f27a027f0eda89d0c1736c18c03aabd
|