DecisionFacts Extraction Library extracts content from PDF, PPTX, Docx, png, jpg., and convert as structured JSON data.
Project description
DF Extract Lib
Requirements
Python 3.10+
Installation
# Using pip
$ python -m pip install df-extract
# Manual install
$ python -m pip install .
1. To extract content from PDF
from df_extract.pdf import ExtractPDF
path = "/home/test/ABC.pdf"
extract_pdf = ExtractPDF(file_path=path)
# By default, output as text
extract_pdf.extract() # Output will be located `/home/test/ABC.pdf.txt`
# Output as json
extract_pdf.extract(as_json=True) # Output will be located `/home/test/ABC.pdf.json`
You can change the output directory with simply pass
output_dir
param
from df_extract.pdf import ExtractPDF
path = "/home/test/ABC.pdf"
extract_pdf = ExtractPDF(file_path=path, output_dir="/home/test/output")
extract_pdf.extract()
Extract content from PDF
with image data
This requires
easyocr
from df_extract.base import ImageExtract
from df_extract.pdf import ExtractPDF
path = "/home/test/ABC.pdf"
image_extract = ImageExtract(model_download_enabled=True)
extract_pdf = ExtractPDF(file_path=path, image_extract=image_extract)
extract_pdf.extract()
2. To extract content from PPT
and PPTx
from df_extract.pptx import ExtractPPTx
path = "/home/test/DEF.pptx"
extract_pptx = ExtractPPTx(file_path=path)
# By default, output as text
extract_pptx.extract() # Output will be located `/home/test/DEF.pptx.txt`
# Output as json
extract_pptx.extract(as_json=True) # Output will be located `/home/test/DEF.pptx.json`
3. To extract content from Doc
and Docx
from df_extract.docx import ExtractDocx
path = "/home/test/GHI.docx"
extract_docx = ExtractDocx(file_path=path)
# By default, output as text
extract_docx.extract() # Output will be located `/home/test/GHI.docx.txt`
# Output as json
extract_docx.extract(as_json=True) # Output will be located `/home/test/GHI.docx.json`
4. To extract content from PNG
, JPEG
and JPG
from df_extract.image import ExtractImage
path = "/home/test/JKL.png"
extract_png = ExtractImage(file_path=path)
# By default, output as text
extract_png.extract() # Output will be located `/home/test/JKL.png.txt`
# Output as json
extract_png.extract(as_json=True) # Output will be located `/home/test/JKL.png.json`
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
df_extract-0.0.2.tar.gz
(11.7 kB
view hashes)
Built Distribution
df_extract-0.0.2-py3-none-any.whl
(14.0 kB
view hashes)
Close
Hashes for df_extract-0.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e17b3940c8f11c61bd52cb85c90ff76740fc41d700610f3bd568439a6969ef26 |
|
MD5 | 40a7a8bb47aaf629dc019dd706c55750 |
|
BLAKE2b-256 | d6f20e10d19fde646bceef662ff176606e449be7d82c207e7043d28afc7cb3f2 |