Python3 library to get urls from PDF files.
Project description
lemonpdf
Python3 library to get urls from PDF files.
Install
sudo apt install tesseract-ocr poppler-utils
pip install lemonpdf
Quickstart
Command line interface use (CLI)
get urls
lemonpdf -u file.pdf
save urls list in file txt
lemonpdf -u file.pdf -o urls.txt -s
get domains
lemonpdf -d file.pdf
save domains in file txt
lemonpdf -d file.pdf -o domains.txt -s
scripts
get urls and save file txt
from lemonpdf import Extractor
pdf_path = 'file.pdf'
output_txt_path = 'out_file.txt'
extractor = Extractor(pdf_path=pdf_path, output_txt_path=output_txt_path)
urls = extractor.extract_urls_from_pdf(save=True)
print(urls)
get domains and save file txt
from lemonpdf import Extractor
pdf_path = 'file.pdf'
output_txt_path = 'domains.txt'
extractor = Extractor(pdf_path=pdf_path, output_txt_path=output_txt_path)
urls = extractor.extract_domains_from_pdf(save=True)
print(urls)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
lemonpdf-2.0rc1.tar.gz
(3.9 kB
view hashes)
Built Distribution
Close
Hashes for lemonpdf-2.0rc1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4c7d953a776d8511e103a7b98fbfd5afffa222428573cba94c2aecfb150ed89c |
|
MD5 | 5e5a735daa485d85e0f3fc97a67814b5 |
|
BLAKE2b-256 | 670170a479b3c5883e244325a41e2951b539f96d7f11d483377c3c1445c74b94 |