Skip to main content

Downloads and extracts text, html from different formats

Project description

naval

Description

naval downloads, fetches and extract text, html from url, file path, file object and more. Downloading(fetching) data from url will be handled for you. pdf, docx, pptx, html, and text files are supported out of box.

File can be from URL, file path, file object or even bytes. URL can point to webpage instead of file e.g google.com which will be treated as html. Its all about file extensions, file.html will be treated as html unless explicitly specified.

Install

naval can be installed with pip

pip install navaly

Usage

downloading from url:

# download to file in path
naval.download("http://example.com/", "output.html")
naval.download("http://example.com/sample.pdf", "output.pdf")

# download to file like object
file_output = BytesIO()
naval.download("http://example.com/sample.pdf", file_output)

# download from multiple urls into folder
# html file will downloaded to 'downloads/' folder
urls = ["http://example.com/", "https://www.google.com/"]
naval.download_all(urls, "downloads")

Extract text and html

# extract text from pdf, docx and pptx files
output_text = naval.extract_text("sample_file.pdf")
output_text = naval.extract_text("sample_file.pptx")
output_html = naval.extract_html("sample_file.docx")
output_text = naval.extract_text("http://example.com/")

# Extract from file like object
with open("sample_file.pdf", mode="rb") as file:
    output_text = naval.extract_text(input_file)

# extract to file(file path, file object)
naval.extract_text_to_file("sample_file.pdf", "output.txt")
naval.extract_html_to_file("sample_file.pdf", "output.html")

# string can passed directly
html = '''
<p> First paragraph </p>
<p> Second paragraph </p>
'''
output_text = naval.extract_text(html, source_locates_data=False, content_type="text/html")

# same with bytes
with open("sample_file.pdf", mode="rb") as file:
    pdf_bytes = file.read()
    output_html = naval.extract_html(pdf_bytes, source_locates_data=False, content_type="application/pdf")

See examples/ folder for more examples

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Dont forget to update tests.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

navaly-0.0.1.tar.gz (1.2 MB view hashes)

Uploaded Source

Built Distribution

navaly-0.0.1-py3-none-any.whl (36.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page