Skip to main content

Downloads and extracts text, html from different formats

Project description

naval

Description

naval downloads, fetches and extract text, html from url, file path, file object and more. Downloading(fetching) data from url will be handled for you. pdf, docx, pptx, html, and text files are supported out of box.

File can be from URL, file path, file object or even bytes. URL can point to webpage instead of file e.g google.com which will be treated as html. Its all about file extensions, file.html will be treated as html unless explicitly specified.

Install

naval can be installed with pip

pip install navaly

Usage

downloading from url:

# download to file in path
naval.download("http://example.com/", "output.html")
naval.download("http://example.com/sample.pdf", "output.pdf")

# download to file like object
file_output = BytesIO()
naval.download("http://example.com/sample.pdf", file_output)

# download from multiple urls into folder
# html file will downloaded to 'downloads/' folder
urls = ["http://example.com/", "https://www.google.com/"]
naval.download_all(urls, "downloads")

Extract text and html

# extract text from pdf, docx and pptx files
output_text = naval.extract_text("sample_file.pdf")
output_text = naval.extract_text("sample_file.pptx")
output_html = naval.extract_html("sample_file.docx")
output_text = naval.extract_text("http://example.com/")

# Extract from file like object
with open("sample_file.pdf", mode="rb") as file:
    output_text = naval.extract_text(input_file)

# extract to file(file path, file object)
naval.extract_text_to_file("sample_file.pdf", "output.txt")
naval.extract_html_to_file("sample_file.pdf", "output.html")

# string can passed directly
html = '''
<p> First paragraph </p>
<p> Second paragraph </p>
'''
output_text = naval.extract_text(html, source_locates_data=False, content_type="text/html")

# same with bytes
with open("sample_file.pdf", mode="rb") as file:
    pdf_bytes = file.read()
    output_html = naval.extract_html(pdf_bytes, source_locates_data=False, content_type="application/pdf")

See examples/ folder for more examples

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Dont forget to update tests.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

navaly-0.0.1.tar.gz (1.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

navaly-0.0.1-py3-none-any.whl (36.2 kB view details)

Uploaded Python 3

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page