Skip to main content

No project description provided

Project description

html2text_rs_py

A Python library backed by Rust's html2text to convert HTML to plain text. The project leverages the power of Rust to ensure fast and efficient operations, while providing an easy-to-use Python interface.

Note this entire thing was done with GPT-4 and it's my first time touching Rust -- just a bit of a weekend sidequest/learning experience. As a wise man once said: "I'm in the arena trying stuff. Some will work, some won't. But always learning."

Table of Contents

Installation

Prerequisites:

  1. Ensure you have both Rust and Python installed on your machine.
  2. Install maturin:
pip install maturin

Building and Installing:

Option 1: Use precompiled binaries from PyPI

You can use the precompiled binaries available on PyPI. This means you don't need to compile anything yourself, and the Rust toolchain is not required.

pip install html2text_rs_py

Option 2: Building from source:

If you prefer to compile the Rust code yourself, or if you're interested in developing, you can build directly from the source code:

  1. First, ensure you have the Rust toolchain installed. If you don't have it, get it from rustup.rs.

  2. Clone this repo:

git clone https://github.com/mpr1255/html2text_rs_py.git
cd html2text_rs_py
  1. Build and install the Python package:
maturin develop --release

This will compile the Rust code and link it with the Python wrapper, making the module available for Python.

Usage

After installing, you can use the Rust functions directly in Python:

from html2text_rs_py import convert_html_directory_to_text, convert_html_file_to_text_py, convert_html_files_to_text_batch_py

convert_html_directory_to_text("./input_directory", "./output_directory")

# Convert a single HTML file to text
convert_html_file_to_text_py("input_file.html", "output_file.txt")

# Convert multiple HTML files to text in a batch
input_files = ["input1.html", "input2.html"]
output_files = ["output1.txt", "output2.txt"]
convert_html_files_to_text_batch_py(input_files, output_files)

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you'd like to change. Please make sure to update tests as appropriate.

License

MIT

Note on benchmarks

Speed was the motivation for this little project. To make sure the comparison was 1:1, I generated a ~1gb dataset of html files that do NOT contain links (because the Rust html2text library does not expose a flag to stop generating the hyperlinked URLs, and I don't know enough Rust to figure it out). This shows that it's only ~6x faster than the normal python implementation and only ~3x faster than the Tika... Not that great... However, I will say there is a lot of boilerplate overhead with those (multithreading) whereas this wrapper has three very simple functions you can call, and the multithreading happens for free under the hood with Rust's rayon.

Benchmarks

Method Threading Documents Processed Total Output Size (bytes) Errors Time (seconds)
tika single-threaded 3007 1500926103 0 94.76
html2text single-threaded 3007 1500340646 0 184.90
tika multi-threaded 3007 1500926103 0 14.29
html2text multi-threaded 3007 1500340646 0 25.65
rust multi-threaded 3007 1531829273 0 3.92

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html2text_rs_py-0.1.1.tar.gz (18.8 kB view details)

Uploaded Source

Built Distribution

html2text_rs_py-0.1.1-cp311-cp311-macosx_11_0_arm64.whl (452.1 kB view details)

Uploaded CPython 3.11 macOS 11.0+ ARM64

File details

Details for the file html2text_rs_py-0.1.1.tar.gz.

File metadata

  • Download URL: html2text_rs_py-0.1.1.tar.gz
  • Upload date:
  • Size: 18.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.3.0

File hashes

Hashes for html2text_rs_py-0.1.1.tar.gz
Algorithm Hash digest
SHA256 e4531b99ca469d893c7fffa6e9c9c28b8937bf01e8ffba550f086e955dbfe16e
MD5 ea48a05294fa53cc121682b5fed16f6c
BLAKE2b-256 5873a19badeea53906f405dad3733624ec3f21a1e3dc0bc4a6529e08ace1b470

See more details on using hashes here.

File details

Details for the file html2text_rs_py-0.1.1-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for html2text_rs_py-0.1.1-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 eba2aee7308c732c2dff714547d8b0fe87d37187a88b051aaa1c7f9c3bf2a872
MD5 8b9e9b99bf3c9d418b12131e237b658d
BLAKE2b-256 131de43ff7712795f4d424ebe3235e183d633051ec5c83dfd48f163905115d99

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page