HTML2PARQUET Python Transform
Project description
html2parquet Transform
This tranforms iterate through zip of HTML files or single HTML files and generates parquet files containing the converted document in string.
The HTML conversion is using the Trafilatura.
Output format
The output format will contain the following colums
{
"title": "string" // the member filename
"document": "string" // the base of the source archive
"contents": "string" // the content of the HTML
"document_id": "string", // the document id, a hash of `contents`
"size": "string", // the size of `contents`
"date_acquired": "date", // the date when the transform was executing
}
Parameters
The transform can be initialized with the following parameters.
Parameter | Default | Description |
---|---|---|
output_format |
markdown |
The output type for the contents column. Valid types are markdown and text . |
When invoking the CLI, the parameters must be set as --html2parquet_<name>
, e.g. --html2parquet_output_format='markdown'
.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file dpk_html2parquet_transform_python-0.2.1.tar.gz
.
File metadata
- Download URL: dpk_html2parquet_transform_python-0.2.1.tar.gz
- Upload date:
- Size: 188.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 64d51213f79ea9585774a17f339b3b4520253760812688068523de3f90d197a6 |
|
MD5 | b7092a60f54a5ff38ab8f6167bfd0c4e |
|
BLAKE2b-256 | 0c0dc44d349a97566d1f8bbd0fb12fc85daa88aa078542fa47e436a2f15ac596 |
File details
Details for the file dpk_html2parquet_transform_python-0.2.1-py3-none-any.whl
.
File metadata
- Download URL: dpk_html2parquet_transform_python-0.2.1-py3-none-any.whl
- Upload date:
- Size: 6.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 129be1bdcb075f6461bde10f2f7f2cda6935c1c2c4a177a739a4136685dbbcc7 |
|
MD5 | 5fda4194c5b951a1e45e6547055aeb5c |
|
BLAKE2b-256 | edd8c94e792190caafdf67212c8ee1f1c9093005c68ee8fe192e718868988300 |