HTML2PARQUET Python Transform
Project description
html2parquet Transform
This tranforms iterate through zip of HTML files or single HTML files and generates parquet files containing the converted document in string.
The HTML conversion is using the Trafilatura.
Output format
The output format will contain the following colums
{
"title": "string" // the member filename
"document": "string" // the base of the source archive
"contents": "string" // the content of the HTML
"document_id": "string", // the document id, a hash of `contents`
"size": "string", // the size of `contents`
"date_acquired": "date", // the date when the transform was executing
}
Parameters
The transform can be initialized with the following parameters.
| Parameter | Default | Description |
|---|---|---|
output_format |
markdown |
The output type for the contents column. Valid types are markdown and text. |
When invoking the CLI, the parameters must be set as --html2parquet_<name>, e.g. --html2parquet_output_format='markdown'.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dpk_html2parquet_transform_python-0.2.1.tar.gz.
File metadata
- Download URL: dpk_html2parquet_transform_python-0.2.1.tar.gz
- Upload date:
- Size: 188.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
64d51213f79ea9585774a17f339b3b4520253760812688068523de3f90d197a6
|
|
| MD5 |
b7092a60f54a5ff38ab8f6167bfd0c4e
|
|
| BLAKE2b-256 |
0c0dc44d349a97566d1f8bbd0fb12fc85daa88aa078542fa47e436a2f15ac596
|
File details
Details for the file dpk_html2parquet_transform_python-0.2.1-py3-none-any.whl.
File metadata
- Download URL: dpk_html2parquet_transform_python-0.2.1-py3-none-any.whl
- Upload date:
- Size: 6.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
129be1bdcb075f6461bde10f2f7f2cda6935c1c2c4a177a739a4136685dbbcc7
|
|
| MD5 |
5fda4194c5b951a1e45e6547055aeb5c
|
|
| BLAKE2b-256 |
edd8c94e792190caafdf67212c8ee1f1c9093005c68ee8fe192e718868988300
|