Skip to main content

HTML2PARQUET Python Transform

Project description

html2parquet Transform

This tranforms iterate through zip of HTML files or single HTML files and generates parquet files containing the converted document in string.

The HTML conversion is using the Trafilatura.

Output format

The output format will contain the following colums

{
	"title": "string"             // the member filename
	"document": "string"          // the base of the source archive
	"contents": "string"          // the content of the HTML
    "document_id": "string",      // the document id, a hash of `contents`
    "size": "string",             // the size of `contents`
    "date_acquired": "date",      // the date when the transform was executing
}

Parameters

The transform can be initialized with the following parameters.

Parameter Default Description
output_format markdown The output type for the contents column. Valid types are markdown and text.

When invoking the CLI, the parameters must be set as --html2parquet_<name>, e.g. --html2parquet_output_format='markdown'.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dpk_html2parquet_transform_python-0.2.1.tar.gz (188.2 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file dpk_html2parquet_transform_python-0.2.1.tar.gz.

File metadata

File hashes

Hashes for dpk_html2parquet_transform_python-0.2.1.tar.gz
Algorithm Hash digest
SHA256 64d51213f79ea9585774a17f339b3b4520253760812688068523de3f90d197a6
MD5 b7092a60f54a5ff38ab8f6167bfd0c4e
BLAKE2b-256 0c0dc44d349a97566d1f8bbd0fb12fc85daa88aa078542fa47e436a2f15ac596

See more details on using hashes here.

File details

Details for the file dpk_html2parquet_transform_python-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for dpk_html2parquet_transform_python-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 129be1bdcb075f6461bde10f2f7f2cda6935c1c2c4a177a739a4136685dbbcc7
MD5 5fda4194c5b951a1e45e6547055aeb5c
BLAKE2b-256 edd8c94e792190caafdf67212c8ee1f1c9093005c68ee8fe192e718868988300

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page