Skip to main content

Python3 library that adds MS Word .doc support to the llm-dataset-converter library.

Project description

The ldc-doc library is an extension to llm-dataset-converter with plugins for handling MS Word .doc files.

It requires antiword to be installed on the system, which textract uses internally for obtaining the text from .doc files.

Changelog

0.0.5 (2025-07-11)

  • using textract_py3 as dependency instead of textract-py3

0.0.4 (2025-03-14)

  • added placeholder support

0.0.3 (2024-12-20)

0.0.2 (2024-07-05)

  • from-doc-pt now uses *.doc as default glob

0.0.1 (2024-05-06)

  • initial release

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ldc_doc-0.0.5.tar.gz (4.5 kB view details)

Uploaded Source

File details

Details for the file ldc_doc-0.0.5.tar.gz.

File metadata

  • Download URL: ldc_doc-0.0.5.tar.gz
  • Upload date:
  • Size: 4.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for ldc_doc-0.0.5.tar.gz
Algorithm Hash digest
SHA256 1d3f287a04b5adefad3fea258557556ac92e83e3f06445ccd26206ef7f15d04e
MD5 4ef390995b72403568103d72415c7dae
BLAKE2b-256 99a0cb4396fe720b64a90cb01284de3329a2e818efa3b31564f151a26822f5f8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page