Skip to main content

Easy and simple document to plain text tool. Supported formats: doc, docx, xls, xlsx, pdf, and many more!

Project description

Docat

Easy and simple document to plain text tool. Supported formats: doc, docx, xls, xlsx, pdf, and many more!

Github: https://github.com/lluises/docat

How it works

Docat works by identifying the document through the MIME type, and then selects a parser to extract all the text from the document.

Currently no OCR is implemented, therefore no text is extracted from images.

Use from CLI

usage: docat [-h] [-o OUTPUT] [-l] [-nl] [documents ...]

Document to plain text transformation tool

positional arguments:
  documents            Documents to process

options:
  -h, --help           show this help message and exit
  -o, --output OUTPUT  Output file. By default outputs to stdout
  -l, --list           List all supported mime types and exit
  -nl, --newline       Ensure that the output ends with a newline (\n)

Example

docat myfile.pdf

Will output all the text from myfile.pdf to the console (stdout).

Use as a python library

import docat

text = docat.process("path/to/myfile.pdf")

print(text)

Using Path from pathlib is also supported:

from pathlib import Path
import docat

file_path = Path(".") / "myfile.pdf"
text = docat.process(file_path)

print(text)

Supported files

Currently, docat supports:

  • Microsoft Docs
  • Microsoft Excel
  • Microsoft PowerPoint
  • Open document (LibreOffice, OpenOffice...)
  • PDF
  • Plain text files
  • SVG with plain text embedded

Suggestions for more documents are welcome.

MIME types

The following MIME types are currently supported by docat:

  • application/javascript
  • application/json
  • application/msword
  • application/pdf
  • application/vnd.ms-excel
  • application/vnd.ms-excel.sheet.macroEnabled.12
  • application/vnd.ms-powerpoint
  • application/vnd.ms-word.document.macroEnabled.12
  • application/vnd.oasis.opendocument.presentation
  • application/vnd.oasis.opendocument.spreadsheet
  • application/vnd.oasis.opendocument.text
  • application/vnd.openxmlformats-officedocument.presentationml.presentation
  • application/vnd.openxmlformats-officedocument.presentationml.slideshow
  • application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
  • application/vnd.openxmlformats-officedocument.wordprocessingml.document
  • image/svg+xml
  • text

License

All the code in this repository is licensed under the Apache License Version 2.0. You may get a copy of the license in the LICENCE file, or online at https://www.apache.org/licenses/LICENSE-2.0.txt.

This program depends on other software packages, which have their own license. Check them to ensure compatibility with your project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docat-1.0.0.tar.gz (9.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docat-1.0.0-py3-none-any.whl (13.5 kB view details)

Uploaded Python 3

File details

Details for the file docat-1.0.0.tar.gz.

File metadata

  • Download URL: docat-1.0.0.tar.gz
  • Upload date:
  • Size: 9.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for docat-1.0.0.tar.gz
Algorithm Hash digest
SHA256 9e410a20855ca08351200fd827919eb0863b537485faee151c90d0b5074226d5
MD5 5e7c8a37451cc7459375e2eed30d916b
BLAKE2b-256 9418124185566b641dd007792df18c6bce6617213598771295fa9c292a11e009

See more details on using hashes here.

File details

Details for the file docat-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: docat-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 13.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for docat-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 938e7bebaa2dfbc9ea1e21d6b0cd997e478d99925587cf13eee540df95d344f1
MD5 59552b8c46937c3bade2e88a7b71aa92
BLAKE2b-256 29efb10635b19af40d82bc3bb4bbefade8b4e12d9fda83e3a7d8bd7dd2a7de0a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page