Easy and simple document to plain text tool. Supported formats: doc, docx, xls, xlsx, pdf, and many more!
Project description
Docat
Easy and simple document to plain text tool. Supported formats: doc, docx, xls, xlsx, pdf, and many more!
Github: https://github.com/lluises/docat
How it works
Docat works by identifying the document through the MIME type, and then selects a parser to extract all the text from the document.
Currently no OCR is implemented, therefore no text is extracted from images.
Use from CLI
usage: docat [-h] [-o OUTPUT] [-l] [-nl] [documents ...]
Document to plain text transformation tool
positional arguments:
documents Documents to process
options:
-h, --help show this help message and exit
-o, --output OUTPUT Output file. By default outputs to stdout
-l, --list List all supported mime types and exit
-nl, --newline Ensure that the output ends with a newline (\n)
Example
docat myfile.pdf
Will output all the text from myfile.pdf to the console (stdout).
Use as a python library
import docat
text = docat.process("path/to/myfile.pdf")
print(text)
Using Path from pathlib is also supported:
from pathlib import Path
import docat
file_path = Path(".") / "myfile.pdf"
text = docat.process(file_path)
print(text)
Supported files
Currently, docat supports:
- Microsoft Docs
- Microsoft Excel
- Microsoft PowerPoint
- Open document (LibreOffice, OpenOffice...)
- Plain text files
- SVG with plain text embedded
Suggestions for more documents are welcome.
MIME types
The following MIME types are currently supported by docat:
- application/javascript
- application/json
- application/msword
- application/pdf
- application/vnd.ms-excel
- application/vnd.ms-excel.sheet.macroEnabled.12
- application/vnd.ms-powerpoint
- application/vnd.ms-word.document.macroEnabled.12
- application/vnd.oasis.opendocument.presentation
- application/vnd.oasis.opendocument.spreadsheet
- application/vnd.oasis.opendocument.text
- application/vnd.openxmlformats-officedocument.presentationml.presentation
- application/vnd.openxmlformats-officedocument.presentationml.slideshow
- application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
- application/vnd.openxmlformats-officedocument.wordprocessingml.document
- image/svg+xml
- text
License
All the code in this repository is licensed under the Apache License Version 2.0. You may get a copy of the license in the LICENCE file, or online at https://www.apache.org/licenses/LICENSE-2.0.txt.
This program depends on other software packages, which have their own license. Check them to ensure compatibility with your project.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docat-1.0.0.tar.gz.
File metadata
- Download URL: docat-1.0.0.tar.gz
- Upload date:
- Size: 9.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9e410a20855ca08351200fd827919eb0863b537485faee151c90d0b5074226d5
|
|
| MD5 |
5e7c8a37451cc7459375e2eed30d916b
|
|
| BLAKE2b-256 |
9418124185566b641dd007792df18c6bce6617213598771295fa9c292a11e009
|
File details
Details for the file docat-1.0.0-py3-none-any.whl.
File metadata
- Download URL: docat-1.0.0-py3-none-any.whl
- Upload date:
- Size: 13.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
938e7bebaa2dfbc9ea1e21d6b0cd997e478d99925587cf13eee540df95d344f1
|
|
| MD5 |
59552b8c46937c3bade2e88a7b71aa92
|
|
| BLAKE2b-256 |
29efb10635b19af40d82bc3bb4bbefade8b4e12d9fda83e3a7d8bd7dd2a7de0a
|