Skip to main content

Converts a pdf into a Libreoffice Writer document, with images resized to A4 and anchored as character

Project description

What is pdf2odt

'pdf2odt' is a tool developed to be able to integrate pdf files in my university notes taken with Libreoffice.

Sometimes I need to edit its content but keeping the original document. So I add the converted pages to images (anchored as character) and then insert their content as text, after going through an OCR.

This tool does not pretend to be a pdf file converter, cloning its format

It uses pdftoppm from poppler to make conversion

Links

Project main page https://github.com/turulomio/pdf2odt/

Pypi web page: https://pypi.org/project/pdf2odt/

Installation and use in Linux

To install, you must have poppler installed to use pdftoppm command. You can use your distribution package manager.

You also need Libreoffice with its python bindings, because unogenerator dependency will use it

Then just type:

pip install pdf2odt

Once installed you can use it typing:

pdf2odt --pdf doc.pdf doc.odt

If you want OCR, you have to install tesseract application then you have to run

pdf2odt --pdf doc.pdf --tesseract doc.odt

Installation and use in Windows

You need python installed. It works with the latest version. Don't forget to add python executables to PATH, marking it in the installation process.

Then just type:

pip install pdf2odt

Now you have to download poppler for windows from https://blog.alivate.com.au/poppler-windows/. Uncompress the downloaded file and add its installation directory to Windows environment path. Here you have how to do it https://www.architectryan.com/2018/03/17/add-to-the-path-on-windows-10/

Now you can use it typing in windows shell:

pdf2odt --pdf doc.pdf doc.odt

If you want OCR, ou have to download tesseract for windows fromm https://github.com/UB-Mannheim/tesseract/wiki. Then you have to add its installation directory to Windows environment path too.

pdf2odt --pdf doc.pdf --tesseract doc.odt

Dependencies

Changelog

1.0.0 (2024-12-22)

  • Migrated to unogenerator
  • Updated to poetry

0.7.0

  • Fixed bug with tesseract parameter position. Thanks @maxlem-neuralium
  • Now temporal files are generated with tempfile module.

0.6.0

  • Tesseract language is now showed in output
  • Now pdf2odt validates PDF document

0.5.0

  • Now pdf2odt detects if tesseract language selected is supported.

0.4.0

  • Added OCR support with tesseract
  • Now uses process concurrency and shows a progress bar

0.3.0

  • Fixed problem with white spaces paths in windows.
  • Improved metadata information.

0.2.0

  • Now works on Windows with popper for windows installation

0.1.0

  • Basic functionality

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2odt-1.0.0.tar.gz (21.0 kB view details)

Uploaded Source

Built Distribution

pdf2odt-1.0.0-py3-none-any.whl (24.4 kB view details)

Uploaded Python 3

File details

Details for the file pdf2odt-1.0.0.tar.gz.

File metadata

  • Download URL: pdf2odt-1.0.0.tar.gz
  • Upload date:
  • Size: 21.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.5 CPython/3.12.8 Linux/6.12.6-gentoo

File hashes

Hashes for pdf2odt-1.0.0.tar.gz
Algorithm Hash digest
SHA256 8f53118f39bb22f826c6052aca3e5d77c10ed7f6e77f5ac188128ef87d7fd692
MD5 58bc1e4babc580c4c8f4c7bcc7fae879
BLAKE2b-256 3147031fc84c4e96638be6c0e66aca74565772a9cbfd3226d54df5d319162749

See more details on using hashes here.

File details

Details for the file pdf2odt-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: pdf2odt-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 24.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.5 CPython/3.12.8 Linux/6.12.6-gentoo

File hashes

Hashes for pdf2odt-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d965c26715deaa30e21a1ab51d23cef73df272983824c80d50b6f935416a1bb9
MD5 cc9228b69177d15dbabe22296f9a0609
BLAKE2b-256 11d856d070cac038a2c48528b0555f04ffc3a53d6e020159dded40133aa0197a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page