Skip to main content

Default parser which can be used by Paperless-ngx if there is no other suitable parser found for a given mime type.

Project description

Default parser for Paperless-ngx

This is a default parser which can be used by Paperless-ngx if there is no other suitable parser found for a given mime type.

It allows to archive documents of all mime types, which are defined in /etc/mime.types. For every document consumed by this parser, the original file gets archived and a PDF as well as a thumbnail are generated.

If a file with known encoding is parsed, the content of this file is read and stored in the document's content metadata. Furthermore a PDF showing this content is generated. Otherwise the content metadata is left empty and a PDF containing the following note is generated:

This document was archived by a default parser for Paperless-ngx. 

original file name: $file_name
mime type: $mime_type

Download original file to work with it.

Prerequisites

This parser requires Gotenberg to be configured for Paperless-ngx.

Installation

  1. Install using PyPI

    pip install paperlessngx-default-parser

    For docker based installations use custom container initialization as described here: https://docs.paperless-ngx.com/advanced_usage/#custom-container-initialization

    Place a script with the following content in the directory for your container initialization scripts and make it executable:

    #!/bin/bash
    pip install paperlessngx-default-parser
    
  2. Add this parser to the PAPERLESS_APPS environment variable, e.g. in your paperless.conf: PAPERLESS_APPS="paperlessngx-default-parser.apps.DefaultParserConfig"

FAQ

Error: File type {mime-type} not supported

Paperless-ngx uses magic numbers to identify the mime type of a file which should be consumed/archived.

On the other hand Paperless-ngx currently requires a custom parser to define a dictionary of mime-types and one default extension per mime type it supports, see also Support for arbitrary binary files? #805 for a proposal to change this behaviour.

This default parser registers itself for all mime types defined in /etc/mime.types. It uses the first file extension defined in /etc/mime for a given mime type as the default extension for this mime type - or an empty string, if there is no extension defined at all.

Since the magic numbers database and /etc/mime.types don't have to be - and in fact are not - in sync, the following situation might occur:

Paperless-ngx identifies - by using magic numbers - a mime type which is not listed in /etc/mime.types. This results in the error File type {mime-type} not supported because the default parser could not register itself for this mime type.

Solution: Add the missing mime type to /etc/mime.types.

Error: Not consuming file {filepath}: Unknown file extension.

Paperless-ngx at the moment handles files differently if they are imported via the consumption directory or via UI.

When importing a file via UI, Paperless-ngx (solely) checks the mime type of the file using magic numbers and checks if there is a parser registered for this mime type.

When importing a file through the consumption directory an additional check is done at first:

Paperless-ngx collects all file extensions for the given mime type by looking at

  • /etc/mime.types and
  • the default extension a parser for this mime type declares.

A file in the consumption directory then is only consumed if its file extension matches one of theses extensions.

For example:

Given a file test.yaml which has mime type text/plain.

Importing via UI successfully archives the document. Importing the same document via the consumption directory leads to error Not consuming file /usr/src/paperless/consume/test.yaml: Unknown file extension.

Solution: Either import the file via UI or add the unknown file extension to the file extensions for this mime type in /etc/mime.types.

File extension when downloading original file

At the moment Paperless-ngx uses a default extension per mime type when downloading an original file.

For example: files of mime type application/octet-stream will get file extension .bin, those with mime-type text/plain will get extension .txt when downloaded.

Solution: In order to use a file with a program it originates of, you may therefore have to change the file extension of the downloaded file manually.

How to modify /etc/mime.types used by Paperless-ngx

For example:

  • Add missing mime types using add_missing_mime_types.sh (see examples there)
  • Create your own custom container initialization script to add/modify mime types.
  • Use your own mime.types file and bind it to /etc/mime.types

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paperlessngx_default_parser-2.0.0.tar.gz (18.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paperlessngx_default_parser-2.0.0-py3-none-any.whl (19.0 kB view details)

Uploaded Python 3

File details

Details for the file paperlessngx_default_parser-2.0.0.tar.gz.

File metadata

File hashes

Hashes for paperlessngx_default_parser-2.0.0.tar.gz
Algorithm Hash digest
SHA256 9aa8d75c66e1994cdb588d31db1ea150e003aeb0ef7059e5b239f6e3400dcb13
MD5 b9c1c2e81891af06dadb24bdbb0a4f92
BLAKE2b-256 8aa21c22d664c4b686ba7c8f37a540e87cb3534192dcce687d01f1e5cdff2957

See more details on using hashes here.

File details

Details for the file paperlessngx_default_parser-2.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for paperlessngx_default_parser-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0bea625d5fc8144245f0b6c2816edd61f1c133aad81faab9df305f1dc916d49d
MD5 a9361a300b1463c3ca51c1227821f4ca
BLAKE2b-256 28fc225bcaf271c5e518fe38f5a0b9d159b8015d5dbbb892b4b2c92cb430648e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page