A local, offline document archive
Project description
filecabinet
filecabinet is a minimal document management system for your computer. It has metadata per document and supports fulltext search in various document types.
Installing
The easiest way to install is to use pip
:
pip install filecabinet
Alternatively you can get the source code at codeberg:
git clone https://codeberg.org/vonshednob/filecabinet
pip install filecabinet
Requirements
filecabinet
requires the xapian python bindings
which can not be installed through pip
!
Other automatically installed required dependencies are:
Even though optional, I strongly recommend installing Tesseract OCR to enable fulltext search in scanned documents.
Quick start
To initialize your file cabinet, run filecabinet init
and provide a new
path where you would like to store your documents:
filecabinet init ~/Documents/cabinet
Now you can start either copying files into ~/Documents/cabinet/inbox
and
run
filecabinet pickup
to process them, or add files manually via
filecabinet add ~/some_scanned_document.jpg
To get a basic overview of documents, you can use the Shell.
Workflow / Use cases
Here’s the usual worflow with filecabinet:
- Put some documents (PDF, scanned documents, etc) into the
inbox
folder of your cabinet - Run
filecabinet pickup
- List all new documents with
filecabinet list new
Other use cases are:
- Search for a specific document with
filecabinet find "searchterm" "other search term"
- Edit the metadata of a document through the shell
filecabinet shell
(see next section)
Shell
There’s a basic shell that allows you to inspect indexed documents, edit their metadata (by means of an external text editor), or view the documents.
To open the shell, run
filecabinet shell
Try help
inside the shell to see what your options are.
Metadata editing
If you want to use a specific text editor to modify metadata, consider
updating your configuration file’s Shell
section and add a
document_editor
, like this:
[Shell]
editor = subl -w
In this example we set up SublimeText as the external editor. Note that the
-w
option is necessary to make filecabinet wait until you’re done editing
the file before returning into the shell.
Visual Studio Code uses the -W
or --wait
flag to accomplish the same
behaviour.
Searching
Searching for tags is done case-insensitive and is done using tag:
.
For example if you're looking for a document that's tagged with banana, you
can search for it by tag:banana
.
Searching new documents is accomplished by searching for tag:new
.
If you only want to find documents that are not new, you can also
search for -tag:new
. Unless specified, a search will ignore whether or not a
document is new.
You can search for any metadata value, like title, author, or language,
by searching with the metadata name and a colon like title:gravity
.
Everything else that does not match the special search terms will be used in the fulltext search.
If you want to search for terms with whitespaces, you can use quotes:
title:"brain surgery"
.
Example:
The title contains "brain", is from author "Gumby" and it was set to some time
before August 2005: title:brain author:gumby date:2015-08-01
Looking for a newly added document with the title "The Larch": title:larch tag:new
Grouping of pages
Sometimes you will have a scanned document in form of multiple pages, each
page a .jpg
file, like page1.jpg
, page2.jpg
, page3.jpg
.
Of course all these pages form the same document.
To tell filecabinet that these files all belong to the same document, you
can put them in a folder inside the inbox before running pickup
:
inbox/doc/page1.jpg
inbox/doc/page2.jpg
inbox/doc/page3.jpg
This will tell filecabinet that they all belong to the same document.
Here’s also where you can hint to the language of the document
for OCR (see Language hinting in the next section) by calling the folder,
for example, doc-nl
to indicate that all pages are written in the dutch
language.
OCR
filecabinet can use Tesseract OCR to do character recognition on pictures and scanned PDFs, so you can search the text of images.
In order for that to work, you have to install Tesseract and some language packages, depending on the languages of the documents you wish to scan.
If you don't have Tesseract OCR installed, filecabinet will still work, but be much less useful.
Language hinting
You can tell filecabinet what language a document has even as it is in the inbox by adding its language as a suffix: hyphen followed by language code (ISO-639).
A few examples will help. Consider these files:
page-1.jpg
contract.png
Suppose your default language is set to english (default-lanugage = eng
in
the configuration file); page-1.jpg
is in English but contract.png
is
in German.
OCR will likely have difficulties with letters like öäü
in contract.png
unless you tell it what language the document is in:
contract-ger.png
ger
is one of the ISO-639 language codes for German (others are de
and
deu
; see wikipedia for the long listing).
With this -ger
suffix, filecabinet will use the correct language packet
(if you have it installed) and the OCR will yield much better results.
Rule based tagging
By using metaindex, filecabinet inherits the powerful rule based tagging. This allows you to automatically add metadata tags to documents based on their text (which might have come from OCR).
Rules are defined in text files and you have to point filecabinet to the
rule files that you want it to use. To do that, add a section [Rules]
to
your configuration file (usually at
~/.config/filecabinet/filecabinet.conf
) and list your rule files like
this:
[Rules]
base = ~/.config/filecabinet/basic_rules.txt
companies = ~/Document/company_rules.txt
The names (before the =
) are somewhat free-form descriptors.
To understand how to write these rule files, please have a look at the metaindex documentation.
To test your rules on documents, you can use the filecabinet test-rules
command. It will run all indexers on a file and show you what tags have
been found by your rules.
When using test-rules
the tested document will not be added to your
cabinet.
Cabinet Directory Structure
Assuming a cabinet is set up at ~/cabinet
, the directory structure is:
~/cabinet
│
├── inbox
│
├── metaindex.conf
│
├── metaindex.log
│
└── documents
│
└── <partial document id>
│
└── <full document id>
│
├── <document id>.yaml
│
├── <document id>.<suffix>
│
└── <document id>.txt
inbox
will be processed (and emptied) whenfilecabinet pickup
is being rundocuments
contains the documents<document id>.yaml
contains the metadata<document id>.<suffix>
is the original document (usually a PDF)<document id>.txt
is the extracted full text, if it could be extractedmetaindex.conf
, the configuration file for filecabinet's metaindexservermetaindex.log
, the log file of file cabinet's metaindexserver
Configuration
filecabinet itself as well as each individual cabinet can be configured
through the user’s configuration file (usually in ~/.config/filecabinet/filecabinet.conf
).
See example.conf
for all configuration options!
Usage from Python
To use filecabinet
from Python, you can use this boilerplate:
from filecabinet import Manager
manager = Manager()
manager.launch_server()
session = manager.new_session()
session
will be an instance of Session
which, together with manager
,
allows manipulation of metadata and querying of documents.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file filecabinet-2.1.0.tar.gz
.
File metadata
- Download URL: filecabinet-2.1.0.tar.gz
- Upload date:
- Size: 25.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 921e0fb8b41f7c27d0f0480fa0fa12ea4590049a26afd1169546e073b09f826c |
|
MD5 | ade760b8c9d814e3fa5fe14618ebbf6f |
|
BLAKE2b-256 | defe149f77940f59df218bc391eeded446c681596c5d7772588a9aa394f3414d |
File details
Details for the file filecabinet-2.1.0-py3-none-any.whl
.
File metadata
- Download URL: filecabinet-2.1.0-py3-none-any.whl
- Upload date:
- Size: 22.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bb768c846fd24247181e6fc03116fd04c58e83fbebadac67d3bce865bae9a833 |
|
MD5 | d6e0dda7fd7d8d8ad73d7782b82a02b3 |
|
BLAKE2b-256 | be48f023c64996e63f8c026144d5b88fc6689624b825698bf24c6a4c2c8c47d4 |