SisC is a tool to automatically separate annotations from the underlying text. SisC uses a fingerprint, that is, a masked version of the text to merge stand-off annotations with another version of the original text, for example, extracted from a PDF file. The fingerprint cannot be used on its own to recreate (meaningful parts of) the original text and can therefore be shared.
Project description
Readme
SisC is a tool to automatically separate annotations from the underlying text. SisC uses a fingerprint, that is, a masked version of the text to merge stand-off annotations with another version of the original text, for example, extracted from a PDF file. The fingerprint cannot be used on its own to recreate (meaningful parts of) the original text and can therefore be shared.
Installation
pip install sisc
Dependencies
For PDF processing, SisC uses pdf2image and Tesseract. Both need to be installed.
Usage
SisC provides a command line interface for easy usage.
Sisc currently best supports TEI XML as the input format. Other formats are partially supported and more formats can easily be added.
Creating a fingerprint
For XML files, two types of masking are available uniform and context. A fingerprint with uniform
masking is created with:
sisc fingerprint uniform input_path
If input_path is a folder, all files in that folder which are of the type specified in file-type will be processed.
By default, file-type is set to xml.
All command line options for uniform fingerprinting
usage: sisc fingerprint uniform [-h] [--file-type {txt,xml}]
[--move-notes | --no-move-notes]
[--add-quotation-marks | --no-add-quotation-marks]
[-s SYMBOL] [-d DISTANCE]
input-path output-path
Command to use uniform masking for the fingerprint.
positional arguments:
input-path Path to txt or xml file to create fingerprint from.
Can be a folder in which case all files will be
processed.
output-path Output folder path.
options:
-h, --help show this help message and exit
--file-type {txt,xml}
The input file type to process. Only used when
input_path is a folder (default: xml).
--move-notes, --no-move-notes
This will move footnotes and endnotes to the end of
their page/the whole text. Only works withXML file
which are annotated with footnotes/endnotes and
pagebreaks. (default: False)
--add-quotation-marks, --no-add-quotation-marks
Add quotation marks in the fingerprint. Useful when
quotations marks are not present in the annotated XML
file. (default: False)
-s SYMBOL, --symbol SYMBOL
The character to use for masking (default: _).
-d DISTANCE, --distance DISTANCE
The number of characters to mask between not masked
characters (default: 10)
Masking
For TEI XML files, SisC supports moving footnotes to the end of the page if the TEI XML files contains annotations for
footnotes and page breaks. This can be useful when the footnotes are moved to their anchor position during annotation.
To turn on moving of footnotes, the command line option --move-notes can be used.
We currently support two types of masking: Uniform masking and context masking.
Uniform masking keeps a certain number of characters, for example two, then masks a certain number of characters, for example five, then keeps two characters and so on. For example:
S___ _ex_ ___h __ __no_____ q____.
Context masking ... For example:
____ text with __ _________ _____.
Aligning Texts
sisc align content_path fingerprint_path output_path
content_path can a file or folder, PDF or Txt
fingerprint_path TBD
output_path Folder to store the result
All command line options for aligning texts
usage: sisc align [-h] [--annotation-path ANNOTATION_PATH]
[--annotation-type {txt,json,xml}] [-f FIRST_PAGE]
[-l LAST_PAGE] [-k KEYS_TO_UPDATE [KEYS_TO_UPDATE ...]]
[--max-num-processes MAX_NUM_PROCESSES]
[--max-text-length MAX_TEXT_LENGTH]
content-path fingerprint-path output-path
Command to align fingerprint and PDF or text.
positional arguments:
content-path Path to the file (or folder) with the content for
alignment (txt or pdf).
fingerprint-path Path to the file (or folder) with the fingerprint
file(s) (txt or xml).
output-path Output folder path.
options:
-h, --help show this help message and exit
--annotation-path ANNOTATION_PATH
Can be used to specify the path to the annotations.
Only needed when the annotations are not part of the
files specified in fingerprint_path.
--annotation-type {txt,json,xml}
The type of the annotations to process. Only used when
content_path isa folder. (default: xml).
-f FIRST_PAGE, --first-page FIRST_PAGE
Can be used to specify the first page to process. Only
used for PDF files and when processing a single PDF
file (default: 1).
-l LAST_PAGE, --last-page LAST_PAGE
Can be used to specify the last page to process. Only
used for PDF files and when processing a single PDF
file (default: -1).
-k KEYS_TO_UPDATE [KEYS_TO_UPDATE ...], --keys KEYS_TO_UPDATE [KEYS_TO_UPDATE ...]
TBD
--max-num-processes MAX_NUM_PROCESSES
Maximum number of processes to use for parallel
processing (default: 1).
--max-text-length MAX_TEXT_LENGTH
The maximum length (in characters) of a text to align
(default: 200000).
Supported formats
TBD
Adding new formats
Example coming soon!
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sisc-0.0.1.tar.gz.
File metadata
- Download URL: sisc-0.0.1.tar.gz
- Upload date:
- Size: 21.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.11.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b1a00018ffe899c4b5e3684de657bd4b2bb7a37d4eb503ebcff42d826e144b69
|
|
| MD5 |
d530fb51eb6a6752e6fab7d05f5ad475
|
|
| BLAKE2b-256 |
4c730de1de2622b9fc7856bd70e6b736753f56eaaf0f7d85f7dd9281997bf05b
|
File details
Details for the file SisC-0.0.1-py3-none-any.whl.
File metadata
- Download URL: SisC-0.0.1-py3-none-any.whl
- Upload date:
- Size: 28.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.11.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
62ca7ef07f898cee3ac892bb63d5977c7d0833a095204273fea96bd4365bb91e
|
|
| MD5 |
65b8bcb4320e2c3f05e09bd1c98964a7
|
|
| BLAKE2b-256 |
e3e391b34cb25199556b5b763f2c5695d038cf256d77e25dd35157c67ed4af5a
|