Skip to main content

SisC is a tool to automatically separate annotations from the underlying text. SisC uses a fingerprint, that is, a masked version of the text to merge stand-off annotations with another version of the original text, for example, extracted from a PDF file. The fingerprint cannot be used on its own to recreate (meaningful parts of) the original text and can therefore be shared.

Project description

Readme

SisC is a tool to automatically separate annotations from the underlying text. SisC uses a fingerprint, that is, a masked version of the text to merge stand-off annotations with another version of the original text, for example, extracted from a PDF file. The fingerprint cannot be used on its own to recreate (meaningful parts of) the original text and can therefore be shared.

Installation

pip install sisc

Dependencies

For PDF processing, SisC uses pdf2image and Tesseract. Both need to be installed.

Usage

SisC provides a command line interface for easy usage.

Sisc currently best supports TEI XML as the input format. Other formats are partially supported and more formats can easily be added.

Creating a fingerprint

For XML files, two types of masking are available uniform and context. A fingerprint with uniform masking is created with:

sisc fingerprint uniform input_path

If input_path is a folder, all files in that folder which are of the type specified in file-type will be processed. By default, file-type is set to xml.

All command line options for uniform fingerprinting
usage: sisc fingerprint uniform [-h] [--file-type {txt,xml}]
                                [--move-notes | --no-move-notes]
                                [--add-quotation-marks | --no-add-quotation-marks]
                                [-s SYMBOL] [-d DISTANCE]
                                input-path output-path

Command to use uniform masking for the fingerprint.

positional arguments:
  input-path            Path to txt or xml file to create fingerprint from.
                        Can be a folder in which case all files will be
                        processed.
  output-path           Output folder path.

options:
  -h, --help            show this help message and exit
  --file-type {txt,xml}
                        The input file type to process. Only used when
                        input_path is a folder (default: xml).
  --move-notes, --no-move-notes
                        This will move footnotes and endnotes to the end of
                        their page/the whole text. Only works withXML file
                        which are annotated with footnotes/endnotes and
                        pagebreaks. (default: False)
  --add-quotation-marks, --no-add-quotation-marks
                        Add quotation marks in the fingerprint. Useful when
                        quotations marks are not present in the annotated XML
                        file. (default: False)
  -s SYMBOL, --symbol SYMBOL
                        The character to use for masking (default: _).
  -d DISTANCE, --distance DISTANCE
                        The number of characters to mask between not masked
                        characters (default: 10)

Masking

For TEI XML files, SisC supports moving footnotes to the end of the page if the TEI XML files contains annotations for footnotes and page breaks. This can be useful when the footnotes are moved to their anchor position during annotation. To turn on moving of footnotes, the command line option --move-notes can be used.

We currently support two types of masking: Uniform masking and context masking.

Uniform masking keeps a certain number of characters, for example two, then masks a certain number of characters, for example five, then keeps two characters and so on. For example:

S___ _ex_ ___h __ __no_____ q____.

Context masking ... For example:

____ text with __ _________ _____.

Aligning Texts

sisc align content_path fingerprint_path output_path

content_path can a file or folder, PDF or Txt fingerprint_path TBD output_path Folder to store the result

All command line options for aligning texts
usage: sisc align [-h] [--annotation-path ANNOTATION_PATH]
                  [--annotation-type {txt,json,xml}] [-f FIRST_PAGE]
                  [-l LAST_PAGE] [-k KEYS_TO_UPDATE [KEYS_TO_UPDATE ...]]
                  [--max-num-processes MAX_NUM_PROCESSES]
                  [--max-text-length MAX_TEXT_LENGTH]
                  content-path fingerprint-path output-path

Command to align fingerprint and PDF or text.

positional arguments:
  content-path          Path to the file (or folder) with the content for
                        alignment (txt or pdf).
  fingerprint-path      Path to the file (or folder) with the fingerprint
                        file(s) (txt or xml).
  output-path           Output folder path.

options:
  -h, --help            show this help message and exit
  --annotation-path ANNOTATION_PATH
                        Can be used to specify the path to the annotations.
                        Only needed when the annotations are not part of the
                        files specified in fingerprint_path.
  --annotation-type {txt,json,xml}
                        The type of the annotations to process. Only used when
                        content_path isa folder. (default: xml).
  -f FIRST_PAGE, --first-page FIRST_PAGE
                        Can be used to specify the first page to process. Only
                        used for PDF files and when processing a single PDF
                        file (default: 1).
  -l LAST_PAGE, --last-page LAST_PAGE
                        Can be used to specify the last page to process. Only
                        used for PDF files and when processing a single PDF
                        file (default: -1).
  -k KEYS_TO_UPDATE [KEYS_TO_UPDATE ...], --keys KEYS_TO_UPDATE [KEYS_TO_UPDATE ...]
                        TBD
  --max-num-processes MAX_NUM_PROCESSES
                        Maximum number of processes to use for parallel
                        processing (default: 1).
  --max-text-length MAX_TEXT_LENGTH
                        The maximum length (in characters) of a text to align
                        (default: 200000).

Supported formats

TBD

Adding new formats

Example coming soon!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sisc-0.0.1.tar.gz (21.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

SisC-0.0.1-py3-none-any.whl (28.7 kB view details)

Uploaded Python 3

File details

Details for the file sisc-0.0.1.tar.gz.

File metadata

  • Download URL: sisc-0.0.1.tar.gz
  • Upload date:
  • Size: 21.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.11.10

File hashes

Hashes for sisc-0.0.1.tar.gz
Algorithm Hash digest
SHA256 b1a00018ffe899c4b5e3684de657bd4b2bb7a37d4eb503ebcff42d826e144b69
MD5 d530fb51eb6a6752e6fab7d05f5ad475
BLAKE2b-256 4c730de1de2622b9fc7856bd70e6b736753f56eaaf0f7d85f7dd9281997bf05b

See more details on using hashes here.

File details

Details for the file SisC-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: SisC-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 28.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.11.10

File hashes

Hashes for SisC-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 62ca7ef07f898cee3ac892bb63d5977c7d0833a095204273fea96bd4365bb91e
MD5 65b8bcb4320e2c3f05e09bd1c98964a7
BLAKE2b-256 e3e391b34cb25199556b5b763f2c5695d038cf256d77e25dd35157c67ed4af5a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page