Skip to main content

Quid is a tool for quotation detection in texts and can deal with common properties of quotations, for example, ellipses or inaccurate quotations.

Project description

Readme

Quid is a tool for quotation detection in texts and can deal with common properties of quotations, for example, ellipses or inaccurate quotations.

Overview

Quid is a tool to find quotations in two texts, called source and target. If known, the source text should be the one that is quoted by the target text. This allows the algorithm to handle things like ellipsis in quotations, e.g.

0	52	This is a long Text and the long test goes on and on
0	45	This is a long Text [...] test goes on and on

Installation

pip install Quid

Usage

There are two ways to use the algorithm. The following two sections describe the use of the algorithm in code and from the command line.

In code

The algorithm can be found in the package quid. To use it create a Quid object which can be configured with the following arguments:

  • The minimum number of tokens of a match (default: 5)
  • The maximum number of tokens to skip when extending a match backwards (default: 10)
  • The maximum number of tokens to skip when extending a match forwards (default: 3)
  • The maximum distance in tokens between to matches considered for merging (default: 2)
  • The maximum distance in tokens between two matches considered for merging where the target text contains an ellipses between the matches (default: 10)
  • Whether to include matched text in the returned data structure (default: True)
  • How to handle ambiguous matches. If False, for a match with multiple matched segments in the source text, multiple matches will be returned. Otherwise, only the first match will be returned. (default: False)
  • The threshold for the minimal levenshtein similarity between tokens (and the initial n-grams) to be accepted as a match (default: 0.85)
  • Whether to split texts which are longer than the threshold (in words) defined with split_length for faster processing (default: False)
  • The threshold for splitting texts (in number of words) (default: 30000)
  • The maximum number of processes for parallel processing (default: 1)

Then call the compare method on the object which expects two texts to be compared. The method returns a list with the following structure: List[Match]. Match stores two MatchSpans. One for the source text and one for the target text. MatchSpan stores the start and end character positions for the matching spans in the source and target text.

from quid.core.Quid import Quid

quid = Quid()
matches = quid.compare('file 1 content', 'file 2 content')

Command line

The quid compare command provides a command line interface to the algorithm.

usage: quid compare [-h] [--text] [--no-text]
                    [--output-type {json,text,csv}] [--csv-sep CSV_SEP]
                    [--output-folder-path OUTPUT_FOLDER_PATH]
                    [--min-match-length MIN_MATCH_LENGTH]
                    [--look-back-limit LOOK_BACK_LIMIT]
                    [--look-ahead-limit LOOK_AHEAD_LIMIT]
                    [--max-merge-distance MAX_MERGE_DISTANCE]
                    [--max-merge-ellipsis-distance MAX_MERGE_ELLIPSIS_DISTANCE]
                    [--create-dated-subfolder]
                    [--no-create-dated-subfolder]
                    [--max-num-processes MAX_NUM_PROCESSES]
                    [--keep-ambiguous-matches]
                    [--no-keep-ambiguous-matches]
                    [--min-levenshtein-similarity MIN_LEVENSHTEIN_SIMILARITY]
                    source-file-path target-path

Quid compare allows the user to find quotations in two texts, a source text
and a target text. If known, the source text should be the one that is quoted
by the target text. This allows the algorithm to handle things like ellipsis
in quotations.

positional arguments:
  source-file-path      Path to the source text file
  target-path           Path to the target text file or folder

optional arguments:
  -h, --help            show this help message and exit
  --text                Include matched text in the returned data structure
  --no-text             Don't include matched text in the returned data
                        structure
  --output-type {json,text,csv}
                        The output type
  --csv-sep CSV_SEP     output separator for csv (default: '\t')
  --output-folder-path OUTPUT_FOLDER_PATH
                        The output folder path. If this option is set the
                        output will be saved to a file created in the
                        specified folder
  --min-match-length MIN_MATCH_LENGTH
                        The minimum number of tokens of a match (>= 1,
                        default: 5)
  --look-back-limit LOOK_BACK_LIMIT
                        The maximum number of tokens to skip when extending a
                        match backwards (>= 0, default: 10)
  --look-ahead-limit LOOK_AHEAD_LIMIT
                        The maximum number of tokens to skip when extending a
                        match forwards (>= 0, default: 3)
  --max-merge-distance MAX_MERGE_DISTANCE
                        The maximum distance in tokens between two matches
                        considered for merging (>= 0, default: 2)
  --max-merge-ellipsis-distance MAX_MERGE_ELLIPSIS_DISTANCE
                        The maximum distance in tokens between two matches
                        considered for merging where the target text contains
                        an ellipsis between the matches (>= 0, default: 10)
  --create-dated-subfolder
                        Create a subfolder named with the current date to
                        store the results
  --no-create-dated-subfolder
                        Don't create a subfolder named with the current date
                        to store the results
  --max-num-processes MAX_NUM_PROCESSES
                        Maximum number of processes to use for parallel
                        processing
  --keep-ambiguous-matches
                        For a match with multiple matched segments in the
                        source text, multiple matches will be returned.
  --no-keep-ambiguous-matches
                        For a match with multiple matched segments in the
                        source text, only the first match will be returned.
  --min-levenshtein-similarity MIN_LEVENSHTEIN_SIMILARITY
                        The threshold for the minimal levenshtein similarity
                        between tokens (and the initial n-grams) to be
                        accepted as a match (between 0 and 1, default: 0.85)
  --split-long-texts    Split texts longer than split-length words for
                        fasterprocessing
  --no-split-long-texts
                        Do not split texts longer than 30000 tokens.
  --split-length SPLIT_LENGTH
                        If split-long-texts is set to True, texts longer (in
                        number of words) than this threshold will be split for
                        faster processing.

By default, the result is returned as a json structure: List[Match]. Match stores two MatchSpans. One for the source text and one for the target text. MatchSpan stores the start and end character positions for the matching spans in the source and target text. For example,

[
  {
    "source_span": {
      "start": 0,
      "end": 52,
      "text": "This is a long Text and the long test goes on and on"
    },
    "target_span": {
      "start": 0,
      "end": 45,
      "text": "This is a long Text [...] test goes on and on"
    }
  }
]

Alternatively, the result can be printed in a human-readable text format, e.g.:

0	52	This is a long Text and the long test goes on and on
0	45	This is a long Text [...] test goes on and on 

In case the matching text is not needed, the option --no-text allows to exclude the text from the output.

Passager

The package passager contains code to extract key passages from the found matches. The passage command produces several json files. The resulting data structure is documented in the data structure readme.

Usage

usage: quid passage [-h]
                    source-file-path target-folder-path
                    matches-folder-path output-folder-path

Quid passage allows the user to extract key passages from the found
matches.

positional arguments:
  source-file-path     Path to the source text file
  target-folder-path   Path to the target texts folder path
  matches-folder-path  Path to the folder with the match files
  output-folder-path   Path to the output folder

Visualization

The package visualization contains code to create the content for a web page to visualize the key passages. For a white label version of the website, see QuidEx-wh.

Usage

usage: quid visualize [-h] [--title TITLE] [--author AUTHOR]
                      [--year YEAR] [--censor]
                      source-file-path target-folder-path
                      passages-folder-path output-folder-path

Quid visualize allows the user to create the files needed for a website that
visualizes the Quid algorithm results.

positional arguments:
  source-file-path      Path to the source text file
  target-folder-path    Path to the target texts folder path
  passages-folder-path
                        Path to the folder with the key passages files, i.e.
                        the resulting files from Quid passage
  output-folder-path    Path to the output folder

optional arguments:
  -h, --help            show this help message and exit
  --title TITLE         Title of the work
  --author AUTHOR       Author of the work
  --year YEAR           Year of the work

Performance

For in-depth information on the evaluation, see our paper below. Perfomance of the current version of Quid is as follows:

Work Precision Recall F-Score
Die Judenbuche 0.82 0.93 0.87
Micheal Kohlhaas 0.71 0.93 0.80

History

Quid was formerly known as Lotte and later renamed. Earlier publications use the name Lotte.

Citation

If you use Quid or base your work on our code, please cite our paper:

@inproceedings{arnold2021lotte,
  title = {{L}otte and {A}nnette: {A} {F}ramework for {F}inding and {E}xploring {K}ey {P}assages in {L}iterary {W}orks},
  author = {Arnold, Frederik and Jäschke, Robert},
  booktitle = {Proceedings of the Workshop on Natural Language Processing for Digital Humanities},
  year = {2021},
  publisher = {NLP Association of India (NLPAI)},
  url = {https://aclanthology.org/2021.nlp4dh-1.7},
  pages = {55--63}
}

Acknowledgements

The algorithm is inspired by sim_text by Dick Grune ^1 and Similarity texter: A text-comparison web tool based on the “sim_text” algorithm by Sofia Kalaidopoulou (2016) ^2

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Quid-2.3.0.tar.gz (31.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

Quid-2.3.0-py3-none-any.whl (40.2 kB view details)

Uploaded Python 3

File details

Details for the file Quid-2.3.0.tar.gz.

File metadata

  • Download URL: Quid-2.3.0.tar.gz
  • Upload date:
  • Size: 31.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.1

File hashes

Hashes for Quid-2.3.0.tar.gz
Algorithm Hash digest
SHA256 40c7022a8721b18728f5c69bfcd2a2343775a1d761af57bceea877997bed8879
MD5 79652301bc1579b57da80d4ee3d89ecc
BLAKE2b-256 d7c214a36e4d4b372e089d08f9c454d2ae8ef1805e66473f143eee71af22c07c

See more details on using hashes here.

File details

Details for the file Quid-2.3.0-py3-none-any.whl.

File metadata

  • Download URL: Quid-2.3.0-py3-none-any.whl
  • Upload date:
  • Size: 40.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.1

File hashes

Hashes for Quid-2.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 648694be9a3027e784a26a14294a2dc908fa7fc0ca918cd3dfe4588c62956358
MD5 d68a5a24fb8b357d2ad40aec6932fd27
BLAKE2b-256 a81afa4771ea77318c4e58b2ca45c4ce6f71adea58b801a3ec56c1e03666dcf4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page