Skip to main content

IndiQuo is a tool for the detection of indirect quotations (summaries and paraphrases).

Project description

Readme

This repository contains the tool IndiQuo for the detection of indirect quotations (summaries and paraphrases) between dramas from DraCor and scholarly works which interpret the drama.

Installation

pip install indiquo

Dependencies

The dependencies to run the Rederwiedergabe Tagger are not installed by default as this can be a tricky process and this tagger is only used as a baseline and not for our approach and therefore not needed in most cases.

Usage

The following sections describe how to use IndiQuo on the command line.

Training

The library supports training of custom models for candidate identification and scene prediction.

Candidate Identification

indiquo train candidate
path_to_train_folder
path_to_the_output_folder
hugginface_model_name

path_to_train_folder has to contain to files named train_set.tsv and val_set.tsv which contain one example per line in the form a string and a label, tab separated, for example:

Some positive example	1
Some negative example	0

hugginface_model_name is the name of the model on huggingface to use for fine-tuning, deepset/gbert-large is used as the default.

Scene Prediction

indiquo train scene
path_to_train_folder
path_to_the_output_folder
hugginface_model_name

path_to_train_folder has to contain to files named train_set.tsv and val_set.tsv which contain one example per line in the form two strings, a drama excerpt and a corresponding summary, tab separated, for example:

Drama excerpt	Summary

hugginface_model_name is the name of the model on huggingface to use for fine-tuning, deutsche-telekom/gbert-large-paraphrase-cosine is used as the default.

Indirect Quotation Identification

To run IndiQuo inference with the default models, use the following command:

indiquo compare full path_to_drama_xml path_to_target_text output_path
All IndiQuo command line options
usage: indiquo compare full [-h] [--add-context | --no-add-context]
                            [--max-candidate-length MAX_CANDIDATE_LENGTH]
                            source-file-path target-path candidate-model
                            scene-model output-folder-path

Identify candidates and corresponding scenes.

positional arguments:
  source-file-path      Path to the source xml drama file
  target-path           Path to the target text file or folder
  candidate-model       Name of the model to load from Hugging Face or path to
                        the model folder (default: Fredr0id/indiquo-
                        candidate).
  scene-model           Name of the model to load from Hugging Face or path to
                        the model folder (default: Fredr0id/indiquo-scene).
  output-folder-path    The output folder path.

options:
  -h, --help            show this help message and exit
  --add-context, --no-add-context
                        If set, candidates are embedded in context up to a
                        total length of --max-candidate-length
  --max-candidate-length MAX_CANDIDATE_LENGTH
                        Maximum length in words of a candidate (default: 128)

The output folder will contain a tsv file for each txt file in the target path. The tsv files have the following structure:

The output will look something like this:

start   end text        score   scenes
10      15  some text   0.5     1:1:0.2#2:5:0.5#...

The first three columns are the character start and end positions and the text of the quotation in the target text. The fourth column is the probability of the positive class, i.e., the candidate is an indirect quotation. The last column contains the top 10 source scenes separated by '#' and each part has the following structure: act:scene:probability.

Baselines and reproduction

It is possible to only run the candidate classification step with the command compare candidate. With the option --model-type it is possible to run the base models (rw=rederwiedergabe, st=SentenceTransformer).

With the command compare sum a SentenceTransformer with summaries can be used.

Citation

If you use the code in repository or base your work on our code, please cite our paper:

TBD

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

indiquo-0.1.0.tar.gz (20.7 kB view details)

Uploaded Source

Built Distribution

indiquo-0.1.0-py3-none-any.whl (30.5 kB view details)

Uploaded Python 3

File details

Details for the file indiquo-0.1.0.tar.gz.

File metadata

  • Download URL: indiquo-0.1.0.tar.gz
  • Upload date:
  • Size: 20.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for indiquo-0.1.0.tar.gz
Algorithm Hash digest
SHA256 21af74ce938f28051082374d4ec96ff1437e4fbbe5857e3267b2e9ba5de76969
MD5 7c04adb76e08b2c7f7179fee87453b34
BLAKE2b-256 9d2a8a58f4551d0d1799d596568c149672dae1ed974b3aa6069ffbcf5e2dbe42

See more details on using hashes here.

File details

Details for the file indiquo-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: indiquo-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 30.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for indiquo-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f46636b572a27246ab7999451bbcecf18a0d38c2e696452f0c95b207f787d8af
MD5 2b052580894663fba37456083ea0c5d9
BLAKE2b-256 1f0d8f3e8e8dd2b3bfdcba8355bdf7c635e77d25e73d49e472ac9db5533412e8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page