Skip to main content

DPToie-Python is an Open Information Extractor for Portuguese language that employs Dependence Parser and Part of Speech Tagger models with Stanford CoreNLP.

Project description

DPToie-Python

Open Information Extractor for Portuguese based on dependency analysis (SpaCy + Stanza).

This guide shows all ways to run the project via src/dptoie/main.py, with all argument variations, both locally (Poetry) and with Docker / Docker Compose.

  • Minimum requirements: Python 3.12+, Poetry, or Docker (optional)
  • Models: Stanza downloads models automatically on first run. You can set STANZA_RESOURCES_DIR to use a local models directory (e.g., ./models/.stanza_resources).

Table of Contents

Installation (Poetry)

poetry install

How to run (local, via Poetry)

General form:

poetry run python3 src/dptoie/main.py \
  -i <input_path> \
  -it <txt|conll> \
  -o <output_path> \
  -ot <json|csv|txt> \
  [-cc] [-sc] [-hs] [-a] [-t] [-debug]

Supported arguments

  • -i, --input: path to the input file. Default: ./inputs/teste.txt
  • -it, --input-type: input file type. Options: txt or conll. Default: txt
    • For txt input: each line in the file is a sentence; the system generates a temporary .conll.
    • For conll input: the input file is already in CoNLL-U format (one sentence per block, separated by an empty line).
  • -o, --output: path to the output file. Default: ./outputs/output.json
  • -ot, --output-type: output format. Options: json, csv, txt. Default: json
  • -cc, --coordinating_conjunctions: enable extractions using coordinating conjunctions
  • -sc, --subordinating_conjunctions: enable extractions using subordinating conjunctions
  • -hs, --hidden_subjects: enable extractions with hidden subjects (Not implemented)
  • -a, --appositive: enable appositive extractions
  • -t, --transitive: enable transitivity for appositives (only has effect when -a is active)
  • -debug: verbose debug mode

Important:

  • Extraction modules are disabled by default. Enable the ones you want using the flags -cc -sc -a -t.

Practical examples

  1. TXT input, JSON output (defaults):
poetry run python3 src/dptoie/main.py -i ./inputs/ceten-200.txt -it txt -o ./outputs/out.json -ot json
  1. TXT input, CSV output, enabling coordination and hidden subject (flag example):
poetry run python3 src/dptoie/main.py -i ./inputs/ceten-200.txt -it txt -o ./outputs/out.csv -ot csv -cc
  1. TXT input, human-readable text output:
poetry run python3 src/dptoie/main.py -i ./inputs/ceten-200.txt -it txt -o ./outputs/out.txt -ot txt -cc -sc -a -t
  1. Input already in CoNLL-U, JSON output:
poetry run python3 src/dptoie/main.py -i ./inputs/teste.conll -it conll -o ./outputs/out.json -ot json -cc -sc -a -t
  1. Only coordinating conjunctions:
poetry run python3 src/dptoie/main.py -i ./inputs/ceten-200.txt -it txt -o ./outputs/cc.json -ot json -cc
  1. Debug mode for detailed inspection:
poetry run python3 src/dptoie/main.py -i ./inputs/ceten-200.txt -it txt -o ./outputs/out.json -ot json -cc -debug
  1. Show arguments list:
poetry run python3 src/dptoie/main.py -h

Expected outputs:

  • JSON: a list of objects per sentence, with extractions inside extractions and possible sub_extractions.
  • CSV: columns id, sentence, arg1, rel, arg2 (includes sub-extractions with hierarchical ids like 1.1).
  • TXT: the sentence followed by extractions and sub-extractions formatted as lines.

How to run with Docker (without Compose)

Build the image (from the project root):

docker build -t dptoie_python .

Run a one-off command (mounting the current directory and pointing to files inside the container):

docker run --rm -it \
  -e STANZA_RESOURCES_DIR=/dptoie_python/models/.stanza_resources \
  -v "$(pwd)":/dptoie_python \
  -w /dptoie_python \
  dptoie_python \
  poetry run python3 src/dptoie/main.py -i /dptoie_python/inputs/teste.conll -it conll -o /dptoie_python/outputs/out.json -ot json -cc -sc -a -t

Note: adjust the -i and -o paths as needed; use -it txt when the input is line-by-line text.

How to run with Docker Compose

The docker-compose.yml file already includes the dptoie_python service. You can edit the command: line for the desired scenario. Example recommended command:

command: poetry run python3 src/dptoie/main.py -i /dptoie_python/inputs/teste.conll -it conll -o /dptoie_python/outputs/out.json -ot json -cc -sc -a -t

Then run:

docker compose up --build

Use run to execute other custom commands:

docker compose run dptoie_python poetry run python3 src/dptoie/main.py -i /dptoie_python/inputs/ceten-200.txt -it txt -o /dptoie_python/outputs/out.csv -ot csv -cc

Tips:

  • The volume .:/dptoie_python allows using local files inside the container.
  • STANZA_RESOURCES_DIR (exposed in the compose file) can point to models/.stanza_resources to avoid repeated downloads.

Quick references

  • TXT input: each line is a sentence; the system creates a temporary .conll.
  • CoNLL-U input: use -it conll and ensure sentences are separated by an empty line.
  • Rule activation: all rules are disabled by default; add the desired flags.
  • Relative paths are interpreted from the project root; in Docker, use absolute paths inside the container (e.g., /dptoie_python/...).

How to cite

If you find this repo helpful, please consider citing:

@Article{dptoie2025, author={xxx xxx}, title={xxxx}, journal={dddd}, year={xxx}, month={x}, day={cc}, issn={xxx}, doi={xxxxx}, url={asas} }

Authors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dptoie-0.1.6.tar.gz (16.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dptoie-0.1.6-py3-none-any.whl (15.1 kB view details)

Uploaded Python 3

File details

Details for the file dptoie-0.1.6.tar.gz.

File metadata

  • Download URL: dptoie-0.1.6.tar.gz
  • Upload date:
  • Size: 16.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dptoie-0.1.6.tar.gz
Algorithm Hash digest
SHA256 792090e6370da06821abf7c41b6d95520bde3aabccd5d1426ca39fa144439ab6
MD5 a68314fb7f47f57bfa324ec2ec753f03
BLAKE2b-256 8969f315130999eedf05209bdc5538dedbcbb6b016ea5567bd57ef00691b5255

See more details on using hashes here.

Provenance

The following attestation bundles were made for dptoie-0.1.6.tar.gz:

Publisher: python-publish.yml on FORMAS/DPToie-Python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dptoie-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: dptoie-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 15.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dptoie-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 550fecf11016d6adc1a5c7d2a92f8c6265f3bde2593f176fb0f32b38c19faff6
MD5 94beff3702f918fd5211deb274f6b6b2
BLAKE2b-256 6e2e3817382eda68a6726005494e6aec5cd36fc5361de8d4e668b6c7f3415dd2

See more details on using hashes here.

Provenance

The following attestation bundles were made for dptoie-0.1.6-py3-none-any.whl:

Publisher: python-publish.yml on FORMAS/DPToie-Python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page