DPToie-Python is an Open Information Extractor for Portuguese language that employs Dependence Parser and Part of Speech Tagger models with Stanford CoreNLP.
Project description
DPToie-Python
Open Information Extractor for Portuguese based on dependency analysis (SpaCy + Stanza).
This guide shows all ways to run the project via src/dptoie/main.py, with all argument variations, both locally (Poetry) and with Docker / Docker Compose.
- Minimum requirements: Python 3.12+, Poetry, or Docker (optional)
- Models: Stanza downloads models automatically on first run. You can set
STANZA_RESOURCES_DIRto use a local models directory (e.g.,./models/.stanza_resources).
Table of Contents
- Installation (Poetry)
- How to run (local, via Poetry)
- How to run with Docker (without Compose)
- How to run with Docker Compose
- Quick references
- How to cite this project
- Authors
Installation (Poetry)
poetry install
How to run (local, via Poetry)
General form:
poetry run python3 src/dptoie/main.py \
-i <input_path> \
-it <txt|conll> \
-o <output_path> \
-ot <json|csv|txt> \
[-cc] [-sc] [-hs] [-a] [-t] [-debug]
Supported arguments
- -i, --input: path to the input file. Default:
./inputs/teste.txt - -it, --input-type: input file type. Options:
txtorconll. Default:txt- For
txtinput: each line in the file is a sentence; the system generates a temporary.conll. - For
conllinput: the input file is already in CoNLL-U format (one sentence per block, separated by an empty line).
- For
- -o, --output: path to the output file. Default:
./outputs/output.json - -ot, --output-type: output format. Options:
json,csv,txt. Default:json - -cc, --coordinating_conjunctions: enable extractions using coordinating conjunctions
- -sc, --subordinating_conjunctions: enable extractions using subordinating conjunctions
- -hs, --hidden_subjects: enable extractions with hidden subjects (Not implemented)
- -a, --appositive: enable appositive extractions
- -t, --transitive: enable transitivity for appositives (only has effect when
-ais active) - -debug: verbose debug mode
Important:
- Extraction modules are disabled by default. Enable the ones you want using the flags
-cc -sc -a -t.
Practical examples
- TXT input, JSON output (defaults):
poetry run python3 src/dptoie/main.py -i ./inputs/ceten-200.txt -it txt -o ./outputs/out.json -ot json
- TXT input, CSV output, enabling coordination and hidden subject (flag example):
poetry run python3 src/dptoie/main.py -i ./inputs/ceten-200.txt -it txt -o ./outputs/out.csv -ot csv -cc
- TXT input, human-readable text output:
poetry run python3 src/dptoie/main.py -i ./inputs/ceten-200.txt -it txt -o ./outputs/out.txt -ot txt -cc -sc -a -t
- Input already in CoNLL-U, JSON output:
poetry run python3 src/dptoie/main.py -i ./inputs/teste.conll -it conll -o ./outputs/out.json -ot json -cc -sc -a -t
- Only coordinating conjunctions:
poetry run python3 src/dptoie/main.py -i ./inputs/ceten-200.txt -it txt -o ./outputs/cc.json -ot json -cc
- Debug mode for detailed inspection:
poetry run python3 src/dptoie/main.py -i ./inputs/ceten-200.txt -it txt -o ./outputs/out.json -ot json -cc -debug
- Show arguments list:
poetry run python3 src/dptoie/main.py -h
Expected outputs:
- JSON: a list of objects per sentence, with extractions inside
extractionsand possiblesub_extractions. - CSV: columns
id,sentence,arg1,rel,arg2(includes sub-extractions with hierarchical ids like1.1). - TXT: the sentence followed by extractions and sub-extractions formatted as lines.
How to run with Docker (without Compose)
Build the image (from the project root):
docker build -t dptoie_python .
Run a one-off command (mounting the current directory and pointing to files inside the container):
docker run --rm -it \
-e STANZA_RESOURCES_DIR=/dptoie_python/models/.stanza_resources \
-v "$(pwd)":/dptoie_python \
-w /dptoie_python \
dptoie_python \
poetry run python3 src/dptoie/main.py -i /dptoie_python/inputs/teste.conll -it conll -o /dptoie_python/outputs/out.json -ot json -cc -sc -a -t
Note: adjust the -i and -o paths as needed; use -it txt when the input is line-by-line text.
How to run with Docker Compose
The docker-compose.yml file already includes the dptoie_python service. You can edit the command: line for the desired scenario. Example recommended command:
command: poetry run python3 src/dptoie/main.py -i /dptoie_python/inputs/teste.conll -it conll -o /dptoie_python/outputs/out.json -ot json -cc -sc -a -t
Then run:
docker compose up --build
Use run to execute other custom commands:
docker compose run dptoie_python poetry run python3 src/dptoie/main.py -i /dptoie_python/inputs/ceten-200.txt -it txt -o /dptoie_python/outputs/out.csv -ot csv -cc
Tips:
- The volume
.:/dptoie_pythonallows using local files inside the container. STANZA_RESOURCES_DIR(exposed in the compose file) can point tomodels/.stanza_resourcesto avoid repeated downloads.
Quick references
- TXT input: each line is a sentence; the system creates a temporary
.conll. - CoNLL-U input: use
-it conlland ensure sentences are separated by an empty line. - Rule activation: all rules are disabled by default; add the desired flags.
- Relative paths are interpreted from the project root; in Docker, use absolute paths inside the container (e.g.,
/dptoie_python/...).
How to cite
If you find this repo helpful, please consider citing:
@Article{dptoie2025, author={xxx xxx}, title={xxxx}, journal={dddd}, year={xxx}, month={x}, day={cc}, issn={xxx}, doi={xxxxx}, url={asas} }
Authors
- Andre Walker
- Rafael Glauber
- Daniela Barreiro Claro
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dptoie-0.1.6.tar.gz.
File metadata
- Download URL: dptoie-0.1.6.tar.gz
- Upload date:
- Size: 16.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
792090e6370da06821abf7c41b6d95520bde3aabccd5d1426ca39fa144439ab6
|
|
| MD5 |
a68314fb7f47f57bfa324ec2ec753f03
|
|
| BLAKE2b-256 |
8969f315130999eedf05209bdc5538dedbcbb6b016ea5567bd57ef00691b5255
|
Provenance
The following attestation bundles were made for dptoie-0.1.6.tar.gz:
Publisher:
python-publish.yml on FORMAS/DPToie-Python
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dptoie-0.1.6.tar.gz -
Subject digest:
792090e6370da06821abf7c41b6d95520bde3aabccd5d1426ca39fa144439ab6 - Sigstore transparency entry: 626356784
- Sigstore integration time:
-
Permalink:
FORMAS/DPToie-Python@19258f113a78f40b8436996b1240f46ae514708b -
Branch / Tag:
refs/tags/0.1.6 - Owner: https://github.com/FORMAS
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@19258f113a78f40b8436996b1240f46ae514708b -
Trigger Event:
release
-
Statement type:
File details
Details for the file dptoie-0.1.6-py3-none-any.whl.
File metadata
- Download URL: dptoie-0.1.6-py3-none-any.whl
- Upload date:
- Size: 15.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
550fecf11016d6adc1a5c7d2a92f8c6265f3bde2593f176fb0f32b38c19faff6
|
|
| MD5 |
94beff3702f918fd5211deb274f6b6b2
|
|
| BLAKE2b-256 |
6e2e3817382eda68a6726005494e6aec5cd36fc5361de8d4e668b6c7f3415dd2
|
Provenance
The following attestation bundles were made for dptoie-0.1.6-py3-none-any.whl:
Publisher:
python-publish.yml on FORMAS/DPToie-Python
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dptoie-0.1.6-py3-none-any.whl -
Subject digest:
550fecf11016d6adc1a5c7d2a92f8c6265f3bde2593f176fb0f32b38c19faff6 - Sigstore transparency entry: 626356795
- Sigstore integration time:
-
Permalink:
FORMAS/DPToie-Python@19258f113a78f40b8436996b1240f46ae514708b -
Branch / Tag:
refs/tags/0.1.6 - Owner: https://github.com/FORMAS
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@19258f113a78f40b8436996b1240f46ae514708b -
Trigger Event:
release
-
Statement type: