ArchiTXT is a tool for structuring textual data into a valid database model. It is guided by a meta-grammar and uses an iterative process of tree rewriting.
Project description
ArchiTXT: Text-to-Database Structuring Tool
ArchiTXT is a robust tool designed to convert unstructured textual data into structured formats that are ready for database storage. It automates the generation of database schemas and creates corresponding data instances, simplifying the integration of text-based information into database systems.
Working with unstructured text can be challenging when you need to store and query it in a structured database. ArchiTXT bridges this gap by transforming raw text into organized, query-friendly structures. By automating both schema generation and data instance creation, it streamlines the entire process of managing textual information in databases.
Installation
To install ArchiTXT, make sure you have Python 3.10+ and pip installed. Then, run:
pip install architxt
For the development version, you can install it directly through GIT using
pip install git+https://github.com/Neplex/ArchiTXT.git
Usage
ArchiTXT is built to work seamlessly with BRAT-annotated corpora that includes pre-labeled named entities. It also requires access to a CoreNLP server, which you can set up using the Docker configuration available in the source repository.
$ architxt --help
Usage: architxt [OPTIONS] COMMAND [ARGS]...
ArchiTXT is a tool for structuring textual data into a valid database model.
It is guided by a meta-grammar and uses an iterative process of tree rewriting.
╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --install-completion Install completion for the current shell. │
│ --show-completion Show completion for the current shell, to copy it or customize the installation. │
│ --help Show this message and exit. │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ─────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ run Extract a database schema form a corpus. │
│ ui Launch the web-based UI. │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
$ architxt run --help
Usage: architxt run [OPTIONS] CORPUS_PATH
Extract a database schema form a corpus.
╭─ Arguments ────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * corpus_path PATH Path to the input corpus. [default: None] [required] │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --tau FLOAT The similarity threshold. [default: 0.7] │
│ --epoch INTEGER Number of iteration for tree rewriting. [default: 100] │
│ --min-support INTEGER Minimum support for tree patterns. [default: 20] │
│ --corenlp-url TEXT URL of the CoreNLP server. [default: http://localhost:9000] │
│ --gen-instances INTEGER Number of synthetic instances to generate. [default: 0] │
│ --language TEXT Language of the input corpus. [default: French] │
│ --debug --no-debug Enable debug mode for more verbose output. [default: no-debug] │
│ --help Show this message and exit. │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
To deploy the CoreNLP server using the source repository, you can use Docker Compose with the following command:
docker compose up -d
Development
Setting Up the Development Environment with Poetry
To set up the development environment using Poetry, ensure that you have Poetry installed. You can install it by following the official installation instructions. Once installed, you can set up the development environment by running the following command:
poetry install
Enabling Pre-Commit Hook
This project uses pre-commit for managing Git hooks.
It should already be installed by poetry as a dev dependency.
To enable the pre-commit hooks locally, run the following command:
poetry run pre-commit install
Once set up, the pre-commit hooks will automatically run every time you make a commit, ensuring code standards are followed.
Meta-Grammar
In ArchiTXT, ANTLR (Another Tool for Language Recognition) is used to generate a parser/lexer for the meta-grammar that verify the database schema's validity. It ensures that the database schema conforms to the expected structure and semantics.
You can view the meta-grammar definition in the metagrammar.g4 file.
To regenerate the parser/lexer for the meta-grammar, run the following command:
$ poetry run antlr4 -Dlanguage=Python3 metagrammar.g4 -o architxt/grammar
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file architxt-0.1.1.tar.gz.
File metadata
- Download URL: architxt-0.1.1.tar.gz
- Upload date:
- Size: 61.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
418f2e5d8257559eb980906d6da822e6019b9a539220ba6a95999b328b5845b7
|
|
| MD5 |
92d74f099a44e6f8e1df283e15ee8150
|
|
| BLAKE2b-256 |
581a864610ac33e0664bf380ccc41f0638cad6d9a066dfb598f431abd8271e91
|
Provenance
The following attestation bundles were made for architxt-0.1.1.tar.gz:
Publisher:
python-build.yml on Neplex/ArchiTXT
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
architxt-0.1.1.tar.gz -
Subject digest:
418f2e5d8257559eb980906d6da822e6019b9a539220ba6a95999b328b5845b7 - Sigstore transparency entry: 177502611
- Sigstore integration time:
-
Permalink:
Neplex/ArchiTXT@245c5851426b3909d0d32c96ee595a7951f119e6 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/Neplex
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-build.yml@245c5851426b3909d0d32c96ee595a7951f119e6 -
Trigger Event:
release
-
Statement type:
File details
Details for the file architxt-0.1.1-py3-none-any.whl.
File metadata
- Download URL: architxt-0.1.1-py3-none-any.whl
- Upload date:
- Size: 72.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
503161854bd95adbff23382cc98b4170732669725601a533e9a422312e396604
|
|
| MD5 |
9b3117f4a9cbc730101c390927fdee42
|
|
| BLAKE2b-256 |
6688e40d3b3808e6dd2ff5226c0bf1ab2e6eb0d85953b2393c27f9d5be8d5d0a
|
Provenance
The following attestation bundles were made for architxt-0.1.1-py3-none-any.whl:
Publisher:
python-build.yml on Neplex/ArchiTXT
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
architxt-0.1.1-py3-none-any.whl -
Subject digest:
503161854bd95adbff23382cc98b4170732669725601a533e9a422312e396604 - Sigstore transparency entry: 177502612
- Sigstore integration time:
-
Permalink:
Neplex/ArchiTXT@245c5851426b3909d0d32c96ee595a7951f119e6 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/Neplex
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-build.yml@245c5851426b3909d0d32c96ee595a7951f119e6 -
Trigger Event:
release
-
Statement type: