Skip to main content

ArchiTXT is a tool for structuring textual data into a valid database model. It is guided by a meta-grammar and uses an iterative process of tree rewriting.

Project description

ArchiTXT: Text-to-Database Structuring Tool

PyPI - Python Version GitHub Actions Workflow Status

ArchiTXT is a robust tool designed to convert unstructured textual data into structured formats that are ready for database storage. It automates the generation of database schemas and creates corresponding data instances, simplifying the integration of text-based information into database systems.

Working with unstructured text can be challenging when you need to store and query it in a structured database. ArchiTXT bridges this gap by transforming raw text into organized, query-friendly structures. By automating both schema generation and data instance creation, it streamlines the entire process of managing textual information in databases.

Installation

To install ArchiTXT, make sure you have Python 3.10+ and pip installed. Then, run:

pip install architxt

For the development version, you can install it directly through GIT using

pip install git+https://github.com/Neplex/ArchiTXT.git

Usage

ArchiTXT is built to work seamlessly with BRAT-annotated corpora that includes pre-labeled named entities. It also requires access to a CoreNLP server, which you can set up using the Docker configuration available in the source repository.

$ architxt --help

 Usage: architxt [OPTIONS] COMMAND [ARGS]...

 ArchiTXT is a tool for structuring textual data into a valid database model.
 It is guided by a meta-grammar and uses an iterative process of tree rewriting.

╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --install-completion          Install completion for the current shell.                                        │
│ --show-completion             Show completion for the current shell, to copy it or customize the installation. │
│ --help                        Show this message and exit.                                                      │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ─────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ run   Extract a database schema form a corpus.                                                                 │
│ ui    Launch the web-based UI.                                                                                 │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
$ architxt run --help

 Usage: architxt run [OPTIONS] CORPUS_PATH

 Extract a database schema form a corpus.

╭─ Arguments ────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *    corpus_path      PATH  Path to the input corpus. [default: None] [required]                               │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --tau                            FLOAT    The similarity threshold. [default: 0.7]                             │
│ --epoch                          INTEGER  Number of iteration for tree rewriting. [default: 100]               │
│ --min-support                    INTEGER  Minimum support for tree patterns. [default: 20]                     │
│ --corenlp-url                    TEXT     URL of the CoreNLP server. [default: http://localhost:9000]          │
│ --gen-instances                  INTEGER  Number of synthetic instances to generate. [default: 0]              │
│ --language                       TEXT     Language of the input corpus. [default: French]                      │
│ --debug            --no-debug             Enable debug mode for more verbose output. [default: no-debug]       │
│ --help                                    Show this message and exit.                                          │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

To deploy the CoreNLP server using the source repository, you can use Docker Compose with the following command:

docker compose up -d

Development

Setting Up the Development Environment with Poetry

To set up the development environment using Poetry, ensure that you have Poetry installed. You can install it by following the official installation instructions. Once installed, you can set up the development environment by running the following command:

poetry install

Enabling Pre-Commit Hook

This project uses pre-commit for managing Git hooks. It should already be installed by poetry as a dev dependency. To enable the pre-commit hooks locally, run the following command:

poetry run pre-commit install

Once set up, the pre-commit hooks will automatically run every time you make a commit, ensuring code standards are followed.

Meta-Grammar

In ArchiTXT, ANTLR (Another Tool for Language Recognition) is used to generate a parser/lexer for the meta-grammar that verify the database schema's validity. It ensures that the database schema conforms to the expected structure and semantics.

You can view the meta-grammar definition in the metagrammar.g4 file.

To regenerate the parser/lexer for the meta-grammar, run the following command:

$ poetry run antlr4 -Dlanguage=Python3 metagrammar.g4 -o architxt/grammar

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

architxt-0.1.0.tar.gz (61.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

architxt-0.1.0-py3-none-any.whl (72.7 kB view details)

Uploaded Python 3

File details

Details for the file architxt-0.1.0.tar.gz.

File metadata

  • Download URL: architxt-0.1.0.tar.gz
  • Upload date:
  • Size: 61.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for architxt-0.1.0.tar.gz
Algorithm Hash digest
SHA256 073ab418799d163610b9c735df67b63a50bd40386d76677637cbf8059e30a88e
MD5 30ff5567cf4dfa9e125bdfc637c934d3
BLAKE2b-256 e73e66165ca5c5f9f28cb9f61e748cf878ad617c7fcec280aa2b981ccca447f1

See more details on using hashes here.

Provenance

The following attestation bundles were made for architxt-0.1.0.tar.gz:

Publisher: python-build.yml on Neplex/ArchiTXT

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file architxt-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: architxt-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 72.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for architxt-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9a21cfb2f63978cf80d699f593470210b2ea225fe9d7cde2e24a14bb57e2b09a
MD5 cc8bc5b4ea6d82a5a3cdd23cbccb23ac
BLAKE2b-256 d322fb0adb77395dc442d1a41e5d3de650d2e93321a170070584c6ef3f9c92b3

See more details on using hashes here.

Provenance

The following attestation bundles were made for architxt-0.1.0-py3-none-any.whl:

Publisher: python-build.yml on Neplex/ArchiTXT

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page