Skip to main content

ArchiTXT is a tool for structuring textual data into a valid database model. It is guided by a meta-grammar and uses an iterative process of tree rewriting.

Project description

ArchiTXT: Text-to-Database Structuring Tool

Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public. PyPI - Status PyPI - Version PyPI - Python Version GitHub Actions Workflow Status SWH DOI

ArchiTXT is a Python library and CLI tool that automatically converts unstructured text corpora into structured, database-ready data. It infers database schemas directly from text and generates corresponding structured instances using a meta-grammar and iterative tree-rewriting process.

ArchiTXT is designed for researchers, data engineers, and NLP practitioners who need a transparent and auditable process to transform raw textual data into storable, queryable and machine-learning-ready datasets.

Why ArchiTXT?

Working with unstructured text becomes complex when you need:

  • Structured storage
  • Queryable entities and relations
  • Reproducible data modeling

ArchiTXT bridges this gap by:

  • Discovering latent structural patterns in annotated corpora
  • Automatically generating database schemas
  • Producing structured instances aligned with the inferred schema
  • Ensuring transparency through rule-based rewriting

Installation

To install ArchiTXT, make sure you have Python 3.10+ and pip installed. Then, run:

pip install architxt

For the development version, you can install it directly through GIT using

pip install git+https://github.com/Neplex/ArchiTXT.git

Usage

ArchiTXT is built to work seamlessly with BRAT-annotated corpora that includes pre-labeled named entities. It can parse the texts using either CoreNLP or SpaCy, depending on your preference and setup. See the documentation for more information.

For CoreNLP, it requires access to a CoreNLP server, which you can set up using the Docker Compose configuration available in the source repository. To deploy it, you can use the following command:

docker compose up -d corenlp

After parsing the annotated texts into ArchiTXT's internal representation, you can infer a database schema and instance based on the annotated entities and generate structured instances accordingly. See the documentation for more information.

The result can be exported as a relational or property graph database. See the documentation for more information.

ArchiTXT is available as a Python library but also provides a command-line interface (CLI) for users who prefer working in the terminal. You can run the CLI using:

architxt --help

Sponsors

This work has received support under the JUNON Program, with financial support from Région Centre-Val de Loire (France).

JUNON Program logo UO logo LIFO logo

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

architxt-0.7.0.tar.gz (118.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

architxt-0.7.0-py3-none-any.whl (144.7 kB view details)

Uploaded Python 3

File details

Details for the file architxt-0.7.0.tar.gz.

File metadata

  • Download URL: architxt-0.7.0.tar.gz
  • Upload date:
  • Size: 118.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for architxt-0.7.0.tar.gz
Algorithm Hash digest
SHA256 302eae63a247105bc4bf13b0fa47ab4a88795e29fe3b53fe989a57f9cc4edd1a
MD5 6c774b5ba1f0d1bc21a817ef9d1f735f
BLAKE2b-256 6c6159e8f50d4f54efafb6ff7833496c4f9f3393691d531d15b2686d7d2e8278

See more details on using hashes here.

Provenance

The following attestation bundles were made for architxt-0.7.0.tar.gz:

Publisher: python-build.yml on Neplex/ArchiTXT

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file architxt-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: architxt-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 144.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for architxt-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6d4396b0074c64ac1ab1a0ac396bc8b33afa497dab97ef81d61d79f977dd5679
MD5 5fc2e705384ce634027022d8b6e85002
BLAKE2b-256 b86de204ae6f9ad316c5766c2d5660d140a4abefd61e9e8712b7ce83eb002676

See more details on using hashes here.

Provenance

The following attestation bundles were made for architxt-0.7.0-py3-none-any.whl:

Publisher: python-build.yml on Neplex/ArchiTXT

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page