Skip to main content

ArchiTXT is a tool for structuring textual data into a valid database model. It is guided by a meta-grammar and uses an iterative process of tree rewriting.

Project description

ArchiTXT: Text-to-Database Structuring Tool

Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public. PyPI - Status PyPI - Version PyPI - Python Version GitHub Actions Workflow Status SWH DOI

ArchiTXT is a Python library and CLI tool that automatically converts unstructured text corpora into structured, database-ready data. It infers database schemas directly from text and generates corresponding structured instances using a meta-grammar and iterative tree-rewriting process.

ArchiTXT is designed for researchers, data engineers, and NLP practitioners who need a transparent and auditable process to transform raw textual data into storable, queryable and machine-learning-ready datasets.

Why ArchiTXT?

Working with unstructured text becomes complex when you need:

  • Structured storage
  • Queryable entities and relations
  • Reproducible data modeling

ArchiTXT bridges this gap by:

  • Discovering latent structural patterns in annotated corpora
  • Automatically generating database schemas
  • Producing structured instances aligned with the inferred schema
  • Ensuring transparency through rule-based rewriting

Installation

To install ArchiTXT, make sure you have Python 3.10+ and pip installed. Then, run:

pip install architxt

For the development version, you can install it directly through GIT using

pip install git+https://github.com/Neplex/ArchiTXT.git

Usage

ArchiTXT is built to work seamlessly with BRAT-annotated corpora that includes pre-labeled named entities. It can parse the texts using either CoreNLP or SpaCy, depending on your preference and setup. See the documentation for more information.

For CoreNLP, it requires access to a CoreNLP server, which you can set up using the Docker Compose configuration available in the source repository. To deploy it, you can use the following command:

docker compose up -d corenlp

After parsing the annotated texts into ArchiTXT's internal representation, you can infer a database schema and instance based on the annotated entities and generate structured instances accordingly. See the documentation for more information.

The result can be exported as a relational or property graph database. See the documentation for more information.

ArchiTXT is available as a Python library but also provides a command-line interface (CLI) for users who prefer working in the terminal. You can run the CLI using:

architxt --help

Sponsors

This work has received support under the JUNON Program, with financial support from Région Centre-Val de Loire (France).

JUNON Program logo UO logo LIFO logo

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

architxt-0.7.1.tar.gz (118.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

architxt-0.7.1-py3-none-any.whl (144.7 kB view details)

Uploaded Python 3

File details

Details for the file architxt-0.7.1.tar.gz.

File metadata

  • Download URL: architxt-0.7.1.tar.gz
  • Upload date:
  • Size: 118.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for architxt-0.7.1.tar.gz
Algorithm Hash digest
SHA256 b4e4c1e142be93a8d317d0d43fcbb1433f27c5097335cc0a5eb72c1044c706b0
MD5 feb0260ae54a89a32387a778db6a5c38
BLAKE2b-256 ed4bf0d6411630a4ab62ae9e6db30e99fca4c3e324c23f09dd2fbffc388e84b4

See more details on using hashes here.

Provenance

The following attestation bundles were made for architxt-0.7.1.tar.gz:

Publisher: python-build.yml on Neplex/ArchiTXT

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file architxt-0.7.1-py3-none-any.whl.

File metadata

  • Download URL: architxt-0.7.1-py3-none-any.whl
  • Upload date:
  • Size: 144.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for architxt-0.7.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b3caa900541cb3ec68bbce86bb27c37a91ecf0ddd45f445ead9d1183dae42bde
MD5 667ebd7f8481a2a41de7d49255f4e921
BLAKE2b-256 5fdda5ce9010ca1ec8dfb7500b1ef23f1d436ce3f6cf22f92ae91d32dbaf51b0

See more details on using hashes here.

Provenance

The following attestation bundles were made for architxt-0.7.1-py3-none-any.whl:

Publisher: python-build.yml on Neplex/ArchiTXT

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page