Skip to main content

ArchiTXT is a tool for structuring textual data into a valid database model. It is guided by a meta-grammar and uses an iterative process of tree rewriting.

Project description

ArchiTXT: Text-to-Database Structuring Tool

Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public. PyPI - Status PyPI - Version PyPI - Python Version GitHub Actions Workflow Status SWH DOI

ArchiTXT is a Python library and CLI tool that automatically converts unstructured text corpora into structured, database-ready data. It infers database schemas directly from text and generates corresponding structured instances using a meta-grammar and iterative tree-rewriting process.

ArchiTXT is designed for researchers, data engineers, and NLP practitioners who need a transparent and auditable process to transform raw textual data into storable, queryable and machine-learning-ready datasets.

Why ArchiTXT?

Working with unstructured text becomes complex when you need:

  • Structured storage
  • Queryable entities and relations
  • Reproducible data modeling

ArchiTXT bridges this gap by:

  • Discovering latent structural patterns in annotated corpora
  • Automatically generating database schemas
  • Producing structured instances aligned with the inferred schema
  • Ensuring transparency through rule-based rewriting

Installation

To install ArchiTXT, make sure you have Python 3.10+ and pip installed. Then, run:

pip install architxt

For the development version, you can install it directly through GIT using

pip install git+https://github.com/Neplex/ArchiTXT.git

Usage

ArchiTXT is built to work seamlessly with BRAT-annotated corpora that includes pre-labeled named entities. It can parse the texts using either CoreNLP or SpaCy, depending on your preference and setup. See the documentation for more information.

For CoreNLP, it requires access to a CoreNLP server, which you can set up using the Docker Compose configuration available in the source repository. To deploy it, you can use the following command:

docker compose up -d corenlp

After parsing the annotated texts into ArchiTXT's internal representation, you can infer a database schema and instance based on the annotated entities and generate structured instances accordingly. See the documentation for more information.

The result can be exported as a relational or property graph database. See the documentation for more information.

ArchiTXT is available as a Python library but also provides a command-line interface (CLI) for users who prefer working in the terminal. You can run the CLI using:

architxt --help

Sponsors

This work has received support under the JUNON Program, with financial support from Région Centre-Val de Loire (France).

JUNON Program logo UO logo LIFO logo

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

architxt-0.6.1.tar.gz (112.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

architxt-0.6.1-py3-none-any.whl (135.1 kB view details)

Uploaded Python 3

File details

Details for the file architxt-0.6.1.tar.gz.

File metadata

  • Download URL: architxt-0.6.1.tar.gz
  • Upload date:
  • Size: 112.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for architxt-0.6.1.tar.gz
Algorithm Hash digest
SHA256 f4c8c9bea9e48642ffa00212f87d99e185326d40d47f14651e09dbe92b8e19dc
MD5 5abffe5e1e8d5919af4b720261a75906
BLAKE2b-256 53127fba0cda338487616d678598ccea6468ea2719792c9545d7cdd3c2dabeba

See more details on using hashes here.

Provenance

The following attestation bundles were made for architxt-0.6.1.tar.gz:

Publisher: python-build.yml on Neplex/ArchiTXT

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file architxt-0.6.1-py3-none-any.whl.

File metadata

  • Download URL: architxt-0.6.1-py3-none-any.whl
  • Upload date:
  • Size: 135.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for architxt-0.6.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1be27a75223c329fc5e2f7f59072bf98e76bbac53c36695908b4a0549383ea24
MD5 0671f0806fcfeed1ef5511ab4d5d88d4
BLAKE2b-256 73f08960dc6545d73669abecc1eb9c2f309aefc3105c0508086551ce8e404a43

See more details on using hashes here.

Provenance

The following attestation bundles were made for architxt-0.6.1-py3-none-any.whl:

Publisher: python-build.yml on Neplex/ArchiTXT

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page