ArchiTXT is a tool for structuring textual data into a valid database model. It is guided by a meta-grammar and uses an iterative process of tree rewriting.
Project description
ArchiTXT: Text-to-Database Structuring Tool
ArchiTXT is a Python library and CLI tool that automatically converts unstructured text corpora into structured, database-ready data. It infers database schemas directly from text and generates corresponding structured instances using a meta-grammar and iterative tree-rewriting process.
ArchiTXT is designed for researchers, data engineers, and NLP practitioners who need a transparent and auditable process to transform raw textual data into storable, queryable and machine-learning-ready datasets.
Why ArchiTXT?
Working with unstructured text becomes complex when you need:
- Structured storage
- Queryable entities and relations
- Reproducible data modeling
ArchiTXT bridges this gap by:
- Discovering latent structural patterns in annotated corpora
- Automatically generating database schemas
- Producing structured instances aligned with the inferred schema
- Ensuring transparency through rule-based rewriting
Installation
To install ArchiTXT, make sure you have Python 3.10+ and pip installed. Then, run:
pip install architxt
For the development version, you can install it directly through GIT using
pip install git+https://github.com/Neplex/ArchiTXT.git
Usage
ArchiTXT is built to work seamlessly with BRAT-annotated corpora that includes pre-labeled named entities. It can parse the texts using either CoreNLP or SpaCy, depending on your preference and setup. See the documentation for more information.
For CoreNLP, it requires access to a CoreNLP server, which you can set up using the Docker Compose configuration available in the source repository. To deploy it, you can use the following command:
docker compose up -d corenlp
After parsing the annotated texts into ArchiTXT's internal representation, you can infer a database schema and instance based on the annotated entities and generate structured instances accordingly. See the documentation for more information.
The result can be exported as a relational or property graph database. See the documentation for more information.
ArchiTXT is available as a Python library but also provides a command-line interface (CLI) for users who prefer working in the terminal. You can run the CLI using:
architxt --help
Sponsors
This work has received support under the JUNON Program, with financial support from Région Centre-Val de Loire (France).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file architxt-0.7.1.tar.gz.
File metadata
- Download URL: architxt-0.7.1.tar.gz
- Upload date:
- Size: 118.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b4e4c1e142be93a8d317d0d43fcbb1433f27c5097335cc0a5eb72c1044c706b0
|
|
| MD5 |
feb0260ae54a89a32387a778db6a5c38
|
|
| BLAKE2b-256 |
ed4bf0d6411630a4ab62ae9e6db30e99fca4c3e324c23f09dd2fbffc388e84b4
|
Provenance
The following attestation bundles were made for architxt-0.7.1.tar.gz:
Publisher:
python-build.yml on Neplex/ArchiTXT
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
architxt-0.7.1.tar.gz -
Subject digest:
b4e4c1e142be93a8d317d0d43fcbb1433f27c5097335cc0a5eb72c1044c706b0 - Sigstore transparency entry: 1206181169
- Sigstore integration time:
-
Permalink:
Neplex/ArchiTXT@1295cc3acdc0a13b496ad95b96ca1889ce65704c -
Branch / Tag:
refs/tags/v0.7.1 - Owner: https://github.com/Neplex
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-build.yml@1295cc3acdc0a13b496ad95b96ca1889ce65704c -
Trigger Event:
release
-
Statement type:
File details
Details for the file architxt-0.7.1-py3-none-any.whl.
File metadata
- Download URL: architxt-0.7.1-py3-none-any.whl
- Upload date:
- Size: 144.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b3caa900541cb3ec68bbce86bb27c37a91ecf0ddd45f445ead9d1183dae42bde
|
|
| MD5 |
667ebd7f8481a2a41de7d49255f4e921
|
|
| BLAKE2b-256 |
5fdda5ce9010ca1ec8dfb7500b1ef23f1d436ce3f6cf22f92ae91d32dbaf51b0
|
Provenance
The following attestation bundles were made for architxt-0.7.1-py3-none-any.whl:
Publisher:
python-build.yml on Neplex/ArchiTXT
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
architxt-0.7.1-py3-none-any.whl -
Subject digest:
b3caa900541cb3ec68bbce86bb27c37a91ecf0ddd45f445ead9d1183dae42bde - Sigstore transparency entry: 1206181172
- Sigstore integration time:
-
Permalink:
Neplex/ArchiTXT@1295cc3acdc0a13b496ad95b96ca1889ce65704c -
Branch / Tag:
refs/tags/v0.7.1 - Owner: https://github.com/Neplex
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-build.yml@1295cc3acdc0a13b496ad95b96ca1889ce65704c -
Trigger Event:
release
-
Statement type: