Skip to main content

No project description provided

Project description

TextGrid Import Modeller

Whats the aim?

This project focuses on attemps for a simple import of text corpora (encoded in XML/TEI) to TextGrid Repository by modeling the required metadata file structure.

NOTE: !This is work in progress! _ Feedback on anything that does not work or needs to be modified is welcome!!!

Installation

The source code is maintained here: https://gitlab.gwdg.de/textplus/textplus-io/textgrid_import_modelling

Clone the project:

git clone https://@gitlab.gwdg.de/textplus/textplus-io/textgrid_import_modelling.git -o {{ your/project/path/name }}

It is recommended to install the project in a local virtual python environment and therefore the necessary steps are basically described:

Version 1 (recommended)

Simply create it within tgmodel. Naming in venv while setting the prompt to the name of the current directory:

cd {{ your/project/path/name }}
python3 -m venv venv/ --prompt "$(pwd | grep -o "[^/]*$")"
. venv/bin/activate
pip install -e .

Version 2

Create it at your favored path:

# create new virtual environment
python3 -m venv {path/to/your/virtEnv}
# activate virtual environment
. {path/to/your/virtEnv}/bin/activate
# install this project
pip install -e {{ your/project/path/name }}

What can be done? (so far)

Build metadata structure needed for TextGrid import

1. init a major config...defining the project and subprojects

You have different options to set the path(s) to your input data.

"Manual" option

Simply set the path to the directory of your TEI files. You can also set a list of paths, seperated by comma.

single directory

tg_configs -n {projectname} project -i {path/to/tei/directory/containing/files}

multiple directory

tg_configs -n {projectname} project -i {1st/path/to/tei/files},{2nd/path/to/tei/files},{3rd/path/to/tei/files}
"Automatic" option

When you have many sub-directories or sub-projects you can also let tgmodel automatically find the directories containing TEI files by setting the basic path containing all sub-projects + the name of the directory, that contains TEI files. The name of that directory has to be identical for all directories!

tg_configs -n {projectname} project -s {path/to/base/directory} -t {name/of/directory/containing/tei/files}

2. init a collection config

tg_configs -n {projectname} collection

This creates the final config, which is needed to build the TextGrid metadata structure.

What the code does:

  • trying to find proposed xpaths inside of all given XML/TEI files
  • if it finds a node by a proposed xpath more time than a defined hit_rate (defined in project.yaml), than this xpath is added to the the "collection config"

Mandatory

All attributes for "rights_holder" & "title" have to be filled out, as these attributes get validated (only for existance) before the code models the structure.

  1. init a collection config

Finally one can build the TextGrid metadata structure

tgm_cli -n {projectname} build-collection

This puts all the files in ./output, but this can be manually defined tgmodel build-collection --help

overview of whole workflow

Exemplary executions

mkdir /tmp/FluffyModelling
cd /tmp/FluffyModelling

CoNSSA

# get corpus
git clone https://github.com/cligs/conssa.git conssa

# initialize all configs
tg_configs -n CoNSSA all -s conssa -t master

Now you can find the project config at: /tmp/FluffyModelling/projects/CoNSSA and the related subproject at: /tmp/FluffyModelling/projects/conssa_master_master containing configs for collection.

For CoNSSA, there is no need for manual editing of the configs, so you can go on and create the meta data files:

tgm_cli -n CoNSSA build-collection

Afterwards, you can find them at: /tmp/FluffyModelling/projects/CoNSSA/conssa_master_master/result

ELTeC-fra

# get corpus
git clone https://github.com/COST-ELTeC/ELTeC-fra eltec-fra

# initialize all configs
tg_configs -n ELTeC-fra all -s eltec-fra -t level1

Now you can find the project config at: /tmp/FluffyModelling/projects/ELTeC-fra

ELTeC needs modifications at the collection config:

nano /tmp/FluffyModelling/projects/ELTeC-fra/FluffyModelling_eltec-fra_level1/collection.yaml
# --> set all attributes of 'rights_holder'

# create the meta data files
tgm_cli -n ELTeC-fra build-collection

Afterwards, you can find them at: /tmp/FluffyModelling/projects/ELTeC-fra/tgm_output

Multi-project examples

textbox

git clone https://github.com/cligs/textbox

tg_configs -n textbox all -s textbox -t tei

tgm_cli -n textbox build-collection

ELTeC

tg_configs -n ELTeC all -s ELTeC -t level1

tgm_cli -n ELTeC build-collection

4Developer

This project is built up in a very simple click-based setup. (see "python click")

All commandline entry points (e.g. tgm_cli, tg_configs, ...) are defined within the entry_points section of setup.py.

Contribution

Please use separate branches for your changes. This will make it easier for us to review and merge your contributions.

Once you have made your changes, add an entry to the Changelog at the end of the '# Latest features and bugfixes' section. This will help us keep track of all the changes made to the project.

Finally, create a merge request to submit your changes. This will allow us to review your changes and merge them into the main branch once they have been approved. Thank you for your contributions!

License

While the specific implementations are located in tgmodel/cli.py.

Copyright [2024] [TU Dresden | CIDS | ZIH]

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tg_model-4.0.4rc3.tar.gz (62.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tg_model-4.0.4rc3-py3-none-any.whl (69.9 kB view details)

Uploaded Python 3

File details

Details for the file tg_model-4.0.4rc3.tar.gz.

File metadata

  • Download URL: tg_model-4.0.4rc3.tar.gz
  • Upload date:
  • Size: 62.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for tg_model-4.0.4rc3.tar.gz
Algorithm Hash digest
SHA256 e05eeb9a7908ccfb9d9771899ba5d07f80f7c36f729c9e5f99a26145d269fbe9
MD5 6a1086c7dafa90e6ab0339be3a741c41
BLAKE2b-256 77f71138df5315a34a547eb84a41f33ee8b3bebcc5d0c65b5ff70af2217ce839

See more details on using hashes here.

File details

Details for the file tg_model-4.0.4rc3-py3-none-any.whl.

File metadata

  • Download URL: tg_model-4.0.4rc3-py3-none-any.whl
  • Upload date:
  • Size: 69.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for tg_model-4.0.4rc3-py3-none-any.whl
Algorithm Hash digest
SHA256 06be7e102bd83c88966a62a7a2e953b5783ae75895e0360cb3239e09b2355b24
MD5 051906f60140f13457fdb4a3bae0cf1e
BLAKE2b-256 dd59c4621e74d66a8c6b70b4ad271d1329029b9ea7309de5989231cfa4661701

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page