A library for processing academic texts in Greek and other languages
Project description
GlossAPI
A library for processing academic texts in Greek and other languages, developed by ΕΕΛΛΑΚ.
Features
- Document Processing: Extract text content from academic PDFs, DOCX, XML, HTML, and other formats with structure preservation
- Robust Batch Processing: Process documents in batches with error isolation and automatic resumption
- Quality Control: Filter and cluster documents based on extraction quality
- Section Extraction: Identify and extract academic sections from documents
- Section Classification: Classify sections using machine learning models
- Greek Language Support: Specialized processing for Greek academic texts
- Metadata Handling: Process academic texts with accompanying metadata
- Customizable Annotation: Map section titles to standardized categories
Installation
pip install glossapi==0.0.9
Usage
The recommended way to use GlossAPI is through the Corpus class, which provides a complete pipeline for processing academic documents:
from glossapi import Corpus
import logging
# Configure logging (optional)
logging.basicConfig(level=logging.INFO)
# Initialize Corpus with input and output directories
corpus = Corpus(
input_dir="/path/to/documents",
output_dir="/path/to/output"
# metadata_path="/path/to/metadata.parquet", # Optional
# annotation_mapping={
# 'Κεφάλαιο': 'chapter',
# # Add more mappings as needed
# }
)
# Step 1: Extract documents (quality control)
corpus.extract()
# Step 2: Extract sections from filtered documents
corpus.section()
# Step 3: Classify and annotate sections
corpus.annotate()
License
This project is licensed under the European Union Public Licence 1.2 (EUPL 1.2).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file glossapi-0.0.13.tar.gz.
File metadata
- Download URL: glossapi-0.0.13.tar.gz
- Upload date:
- Size: 280.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3f0144b73fab34bde8ea681b6f570992bd77096f6182748f947f14b970c393b8
|
|
| MD5 |
1dbda381dca4dba2acab2b78f768ae11
|
|
| BLAKE2b-256 |
73e73656fc3852f84400041f1b13443b2f2cab3e31209e3e50807c2ba96c7093
|
File details
Details for the file glossapi-0.0.13-py3-none-any.whl.
File metadata
- Download URL: glossapi-0.0.13-py3-none-any.whl
- Upload date:
- Size: 281.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6c191aa85f36c7e6a633c31c2c9302c540ce41d8a5dc8089040e31ea08b1f26f
|
|
| MD5 |
8851830848488a1c33c89f0faa9793cc
|
|
| BLAKE2b-256 |
7029cfcac52d826ed21a41fbfe75d2cb66d052476fe619b0bff4d958741992d1
|