Skip to main content

A library for processing academic texts in Greek and other languages

Project description

GlossAPI

Release Version PyPI Status

A library for processing academic texts in Greek and other languages, developed by ΕΕΛΛΑΚ.

Features

  • PDF Processing: Extract text content from academic PDFs with structure preservation
  • Quality Control: Filter and cluster documents based on extraction quality
  • Section Extraction: Identify and extract academic sections from documents
  • Section Classification: Classify sections using machine learning models
  • Greek Language Support: Specialized processing for Greek academic texts
  • Metadata Handling: Process academic texts with accompanying metadata
  • Customizable Annotation: Map section titles to standardized categories

Installation

pip install glossapi

Usage

The recommended way to use GlossAPI is through the Corpus class, which provides a complete pipeline for processing academic documents:

from glossapi import Corpus
import logging

# Configure logging (optional)
logging.basicConfig(level=logging.INFO)

# Initialize Corpus with input and output directories
corpus = Corpus(
    input_dir="/path/to/documents",
    output_dir="/path/to/output",
    metadata_path="/path/to/metadata.parquet",  # Optional
    annotation_mapping={
        'Κεφάλαιο': 'chapter', # i.e. a label in document_type column : references text type to be annotated chapter or text for now
        # Add more mappings as needed
    }
)

# Step 1: Filter documents (quality control)
corpus.extract()

# Step 2: Extract sections from filtered documents
corpus.section()

# Step 3: Classify and annotate sections
corpus.annotate()

License

This project is licensed under the European Union Public Licence 1.2 (EUPL 1.2).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

glossapi-0.0.5.tar.gz (275.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

glossapi-0.0.5-py3-none-any.whl (275.3 kB view details)

Uploaded Python 3

File details

Details for the file glossapi-0.0.5.tar.gz.

File metadata

  • Download URL: glossapi-0.0.5.tar.gz
  • Upload date:
  • Size: 275.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for glossapi-0.0.5.tar.gz
Algorithm Hash digest
SHA256 d0c471ca2abd6f52b5eab99e0a98656a660430b60a8ac5b2e460b15b20ed56f1
MD5 791e94acfd366861642b0a834f05d24b
BLAKE2b-256 e5891b26edf6c0787c6e2a944e70560485da0455cd3a6e73995374201d3fe732

See more details on using hashes here.

File details

Details for the file glossapi-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: glossapi-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 275.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for glossapi-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 09fee246c48458e95b695777422b57436ac2f335888d63558f90050e147e8361
MD5 6f1dfda0b3f42740ea21a3906e7e1f9c
BLAKE2b-256 cd7099db53c5db94f5b9b74e83723185933ca867cd1e62363803fb6a682c45bd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page