Skip to main content

A library for processing academic texts in Greek and other languages

Project description

GlossAPI

Release Version PyPI Test Status

A library for processing academic texts in Greek and other languages, developed by ΕΕΛΛΑΚ.

Features

  • PDF Processing: Extract text content from academic PDFs with structure preservation
  • Quality Control: Filter and cluster documents based on extraction quality
  • Section Extraction: Identify and extract academic sections from documents
  • Section Classification: Classify sections using machine learning models
  • Greek Language Support: Specialized processing for Greek academic texts
  • Metadata Handling: Process academic texts with accompanying metadata
  • Customizable Annotation: Map section titles to standardized categories

Installation

pip install -i https://test.pypi.org/simple/ glossapi==0.0.3.5.2

Usage

The recommended way to use GlossAPI is through the Corpus class, which provides a complete pipeline for processing academic documents:

from glossapi import Corpus
import logging

# Configure logging (optional)
logging.basicConfig(level=logging.INFO)

# Initialize Corpus with input and output directories
corpus = Corpus(
    input_dir="/path/to/documents",
    output_dir="/path/to/output",
    metadata_path="/path/to/metadata.parquet",  # Optional
    annotation_mapping={
        'Κεφάλαιο': 'chapter',
        # Add more mappings as needed
    }
)

# Step 1: Filter documents (quality control)
corpus.filter()

# Step 2: Extract sections from filtered documents
corpus.section()

# Step 3: Classify and annotate sections
corpus.annotate()

License

This project is licensed under the European Union Public Licence 1.2 (EUPL 1.2).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

glossapi-0.0.4.1.tar.gz (272.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

glossapi-0.0.4.1-py3-none-any.whl (273.0 kB view details)

Uploaded Python 3

File details

Details for the file glossapi-0.0.4.1.tar.gz.

File metadata

  • Download URL: glossapi-0.0.4.1.tar.gz
  • Upload date:
  • Size: 272.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for glossapi-0.0.4.1.tar.gz
Algorithm Hash digest
SHA256 a808be6f6e756949ae9ece4f26854eea5da8e2196b2e58cddf587285e049fbd4
MD5 f639f196b55f6f9f192abed6bac3e123
BLAKE2b-256 916e4400e5aa192f31e14726a7ee33b0b51d9e4585042654fbc3624d8f44963d

See more details on using hashes here.

Provenance

The following attestation bundles were made for glossapi-0.0.4.1.tar.gz:

Publisher: python-publish.yml on eellak/glossAPI

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file glossapi-0.0.4.1-py3-none-any.whl.

File metadata

  • Download URL: glossapi-0.0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 273.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for glossapi-0.0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9f8f564f27e2d614112fa4a437822bf8a3ae4367ca4dbff1d94cd9a2ee666f7b
MD5 3e9bf9685b895b2aeee4575f4dce80a9
BLAKE2b-256 7753f2020d4c426affecfb94111eece885d7a1fbffc3625fc758d83a9485107b

See more details on using hashes here.

Provenance

The following attestation bundles were made for glossapi-0.0.4.1-py3-none-any.whl:

Publisher: python-publish.yml on eellak/glossAPI

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page