Skip to main content

A Python library for generating discrete paragraph labels from concept extraction, graph communities, and interpretable assignment rules, with heuristic, spaCy, and provider-backed LLM extraction modes.

Project description

paralabelgen

paralabelgen is a Python library for generating discrete paragraph labels from concept extraction, graph communities, and interpretable assignment rules.

  • PyPI distribution: paralabelgen
  • Python import package: labelgen
  • Repository: https://github.com/HuRuilizhen/labelgen

Install

pip install paralabelgen

If you want to use the default spaCy extractor, install a compatible English pipeline such as:

python -m spacy download en_core_web_sm

en_core_web_sm is the recommended default model, but you can point spacy_model_name at another installed compatible spaCy pipeline.

Quick Start

Default spaCy pipeline

from labelgen import LabelGenerator, LabelGeneratorConfig

paragraphs = [
    "OpenAI builds language models for developers.",
    "Developers use language models in production systems.",
]

generator = LabelGenerator(LabelGeneratorConfig())
result = generator.fit_transform(paragraphs)

LLM extraction pipeline

from labelgen import LabelGenerator, LabelGeneratorConfig

config = LabelGeneratorConfig(
    extractor_mode="llm",
    use_graph_community_detection=False,
)
config.extraction.llm.provider = "openai"
config.extraction.llm.model = "gpt-5-mini"

generator = LabelGenerator(config)
result = generator.fit_transform(
    [
        "OpenAI builds language models and developer APIs for production systems.",
        "Production systems need monitoring and evaluation tooling.",
    ]
)

Extraction Modes

LabelGeneratorConfig.extractor_mode supports three modes:

  • spacy: default public extractor using spaCy noun chunks and entities
  • heuristic: deterministic fallback extractor using rule-based spans
  • llm: provider-backed concept extraction using a unified OpenAI-compatible chat-completions client

If extractor_mode is unset, the legacy use_nlp_extractor compatibility flag is still respected. New code should prefer extractor_mode.

LLM Provider Model

The LLM extraction path is opt-in and synchronous. The current provider layer is unified around one OpenAI-compatible client and supports:

  • openai
  • mistral
  • qwen

Configure the provider and model under config.extraction.llm:

  • provider
  • model
  • api_key_env_var
  • base_url
  • organization
  • timeout_seconds
  • max_retries
  • temperature
  • max_output_tokens
  • batch_size
  • max_concepts_per_paragraph

Set the corresponding API key in the expected environment variable:

  • OPENAI_API_KEY
  • MISTRAL_API_KEY
  • DASHSCOPE_API_KEY

Structured Output And Reliability

The LLM extractor now prefers provider-enforced structured output when the configured endpoint supports OpenAI-compatible JSON schema response formatting.

  • prompt guidance is still used, but it is no longer the only output contract
  • structured output is enforced first when available
  • if an OpenAI-compatible endpoint rejects JSON-schema response formatting, the client falls back to the prompt-only request on the same LLM path
  • the extractor does not silently fall back to spacy or heuristic

Recommended LLM Settings

Low-risk evaluation workflow

For routine evaluation runs, prefer a conservative configuration:

  • temperature = 0.0
  • batch_size = 1 or a small batch size
  • cache_enabled = True
  • record_extraction_artifacts = False

This keeps runs reproducible and avoids writing extra local artifacts unless you actually need them.

Debugging-oriented workflow

When you need to inspect provider behavior, you can enable artifacts:

  • record_extraction_artifacts = True
  • record_raw_response_text = True only when raw provider output is needed
  • record_paragraph_text = True only when paragraph text is safe to store
  • record_paragraph_metadata = True only when metadata is safe to store

Artifact recording is optional and should stay disabled by default for routine usage.

Cache And Artifact Notes

  • cache_enabled=True stores parsed concept lists on disk and avoids repeated provider calls for the same effective request
  • cache invalidation includes both prompt_version and the effective prompt text
  • artifacts are intended for local evaluation and debugging workflows, not as a default production feature

Optional Manual Smoke Test

For a small manual LLM-path verification outside the default test suite, run the example script with one provider/model pair and a valid API key in the expected environment variable:

OPENAI_API_KEY=... .venv/bin/python examples/llm_extraction.py

This is intended as a lightweight manual smoke test for provider connectivity and parsing, not as part of the default automated suite.

Public API

The main public entrypoints are:

  • LabelGenerator
  • LabelGeneratorConfig
  • Paragraph, Concept, ConceptMention, Community, ParagraphLabels
  • dump_result() and load_result()

Detailed API notes are available in docs/public_api.md.

Examples

Runnable examples are available in examples/:

Configuration Notes

  • fit() learns concepts and communities from a corpus
  • transform() applies previously learned communities to new paragraphs
  • fit_transform() learns and labels the same input in one pass
  • use_graph_community_detection=True uses Leiden community detection
  • use_graph_community_detection=False uses deterministic connected components
  • the default spaCy path requires the configured spaCy model to be installed
  • the LLM path requires valid provider configuration and credentials

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paralabelgen-0.2.1.tar.gz (60.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paralabelgen-0.2.1-py3-none-any.whl (39.3 kB view details)

Uploaded Python 3

File details

Details for the file paralabelgen-0.2.1.tar.gz.

File metadata

  • Download URL: paralabelgen-0.2.1.tar.gz
  • Upload date:
  • Size: 60.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for paralabelgen-0.2.1.tar.gz
Algorithm Hash digest
SHA256 b9f90e39f88ab3380c29a69335183116e9c42424a32b8d531cbcc835fe8e9d89
MD5 d603772fc1f28e9a215c11c270eb0ca2
BLAKE2b-256 077c626faa7ad45dc6b01a93417e6e92954485a2afadda6f12b0f6885dd80937

See more details on using hashes here.

File details

Details for the file paralabelgen-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: paralabelgen-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 39.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for paralabelgen-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 efc801a4b27454437047ef30d14f3afafffd7fee99443b56d99060151d9957fb
MD5 f6bbb71719907127a22d9e41f3f32425
BLAKE2b-256 b5ab6912738f0644f99f4f3833add4428b7b810eeccbecedb1cb9ac4b2745978

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page