Skip to main content

A Python library for generating discrete paragraph labels from concept extraction, graph communities, and interpretable assignment rules, with heuristic, spaCy, and provider-backed LLM extraction modes.

Project description

paralabelgen

paralabelgen is a Python library for generating discrete paragraph labels from concept extraction, graph communities, and interpretable assignment rules.

  • PyPI distribution: paralabelgen
  • Python import package: labelgen
  • Repository: https://github.com/HuRuilizhen/labelgen

Install

pip install paralabelgen

If you want to use the default spaCy extractor, install a compatible English pipeline such as:

python -m spacy download en_core_web_sm

en_core_web_sm is the recommended default model, but you can point spacy_model_name at another installed compatible spaCy pipeline.

Quick Start

Default spaCy pipeline

from labelgen import LabelGenerator, LabelGeneratorConfig

paragraphs = [
    "OpenAI builds language models for developers.",
    "Developers use language models in production systems.",
]

generator = LabelGenerator(LabelGeneratorConfig())
result = generator.fit_transform(paragraphs)

for concept in result.concepts:
    print(concept.normalized, concept.kind, concept.document_frequency, sep=" | ")

for assignment in result.paragraph_labels:
    print(assignment.paragraph_id, assignment.label_ids, assignment.label_scores)

LLM extraction pipeline

from labelgen import LabelGenerator, LabelGeneratorConfig

config = LabelGeneratorConfig(
    extractor_mode="llm",
    use_graph_community_detection=False,
)
config.extraction.llm.provider = "openai"
config.extraction.llm.model = "gpt-5-mini"

generator = LabelGenerator(config)
result = generator.fit_transform(
    [
        "OpenAI builds language models and developer APIs for production systems.",
        "Production systems need monitoring and evaluation tooling.",
    ]
)

The LLM extractor supports openai, mistral, and qwen style providers. Set the corresponding API key in the expected environment variable:

  • OPENAI_API_KEY
  • MISTRAL_API_KEY
  • DASHSCOPE_API_KEY

Extraction Modes

LabelGeneratorConfig.extractor_mode supports three modes:

  • spacy: default public extractor using spaCy noun chunks and entities
  • heuristic: deterministic fallback extractor using rule-based spans
  • llm: provider-backed concept extraction using structured JSON output

If extractor_mode is unset, the legacy use_nlp_extractor compatibility flag is still respected. New code should prefer extractor_mode.

LLM Configuration Notes

The LLM extraction path is opt-in and synchronous. Key settings live under config.extraction.llm:

  • provider
  • model
  • api_key_env_var
  • base_url
  • temperature
  • max_output_tokens
  • batch_size
  • max_concepts_per_paragraph
  • cache_enabled
  • cache_dir
  • record_extraction_artifacts
  • artifact_dir
  • prompt_version
  • prompt_template

Cache and artifact behavior:

  • cache_enabled=True stores parsed concept lists on disk and avoids repeated provider calls for the same effective request
  • record_extraction_artifacts=True writes structured per-batch extraction artifacts for audit and experiment analysis
  • both are optional and can be disabled independently

Public API

The main public entrypoints are:

  • LabelGenerator
  • LabelGeneratorConfig
  • Paragraph, Concept, ConceptMention, Community, ParagraphLabels
  • dump_result() and load_result()

Detailed API notes are available in docs/public_api.md.

Examples

Runnable examples are available in examples/:

Configuration Notes

  • fit() learns concepts and communities from a corpus
  • transform() applies previously learned communities to new paragraphs
  • fit_transform() learns and labels the same input in one pass
  • use_graph_community_detection=True uses Leiden community detection
  • use_graph_community_detection=False uses deterministic connected components
  • the default spaCy path requires the configured spaCy model to be installed
  • the LLM path does not silently fall back to spaCy or heuristic extraction

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paralabelgen-0.2.0.tar.gz (48.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paralabelgen-0.2.0-py3-none-any.whl (37.1 kB view details)

Uploaded Python 3

File details

Details for the file paralabelgen-0.2.0.tar.gz.

File metadata

  • Download URL: paralabelgen-0.2.0.tar.gz
  • Upload date:
  • Size: 48.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for paralabelgen-0.2.0.tar.gz
Algorithm Hash digest
SHA256 d053b171536eb64593906a1a63fe75851b5fe39f739691094aa988fd46cd546b
MD5 977cefe5c557d7b3cff33f1d89d85cfd
BLAKE2b-256 288dd9dcd43c8928d7fc7bed5fed43ff9293a30c31cb63e8d79216bb9509946b

See more details on using hashes here.

File details

Details for the file paralabelgen-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: paralabelgen-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 37.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for paralabelgen-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ff66d4affae1fb96dd69ff5784b3b4a24037c807cef6cb61b3fc68dd390a7c99
MD5 c434b71ad3ff7a62394e3ee0169b6db2
BLAKE2b-256 46f1aca14418e8362c944ffd05122640c480d6c8b724c6752c8aa63272af4e55

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page