A Python library for generating discrete paragraph labels from concept extraction, graph communities, and interpretable assignment rules, with heuristic, spaCy, and provider-backed LLM extraction modes.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

HuRuilizhen

These details have not been verified by PyPI

Project description

paralabelgen

paralabelgen is a Python library for generating discrete paragraph labels from concept extraction, graph communities, and interpretable assignment rules.

PyPI distribution: paralabelgen
Python import package: labelgen
Repository: https://github.com/HuRuilizhen/labelgen

Install

pip install paralabelgen

If you want to use the default spaCy extractor, install a compatible English pipeline such as:

python -m spacy download en_core_web_sm

en_core_web_sm is the recommended default model, but you can point spacy_model_name at another installed compatible spaCy pipeline.

Quick Start

Default spaCy pipeline

from labelgen import LabelGenerator, LabelGeneratorConfig

paragraphs = [
    "OpenAI builds language models for developers.",
    "Developers use language models in production systems.",
]

generator = LabelGenerator(LabelGeneratorConfig())
result = generator.fit_transform(paragraphs)

LLM extraction pipeline

from labelgen import LabelGenerator, LabelGeneratorConfig

config = LabelGeneratorConfig(
    extractor_mode="llm",
    use_graph_community_detection=False,
)
config.extraction.llm.provider = "openai"
config.extraction.llm.model = "gpt-5-mini"

generator = LabelGenerator(config)
result = generator.fit_transform(
    [
        "OpenAI builds language models and developer APIs for production systems.",
        "Production systems need monitoring and evaluation tooling.",
    ]
)

Extraction Modes

LabelGeneratorConfig.extractor_mode supports three modes:

spacy: default public extractor using spaCy noun chunks and entities
heuristic: deterministic fallback extractor using rule-based spans
llm: provider-backed concept extraction using a unified OpenAI-compatible chat-completions client

If extractor_mode is unset, the legacy use_nlp_extractor compatibility flag is still respected. New code should prefer extractor_mode.

LLM Provider Model

The LLM extraction path is opt-in and synchronous. The current provider layer is unified around one OpenAI-compatible client and supports:

openai
mistral
qwen
ollama
deepseek

Configure the provider and model under config.extraction.llm:

provider
model
api_key_env_var
base_url
organization
timeout_seconds
max_retries
temperature
max_output_tokens
batch_size
max_concepts_per_paragraph

Set the corresponding API key in the expected environment variable:

OPENAI_API_KEY
MISTRAL_API_KEY
DASHSCOPE_API_KEY
DEEPSEEK_API_KEY
OLLAMA_API_KEY for authenticated or proxied Ollama deployments

For local Ollama usage, the default base URL is:

http://localhost:11434/v1

Local Ollama runs do not require an API key by default. When provider="ollama", the client also disables reasoning by default to preserve output budget for the final JSON payload.

Output Contract Modes

config.extraction.llm.output_contract_mode controls how aggressively the provider client tries to enforce a structured response:

auto: try stronger output contracts before falling back
json_schema: require JSON-schema structured output
json_object: require JSON-object mode
prompt_only: rely only on prompt instructions

auto is the recommended default. For OpenAI-compatible providers, the client tries:

json_schema
then json_object
then prompt_only

and only falls back when the provider clearly rejects the stronger contract.

DeepSeek follows a narrower auto sequence based on the official API documentation:

json_object
then prompt_only

Structured Output And Reliability

The LLM extractor now prefers provider-enforced structured output when the configured endpoint supports OpenAI-compatible JSON schema response formatting.

prompt guidance is still used, but it is no longer the only output contract
structured output is enforced first when available
if an OpenAI-compatible endpoint rejects a stronger contract, the client degrades to a weaker output contract on the same LLM path
the extractor does not silently fall back to spacy or heuristic

Recommended LLM Settings

Low-risk evaluation workflow

For routine evaluation runs, prefer a conservative configuration:

temperature = 0.0
batch_size = 1 or a small batch size
cache_enabled = True
record_extraction_artifacts = False

For local Ollama models, batch_size = 1 is the safest default for benchmark and smoke-test runs.

This keeps runs reproducible and avoids writing extra local artifacts unless you actually need them.

Debugging-oriented workflow

When you need to inspect provider behavior, you can enable artifacts:

record_extraction_artifacts = True
record_raw_response_text = True only when raw provider output is needed
record_paragraph_text = True only when paragraph text is safe to store
record_paragraph_metadata = True only when metadata is safe to store

Artifact recording is optional and should stay disabled by default for routine usage.

Cache And Artifact Notes

cache_enabled=True stores parsed concept lists on disk and avoids repeated provider calls for the same effective request
cache invalidation includes both prompt_version and the effective prompt text
artifacts are intended for local evaluation and debugging workflows, not as a default production feature

Benchmarking

The repository includes a local benchmark harness for extractor comparisons:

benchmark/run_benchmark.py
benchmark/summarize_results.py

Benchmark inputs are local development assets and should live under experiment/. The benchmark loader accepts:

.jsonl
.json

Each record must provide:

text

and may optionally provide:

id

Benchmark code is for development evaluation only and is excluded from release artifacts.

The current TechQA benchmark comparisons include:

heuristic
spacy
llm:ollama
llm:mistral
llm:deepseek

Optional Manual Smoke Test

For a small manual LLM-path verification outside the default test suite, run the example script with one provider/model pair and a valid API key in the expected environment variable:

OPENAI_API_KEY=... .venv/bin/python examples/llm_extraction.py

This is intended as a lightweight manual smoke test for provider connectivity and parsing, not as part of the default automated suite.

Public API

The main public entrypoints are:

LabelGenerator
LabelGeneratorConfig
Paragraph, Concept, ConceptMention, Community, ParagraphLabels
dump_result() and load_result()

Detailed API notes are available in docs/public_api.md.

Examples

Runnable examples are available in examples/:

Configuration Notes

fit() learns concepts and communities from a corpus
transform() applies previously learned communities to new paragraphs
fit_transform() learns and labels the same input in one pass
use_graph_community_detection=True uses Leiden community detection
use_graph_community_detection=False uses deterministic connected components
the default spaCy path requires the configured spaCy model to be installed
the LLM path requires valid provider configuration and credentials
local Ollama usage does not require credentials unless your deployment is explicitly authenticated

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

HuRuilizhen

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.3

Apr 18, 2026

0.2.2

Apr 14, 2026

0.2.1

Mar 31, 2026

0.2.0

Mar 25, 2026

0.1.1

Mar 24, 2026

0.1.0

Mar 23, 2026

0.0.0

Mar 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paralabelgen-0.2.3.tar.gz (31.5 kB view details)

Uploaded Apr 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

paralabelgen-0.2.3-py3-none-any.whl (42.0 kB view details)

Uploaded Apr 18, 2026 Python 3

File details

Details for the file paralabelgen-0.2.3.tar.gz.

File metadata

Download URL: paralabelgen-0.2.3.tar.gz
Upload date: Apr 18, 2026
Size: 31.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for paralabelgen-0.2.3.tar.gz
Algorithm	Hash digest
SHA256	`ea676efb170e15c10f2c2e864b443971fa58351ad06722b7563539f284e480a9`
MD5	`0ee4fc27a10c1f288e5f3c6c23ef6490`
BLAKE2b-256	`1a9af49ebff420a32ef4c8850ebce2bff2a560d92b9e4e941a726c5d24ec7aef`

See more details on using hashes here.

Provenance

The following attestation bundles were made for paralabelgen-0.2.3.tar.gz:

Publisher: publish.yml on HuRuilizhen/labelgen

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: paralabelgen-0.2.3.tar.gz
- Subject digest: ea676efb170e15c10f2c2e864b443971fa58351ad06722b7563539f284e480a9
- Sigstore transparency entry: 1338816888
- Sigstore integration time: Apr 18, 2026
Source repository:
- Permalink: HuRuilizhen/labelgen@71e69eb3bef313539a8a498b7cf213205e71813e
- Branch / Tag: refs/tags/v0.2.3
- Owner: https://github.com/HuRuilizhen
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@71e69eb3bef313539a8a498b7cf213205e71813e
- Trigger Event: push

File details

Details for the file paralabelgen-0.2.3-py3-none-any.whl.

File metadata

Download URL: paralabelgen-0.2.3-py3-none-any.whl
Upload date: Apr 18, 2026
Size: 42.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for paralabelgen-0.2.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`64e86ef8c07dbe0f1e88dab2d95f37d375fc22fbf7ea5fd2ddbe1615427a8068`
MD5	`c090b2ca067121dac9718f79f7c87ba6`
BLAKE2b-256	`d890ecc85b183ecd100a48f0873151f0dcdf98fe3dffe5fefe57d1e516f1b1d2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for paralabelgen-0.2.3-py3-none-any.whl:

Publisher: publish.yml on HuRuilizhen/labelgen

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: paralabelgen-0.2.3-py3-none-any.whl
- Subject digest: 64e86ef8c07dbe0f1e88dab2d95f37d375fc22fbf7ea5fd2ddbe1615427a8068
- Sigstore transparency entry: 1338816893
- Sigstore integration time: Apr 18, 2026
Source repository:
- Permalink: HuRuilizhen/labelgen@71e69eb3bef313539a8a498b7cf213205e71813e
- Branch / Tag: refs/tags/v0.2.3
- Owner: https://github.com/HuRuilizhen
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@71e69eb3bef313539a8a498b7cf213205e71813e
- Trigger Event: push

paralabelgen 0.2.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

paralabelgen

Install

Quick Start

Default spaCy pipeline

LLM extraction pipeline

Extraction Modes

LLM Provider Model

Output Contract Modes

Structured Output And Reliability

Recommended LLM Settings

Low-risk evaluation workflow

Debugging-oriented workflow

Cache And Artifact Notes

Benchmarking

Optional Manual Smoke Test

Public API

Examples

Configuration Notes

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance