A Python library for generating discrete paragraph labels from concept extraction, graph communities, and interpretable assignment rules, with heuristic, spaCy, and provider-backed LLM extraction modes.
Project description
paralabelgen
paralabelgen is a Python library for generating discrete paragraph labels
from concept extraction, graph communities, and interpretable assignment rules.
- PyPI distribution:
paralabelgen - Python import package:
labelgen - Repository:
https://github.com/HuRuilizhen/labelgen
Install
pip install paralabelgen
If you want to use the default spaCy extractor, install a compatible English pipeline such as:
python -m spacy download en_core_web_sm
en_core_web_sm is the recommended default model, but you can point
spacy_model_name at another installed compatible spaCy pipeline.
Quick Start
Default spaCy pipeline
from labelgen import LabelGenerator, LabelGeneratorConfig
paragraphs = [
"OpenAI builds language models for developers.",
"Developers use language models in production systems.",
]
generator = LabelGenerator(LabelGeneratorConfig())
result = generator.fit_transform(paragraphs)
LLM extraction pipeline
from labelgen import LabelGenerator, LabelGeneratorConfig
config = LabelGeneratorConfig(
extractor_mode="llm",
use_graph_community_detection=False,
)
config.extraction.llm.provider = "openai"
config.extraction.llm.model = "gpt-5-mini"
generator = LabelGenerator(config)
result = generator.fit_transform(
[
"OpenAI builds language models and developer APIs for production systems.",
"Production systems need monitoring and evaluation tooling.",
]
)
Extraction Modes
LabelGeneratorConfig.extractor_mode supports three modes:
spacy: default public extractor using spaCy noun chunks and entitiesheuristic: deterministic fallback extractor using rule-based spansllm: provider-backed concept extraction using a unified OpenAI-compatible chat-completions client
If extractor_mode is unset, the legacy use_nlp_extractor compatibility flag
is still respected. New code should prefer extractor_mode.
LLM Provider Model
The LLM extraction path is opt-in and synchronous. The current provider layer is unified around one OpenAI-compatible client and supports:
openaimistralqwenollamadeepseek
Configure the provider and model under config.extraction.llm:
providermodelapi_key_env_varbase_urlorganizationtimeout_secondsmax_retriestemperaturemax_output_tokensbatch_sizemax_concepts_per_paragraph
Set the corresponding API key in the expected environment variable:
OPENAI_API_KEYMISTRAL_API_KEYDASHSCOPE_API_KEYDEEPSEEK_API_KEYOLLAMA_API_KEYfor authenticated or proxied Ollama deployments
For local Ollama usage, the default base URL is:
http://localhost:11434/v1
Local Ollama runs do not require an API key by default. When provider="ollama",
the client also disables reasoning by default to preserve output budget for the
final JSON payload.
Output Contract Modes
config.extraction.llm.output_contract_mode controls how aggressively the
provider client tries to enforce a structured response:
auto: try stronger output contracts before falling backjson_schema: require JSON-schema structured outputjson_object: require JSON-object modeprompt_only: rely only on prompt instructions
auto is the recommended default. For OpenAI-compatible providers, the client
tries:
json_schema- then
json_object - then
prompt_only
and only falls back when the provider clearly rejects the stronger contract.
DeepSeek follows a narrower auto sequence based on the official API
documentation:
json_object- then
prompt_only
Structured Output And Reliability
The LLM extractor now prefers provider-enforced structured output when the configured endpoint supports OpenAI-compatible JSON schema response formatting.
- prompt guidance is still used, but it is no longer the only output contract
- structured output is enforced first when available
- if an OpenAI-compatible endpoint rejects a stronger contract, the client degrades to a weaker output contract on the same LLM path
- the extractor does not silently fall back to
spacyorheuristic
Recommended LLM Settings
Low-risk evaluation workflow
For routine evaluation runs, prefer a conservative configuration:
temperature = 0.0batch_size = 1or a small batch sizecache_enabled = Truerecord_extraction_artifacts = False
For local Ollama models, batch_size = 1 is the safest default for benchmark
and smoke-test runs.
This keeps runs reproducible and avoids writing extra local artifacts unless you actually need them.
Debugging-oriented workflow
When you need to inspect provider behavior, you can enable artifacts:
record_extraction_artifacts = Truerecord_raw_response_text = Trueonly when raw provider output is neededrecord_paragraph_text = Trueonly when paragraph text is safe to storerecord_paragraph_metadata = Trueonly when metadata is safe to store
Artifact recording is optional and should stay disabled by default for routine usage.
Cache And Artifact Notes
cache_enabled=Truestores parsed concept lists on disk and avoids repeated provider calls for the same effective request- cache invalidation includes both
prompt_versionand the effective prompt text - artifacts are intended for local evaluation and debugging workflows, not as a default production feature
Benchmarking
The repository includes a local benchmark harness for extractor comparisons:
benchmark/run_benchmark.pybenchmark/summarize_results.py
Benchmark inputs are local development assets and should live under
experiment/. The benchmark loader accepts:
.jsonl.json
Each record must provide:
text
and may optionally provide:
id
Benchmark code is for development evaluation only and is excluded from release artifacts.
The current TechQA benchmark comparisons include:
heuristicspacyllm:ollamallm:mistralllm:deepseek
Optional Manual Smoke Test
For a small manual LLM-path verification outside the default test suite, run the example script with one provider/model pair and a valid API key in the expected environment variable:
OPENAI_API_KEY=... .venv/bin/python examples/llm_extraction.py
This is intended as a lightweight manual smoke test for provider connectivity and parsing, not as part of the default automated suite.
Public API
The main public entrypoints are:
LabelGeneratorLabelGeneratorConfigParagraph,Concept,ConceptMention,Community,ParagraphLabelsdump_result()andload_result()
Detailed API notes are available in docs/public_api.md.
Examples
Runnable examples are available in examples/:
examples/basic_usage.pyexamples/custom_config.pyexamples/save_and_load.pyexamples/llm_extraction.py
Configuration Notes
fit()learns concepts and communities from a corpustransform()applies previously learned communities to new paragraphsfit_transform()learns and labels the same input in one passuse_graph_community_detection=Trueuses Leiden community detectionuse_graph_community_detection=Falseuses deterministic connected components- the default spaCy path requires the configured spaCy model to be installed
- the LLM path requires valid provider configuration and credentials
- local Ollama usage does not require credentials unless your deployment is explicitly authenticated
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file paralabelgen-0.2.3.tar.gz.
File metadata
- Download URL: paralabelgen-0.2.3.tar.gz
- Upload date:
- Size: 31.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ea676efb170e15c10f2c2e864b443971fa58351ad06722b7563539f284e480a9
|
|
| MD5 |
0ee4fc27a10c1f288e5f3c6c23ef6490
|
|
| BLAKE2b-256 |
1a9af49ebff420a32ef4c8850ebce2bff2a560d92b9e4e941a726c5d24ec7aef
|
Provenance
The following attestation bundles were made for paralabelgen-0.2.3.tar.gz:
Publisher:
publish.yml on HuRuilizhen/labelgen
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
paralabelgen-0.2.3.tar.gz -
Subject digest:
ea676efb170e15c10f2c2e864b443971fa58351ad06722b7563539f284e480a9 - Sigstore transparency entry: 1338816888
- Sigstore integration time:
-
Permalink:
HuRuilizhen/labelgen@71e69eb3bef313539a8a498b7cf213205e71813e -
Branch / Tag:
refs/tags/v0.2.3 - Owner: https://github.com/HuRuilizhen
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@71e69eb3bef313539a8a498b7cf213205e71813e -
Trigger Event:
push
-
Statement type:
File details
Details for the file paralabelgen-0.2.3-py3-none-any.whl.
File metadata
- Download URL: paralabelgen-0.2.3-py3-none-any.whl
- Upload date:
- Size: 42.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
64e86ef8c07dbe0f1e88dab2d95f37d375fc22fbf7ea5fd2ddbe1615427a8068
|
|
| MD5 |
c090b2ca067121dac9718f79f7c87ba6
|
|
| BLAKE2b-256 |
d890ecc85b183ecd100a48f0873151f0dcdf98fe3dffe5fefe57d1e516f1b1d2
|
Provenance
The following attestation bundles were made for paralabelgen-0.2.3-py3-none-any.whl:
Publisher:
publish.yml on HuRuilizhen/labelgen
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
paralabelgen-0.2.3-py3-none-any.whl -
Subject digest:
64e86ef8c07dbe0f1e88dab2d95f37d375fc22fbf7ea5fd2ddbe1615427a8068 - Sigstore transparency entry: 1338816893
- Sigstore integration time:
-
Permalink:
HuRuilizhen/labelgen@71e69eb3bef313539a8a498b7cf213205e71813e -
Branch / Tag:
refs/tags/v0.2.3 - Owner: https://github.com/HuRuilizhen
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@71e69eb3bef313539a8a498b7cf213205e71813e -
Trigger Event:
push
-
Statement type: