A Python library for generating discrete paragraph labels from concept extraction, graph communities, and interpretable assignment rules, with heuristic, spaCy, and provider-backed LLM extraction modes.
Project description
paralabelgen
paralabelgen is a Python library for generating discrete paragraph labels
from concept extraction, graph communities, and interpretable assignment rules.
- PyPI distribution:
paralabelgen - Python import package:
labelgen - Repository:
https://github.com/HuRuilizhen/labelgen
Install
pip install paralabelgen
If you want to use the default spaCy extractor, install a compatible English pipeline such as:
python -m spacy download en_core_web_sm
en_core_web_sm is the recommended default model, but you can point
spacy_model_name at another installed compatible spaCy pipeline.
Quick Start
Default spaCy pipeline
from labelgen import LabelGenerator, LabelGeneratorConfig
paragraphs = [
"OpenAI builds language models for developers.",
"Developers use language models in production systems.",
]
generator = LabelGenerator(LabelGeneratorConfig())
result = generator.fit_transform(paragraphs)
LLM extraction pipeline
from labelgen import LabelGenerator, LabelGeneratorConfig
config = LabelGeneratorConfig(
extractor_mode="llm",
use_graph_community_detection=False,
)
config.extraction.llm.provider = "openai"
config.extraction.llm.model = "gpt-5-mini"
generator = LabelGenerator(config)
result = generator.fit_transform(
[
"OpenAI builds language models and developer APIs for production systems.",
"Production systems need monitoring and evaluation tooling.",
]
)
Extraction Modes
LabelGeneratorConfig.extractor_mode supports three modes:
spacy: default public extractor using spaCy noun chunks and entitiesheuristic: deterministic fallback extractor using rule-based spansllm: provider-backed concept extraction using a unified OpenAI-compatible chat-completions client
If extractor_mode is unset, the legacy use_nlp_extractor compatibility flag
is still respected. New code should prefer extractor_mode.
LLM Provider Model
The LLM extraction path is opt-in and synchronous. The current provider layer is unified around one OpenAI-compatible client and supports:
openaimistralqwen
Configure the provider and model under config.extraction.llm:
providermodelapi_key_env_varbase_urlorganizationtimeout_secondsmax_retriestemperaturemax_output_tokensbatch_sizemax_concepts_per_paragraph
Set the corresponding API key in the expected environment variable:
OPENAI_API_KEYMISTRAL_API_KEYDASHSCOPE_API_KEY
Structured Output And Reliability
The LLM extractor now prefers provider-enforced structured output when the configured endpoint supports OpenAI-compatible JSON schema response formatting.
- prompt guidance is still used, but it is no longer the only output contract
- structured output is enforced first when available
- if an OpenAI-compatible endpoint rejects JSON-schema response formatting, the client falls back to the prompt-only request on the same LLM path
- the extractor does not silently fall back to
spacyorheuristic
Recommended LLM Settings
Low-risk evaluation workflow
For routine evaluation runs, prefer a conservative configuration:
temperature = 0.0batch_size = 1or a small batch sizecache_enabled = Truerecord_extraction_artifacts = False
This keeps runs reproducible and avoids writing extra local artifacts unless you actually need them.
Debugging-oriented workflow
When you need to inspect provider behavior, you can enable artifacts:
record_extraction_artifacts = Truerecord_raw_response_text = Trueonly when raw provider output is neededrecord_paragraph_text = Trueonly when paragraph text is safe to storerecord_paragraph_metadata = Trueonly when metadata is safe to store
Artifact recording is optional and should stay disabled by default for routine usage.
Cache And Artifact Notes
cache_enabled=Truestores parsed concept lists on disk and avoids repeated provider calls for the same effective request- cache invalidation includes both
prompt_versionand the effective prompt text - artifacts are intended for local evaluation and debugging workflows, not as a default production feature
Optional Manual Smoke Test
For a small manual LLM-path verification outside the default test suite, run the example script with one provider/model pair and a valid API key in the expected environment variable:
OPENAI_API_KEY=... .venv/bin/python examples/llm_extraction.py
This is intended as a lightweight manual smoke test for provider connectivity and parsing, not as part of the default automated suite.
Public API
The main public entrypoints are:
LabelGeneratorLabelGeneratorConfigParagraph,Concept,ConceptMention,Community,ParagraphLabelsdump_result()andload_result()
Detailed API notes are available in docs/public_api.md.
Examples
Runnable examples are available in examples/:
examples/basic_usage.pyexamples/custom_config.pyexamples/save_and_load.pyexamples/llm_extraction.py
Configuration Notes
fit()learns concepts and communities from a corpustransform()applies previously learned communities to new paragraphsfit_transform()learns and labels the same input in one passuse_graph_community_detection=Trueuses Leiden community detectionuse_graph_community_detection=Falseuses deterministic connected components- the default spaCy path requires the configured spaCy model to be installed
- the LLM path requires valid provider configuration and credentials
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file paralabelgen-0.2.1.tar.gz.
File metadata
- Download URL: paralabelgen-0.2.1.tar.gz
- Upload date:
- Size: 60.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b9f90e39f88ab3380c29a69335183116e9c42424a32b8d531cbcc835fe8e9d89
|
|
| MD5 |
d603772fc1f28e9a215c11c270eb0ca2
|
|
| BLAKE2b-256 |
077c626faa7ad45dc6b01a93417e6e92954485a2afadda6f12b0f6885dd80937
|
File details
Details for the file paralabelgen-0.2.1-py3-none-any.whl.
File metadata
- Download URL: paralabelgen-0.2.1-py3-none-any.whl
- Upload date:
- Size: 39.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
efc801a4b27454437047ef30d14f3afafffd7fee99443b56d99060151d9957fb
|
|
| MD5 |
f6bbb71719907127a22d9e41f3f32425
|
|
| BLAKE2b-256 |
b5ab6912738f0644f99f4f3833add4428b7b810eeccbecedb1cb9ac4b2745978
|