A Python library for generating discrete paragraph labels from concept extraction, graph communities, and interpretable assignment rules, with heuristic, spaCy, and provider-backed LLM extraction modes.
Project description
paralabelgen
paralabelgen is a Python library for generating discrete paragraph labels
from concept extraction, graph communities, and interpretable assignment rules.
- PyPI distribution:
paralabelgen - Python import package:
labelgen - Repository:
https://github.com/HuRuilizhen/labelgen
Install
pip install paralabelgen
If you want to use the default spaCy extractor, install a compatible English pipeline such as:
python -m spacy download en_core_web_sm
en_core_web_sm is the recommended default model, but you can point
spacy_model_name at another installed compatible spaCy pipeline.
Quick Start
Default spaCy pipeline
from labelgen import LabelGenerator, LabelGeneratorConfig
paragraphs = [
"OpenAI builds language models for developers.",
"Developers use language models in production systems.",
]
generator = LabelGenerator(LabelGeneratorConfig())
result = generator.fit_transform(paragraphs)
for concept in result.concepts:
print(concept.normalized, concept.kind, concept.document_frequency, sep=" | ")
for assignment in result.paragraph_labels:
print(assignment.paragraph_id, assignment.label_ids, assignment.label_scores)
LLM extraction pipeline
from labelgen import LabelGenerator, LabelGeneratorConfig
config = LabelGeneratorConfig(
extractor_mode="llm",
use_graph_community_detection=False,
)
config.extraction.llm.provider = "openai"
config.extraction.llm.model = "gpt-5-mini"
generator = LabelGenerator(config)
result = generator.fit_transform(
[
"OpenAI builds language models and developer APIs for production systems.",
"Production systems need monitoring and evaluation tooling.",
]
)
The LLM extractor supports openai, mistral, and qwen style providers.
Set the corresponding API key in the expected environment variable:
OPENAI_API_KEYMISTRAL_API_KEYDASHSCOPE_API_KEY
Extraction Modes
LabelGeneratorConfig.extractor_mode supports three modes:
spacy: default public extractor using spaCy noun chunks and entitiesheuristic: deterministic fallback extractor using rule-based spansllm: provider-backed concept extraction using structured JSON output
If extractor_mode is unset, the legacy use_nlp_extractor compatibility flag
is still respected. New code should prefer extractor_mode.
LLM Configuration Notes
The LLM extraction path is opt-in and synchronous. Key settings live under
config.extraction.llm:
providermodelapi_key_env_varbase_urltemperaturemax_output_tokensbatch_sizemax_concepts_per_paragraphcache_enabledcache_dirrecord_extraction_artifactsartifact_dirprompt_versionprompt_template
Cache and artifact behavior:
cache_enabled=Truestores parsed concept lists on disk and avoids repeated provider calls for the same effective requestrecord_extraction_artifacts=Truewrites structured per-batch extraction artifacts for audit and experiment analysis- both are optional and can be disabled independently
Public API
The main public entrypoints are:
LabelGeneratorLabelGeneratorConfigParagraph,Concept,ConceptMention,Community,ParagraphLabelsdump_result()andload_result()
Detailed API notes are available in docs/public_api.md.
Examples
Runnable examples are available in examples/:
examples/basic_usage.pyexamples/custom_config.pyexamples/save_and_load.pyexamples/llm_extraction.py
Configuration Notes
fit()learns concepts and communities from a corpustransform()applies previously learned communities to new paragraphsfit_transform()learns and labels the same input in one passuse_graph_community_detection=Trueuses Leiden community detectionuse_graph_community_detection=Falseuses deterministic connected components- the default spaCy path requires the configured spaCy model to be installed
- the LLM path does not silently fall back to spaCy or heuristic extraction
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file paralabelgen-0.2.0.tar.gz.
File metadata
- Download URL: paralabelgen-0.2.0.tar.gz
- Upload date:
- Size: 48.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d053b171536eb64593906a1a63fe75851b5fe39f739691094aa988fd46cd546b
|
|
| MD5 |
977cefe5c557d7b3cff33f1d89d85cfd
|
|
| BLAKE2b-256 |
288dd9dcd43c8928d7fc7bed5fed43ff9293a30c31cb63e8d79216bb9509946b
|
File details
Details for the file paralabelgen-0.2.0-py3-none-any.whl.
File metadata
- Download URL: paralabelgen-0.2.0-py3-none-any.whl
- Upload date:
- Size: 37.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ff66d4affae1fb96dd69ff5784b3b4a24037c807cef6cb61b3fc68dd390a7c99
|
|
| MD5 |
c434b71ad3ff7a62394e3ee0169b6db2
|
|
| BLAKE2b-256 |
46f1aca14418e8362c944ffd05122640c480d6c8b724c6752c8aa63272af4e55
|