A Python library for generating discrete paragraph labels from concept extraction, graph communities, and interpretable assignment rules.
Project description
paralabelgen
paralabelgen is a Python library for generating discrete multi-label
annotations for text paragraphs from concept extraction, graph communities,
and interpretable assignment rules.
- PyPI distribution:
paralabelgen - Python import package:
labelgen - Repository:
https://github.com/HuRuilizhen/labelgen
Install
pip install paralabelgen
python -m spacy download en_core_web_sm
en_core_web_sm is the recommended default model. If you already use another
compatible English spaCy pipeline, you can point spacy_model_name at that
installed model instead.
Quick Start
from labelgen import LabelGenerator, LabelGeneratorConfig
paragraphs = [
"OpenAI builds language models for developers.",
"Developers use language models in production systems.",
]
config = LabelGeneratorConfig(
use_nlp_extractor=False,
use_graph_community_detection=False,
)
generator = LabelGenerator(LabelGeneratorConfig())
result = generator.fit_transform(paragraphs)
print("Concepts:")
for concept in result.concepts:
print(concept.normalized, concept.kind, concept.document_frequency, sep=" | ")
print("Labels:")
for assignment in result.paragraph_labels:
print(assignment.paragraph_id, assignment.label_ids, assignment.label_scores)
The default public pipeline uses spaCy extraction and Leiden community detection. Install the recommended spaCy model before running the example.
Public API
The main public entrypoints are:
LabelGeneratorLabelGeneratorConfigParagraph,Concept,ConceptMention,Community,ParagraphLabelsdump_result()andload_result()
Detailed API notes are available in docs/public_api.md.
Examples
Runnable examples are available in examples/:
Configuration Notes
fit()learns concept communities from a corpus.transform()applies previously learned communities to new paragraphs.fit_transform()learns and labels the same input in one pass.- The default pipeline uses spaCy extraction and Leiden community detection.
- The default NLP path requires the configured spaCy model to be installed.
en_core_web_smis the recommended default model name.- If the configured model is missing, the library raises an explicit runtime error.
- Set
use_nlp_extractor=Falseto switch to the deterministic heuristic extractor. - Set
use_graph_community_detection=Falseto switch to deterministic connected-components community detection. - The heuristic extractor uses capitalized spans as lightweight entities and non-stopword spans as candidate noun phrases.
Opt Out Of Enhanced Implementations
from labelgen import LabelGenerator, LabelGeneratorConfig
config = LabelGeneratorConfig(
use_nlp_extractor=False,
use_graph_community_detection=False,
)
generator = LabelGenerator(config)
Use A Different spaCy Model
from labelgen import LabelGenerator, LabelGeneratorConfig
config = LabelGeneratorConfig()
config.extraction.spacy_model_name = "en_core_web_md"
generator = LabelGenerator(config)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file paralabelgen-0.1.1.tar.gz.
File metadata
- Download URL: paralabelgen-0.1.1.tar.gz
- Upload date:
- Size: 39.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
629a4dd3757d043386b3f87e7fc3f1cf1be75912fe7bae3b355ba1fe976bd819
|
|
| MD5 |
599f893ad00040698fabd38accdf0250
|
|
| BLAKE2b-256 |
770cb2428f3d51e5739117785064f41f60b06f1525385bdc88d6cf05bd5b3045
|
File details
Details for the file paralabelgen-0.1.1-py3-none-any.whl.
File metadata
- Download URL: paralabelgen-0.1.1-py3-none-any.whl
- Upload date:
- Size: 28.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
29217b29f52b60a0b9405ab865a0502de70793f9aa38f8d969c2c75b4be2ca71
|
|
| MD5 |
0676204e7a09c9414b3c1ed8a0651295
|
|
| BLAKE2b-256 |
c77548bed63b0f3f2d303d6553f96dc5ac6dd956ab805319930f7d1bf8864ae7
|