Skip to main content

A Python library for generating discrete paragraph labels from concept extraction, graph communities, and interpretable assignment rules.

Project description

paralabelgen

paralabelgen is a Python library for generating discrete multi-label annotations for text paragraphs from concept extraction, graph communities, and interpretable assignment rules.

  • PyPI distribution: paralabelgen
  • Python import package: labelgen
  • Repository: https://github.com/HuRuilizhen/labelgen

Install

pip install paralabelgen
python -m spacy download en_core_web_sm

en_core_web_sm is the recommended default model. If you already use another compatible English spaCy pipeline, you can point spacy_model_name at that installed model instead.

Quick Start

from labelgen import LabelGenerator, LabelGeneratorConfig

paragraphs = [
    "OpenAI builds language models for developers.",
    "Developers use language models in production systems.",
]

config = LabelGeneratorConfig(
    use_nlp_extractor=False,
    use_graph_community_detection=False,
)
generator = LabelGenerator(LabelGeneratorConfig())
result = generator.fit_transform(paragraphs)

print("Concepts:")
for concept in result.concepts:
    print(concept.normalized, concept.kind, concept.document_frequency, sep=" | ")

print("Labels:")
for assignment in result.paragraph_labels:
    print(assignment.paragraph_id, assignment.label_ids, assignment.label_scores)

The default public pipeline uses spaCy extraction and Leiden community detection. Install the recommended spaCy model before running the example.

Public API

The main public entrypoints are:

  • LabelGenerator
  • LabelGeneratorConfig
  • Paragraph, Concept, ConceptMention, Community, ParagraphLabels
  • dump_result() and load_result()

Detailed API notes are available in docs/public_api.md.

Examples

Runnable examples are available in examples/:

Configuration Notes

  • fit() learns concept communities from a corpus.
  • transform() applies previously learned communities to new paragraphs.
  • fit_transform() learns and labels the same input in one pass.
  • The default pipeline uses spaCy extraction and Leiden community detection.
  • The default NLP path requires the configured spaCy model to be installed.
  • en_core_web_sm is the recommended default model name.
  • If the configured model is missing, the library raises an explicit runtime error.
  • Set use_nlp_extractor=False to switch to the deterministic heuristic extractor.
  • Set use_graph_community_detection=False to switch to deterministic connected-components community detection.
  • The heuristic extractor uses capitalized spans as lightweight entities and non-stopword spans as candidate noun phrases.

Opt Out Of Enhanced Implementations

from labelgen import LabelGenerator, LabelGeneratorConfig

config = LabelGeneratorConfig(
    use_nlp_extractor=False,
    use_graph_community_detection=False,
)
generator = LabelGenerator(config)

Use A Different spaCy Model

from labelgen import LabelGenerator, LabelGeneratorConfig

config = LabelGeneratorConfig()
config.extraction.spacy_model_name = "en_core_web_md"

generator = LabelGenerator(config)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paralabelgen-0.1.1.tar.gz (39.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paralabelgen-0.1.1-py3-none-any.whl (28.7 kB view details)

Uploaded Python 3

File details

Details for the file paralabelgen-0.1.1.tar.gz.

File metadata

  • Download URL: paralabelgen-0.1.1.tar.gz
  • Upload date:
  • Size: 39.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for paralabelgen-0.1.1.tar.gz
Algorithm Hash digest
SHA256 629a4dd3757d043386b3f87e7fc3f1cf1be75912fe7bae3b355ba1fe976bd819
MD5 599f893ad00040698fabd38accdf0250
BLAKE2b-256 770cb2428f3d51e5739117785064f41f60b06f1525385bdc88d6cf05bd5b3045

See more details on using hashes here.

File details

Details for the file paralabelgen-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: paralabelgen-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 28.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for paralabelgen-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 29217b29f52b60a0b9405ab865a0502de70793f9aa38f8d969c2c75b4be2ca71
MD5 0676204e7a09c9414b3c1ed8a0651295
BLAKE2b-256 c77548bed63b0f3f2d303d6553f96dc5ac6dd956ab805319930f7d1bf8864ae7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page