Skip to main content

Word segmentation with transformers

Project description

✂️ hashformers

Open In Colab PyPi license stars

Hashformers is a word segmentation library that fills a gap in the NLP ecosystem between heuristic-based splitters and LLM prompt-based segmentation. It can be used with any language model from the Hugging Face Model Hub, from auto-regressive models like GPT-2 to recent large language models (LLMs).

Hashformers uses language models and a beam search algorithm to segment text without spaces into words. Benchmarks show that it can outperform heuristic-based splitters and LLM prompt-based approaches on word segmentation tasks.

✂️ Google Colab Tutorial

✂️ Evaluation Report


🚀 Quick Start

Installation

pip install hashformers

Basic Usage

from hashformers import TransformerWordSegmenter as WordSegmenter

ws = WordSegmenter(
    segmenter_model_name_or_path="distilgpt2"
) # You can use any model from the Hugging Face Model Hub

segmentations = ws.segment([
    "#weneedanationalpark",
    "#icecold"
])

print(segmentations)
# ['we need a national park', 'ice cold']

Using Language-Specific Models

# Russian hashtags with RuGPT3
ws = WordSegmenter(
    segmenter_model_name_or_path="ai-forever/rugpt3small_based_on_gpt2"
)

segmentations = ws.segment(["#москвасити"])

print(segmentations)
# ['москва сити']

spaCy Integration

Hashformers can be used as a spaCy pipeline component:

import spacy
import hashformers.spacy  # registers the "hashformers" component

nlp = spacy.blank("en")
nlp.add_pipe("hashformers", config={"model": "distilgpt2"})

doc = nlp("#weneedanationalpark")
print(doc._.segmented)  # "we need a national park"

Install with spaCy support:

pip install hashformers[spacy]

When to Use Hashformers?

The table below outlines when to use Hashformers versus other approaches like heuristic-based splitters (e.g., SymSpell, WordNinja) or large LLMs.

Approach Examples Recommended When... Notes
Heuristic-based SymSpell, Ekphrasis, WordNinja, Spiral (Ronin) Scalability is a primary requirement.

• The segmentation domain works well with a standard pre-built vocabulary.
Fast and efficient, but requires a pre-built vocabulary which can be limiting for niche domains or languages.
Hashformers Hashformers Scalability is needed.

• You are working in a domain or language where a Language Model is readily available, but compiling a manual vocabulary for your task is too burdensome.
Evidence shows Hashformers can be superior to LLMs of similar scale (0.5B parameters).
Large LLMs OpenAI, Local LLM Deployment Cost, latency, and scalability are not concerns.

• You are segmenting a low volume of items.
To gain an accuracy advantage over Hashformers, you generally need to use significantly larger LLMs.

📚 Research & Citations

Hashformers was recognized as state-of-the-art for hashtag segmentation at LREC 2022.

Papers Using Hashformers

Citation

If you find Hashformers useful, please consider citing our paper:

@misc{rodrigues2021zeroshot,
      title={Zero-shot hashtag segmentation for multilingual sentiment analysis}, 
      author={Ruan Chaves Rodrigues and Marcelo Akira Inuzuka and Juliana Resplande Sant'Anna Gomes and Acquila Santos Rocha and Iacer Calixto and Hugo Alexandre Dantas do Nascimento},
      year={2021},
      eprint={2112.03213},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

🤝 Contributing

Pull requests are welcome! Read our paper for details on the framework architecture.

git clone https://github.com/ruanchaves/hashformers.git
cd hashformers
pip install -e .

📖 Resources

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hashformers-2.2.0.tar.gz (25.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hashformers-2.2.0-py3-none-any.whl (28.9 kB view details)

Uploaded Python 3

File details

Details for the file hashformers-2.2.0.tar.gz.

File metadata

  • Download URL: hashformers-2.2.0.tar.gz
  • Upload date:
  • Size: 25.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for hashformers-2.2.0.tar.gz
Algorithm Hash digest
SHA256 a1f5c699f68d9a38f2ead18ae0b7699764dc1d4ed03a7dbfb1b937c996331e39
MD5 ed6fe8c2afd0fd457b6b11fd7cb7ab5c
BLAKE2b-256 b13521c9d606b1bd49e8ceb2bcf61485dec3baa1d090f4a2b0e0eb982e63d5cc

See more details on using hashes here.

File details

Details for the file hashformers-2.2.0-py3-none-any.whl.

File metadata

  • Download URL: hashformers-2.2.0-py3-none-any.whl
  • Upload date:
  • Size: 28.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for hashformers-2.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5415ff3a7fef49c6cbc95b9db08953c29f37f8ae9515aacec8be0fb4889a4857
MD5 254a04aeb0a71a86f7866c513517d0e6
BLAKE2b-256 29292934a1e903dee67a5515cfc28427b79dd7e5ea8a70fd8c46a1569b609464

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page