Word segmentation with transformers
Project description
✂️ hashformers
Hashformers is a word segmentation library that fills a gap in the NLP ecosystem between heuristic-based splitters and LLM prompt-based segmentation. It can be used with any language model from the Hugging Face Model Hub, from auto-regressive models like GPT-2 to recent large language models (LLMs).
Hashformers uses language models and a beam search algorithm to segment text without spaces into words. Benchmarks show that it can outperform heuristic-based splitters and LLM prompt-based approaches on word segmentation tasks.
✂️ Google Colab Tutorial
✂️ Evaluation Report
🚀 Quick Start
Installation
pip install hashformers
Basic Usage
from hashformers import TransformerWordSegmenter as WordSegmenter
ws = WordSegmenter(
segmenter_model_name_or_path="distilgpt2"
) # You can use any model from the Hugging Face Model Hub
segmentations = ws.segment([
"#weneedanationalpark",
"#icecold"
])
print(segmentations)
# ['we need a national park', 'ice cold']
Using Language-Specific Models
# Russian hashtags with RuGPT3
ws = WordSegmenter(
segmenter_model_name_or_path="ai-forever/rugpt3small_based_on_gpt2"
)
segmentations = ws.segment(["#москвасити"])
print(segmentations)
# ['москва сити']
spaCy Integration
Hashformers can be used as a spaCy pipeline component:
import spacy
import hashformers.spacy # registers the "hashformers" component
nlp = spacy.blank("en")
nlp.add_pipe("hashformers", config={"model": "distilgpt2"})
doc = nlp("#weneedanationalpark")
print(doc._.segmented) # "we need a national park"
Install with spaCy support:
pip install hashformers[spacy]
When to Use Hashformers?
The table below outlines when to use Hashformers versus other approaches like heuristic-based splitters (e.g., SymSpell, WordNinja) or large LLMs.
| Approach | Examples | Recommended When... | Notes |
|---|---|---|---|
| Heuristic-based | SymSpell, Ekphrasis, WordNinja, Spiral (Ronin) | • Scalability is a primary requirement. • The segmentation domain works well with a standard pre-built vocabulary. |
Fast and efficient, but requires a pre-built vocabulary which can be limiting for niche domains or languages. |
| Hashformers | Hashformers | • Scalability is needed. • You are working in a domain or language where a Language Model is readily available, but compiling a manual vocabulary for your task is too burdensome. |
Evidence shows Hashformers can be superior to LLMs of similar scale (0.5B parameters). |
| Large LLMs | OpenAI, Local LLM Deployment | • Cost, latency, and scalability are not concerns. • You are segmenting a low volume of items. |
To gain an accuracy advantage over Hashformers, you generally need to use significantly larger LLMs. |
📚 Research & Citations
Hashformers was recognized as state-of-the-art for hashtag segmentation at LREC 2022.
Papers Using Hashformers
-
Zero-shot hashtag segmentation for multilingual sentiment analysis
-
Generalizability of Abusive Language Detection Models on Homogeneous German Datasets
-
The problem of varying annotations to identify abusive language in social media content
-
NUSS: An R package for mixed N-grams and unigram sequence segmentation
Citation
If you find Hashformers useful, please consider citing our paper:
@misc{rodrigues2021zeroshot,
title={Zero-shot hashtag segmentation for multilingual sentiment analysis},
author={Ruan Chaves Rodrigues and Marcelo Akira Inuzuka and Juliana Resplande Sant'Anna Gomes and Acquila Santos Rocha and Iacer Calixto and Hugo Alexandre Dantas do Nascimento},
year={2021},
eprint={2112.03213},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
🤝 Contributing
Pull requests are welcome! Read our paper for details on the framework architecture.
git clone https://github.com/ruanchaves/hashformers.git
cd hashformers
pip install -e .
📖 Resources
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hashformers-2.2.0.tar.gz.
File metadata
- Download URL: hashformers-2.2.0.tar.gz
- Upload date:
- Size: 25.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a1f5c699f68d9a38f2ead18ae0b7699764dc1d4ed03a7dbfb1b937c996331e39
|
|
| MD5 |
ed6fe8c2afd0fd457b6b11fd7cb7ab5c
|
|
| BLAKE2b-256 |
b13521c9d606b1bd49e8ceb2bcf61485dec3baa1d090f4a2b0e0eb982e63d5cc
|
File details
Details for the file hashformers-2.2.0-py3-none-any.whl.
File metadata
- Download URL: hashformers-2.2.0-py3-none-any.whl
- Upload date:
- Size: 28.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5415ff3a7fef49c6cbc95b9db08953c29f37f8ae9515aacec8be0fb4889a4857
|
|
| MD5 |
254a04aeb0a71a86f7866c513517d0e6
|
|
| BLAKE2b-256 |
29292934a1e903dee67a5515cfc28427b79dd7e5ea8a70fd8c46a1569b609464
|