Tunisian Named Entity Recognition for French text using CamemBERT
Project description
🇹🇳 tun-camembert-ner
Tunisian Named Entity Recognition for French text, powered by a fine-tuned CamemBERT model.
📖 Table of Contents
- Overview
- Features
- Installation
- Quick Start
- Usage
- Model
- Project Pipeline
- Dataset
- Training
- Results
- Contributing
- License
Overview
tun-camembert-ner is an open-source Python library for Named Entity Recognition (NER) in Tunisian French text. It detects and classifies named entities into three categories:
| Entity | Description | Example |
|---|---|---|
PER |
Person names | Ahmed Karray, Samir Saied |
LOC |
Cities, regions, countries | Tunis, Sfax, Monastir |
ORG |
Organizations, companies, institutions | STEG, Tunisair, BIAT |
The model is built on top of CamemBERT (camembert-base), a French BERT model pre-trained on 138GB of French text, fine-tuned on a custom Tunisian French NER dataset.
Features
- ✅ Detects PER, LOC, and ORG entities in French Tunisian text
- ✅ Word-level tokenization — no sub-token splitting
- ✅ Built on CamemBERT — optimized for French
- ✅ Simple and clean Python API
- ✅ CPU and GPU support
- ✅ Lightweight and easy to install
Installation
pip install tun-camembert-ner
Requirements:
- Python 3.8+
- PyTorch 2.0+
- Transformers 4.40+
Quick Start
from tunisian_ner import NER
ner = NER()
result = ner("Ahmed Karray dirige la STEG à Tunis.")
print(result)
Output:
[
{"word": "Ahmed Karray", "entity_group": "PER", "score": 0.999},
{"word": "STEG", "entity_group": "ORG", "score": 0.997},
{"word": "Tunis", "entity_group": "LOC", "score": 0.999}
]
Usage
Basic usage
from tunisian_ner import NER
ner = NER()
result = ner("Le ministre Samir Saied a visité Sfax hier.")
print(result)
Multiple sentences
sentences = [
"Ahmed Karray dirige la STEG à Tunis.",
"Tunisair a annoncé de nouveaux vols vers Paris.",
"Fatma Mseddi représente Ennahdha à Monastir.",
"La BIAT a ouvert une nouvelle agence à Sousse.",
]
for sent in sentences:
print(f"\n📝 {sent}")
for ent in ner(sent):
print(f" → {ent['word']:<25} {ent['entity_group']} ({ent['score']})")
Filter by entity type
text = "Riadh Bettaieb a signé un accord à Sousse avec la BIAT."
entities = ner(text)
persons = [e for e in entities if e["entity_group"] == "PER"]
locations = [e for e in entities if e["entity_group"] == "LOC"]
orgs = [e for e in entities if e["entity_group"] == "ORG"]
print(f"Persons : {[e['word'] for e in persons]}")
print(f"Locations : {[e['word'] for e in locations]}")
print(f"Orgs : {[e['word'] for e in orgs]}")
Use a custom model
ner = NER(model="your-username/your-custom-model")
Model
The model is hosted on HuggingFace Hub:
🤗 NourBesrour/tun-ner-camembert
| Property | Value |
|---|---|
| Base model | camembert-base |
| Task | Token Classification (NER) |
| Language | French (Tunisian) |
| Labels | O, B-PER, I-PER, B-LOC, I-LOC, B-ORG, I-ORG |
| Training epochs | 10 |
| Max sequence length | 128 tokens |
Project Pipeline
This library was built following a complete NLP pipeline from scratch. Here is the full process:
Step 1 — Data Collection (Web Scraping)
Tunisian French text was collected from multiple online sources using BeautifulSoup and Requests.
Sources used:
| Category | Sites |
|---|---|
| General news | lapresse.tn, kapitalis.com, leaders.com.tn, webdo.tn |
| Economy | businessnews.com.tn, ilboursa.com, leaders.com.tn |
| Politics | realites.com.tn, tap.info.tn, presidency.tn |
| Sport | sport.tn, tunisiesport.net |
| Tech | tunisienumerique.com, tekiano.com |
| Wikipedia | fr.wikipedia.org (Tunisia articles) |
The scraper collected raw French sentences from article paragraphs, filtered to keep only sentences with at least 3 words.
Step 2 — Data Annotation
Raw sentences were annotated in BIO (Beginning-Inside-Outside) format
using GLiNER (urchade/gliner_multi-v2.1), a zero-shot Named Entity
Recognition model that runs fully locally — no API key needed.
GLiNER predicts entity spans directly from raw text, then the results are
automatically converted to BIO format. A threshold of 0.4 was used to
filter low-confidence predictions.
How it works:
- Input: raw French sentence
- GLiNER detects spans for
person,location,organization - Spans are mapped to word-level BIO labels
- Output:
.conllfile ready for training
BIO format example:
Ahmed B-PER
Karray I-PER
dirige O
la O
STEG B-ORG
à O
Tunis B-LOC
. O
Label meaning:
| Label | Meaning |
|---|---|
B-PER |
Beginning of a person name |
I-PER |
Inside (continuation) of a person name |
B-LOC |
Beginning of a location |
I-LOC |
Inside a location |
B-ORG |
Beginning of an organization |
I-ORG |
Inside an organization |
O |
Not an entity |
Step 3 — Data Validation & Fixing
The annotated .conll file was validated and fixed using custom scripts:
validate_conll.py— checks for format errors, unknown labels, and BIO inconsistencies. Generates a visual report showing entity distribution, sentence length histogram, and top entities per type.fix_bio.py— automatically fixes consecutiveB-X B-Xsequences into correctB-X I-X I-Xtagging.
Step 4 — Dataset Split
The validated dataset was split into 3 subsets:
| Split | Size | Purpose |
|---|---|---|
train.conll |
80% | Model learns from this |
dev.conll |
10% | Monitors progress during training |
test.conll |
10% | Final evaluation only |
Step 5 — Model Fine-tuning
The model was fine-tuned on Google Colab (free T4 GPU) using the HuggingFace Trainer API.
Base model: camembert-base
- Pre-trained on 138GB of French text
- Robust, stable, and perfectly suited for French NER
- Fully compatible with HuggingFace standard classes
Training configuration:
| Parameter | Value |
|---|---|
| Epochs | 10 |
| Batch size | 16 |
| Learning rate | 3e-5 |
| Scheduler | Cosine |
| Max length | 128 tokens |
| Mixed precision | fp16 |
| Best model metric | F1 score |
Tokenization strategy:
- Words are pre-split by space before tokenization
- Sub-tokens generated by CamemBERT tokenizer are aligned to their original word
- Only the first sub-token of each word receives the real label
- Other sub-tokens receive label
-100(ignored in loss computation)
Step 6 — Model Upload to HuggingFace Hub
After training, the model and tokenizer were pushed to HuggingFace Hub using trainer.push_to_hub():
https://huggingface.co/NourBesrour/tun-ner-camembert
Step 7 — Python Library Packaging
The model was wrapped in a clean Python API and published to PyPI as tun-camembert-ner.
Library structure:
tun-camembert-ner/
├── tunisian_ner/
│ ├── __init__.py ← from tunisian_ner import NER
│ └── ner.py ← NER class
├── pyproject.toml ← package metadata
├── README.md
└── LICENSE
Install:
pip install tun-camembert-ner
Results
Results on the test set after 10 epochs of fine-tuning:
| Entity | F1 Score |
|---|---|
| PER | ~0.99 |
| LOC | ~0.99 |
| ORG | ~0.99 |
| Overall | ~0.99 |
Contributing
Contributions are welcome! Here's how to contribute:
- Fork the repository
- Create a new branch:
git checkout -b feature/my-feature - Make your changes and commit:
git commit -m "add my feature" - Push to your branch:
git push origin feature/my-feature - Open a Pull Request
Ideas for contributions:
- Add more annotated training data
- Support Arabic script input
- Add confidence threshold filtering
- Improve entity boundary detection
License
This project is licensed under the MIT License — see the LICENSE file for details.
Citation
If you use this library in your research, please cite:
@software{besrour2025tunner,
author = {Nour Besrour},
title = {tun-camembert-ner: Tunisian NER for French text},
year = {2025},
publisher = {PyPI},
url = {https://pypi.org/project/tun-camembert-ner/}
}
Acknowledgements
- CamemBERT — French BERT model by Inria
- HuggingFace Transformers — model training and inference
- Google Colab — free GPU for training
- Tunisian news websites for providing the raw text data
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tun_camembert_ner-0.1.2.tar.gz.
File metadata
- Download URL: tun_camembert_ner-0.1.2.tar.gz
- Upload date:
- Size: 7.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
27c908ae266019bd4b37f98acb4a87802000cce3423d2f1d30f5cff09557d2b4
|
|
| MD5 |
ff16f47c4e5cdf9698bd22a603a5a5ea
|
|
| BLAKE2b-256 |
3e94c35a7e447cad10b0c82b44b188d690bb3c1659735d7263d60be507b20a33
|
File details
Details for the file tun_camembert_ner-0.1.2-py3-none-any.whl.
File metadata
- Download URL: tun_camembert_ner-0.1.2-py3-none-any.whl
- Upload date:
- Size: 8.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
344fb677d16068b639710afb7ca5b8d732050b8c60e64eccd4d04300ce015b5c
|
|
| MD5 |
5821b530b828961e6eeb60cbf0bd78db
|
|
| BLAKE2b-256 |
4cd1a90f24648fd763fe1bb5e4d05c17511c1221af2821e638937a3904b347c2
|