Tunisian Named Entity Recognition for French text using CamemBERT

Project description

🇹🇳 tun-camembert-ner

Tunisian Named Entity Recognition for French text, powered by a fine-tuned CamemBERT model.

📖 Table of Contents

Overview
Features
Installation
Quick Start
Usage
Model
Project Pipeline
Dataset
Training
Results
Contributing
License

Overview

tun-camembert-ner is an open-source Python library for Named Entity Recognition (NER) in Tunisian French text. It detects and classifies named entities into three categories:

Entity	Description	Example
`PER`	Person names	`Ahmed Karray`, `Samir Saied`
`LOC`	Cities, regions, countries	`Tunis`, `Sfax`, `Monastir`
`ORG`	Organizations, companies, institutions	`STEG`, `Tunisair`, `BIAT`

The model is built on top of CamemBERT (camembert-base), a French BERT model pre-trained on 138GB of French text, fine-tuned on a custom Tunisian French NER dataset.

Features

✅ Detects PER, LOC, and ORG entities in French Tunisian text
✅ Word-level tokenization — no sub-token splitting
✅ Built on CamemBERT — optimized for French
✅ Simple and clean Python API
✅ CPU and GPU support
✅ Lightweight and easy to install

Installation

pip install tun-camembert-ner

Requirements:

Python 3.8+
PyTorch 2.0+
Transformers 4.40+

Quick Start

from tunisian_ner import NER

ner = NER()

result = ner("Ahmed Karray dirige la STEG à Tunis.")
print(result)

Output:

[
  {"word": "Ahmed Karray", "entity_group": "PER", "score": 0.999},
  {"word": "STEG",         "entity_group": "ORG", "score": 0.997},
  {"word": "Tunis",        "entity_group": "LOC", "score": 0.999}
]

Usage

Basic usage

from tunisian_ner import NER

ner = NER()

result = ner("Le ministre Samir Saied a visité Sfax hier.")
print(result)

Multiple sentences

sentences = [
    "Ahmed Karray dirige la STEG à Tunis.",
    "Tunisair a annoncé de nouveaux vols vers Paris.",
    "Fatma Mseddi représente Ennahdha à Monastir.",
    "La BIAT a ouvert une nouvelle agence à Sousse.",
]

for sent in sentences:
    print(f"\n📝 {sent}")
    for ent in ner(sent):
        print(f"   → {ent['word']:<25} {ent['entity_group']}  ({ent['score']})")

Filter by entity type

text = "Riadh Bettaieb a signé un accord à Sousse avec la BIAT."
entities = ner(text)

persons   = [e for e in entities if e["entity_group"] == "PER"]
locations = [e for e in entities if e["entity_group"] == "LOC"]
orgs      = [e for e in entities if e["entity_group"] == "ORG"]

print(f"Persons   : {[e['word'] for e in persons]}")
print(f"Locations : {[e['word'] for e in locations]}")
print(f"Orgs      : {[e['word'] for e in orgs]}")

Use a custom model

ner = NER(model="your-username/your-custom-model")

Model

The model is hosted on HuggingFace Hub:

🤗 NourBesrour/tun-ner-camembert

Property	Value
Base model	`camembert-base`
Task	Token Classification (NER)
Language	French (Tunisian)
Labels	`O`, `B-PER`, `I-PER`, `B-LOC`, `I-LOC`, `B-ORG`, `I-ORG`
Training epochs	10
Max sequence length	128 tokens

Project Pipeline

This library was built following a complete NLP pipeline from scratch. Here is the full process:

Step 1 — Data Collection (Web Scraping)

Tunisian French text was collected from multiple online sources using BeautifulSoup and Requests.

Sources used:

Category	Sites
General news	`lapresse.tn`, `kapitalis.com`, `leaders.com.tn`, `webdo.tn`
Economy	`businessnews.com.tn`, `ilboursa.com`, `leaders.com.tn`
Politics	`realites.com.tn`, `tap.info.tn`, `presidency.tn`
Sport	`sport.tn`, `tunisiesport.net`
Tech	`tunisienumerique.com`, `tekiano.com`
Wikipedia	`fr.wikipedia.org` (Tunisia articles)

The scraper collected raw French sentences from article paragraphs, filtered to keep only sentences with at least 3 words.

Step 2 — Data Annotation

Raw sentences were annotated in BIO (Beginning-Inside-Outside) format using GLiNER (urchade/gliner_multi-v2.1), a zero-shot Named Entity Recognition model that runs fully locally — no API key needed.

GLiNER predicts entity spans directly from raw text, then the results are automatically converted to BIO format. A threshold of 0.4 was used to filter low-confidence predictions.

How it works:

Input: raw French sentence
GLiNER detects spans for person, location, organization
Spans are mapped to word-level BIO labels
Output: .conll file ready for training

BIO format example:

Ahmed       B-PER
Karray      I-PER
dirige      O
la          O
STEG        B-ORG
à           O
Tunis       B-LOC
.           O

Label meaning:

Label	Meaning
`B-PER`	Beginning of a person name
`I-PER`	Inside (continuation) of a person name
`B-LOC`	Beginning of a location
`I-LOC`	Inside a location
`B-ORG`	Beginning of an organization
`I-ORG`	Inside an organization
`O`	Not an entity

Step 3 — Data Validation & Fixing

The annotated .conll file was validated and fixed using custom scripts:

validate_conll.py — checks for format errors, unknown labels, and BIO inconsistencies. Generates a visual report showing entity distribution, sentence length histogram, and top entities per type.
fix_bio.py — automatically fixes consecutive B-X B-X sequences into correct B-X I-X I-X tagging.

Step 4 — Dataset Split

The validated dataset was split into 3 subsets:

Split	Size	Purpose
`train.conll`	80%	Model learns from this
`dev.conll`	10%	Monitors progress during training
`test.conll`	10%	Final evaluation only

Step 5 — Model Fine-tuning

The model was fine-tuned on Google Colab (free T4 GPU) using the HuggingFace Trainer API.

Base model: camembert-base

Pre-trained on 138GB of French text
Robust, stable, and perfectly suited for French NER
Fully compatible with HuggingFace standard classes

Training configuration:

Parameter	Value
Epochs	10
Batch size	16
Learning rate	3e-5
Scheduler	Cosine
Max length	128 tokens
Mixed precision	fp16
Best model metric	F1 score

Tokenization strategy:

Words are pre-split by space before tokenization
Sub-tokens generated by CamemBERT tokenizer are aligned to their original word
Only the first sub-token of each word receives the real label
Other sub-tokens receive label -100 (ignored in loss computation)

Step 6 — Model Upload to HuggingFace Hub

After training, the model and tokenizer were pushed to HuggingFace Hub using trainer.push_to_hub():

https://huggingface.co/NourBesrour/tun-ner-camembert

Step 7 — Python Library Packaging

The model was wrapped in a clean Python API and published to PyPI as tun-camembert-ner.

Library structure:

tun-camembert-ner/
├── tunisian_ner/
│   ├── __init__.py       ← from tunisian_ner import NER
│   └── ner.py            ← NER class
├── pyproject.toml        ← package metadata
├── README.md
└── LICENSE

Install:

pip install tun-camembert-ner

Results

Results on the test set after 10 epochs of fine-tuning:

Entity	F1 Score
PER	~0.99
LOC	~0.99
ORG	~0.99
Overall	~0.99

Contributing

Contributions are welcome! Here's how to contribute:

Fork the repository
Create a new branch: git checkout -b feature/my-feature
Make your changes and commit: git commit -m "add my feature"
Push to your branch: git push origin feature/my-feature
Open a Pull Request

Ideas for contributions:

Add more annotated training data
Support Arabic script input
Add confidence threshold filtering
Improve entity boundary detection

License

This project is licensed under the MIT License — see the LICENSE file for details.

Citation

If you use this library in your research, please cite:

@software{besrour2025tunner,
  author    = {Nour Besrour},
  title     = {tun-camembert-ner: Tunisian NER for French text},
  year      = {2025},
  publisher = {PyPI},
  url       = {https://pypi.org/project/tun-camembert-ner/}
}

Acknowledgements

CamemBERT — French BERT model by Inria
HuggingFace Transformers — model training and inference
Google Colab — free GPU for training
Tunisian news websites for providing the raw text data

Project details

Release history Release notifications | RSS feed

This version

0.1.2

Jun 5, 2026

0.1.1

Jun 5, 2026

0.1.0

Jun 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tun_camembert_ner-0.1.2.tar.gz (7.2 kB view details)

Uploaded Jun 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tun_camembert_ner-0.1.2-py3-none-any.whl (8.0 kB view details)

Uploaded Jun 5, 2026 Python 3

File details

Details for the file tun_camembert_ner-0.1.2.tar.gz.

File metadata

Download URL: tun_camembert_ner-0.1.2.tar.gz
Upload date: Jun 5, 2026
Size: 7.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for tun_camembert_ner-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`27c908ae266019bd4b37f98acb4a87802000cce3423d2f1d30f5cff09557d2b4`
MD5	`ff16f47c4e5cdf9698bd22a603a5a5ea`
BLAKE2b-256	`3e94c35a7e447cad10b0c82b44b188d690bb3c1659735d7263d60be507b20a33`

See more details on using hashes here.

File details

Details for the file tun_camembert_ner-0.1.2-py3-none-any.whl.

File metadata

Download URL: tun_camembert_ner-0.1.2-py3-none-any.whl
Upload date: Jun 5, 2026
Size: 8.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for tun_camembert_ner-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`344fb677d16068b639710afb7ca5b8d732050b8c60e64eccd4d04300ce015b5c`
MD5	`5821b530b828961e6eeb60cbf0bd78db`
BLAKE2b-256	`4cd1a90f24648fd763fe1bb5e4d05c17511c1221af2821e638937a3904b347c2`

See more details on using hashes here.

tun-camembert-ner 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

🇹🇳 tun-camembert-ner

📖 Table of Contents

Overview

Features

Installation

Quick Start

Usage

Basic usage

Multiple sentences

Filter by entity type

Use a custom model

Model

Project Pipeline

Step 1 — Data Collection (Web Scraping)

Step 2 — Data Annotation

Step 3 — Data Validation & Fixing

Step 4 — Dataset Split

Step 5 — Model Fine-tuning

Step 6 — Model Upload to HuggingFace Hub

Step 7 — Python Library Packaging

Results

Contributing

License

Citation

Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes