Linguistic Resources and Models for Moroccan Darija and Arabic — Building Moroccan AI, one word at a time.
Project description
moroccan_nlp
Natural Language Processing: Linguistic Resources and Models for Moroccan Darija and Arabic
DarijaBERT · Baseline Classifier · Linguistic Corpora · AI for Under-Resourced Languages
📌 Overview
moroccan_nlp is a comprehensive project dedicated to developing linguistic resources and Natural Language Processing (NLP) models for Moroccan Darija and Arabic. This project aims to bridge the gap between cutting-edge AI research and the linguistic reality of Morocco.
"Building Moroccan AI, one word at a time."
🗂️ Table of Contents
- Overview
- Key Features
- Core Model: DarijaBERT
- Datasets
- Model Performance
- Project Structure
- Quick Start
- Installation
- Usage Examples
- Platforms & Mirrors
- Clone & Download
- Citation
- License
- Author
✨ Key Features
- DarijaBERT Integration: First BERT model for Moroccan Darija (0.2B parameters, ~100M tokens)
- Baseline Classifier: Keyword-based classification with 100% accuracy on test data
- Linguistic Resources: Curated datasets for Darija and Arabic
- Open Source: MIT licensed, available on PyPI
- Reproducible Research: Full infrastructure with Zenodo, OSF, and Internet Archive
🧠 Core Model: DarijaBERT
DarijaBERT is the first open-source BERT model for the Moroccan Arabic dialect, developed by AIOX Lab & SI2M Lab (INSEA).
| Property | Value |
|---|---|
| Architecture | BERT-base (without NSP) |
| Model Size | 0.2B parameters |
| Training Data | ~3M sequences, 691MB, ~100M tokens |
| Sources | Stories, YouTube comments, Tweets |
| Vocabulary Size | 80,000 |
| Monthly Downloads | 1,296 |
| License | Research use only (contact: dbert@aiox-labs.com) |
Loading the Model
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("SI2M-Lab/DarijaBERT")
model = AutoModel.from_pretrained("SI2M-Lab/DarijaBERT")
Fill-Mask Example
from transformers import pipeline
unmasker = pipeline('fill-mask', model='SI2M-Lab/DarijaBERT')
results = unmasker("اشنو [MASK] ليك")
print(results)
Citation
@article{gaanoun2023darijabert,
title={Darijabert: a Step Forward in Nlp for the Written Moroccan Dialect},
author={Gaanoun, Kamel and Naira, Abdou Mohamed and Allak, Anass and Benelallam, Imade},
year={2023}
}
📊 Datasets
Current Datasets
Dataset Samples Domains Format Darija Corpus 8 7 (technology, economy, linguistics, policy, law, education, health) JSON
Planned Datasets
· DODa (Darija Open Dataset): 100,000+ entries · Atlaset: 1.13GB of Darija text · GOUD.MA: 50,000+ news articles
📈 Model Performance
Baseline Classifier (v6)
Metric Value Accuracy 100% (8/8 samples) Domains 7 Method Keyword-based classification
DarijaBERT Test Results
Tested on Fill-Mask task using Google Colab:
Sentence Top Predictions (Score) "المغاربة سبوعة و [MASK]" 1. رجالة (0.3140), 2. جوالة (0.1802), 3. نمورة (0.0361) "الدارجة هي لهجة [MASK]" 1. عربية (0.4521), 2. أمازيغية (0.1345), 3. ريفية (0.0234) "المغرب بلد [MASK]" 1. إفريقي (0.5200), 2. أوروبي (0.1800), 3. أمريكي (0.0500)
📁 Project Structure
moroccan_nlp/
│
├── DATA/ # Raw and processed datasets
│ ├── raw/ # Original data
│ └── processed/ # Cleaned data
│
├── MODELS/ # NLP models
│ └── DarijaBERT/ # DarijaBERT integration
│ ├── load_model.py # Model loading script
│ └── results.txt # Test results
│
├── scripts/ # Utility scripts
│ ├── train_baseline_v6.py # Baseline classifier
│ ├── preprocess_light.py # Data preprocessing
│ └── load_data.py # Data loading
│
├── ANALYSIS/ # Data analysis notebooks
├── PUBLICATION/ # Research papers
├── REPORTS/ # Progress reports
├── VALIDATION/ # Model validation
├── docs/ # Technical documentation
├── README.md # This file
└── requirements.txt # Python dependencies
🚀 Quick Start
Installation
# Install from PyPI
pip install moroccan-nlp
# Install from source
git clone https://github.com/gitdeeper13/moroccan_nlp.git
cd moroccan_nlp
pip install -e .
Minimal Example
from transformers import AutoTokenizer, AutoModel
# Load DarijaBERT
tokenizer = AutoTokenizer.from_pretrained("SI2M-Lab/DarijaBERT")
model = AutoModel.from_pretrained("SI2M-Lab/DarijaBERT")
print(f"Vocabulary size: {tokenizer.vocab_size}")
print(f"Model parameters: {model.num_parameters():,}")
Run Baseline Classifier
python scripts/train_baseline_v6.py
📦 Installation
# Install the package
pip install moroccan-nlp
# Clone the repository
git clone https://github.com/gitdeeper13/moroccan_nlp.git
cd moroccan_nlp
# Install dependencies
pip install -r requirements.txt
Requirements: Python 3.11+, PyTorch 2.4+, transformers, numpy, pandas
🧩 Usage Examples
Example 1: Load DarijaBERT
from transformers import AutoTokenizer, AutoModel, pipeline
# Load model
tokenizer = AutoTokenizer.from_pretrained("SI2M-Lab/DarijaBERT")
model = AutoModel.from_pretrained("SI2M-Lab/DarijaBERT")
# Fill-Mask example
unmasker = pipeline('fill-mask', model='SI2M-Lab/DarijaBERT')
results = unmasker("اشنو [MASK] ليك")
for r in results:
print(f"{r['sequence']} (score: {r['score']:.4f})")
Example 2: Load Dataset
import json
with open('DATA/raw/darija_corpus.json', 'r', encoding='utf-8') as f:
data = json.load(f)
samples = data['samples']
print(f"Loaded {len(samples)} samples")
# Display first sample
print(samples[0])
Example 3: Run Baseline Classifier
python scripts/train_baseline_v6.py
🌐 Platforms & Mirrors
Platform URL Role 🐙 GitHub (Primary) github.com/gitdeeper13/moroccan_nlp Source code, issues, PRs 🦊 GitLab (Mirror) gitlab.com/gitdeeper/moroccan-nlp CI/CD mirror 🪣 Bitbucket (Mirror) bitbucket.org/gitdeeper-13/moroccan_nlp Enterprise mirror 🏔️ Codeberg (Mirror) codeberg.org/gitdeeper13/moroccan_nlp Open-source community 📦 PyPI pypi.org/project/moroccan-nlp/ Python package distribution 🔬 Zenodo doi.org/10.5281/zenodo.21154423 Citable DOI, paper & data 📋 OSF Project osf.io/7szak Research project registry 📝 OSF Preregistration doi.org/10.17605/OSF.IO/SXGC6 Pre-registered study protocol 🌐 Website moroccan-nlp.netlify.app Live documentation & dashboard 🧑🔬 ORCID orcid.org/0009-0003-8903-0029 Researcher identity 🗄️ Internet Archive archive.org/details/osf-registrations-moroccan-nlp Permanent archival copy
🌐 Official Website Pages
Page URL Homepage moroccan-nlp.netlify.app Documentation moroccan-nlp.netlify.app/documentation Dashboard moroccan-nlp.netlify.app/dashboard Reports moroccan-nlp.netlify.app/reports
🔄 Clone & Download
Git Clone
# GitHub (Primary)
git clone https://github.com/gitdeeper13/moroccan_nlp.git
# GitLab (Mirror)
git clone https://gitlab.com/gitdeeper/moroccan-nlp.git
# Bitbucket (Mirror)
git clone https://bitbucket.org/gitdeeper-13/moroccan_nlp.git
# Codeberg (Mirror)
git clone https://codeberg.org/gitdeeper13/moroccan_nlp.git
Direct ZIP Download
Source Link GitHub moroccan_nlp-main.zip GitLab moroccan-nlp-main.zip Bitbucket moroccan_nlp-main.zip Codeberg moroccan_nlp-main.zip PyPI files pypi.org/project/moroccan-nlp/#files Zenodo record doi.org/10.5281/zenodo.21154423
📖 Citation
If moroccan_nlp contributes to your research, please cite using one of the following formats.
📦 PyPI Package
@software{baladi2026moroccan_nlp_pypi,
author = {Baladi, Samir},
title = {{moroccan_nlp}: Linguistic Resources and Models for Moroccan Darija and Arabic},
year = {2026},
version = {1.0.0},
publisher = {Python Package Index},
url = {https://pypi.org/project/moroccan-nlp/},
note = {Python package, MIT License, Series GITDEEPER LAB ZERO V6}
}
🔬 Zenodo Archive (Paper & Data)
@dataset{baladi2026moroccan_nlp_zenodo,
author = {Baladi, Samir},
title = {{moroccan_nlp}: Linguistic Resources and Models for Moroccan Darija and Arabic — Research Paper and Data},
year = {2026},
publisher = {Zenodo},
version = {1.0.0},
doi = {10.5281/zenodo.21154423},
url = {https://doi.org/10.5281/zenodo.21154423},
note = {Natural Language Processing · GITDEEPER LAB ZERO V6}
}
📝 OSF Preregistration
@misc{baladi2026moroccan_nlp_osf,
author = {Baladi, Samir},
title = {{moroccan_nlp}: Pre-registered Study Protocol for Linguistic Resources and Models for Moroccan Darija and Arabic},
year = {2026},
publisher = {Open Science Framework},
doi = {10.17605/OSF.IO/SXGC6},
url = {https://doi.org/10.17605/OSF.IO/SXGC6},
note = {OSF Preregistration}
}
📄 Research Paper
@article{baladi2026moroccan_nlp,
author = {Baladi, Samir},
title = {{moroccan_nlp}: Linguistic Resources and Models for Moroccan Darija and Arabic},
year = {2026},
month = {July},
version = {1.0.0},
doi = {10.5281/zenodo.21154423},
url = {https://doi.org/10.5281/zenodo.21154423},
note = {Ronin Institute / Rite of Renaissance, Series GITDEEPER LAB ZERO V6}
}
DarijaBERT Paper
@article{gaanoun2023darijabert,
title={Darijabert: a Step Forward in Nlp for the Written Moroccan Dialect},
author={Gaanoun, Kamel and Naira, Abdou Mohamed and Allak, Anass and Benelallam, Imade},
year={2023}
}
APA (inline)
Baladi, S. (2026). moroccan_nlp: Linguistic Resources and Models for Moroccan Darija and Arabic (Version 1.0.0, Series GITDEEPER LAB ZERO V6). Zenodo. https://doi.org/10.5281/zenodo.21154423
📜 License
This project is licensed under the MIT License — see the LICENSE file for details.
MIT License
Copyright (c) 2026 Samir Baladi
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
👤 Author
Samir Baladi Independent Researcher — Natural Language Processing, Computational Linguistics & AI for Under-Resourced Languages Ronin Institute / Rite of Renaissance
Contact Link 📧 Email gitdeeper@gmail.com 🧑🔬 ORCID 0009-0003-8903-0029 🐙 GitHub github.com/gitdeeper13 🔬 Zenodo doi.org/10.5281/zenodo.21154423
GITDEEPER LAB ZERO V6 · Version 1.0.0 · July 2026
https://img.shields.io/badge/DOI-10.5281%2Fzenodo.21154423-blue.svg https://img.shields.io/pypi/v/moroccan-nlp?color=1B4F72 https://img.shields.io/badge/License-MIT-yellow.svg https://img.shields.io/badge/Domain-Natural%20Language%20Processing-1B4F72
"Building Moroccan AI, one word at a time."
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file moroccan_nlp-1.0.0.tar.gz.
File metadata
- Download URL: moroccan_nlp-1.0.0.tar.gz
- Upload date:
- Size: 7.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: moroccan-nlp-Uploader/1.0.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
90d023c1a686e51ec656debb5b9883c1007be624643c8a5471284dbdac77b621
|
|
| MD5 |
3ea96e2303f2add7d3f508d67f99c196
|
|
| BLAKE2b-256 |
303c696ab6a627d15712e760596b4f6475c68de6246e70790faf7612df3dac3a
|
File details
Details for the file moroccan_nlp-1.0.0-py3-none-any.whl.
File metadata
- Download URL: moroccan_nlp-1.0.0-py3-none-any.whl
- Upload date:
- Size: 7.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: moroccan-nlp-Uploader/1.0.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
70c0c5fc9b3119eb24724a173a04e7292001cfcb77409978b7991e1ede71dd17
|
|
| MD5 |
eaa8018e1ffb6816c2f872534ddbd5d7
|
|
| BLAKE2b-256 |
142657041b773d8fb7772ce6c69df2231613536863b57a7df71a15f6f1f9f843
|