Skip to main content

This project is a collection of Natural Language Processing tools for Kurdish Language.

Project description

Aamraz - Kurdish NLP collection

Overview

Aamraz which is written "ئامراز" in kurdish script means "instrument". This project is a collection of Natural Language Processing tools for Kurdish Language. Despite being spoken by millions, Kurdish remains an under-resourced language in Natural Language Processing (NLP). Recognizing the rich cultural heritage and historical significance of the Kurdish people, we—regardless of ethnicity—are committed to advancing tools and pre-trained models that empower the Kurdish language in modern research and technology. Our work aims to foster further development and provide a foundation for future research and applications in NLP.

Base Features

  • Normalization
  • Tokenization
  • Word Embedding: Creates vector representations of words.
  • Sentences Embedding: Creates vector representations of sentences.

Tools

Installation

pip install aamraz

PretrainedModels

some useful pre-trained Models:

Model Description Size
FastText WordEmbedding Model trained using FastText method on our own Corpus.
This is bot the fasttext & skip-gram model itself (fasttext model.
~ 2.3 GB
FastText WordEmbedding - Lite Model trained using FastText method on our own Corpus.
This is bot the fasttext & skip-gram model itself (fasttext model.
~ 800 MB
Word2vec Model Including needed .bin and .npy files ~ 92 MB

Usage

import aamraz

# Normalization
normalizer= aamraz.Normalizer()
sample_sentence="قڵبە‌کە‌م‌ بە‌  کوردی‌  قسە‌ دە‌کات‌."
normalized_sentence=normalizer.normalize(sample_sentence)
print(normalized_sentence)

# Tokenization
tokenizer = aamraz.WordTokenizer()
sample_sentence="زوانی له دربره"
tokens = tokenizer.tokenize(sample_sentence)
print(tokens)

# Embedding by fasttext
model_path = 'kurdish_fasttext_skipgram_dim300_v1.bin'
embedding_model = aamraz.EmbeddingModel(model_path, dim=50)

sample_word="ئامراز"
sample_sentence="زوانی له دربره"

word_vector = embedding_model.word_embedding(sample_word)
sentence_vector = embedding_model.sentence_embedding(sample_sentence)

print(word_vector)
print(sentence_vector)

# Embedding by word2vec
model_path = 'kurdish_word2vec_model_dim100_v1.bin'
embedding_model = aamraz.EmbeddingModel(model_path, type='word2vec')

sample_word="ئامراز"
sample_sentence="زوانی له دربره"

word_vector = embedding_model.word_embedding(sample_word)
sentence_vector = embedding_model.sentence_embedding(sample_sentence)

print(word_vector)
print(sentence_vector)

License

This project is licensed under the MIT License. You are free to use, distribute, modify, and build upon this work, even for commercial purposes, as long as you include a copy of the original MIT License and provide proper attribution.

To view a copy of this license, visit: https://opensource.org/licenses/MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aamraz-0.0.7.tar.gz (6.4 kB view details)

Uploaded Source

Built Distribution

aamraz-0.0.7-py3-none-any.whl (6.9 kB view details)

Uploaded Python 3

File details

Details for the file aamraz-0.0.7.tar.gz.

File metadata

  • Download URL: aamraz-0.0.7.tar.gz
  • Upload date:
  • Size: 6.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for aamraz-0.0.7.tar.gz
Algorithm Hash digest
SHA256 7c7af93bdf27cbecbdab28e9628009130ba0bf58d25af4847e540940d238bbc6
MD5 e22745d72af89c9a6696c18b8115c2c3
BLAKE2b-256 57038c5afa4d7a1aab52e0fc1d31124b0485420e18e00a71b34e05d013665fb9

See more details on using hashes here.

File details

Details for the file aamraz-0.0.7-py3-none-any.whl.

File metadata

  • Download URL: aamraz-0.0.7-py3-none-any.whl
  • Upload date:
  • Size: 6.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for aamraz-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 2fa248c55f53fe18b2c8b7df20edbc9aefd0766c08ea623b46087032d25534e2
MD5 709cbb54484137b1cecd857d53e86026
BLAKE2b-256 f13d982927c6cdf797e5cfafaa9fd512d238050b794ab8fa1130398471587cc0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page