A library for processing Code Mixed Text. Still in development!
Project description
CMTT is a wrapper library that makes code-mixed text processing more efficient than ever. More documentation incoming!
Installation
pip install cmtt
Getting Started
How to use this library:
from cmtt.data import *
from cmtt.preprocessing import *
# Loading json files
result_json = load_url('https://world.openfoodfacts.org/api/v0/product/5060292302201.json')
# Loading csv files
result_csv = load_url('https://gist.githubusercontent.com/rnirmal/e01acfdaf54a6f9b24e91ba4cae63518/raw/b589a5c5a851711e20c5eb28f9d54742d1fe2dc/datasets.csv')
# List the key properties available for the datasets provided by the cmtt library
keys = list_dataset_keys()
# List all datasets provided by cmtt
# Specifying the 'key' property allows to return the dataset names with the respective 'key' value
# Specifying the 'key' as 'all' returns all the information pertaining to all the datasets
data = list_cmtt_datasets()
# Download multiple datasets provided by cmtt, returning a list of paths where the datasets get downloaded
# The Datasets are downloaded into a new 'cmtt' directory inside the user profile directory of the operating system
lst = download_cmtt_datasets(["linc_ner_hineng", "L3Cube_HingLID_all", "linc_lid_spaeng"])
# Download a dataset from a url, returning the path where the dataset gets downloaded
# The Dataset is downloaded into a new directory 'datasets' inside the current working directory
path = download_dataset_url('https://world.openfoodfacts.org/api/v0/product/5060292302201.json')
# CMTT currently provides 3 tokenizers - basic, word and wordpiece tokenizers
# Basic Tokenizer
text = "This Python interpreter is in a conda environment, but the environment has not been activated. Libraries may fail to load. To activate this environment"
tokenized_text_basic = basic_tokenize(text)
# Word Tokenizer
WordT = WordTokenizer()
tokenized_text_word = WordT.tokenize(text)
# Wordpiece Tokenizer
WordpieceT= Wordpiece_tokenizer()
tokenized_text_wordpiece = WordpieceT.tokenize(text)
# Search functionality
instances, list_instances = search_word(text, 'this', tokenize = True, width = 3)
# Sentence piece based tokenizers for Hindi and Hinglish
# Download the models for the tokenizers. If already downloaded then cmtt does not download it again.
download_models('hi')
download_models('hi-en')
# Sentence piece based Tokenizer for Hindi
_hi = "मैं इनदोनों श्रेणियों के बीच कुछ भी० सामान्य नहीं देखता।"
lst = tokenize(_hi ,'hi-en')
# Output of tokenizer written on a txt file as terminal does not show devanagari text accurately.
with open(r"test_hi.txt", 'w', encoding = "utf-8") as f:
for i in lst:
f.write(i + "\n")
# Sentence piece based Tokenizer for Hinglish
_hien = "hi kya haal chaal? hum cmtt naam ki python library develop kar rahe hain"
lst = tokenize(_hien ,'hi-en')
with open(r"test_hien.txt", 'w', encoding = "utf-8") as f:
for i in lst:
f.write(i + "\n")
Contributors
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
cmtt-0.5.0.tar.gz
(441.2 kB
view details)
Built Distribution
cmtt-0.5.0-py3-none-any.whl
(870.4 kB
view details)
File details
Details for the file cmtt-0.5.0.tar.gz
.
File metadata
- Download URL: cmtt-0.5.0.tar.gz
- Upload date:
- Size: 441.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 71b8e4a07e2d9e033ca69226a0eecb8afea4995ea5d6cc29122b5abbc9174385 |
|
MD5 | ee88305f44b74ac7dfe4c211926ea3e5 |
|
BLAKE2b-256 | 44ffec96ab5466232fb81b23478ee64fd0876cf56e6d9b43c80bf5b67554ce25 |
File details
Details for the file cmtt-0.5.0-py3-none-any.whl
.
File metadata
- Download URL: cmtt-0.5.0-py3-none-any.whl
- Upload date:
- Size: 870.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 608259c0c862cd0fad2bcb3f07fe033fa502fa085775c938047a696721a33cd0 |
|
MD5 | 3ee57a1a9eebcf57b54883602d0885db |
|
BLAKE2b-256 | dd2cfc6e90bb851de9d2e7fa77877514a17cc53bbda12991b3326ed1e3663dca |