Skip to main content

A library for processing Code Mixed Text. Still in development!

Project description

forthebadge made-with-python


code style: blackCompatibility

CMTT is a wrapper library that makes code-mixed text processing more efficient than ever. More documentation incoming!

Installation

pip install cmtt

Getting Started

How to use this library:

from cmtt.data import *
from cmtt.preprocessing import *

# Loading json files
result_json = load_url('https://world.openfoodfacts.org/api/v0/product/5060292302201.json')

# Loading csv files
result_csv = load_url('https://gist.githubusercontent.com/rnirmal/e01acfdaf54a6f9b24e91ba4cae63518/raw/b589a5c5a851711e20c5eb28f9d54742d1fe2dc/datasets.csv')

# List the key properties available for the datasets provided by the cmtt library
keys = list_dataset_keys()

# List all datasets provided by cmtt
# Specifying the 'key' property allows to return the dataset names with the respective 'key' value
# Specifying the 'key' as 'all' returns all the information pertaining to all the datasets
data = list_cmtt_datasets()

# Download multiple datasets provided by cmtt, returning a list of paths where the datasets get downloaded
# The Datasets are downloaded into a new 'cmtt' directory inside the user profile directory of the operating system
lst = download_cmtt_datasets(["linc_ner_hineng", "L3Cube_HingLID_all", "linc_lid_spaeng"])

# Download a dataset from a url, returning the path where the dataset gets downloaded
# The Dataset is downloaded into a new directory 'datasets' inside the current working directory
path = download_dataset_url('https://world.openfoodfacts.org/api/v0/product/5060292302201.json')

# CMTT currently provides 3 tokenizers - basic, word and wordpiece tokenizers
# Basic Tokenizer
text = "This Python interpreter is in a conda environment, but the environment has not been activated.  Libraries may fail to load.  To activate this environment"
tokenized_text_basic = basic_tokenize(text)

# Word Tokenizer
WordT = WordTokenizer()
tokenized_text_word = WordT.tokenize(text)

# Wordpiece Tokenizer
WordpieceT= Wordpiece_tokenizer()
tokenized_text_wordpiece  = WordpieceT.tokenize(text)

# Search functionality
instances, list_instances = search_word(text, 'this', tokenize = True, width = 3)

# Sentence piece based tokenizers for Hindi and Hinglish
# Download the models for the tokenizers. If already downloaded then cmtt does not download it again.
download_models('hi')
download_models('hi-en')

# Sentence piece based Tokenizer for Hindi
_hi = "मैं इनदोनों श्रेणियों के बीच कुछ भी० सामान्य नहीं देखता।"
lst = tokenize(_hi ,'hi-en')
# Output of tokenizer written on a txt file as terminal does not show devanagari text accurately.
with open(r"test_hi.txt", 'w', encoding = "utf-8") as f:
  for i in lst:
    f.write(i + "\n")

# Sentence piece based Tokenizer for Hinglish
_hien = "hi kya haal chaal? hum cmtt naam ki python library develop kar rahe hain"
lst = tokenize(_hien ,'hi-en')
with open(r"test_hien.txt", 'w', encoding = "utf-8") as f:
  for i in lst:
    f.write(i + "\n")

Contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cmtt-0.5.0.tar.gz (441.2 kB view details)

Uploaded Source

Built Distribution

cmtt-0.5.0-py3-none-any.whl (870.4 kB view details)

Uploaded Python 3

File details

Details for the file cmtt-0.5.0.tar.gz.

File metadata

  • Download URL: cmtt-0.5.0.tar.gz
  • Upload date:
  • Size: 441.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.14

File hashes

Hashes for cmtt-0.5.0.tar.gz
Algorithm Hash digest
SHA256 71b8e4a07e2d9e033ca69226a0eecb8afea4995ea5d6cc29122b5abbc9174385
MD5 ee88305f44b74ac7dfe4c211926ea3e5
BLAKE2b-256 44ffec96ab5466232fb81b23478ee64fd0876cf56e6d9b43c80bf5b67554ce25

See more details on using hashes here.

File details

Details for the file cmtt-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: cmtt-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 870.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.14

File hashes

Hashes for cmtt-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 608259c0c862cd0fad2bcb3f07fe033fa502fa085775c938047a696721a33cd0
MD5 3ee57a1a9eebcf57b54883602d0885db
BLAKE2b-256 dd2cfc6e90bb851de9d2e7fa77877514a17cc53bbda12991b3326ed1e3663dca

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page