Skip to main content

A library for processing Code Mixed Text. Still in development!

Project description

forthebadge made-with-python


code style: blackCompatibility

CMTT is a wrapper library that makes code-mixed text processing more efficient than ever. More documentation incoming!

Installation

pip install cmtt

Getting Started

How to use this library:

from cmtt.data import *
from cmtt.preprocessing import *

# Loading json files
result_json = load_url('https://world.openfoodfacts.org/api/v0/product/5060292302201.json')

# Loading csv files
result_csv = load_url('https://gist.githubusercontent.com/rnirmal/e01acfdaf54a6f9b24e91ba4cae63518/raw/b589a5c5a851711e20c5eb28f9d54742d1fe2dc/datasets.csv')

# List the key properties available for the datasets provided by the cmtt library
keys = list_dataset_keys()

# List all datasets provided by cmtt
# Specifying the 'key' property allows to return the dataset names with the respective 'key' value
# Specifying the 'key' as 'all' returns all the information pertaining to all the datasets
data = list_cmtt_datasets()

# Download multiple datasets provided by cmtt, returning a list of paths where the datasets get downloaded
# The Datasets are downloaded into a new 'cmtt' directory inside the root directory of the operating system
lst = download_cmtt_datasets(["linc_ner_hineng", "L3Cube_HingLID_all", "linc_lid_spaeng"])

# Download a dataset from a url, returning the path where the dataset gets downloaded
# The Dataset is downloaded into a new directory 'datasets' inside the current working directory
path = download_dataset_url('https://world.openfoodfacts.org/api/v0/product/5060292302201.json')

# CMTT currently provides 3 tokenizers - basic, word and wordpiece tokenizers
# Basic Tokenizer
text = "This Python interpreter is in a conda environment, but the environment has not been activated.  Libraries may fail to load.  To activate this environment"
tokenized_text_basic = basic_tokenize(text)

# Word Tokenizer
WordT = WordTokenizer()
tokenized_text_word = WordT.tokenize(text)

# Wordpiece Tokenizer
WordpieceT= Wordpiece_tokenizer()
tokenized_text_wordpiece  = WordpieceT.tokenize(text)

# Search functionality
instances, list_instances = search_word(text, 'this', tokenize = True, width = 3)

Contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cmtt-0.3.0.tar.gz (438.8 kB view details)

Uploaded Source

Built Distribution

cmtt-0.3.0-py3-none-any.whl (866.9 kB view details)

Uploaded Python 3

File details

Details for the file cmtt-0.3.0.tar.gz.

File metadata

  • Download URL: cmtt-0.3.0.tar.gz
  • Upload date:
  • Size: 438.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.14

File hashes

Hashes for cmtt-0.3.0.tar.gz
Algorithm Hash digest
SHA256 9256c0efd3a081df62878db8910da0e994b4533aa32fa158e86f22e590c74628
MD5 f8d05b30b2790ae6ca1652907dc841d6
BLAKE2b-256 ac007addcc4f6b34dfdc7530471e84988b6bc7f9cb14ffe091ec6fa2070600ed

See more details on using hashes here.

File details

Details for the file cmtt-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: cmtt-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 866.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.14

File hashes

Hashes for cmtt-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8ce4e878899bc3536320f3f2d7b854427fbaa882fdc66c2cc78af000cf1a7f64
MD5 909e652849eb4098e03e296f3cfd3b63
BLAKE2b-256 e433b80dd6639d114e02ad166f8676ea21a2fe1693b5e6554e7ea4a44ea3d7f5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page