A library for processing Code Mixed Text. Still in development!
Project description
CMTT is a wrapper library that makes code-mixed text processing more efficient than ever. More documentation incoming!
Installation
pip install code-mixed-text-toolkit
Get started
How to use this library:
import code_mixed_text_toolkit.data as cmtt_data
import code_mixed_text_toolkit.preprocessing as cmtt_pp
# Loading json files
result_json = cmtt_data.load('https://world.openfoodfacts.org/api/v0/product/5060292302201.json')
# Loading csv files
result_csv = cmtt_data.load('https://gist.githubusercontent.com/rnirmal/e01acfdaf54a6f9b24e91ba4cae63518/raw/b589a5c5a851711e20c5eb28f9d54742d1fe2dc/datasets.csv')
# List the key properties available for the datasets provided by the cmtt library
keys = cmtt_data.list_dataset_keys()
# List all datasets provided by cmtt
# Specifying the 'key' property allows to return the dataset names with the respective 'key' value
# Specifying the 'key' as 'all' returns all the information pertaining to all the datasets
data = cmtt_data.list_cmtt_datasets()
print(data)
# Download multiple datasets provided by cmtt, returning a list of paths where the datasets get downloaded
# The Datasets are downloaded into a new 'cmtt' directory inside the root directory of the operating system
lst = cmtt_data.download_cmtt_datasets(["linc_ner_hineng", "L3Cube_HingLID_all", "linc_lid_spaeng"])
# Download a dataset from a url, returning the path where the dataset gets downloaded
# The Dataset is downloaded into a new directory 'datasets' inside the current working directory
path = cmtt_data.download_dataset_url('https://world.openfoodfacts.org/api/v0/product/5060292302201.json')
# Load and preprocess txt dataset
result_txt = cmtt_data.load('https://www.w3.org/TR/PNG/iso_8859-1.txt')
result_txt_tokenized = cmtt_pp.tokenizer.word_tokenize(result_txt)
# Search target word in txt corpus
cmtt_pp.search.search_word(result_txt, 'with', tokenize = True, width = 3)
Contributors
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
code_mixed_text_toolkit-0.5.0.tar.gz
(439.1 kB
view hashes)
Built Distribution
Close
Hashes for code_mixed_text_toolkit-0.5.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3c5cb891dc2e5701c82461f461a0b41e538f8969af18b46ecc9d0d61aeb2d484 |
|
MD5 | d20a6b0f5b21ca6d83d19a4db6ee3bbe |
|
BLAKE2b-256 | 4d5be8897186dd53c3740403269ea5abc1bff96d21cc55991538c0275ef420a5 |
Close
Hashes for code_mixed_text_toolkit-0.5.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 65cbdddc50f976bf21f070fec136985b77c6aad8dbf24ad8b71c84fdb4c1465c |
|
MD5 | 9e2455e72d21905fcc54282c576461fa |
|
BLAKE2b-256 | b384ee6d7691ee69a77531395d8448453312bfef8fcb7ed36fa701b4cf611075 |