A library for processing Code Mixed Text. Still in development!
Project description
CMTT is a wrapper library that makes code-mixed text processing more efficient than ever. More documentation incoming!
Installation
pip install code-mixed-text-toolkit
Get started
How to use this library:
import code_mixed_text_toolkit.data as cmtt_data
import code_mixed_text_toolkit.preprocessing as cmtt_pp
# Loading json files
result_json = cmtt_data.load('https://world.openfoodfacts.org/api/v0/product/5060292302201.json')
# Loading csv files
result_csv = cmtt_data.load('https://gist.githubusercontent.com/rnirmal/e01acfdaf54a6f9b24e91ba4cae63518/raw/b589a5c5a851711e20c5eb28f9d54742d1fe2dc/datasets.csv')
# List all datasets available
cmtt_data.list_datasets(show_key="url")
# Download specific datasets
cmtt_data.download("openfoodfacts")
cmtt_data.download("rnirmal")
# Load and preprocess txt dataset
result_txt = cmtt_data.load('https://www.w3.org/TR/PNG/iso_8859-1.txt')
result_txt_tokenized = cmtt_pp.tokenizer.word_tokenize(result_txt)
# Search target word in txt corpus
cmtt_pp.search.search_word(result_txt, 'with', tokenize = True, width = 3)
Contributors
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for code_mixed_text_toolkit-0.3.5.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9949688ab306c7d146778a33b49be2663f9c5670bc60af60513c96fd43adf81f |
|
MD5 | c6aa7aa3ead56e22ae4578e35dda28d3 |
|
BLAKE2b-256 | 56abf503c734b411e3728621a21a8772f7cb6990aa47dcd6b598430a7e76e64b |
Close
Hashes for code_mixed_text_toolkit-0.3.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b52bb39e4a64592869008801d76f567812738a6905dd8c27db0562add78da7c3 |
|
MD5 | 3635b5eb91ba489b4be876e0f83f549e |
|
BLAKE2b-256 | 31e0d48959fbddfc04f102e5ecc99f2f7d94b25412d079b9b76fe8b9e489412d |