A library for processing Code Mixed Text. Still in development!
Project description
This repository has been archived!
You can find the latest version of the source code inside the CMTT repository, where it will continue to be developed.
CMTT is a wrapper library that makes code-mixed text processing more efficient than ever. More documentation incoming!
Installation
pip install code-mixed-text-toolkit
Get started
How to use this library:
import code_mixed_text_toolkit.data as cmtt_data
import code_mixed_text_toolkit.preprocessing as cmtt_pp
# Loading json files
result_json = cmtt_data.load('https://world.openfoodfacts.org/api/v0/product/5060292302201.json')
# Loading csv files
result_csv = cmtt_data.load('https://gist.githubusercontent.com/rnirmal/e01acfdaf54a6f9b24e91ba4cae63518/raw/b589a5c5a851711e20c5eb28f9d54742d1fe2dc/datasets.csv')
# List the key properties available for the datasets provided by the cmtt library
keys = cmtt_data.list_dataset_keys()
# List all datasets provided by cmtt
# Specifying the 'key' property allows to return the dataset names with the respective 'key' value
# Specifying the 'key' as 'all' returns all the information pertaining to all the datasets
data = cmtt_data.list_cmtt_datasets()
print(data)
# Download multiple datasets provided by cmtt, returning a list of paths where the datasets get downloaded
# The Datasets are downloaded into a new 'cmtt' directory inside the root directory of the operating system
lst = cmtt_data.download_cmtt_datasets(["linc_ner_hineng", "L3Cube_HingLID_all", "linc_lid_spaeng"])
# Download a dataset from a url, returning the path where the dataset gets downloaded
# The Dataset is downloaded into a new directory 'datasets' inside the current working directory
path = cmtt_data.download_dataset_url('https://world.openfoodfacts.org/api/v0/product/5060292302201.json')
# Load and preprocess txt dataset
result_txt = cmtt_data.load('https://www.w3.org/TR/PNG/iso_8859-1.txt')
result_txt_tokenized = cmtt_pp.tokenizer.word_tokenize(result_txt)
# Search target word in txt corpus
cmtt_pp.search.search_word(result_txt, 'with', tokenize = True, width = 3)
Contributors
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
code_mixed_text_toolkit-0.5.5.tar.gz
(439.3 kB
view details)
Built Distribution
File details
Details for the file code_mixed_text_toolkit-0.5.5.tar.gz
.
File metadata
- Download URL: code_mixed_text_toolkit-0.5.5.tar.gz
- Upload date:
- Size: 439.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | aa9515addb28731e0b4262972eac53781580226e7591a46667890e9eef21a53b |
|
MD5 | 93858cb9b9a66af27e28dfbdb63dbccf |
|
BLAKE2b-256 | 5fb02af8a80f852d0f161ca5dda763dd9345d37d593b4b96d9ba4b54da178a74 |
File details
Details for the file code_mixed_text_toolkit-0.5.5-py3-none-any.whl
.
File metadata
- Download URL: code_mixed_text_toolkit-0.5.5-py3-none-any.whl
- Upload date:
- Size: 868.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d7ebc7caf399ba6b8b0a3f24829d92f355869ceafe71740b5efd663ab11d94dd |
|
MD5 | c9b7ae84bf5a55866ef81a411e02c08b |
|
BLAKE2b-256 | d372024df1f537dc5d15bc8a97dbf49a5d1b464cedd984fcab5d6b65fcfb41e0 |