No project description provided
Project description
my-nlp-wrangler
Description
This is a framwork for nlp clean data wrangler
這是一個很簡單的nlp清理文字並斷詞的架構。Cleaner主要作為去除標點符號和網址,而Tokenizer會先使用jeiba斷詞,並且移除以定義的stop word(須自行輸入stop word的位置)。
Flow
Cleaner
- remove puncuation
- remove rul
Tokenizer
- default by jeiba tokenizer
- remove stop words
Quick Start
Installation command: pip install my-nlp-wrangler
from mynlpwrangler.cleaner import ArticleCleaner
from mynlpwrangler.tokenizer import Tokenizer
df = pd.DataFrame(
{
"id": ["10001", "11375", "23423"],
"text": ["Hello, https://www.google.com/", "Hello,world", 'How do you do? http://www.google.com']
})
# To clean the sentence by removing the puncuation and url
ac = ArticleCleaner(col='content',cleaned_col='clean_sentence')
clean_data = ac.clean_data(df=article_df)
# Tokentize the sentence and generate the segmented word
tokenized_column = 'tokenize_word'
tk=Tokenizer(stop_word_path = f'{os.getcwd()}/stop_word.txt')
tk.tokenize_dataframe(clean_data,sentences_column = 'content',new_generate_column = tokenized_column)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
my-nlp-wrangler-0.0.3.tar.gz
(3.8 kB
view hashes)
Built Distribution
Close
Hashes for my_nlp_wrangler-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 18256410a55aab50e4c6b4ac88ce4bde75ff38e87da780a317f2dc9ac21f9e93 |
|
MD5 | a3dd47b2b10afa7836045310cb5140aa |
|
BLAKE2b-256 | 3e875aed25fbd0ef564d990862e23a7becc45f36cd4d98e2bf71abb000e6388e |