Data list cleaning
Project description
数据清洗--cleancc
cleancc
- 快速清洗数据内容可以
- 项目地址(欢迎star):https://github.com/Amiee-well/clean
使用方法
-
pip install cleancc
-
import cleancc
-
共有五个函数调用:
1.第一个函数为punct:
[
去除标点并让所有字母小写
:param pop_list:所要处理的的列表格式
:param lower:是否转小写,默认是
:return all_comment:处理后的结果-字符串格式
]
2.第二个函数为statistics:
[
词频统计
:param pop_list:所要处理的的列表格式
:param symbol:是否去除标点,默认是
:param lower:是否转小写,默认是
:return wordCount_dict:统计结果-字典格式
]
3.第三个函数为stop_words:
[
删除词频统计中的停顿词
:param statis:是否选择词频清理
:param pop_list:所要处理的的列表格式
:param symbol:是否去除标点,默认是
:param lower:是否转小写,默认是
:param wordCount_dict:词频统计结果-字典
:return wordCount_dict:清除后结果-字典格式
]
4.第四个函数为Count_Sort:
[
字典排名数目排序
:param wordCount_dict:词频统计结果-字典
:param choices_number:返回前choices_number个字典个数
:return keyword_list:出现的单词-列表格式
:return value_list:单词对应的词频-列表格式
]
5.第五个函数为word_all:
[
调用全部函数
:param pop_list:所要处理的的列表格式
:param choices_number:返回前choices_number个字典个数
:param symbol:是否去除标点,默认是
:param lower:是否转小写,默认是
:return keyword_list:出现的单词-列表格式
:return value_list:单词对应的词频-列表格式
]
注意事项
-
注意:处理数据参数类型为列表,需要pandas转换为列表后进行调用!
-
使用示例:
import pandas as pd
from cleancc import clean
from bs4 import BeautifulSoup
df = pd.read_csv("label.csv",sep='\t', escapechar='\\')
review_list = df['review'].tolist()
comment_list = [BeautifulSoup(k,'lxml').text for k in review_list]
print(comment_list)
keyword_list, value_list = clean.word_all(comment_list,150)
print(keyword_list, value_list)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cleancc-0.1.0.tar.gz.
File metadata
- Download URL: cleancc-0.1.0.tar.gz
- Upload date:
- Size: 6.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a3e1e7aa5af9a67b12a4f7cc7b8a71aa3c1023f0686a9ec2e8c869a62bdbd65
|
|
| MD5 |
ad4e0fda53123e2b28cd64864516db3f
|
|
| BLAKE2b-256 |
b6a4c0e8f8d01185b354c3a6eaf3a2f3f0e06f740c42101cb49f69f7c114e33b
|
File details
Details for the file cleancc-0.1.0-py3-none-any.whl.
File metadata
- Download URL: cleancc-0.1.0-py3-none-any.whl
- Upload date:
- Size: 7.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3933343e007a11722eb094cce86f0831cbec7da051868ed4b47759180158f875
|
|
| MD5 |
5a00f22cc448f3396713f24507b7a2d4
|
|
| BLAKE2b-256 |
a804a189295e6b6c811b7a2f6d0ce845d7ed3e6eaf152a8c5057b8dcd076ca6b
|