Skip to main content

Data list cleaning

Project description

数据清洗--cleancc


cleancc

使用方法

  • pip install cleancc

  • import cleancc

  • 共有五个函数调用:

    1.第一个函数为punct:

    [

    ​ 去除标点并让所有字母小写

    ​ :param pop_list:所要处理的的列表格式

    ​ :param lower:是否转小写,默认是

    ​ :return all_comment:处理后的结果-字符串格式

    ]

    2.第二个函数为statistics:

    [

    ​ 词频统计

    ​ :param pop_list:所要处理的的列表格式

    ​ :param symbol:是否去除标点,默认是

    ​ :param lower:是否转小写,默认是

    ​ :return wordCount_dict:统计结果-字典格式

    ]

    3.第三个函数为stop_words:

    [

    ​ 删除词频统计中的停顿词

    ​ :param statis:是否选择词频清理

    ​ :param pop_list:所要处理的的列表格式

    ​ :param symbol:是否去除标点,默认是

    ​ :param lower:是否转小写,默认是

    ​ :param wordCount_dict:词频统计结果-字典

    ​ :return wordCount_dict:清除后结果-字典格式

    ]

    4.第四个函数为Count_Sort:

    [

    ​ 字典排名数目排序

    ​ :param wordCount_dict:词频统计结果-字典

    ​ :param choices_number:返回前choices_number个字典个数

    ​ :return keyword_list:出现的单词-列表格式

    ​ :return value_list:单词对应的词频-列表格式

    ]

    5.第五个函数为word_all:

    [

    ​ 调用全部函数

    ​ :param pop_list:所要处理的的列表格式

    ​ :param choices_number:返回前choices_number个字典个数

    ​ :param symbol:是否去除标点,默认是

    ​ :param lower:是否转小写,默认是

    ​ :return keyword_list:出现的单词-列表格式

    ​ :return value_list:单词对应的词频-列表格式

    ]

注意事项

  • 注意:处理数据参数类型为列表,需要pandas转换为列表后进行调用!

  • 使用示例:

import pandas as pd
from cleancc import clean 
from bs4 import BeautifulSoup

df = pd.read_csv("label.csv",sep='\t', escapechar='\\')
review_list = df['review'].tolist()
comment_list = [BeautifulSoup(k,'lxml').text for k in review_list]
print(comment_list)

keyword_list, value_list = clean.word_all(comment_list,150)
print(keyword_list, value_list)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cleancc-0.1.0.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cleancc-0.1.0-py3-none-any.whl (7.1 kB view details)

Uploaded Python 3

File details

Details for the file cleancc-0.1.0.tar.gz.

File metadata

  • Download URL: cleancc-0.1.0.tar.gz
  • Upload date:
  • Size: 6.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.5

File hashes

Hashes for cleancc-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0a3e1e7aa5af9a67b12a4f7cc7b8a71aa3c1023f0686a9ec2e8c869a62bdbd65
MD5 ad4e0fda53123e2b28cd64864516db3f
BLAKE2b-256 b6a4c0e8f8d01185b354c3a6eaf3a2f3f0e06f740c42101cb49f69f7c114e33b

See more details on using hashes here.

File details

Details for the file cleancc-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: cleancc-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 7.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.5

File hashes

Hashes for cleancc-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3933343e007a11722eb094cce86f0831cbec7da051868ed4b47759180158f875
MD5 5a00f22cc448f3396713f24507b7a2d4
BLAKE2b-256 a804a189295e6b6c811b7a2f6d0ce845d7ed3e6eaf152a8c5057b8dcd076ca6b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page