Skip to main content


Project description


tidytext是R语言的文本分析包,一般数据会整理为dataframe,每行都是由docid-word-freq组成。有一本R语言的文本挖掘书《Text mining with R》,知识体系挺完整的,该书主力分析工具是R语言的tidytext包。

最早 项目初步实现了R语言中的unntest_tokens和bind_tf_idf,但未实现get_sentiments和get_stopwords,本项目主要是基于,将其完善。

本项目可能图片看不到,大家可以点击链接: 密码:wucj 下载本文代码和数据。


pip install tidytextpy



  • chapterid 第几章
  • title 章(节)标题
  • text 每章节的文本内容(分词后以空格间隔的文本,形态类似英文)
import pandas as pd
import jieba
import re
pd.set_option('display.max_rows', 6)

raw_texts = open('三体.txt', encoding='utf-8').read()
texts = re.split('第\d+章', raw_texts)
texts = [text for text in texts if text]
texts = [' '.join(jieba.lcut(text)) for text in texts if text]
titles = re.findall('第\d+章 (.*?)\n', raw_texts)

data = {'chapterid': list(range(1, len(titles)+1)),
        'title': titles,
        'text': texts}
df = pd.DataFrame(data)
Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/sc/3mnt5tgs419_hk7s16gq61p80000gn/T/jieba.cache
Loading model cost 0.592 seconds.
Prefix dict has been built successfully.
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;

.dataframe thead th {
    text-align: right;
chapterid title text
0 1 科学边界(1) 科学 边界 ( 1 ) \n \n 恋上你 看书 网 630book...
1 2 科学边界(2) 科学 边界 ( 2 ) \n \n 恋上你 看书 网 630book...
2 3 台球 台球 \n \n 恋上你 看书 网 630bookla , 最快...
... ... ... ...
210 211 【时间之外,我们的宇宙】(2) 【 时间 之外 , 我们 的 宇宙 】 ( 2 ) \n \n 恋上你 ...
211 212 【时间之外,我们的宇宙】(3) 【 时间 之外 , 我们 的 宇宙 】 ( 3 ) \n \n 恋上你 ...
212 213 注释 注释 \n \n 恋上你 看书 网 630bookla , 最快...

213 rows × 3 columns


  • get_stopwords 停用词表
  • get_sentiments 情感词典
  • unnest_tokens 分词函数
  • bind_tf_idf 计算tf-idf


get_stopwords(language) 获取对应语言的停用词表,目前仅支持chinese和english两种语言

from tidytextpy import get_stopwords

cn_stps = get_stopwords('chinese')
en_stps = get_stopwords()


get_sentiments('词典名') 调用词典,返回词典的dataframe数据。

  • afinn sentiment取值-5到5
  • bing sentiment取值为positive或negative
  • nrc sentiment取值为positive或negative,及细粒度的情绪分类信息
  • dutir sentiment为中文七种情绪类别(细粒度情绪分类信息)
  • hownet sentiment为positive或negative


from tidytextpy import get_sentiments

<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;

.dataframe thead th {
    text-align: right;
sentiment word
0 冷不防
1 惊动
2 珍闻
... ... ...
27411 匆猝
27412 忧心仲忡
27413 面面厮觑

27414 rows × 2 columns

<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;

.dataframe thead th {
    text-align: right;
word sentiment
0 abacus trust
1 abandon fear
2 abandon negative
... ... ...
13898 zest positive
13899 zest trust
13900 zip negative

13901 rows × 2 columns


unnest_tokens(__data, output, input)

  • __data 待处理的dataframe数据
  • output 新生成的dataframe中,用于存储分词结果的字段名
  • input 待分词数据的字段名(待处理的dataframe数据)
from tidytextpy import unnest_tokens

tokens = unnest_tokens(df, output='word', input='text')
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;

.dataframe thead th {
    text-align: right;
chapterid title word
0 1 科学边界(1) 科学
0 1 科学边界(1) 边界
0 1 科学边界(1) 1
... ... ... ...
212 213 注释 想到
212 213 注释 暗物质
212 213 注释

556595 rows × 3 columns


从这里开始会用到plydata的管道符>> 和相关的常用函数,建议大家遇到不懂的地方查阅plydata文档

from plydata import count, group_by, ungroup

wordfreq = (df 
            >> unnest_tokens(output='word', input='text') #分词
            >> group_by('chapterid')  #按章节分组
            >> count() #对每章用词量进行统计
            >> ungroup() #去除分组

<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;

.dataframe thead th {
    text-align: right;
chapterid n
0 1 2549
1 2 2666
2 3 1726
... ... ...
210 211 2505
211 212 2646
212 213 2477

213 rows × 2 columns



from plotnine import ggplot, aes, theme, geom_line, labs, theme, element_text
from plotnine.options import figure_size

(ggplot(wordfreq, aes(x='chapterid', y='n'))+
 theme(figure_size=(12, 8),
       title=element_text(family='Kai', size=15), 


<ggplot: (338899281)>


重要的事情多重复一遍o( ̄︶ ̄)o

get_sentiments('词典名') 调用词典,返回词典的dataframe数据。

  • afinn sentiment取值-5到5
  • bing sentiment取值为positive或negative
  • nrc sentiment取值为positive或negative,及细粒度的情绪分类信息
  • dutir sentiment为中文七种情绪类别(细粒度情绪分类信息)
  • hownet sentiment为positive或negative



这里会用到plydata的很多知识点,大家可以查看 相关函数的文档。

from plydata import inner_join, count, define, call
from plydata.tidy import spread

chapter_sentiment_score = (
    df #分词
    >> unnest_tokens(output='word', input='text') 
    >> inner_join(get_sentiments('hownet')) #让分词结果与hownet词表交集,给每个词分配sentiment
    >> count('chapterid', 'sentiment')#统计每章中每类sentiment的个数
    >> spread('sentiment', 'n') #将sentiment中的positive和negative转化为两列
    >> call('.fillna', 0) #将缺失值替换为0
    >> define(score = '(positive-negative)/(positive+negative)') #计算每一章的情感分score

<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;

.dataframe thead th {
    text-align: right;
chapterid negative positive score
0 1 93.0 56.0 -0.248322
1 2 98.0 83.0 -0.082873
2 3 54.0 37.0 -0.186813
... ... ... ... ...
210 211 56.0 73.0 0.131783
211 212 71.0 67.0 -0.028986
212 213 75.0 74.0 -0.006711

213 rows × 4 columns



from plotnine import ggplot, aes, geom_line, element_text, labs, theme

(ggplot(chapter_sentiment_score, aes('chapterid', 'score'))+
 labs(x='章节', y='情感值score', title='《三体》小说情感走势图')+


<ggplot: (364328989)>




import pandas as pd
pd.set_option('display.max_rows', 6)

zen = """
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

zen_split = zen.splitlines()

df = pd.DataFrame({'docid': list(range(len(zen_split))),
                  'text': zen_split})

<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;

.dataframe thead th {
    text-align: right;
docid text
0 0
1 1 The Zen of Python, by Tim Peters
2 2
... ... ...
19 19 If the implementation is hard to explain, it's...
20 20 If the implementation is easy to explain, it m...
21 21 Namespaces are one honking great idea -- let's...

22 rows × 2 columns



bind_tf_idf(_data, term, document, n)

  • _data 传入的df
  • term df中词语对应的字段名
  • document df中文档id的字段名
  • n df中词频数对应的字段名
from tidytextpy import bind_tf_idf
from plydata import count, group_by, ungroup

tfidfs = (df
          >> unnest_tokens(output='word', input='text')
          >> count('docid', 'word')
          >> bind_tf_idf(term='word', document='docid', n='n')

<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;

.dataframe thead th {
    text-align: right;
docid word n tf idf tf_idf
0 1 the 1 0.142857 1.386294 0.198042
1 1 zen 1 0.142857 2.995732 0.427962
2 1 of 1 0.142857 1.897120 0.271017
... ... ... ... ... ... ...
137 21 more 1 0.090909 2.995732 0.272339
138 21 of 1 0.090909 1.897120 0.172465
139 21 those 1 0.090909 2.995732 0.272339

140 rows × 6 columns


如果您是经管人文社科专业背景,编程小白,面临海量文本数据采集和处理分析艰巨任务,可以参看《python网络爬虫与文本数据分析》视频课。作为文科生,一样也是从两眼一抹黑开始,这门课程是用五年时间凝缩出来的。自认为讲的很通俗易懂o( ̄︶ ̄)o,

  • python入门
  • 网络爬虫
  • 数据读取
  • 文本分析入门
  • 机器学习与文本分析
  • 文本分析在经管研究中的应用

感兴趣的童鞋不妨 戳一下《python网络爬虫与文本数据分析》进来看看~


Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tidytextpy-1.0.macosx-10.9-x86_64.tar.gz (398.9 kB view hashes)

Uploaded Source

Built Distribution

tidytextpy-1.0-py3-none-any.whl (401.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page