A Chinese co-word analysis with topic discovery package
Project description
catd
A Chinese co-word analysis with topic discovery package.
Overview
The catd co-word analysis with topic discovery package is intend for Chinese corpus analysis.
Use case
For better experience, you can run this script (with your corpus which have list of documents separated by '\n'
.)
Corpus('$ProjectRoot/data/original_data/tianya_posts_test_set_10.txt):
documents1
documents2
...
Program:
import catd
import os
corpus = []
with open(os.path.join('data', 'original_data', 'tianya_posts_test_set_10.txt'), encoding='utf-8') as f:
for line in f:
corpus.append(line)
stop_words_set = catd.util.collect_all_words_to_set_from_dir(os.path.join('data', 'stop_words'))
cut_corpus = catd.util.word_cut(corpus, stop_words_set)
word_net = catd.WordNet()
coded_corpus = word_net.generate_nodes_hash_and_edge(cut_corpus)
word_net.add_cut_corpus(coded_corpus)
Note
Now I am working on the efficient visualization for big graph (hundreds of millions of edges).
If you have any question or suggestion, feel free to contact the Author in English or Chinese. But for the benefit of all users, please make communicate in English when it is public.
Data Structure
* WordNet
* nodes list[WordNode1, WordNode2, ...])
* edges dict[word][neighbors] -> weight)
* docs list[Doc1, Doc2, ...]
* get_node_by_str dict[word] -> WordNode
* WordNode
* id
* name
* doc_count
* word_count
* inverse_document_frequency
* Doc
* id
* word_count_in_doc
* word_tf_in_doc
* word_tf_idf
* num_of_words
log
0.3.0
Add support for lda model and topic information aggregation from words.
License
MIT License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file catd-0.5.0.tar.gz
.
File metadata
- Download URL: catd-0.5.0.tar.gz
- Upload date:
- Size: 13.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ee6ad25b55697f542260d34c52aaa659409d0ed6da8e9b541796ae395ce2a8aa |
|
MD5 | 66f89ad6259f0dc3a1fff6b57ca14e95 |
|
BLAKE2b-256 | 0bcdf767d99931792ae2fda3aeddc881a9ead70c0d9cc4a7d6130c8659c5ed2a |
File details
Details for the file catd-0.5.0-py3-none-any.whl
.
File metadata
- Download URL: catd-0.5.0-py3-none-any.whl
- Upload date:
- Size: 13.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f3a030223907e1201744dc05d350a5d19d3d03e04ca2139c303f6b409746de70 |
|
MD5 | 4c44d6a01f75caed23c097c1e35663b6 |
|
BLAKE2b-256 | 2317a226920320e1b5131dd0791a2498d1d0ba45471151174fe6eeac325f4425 |