A Chinese co-word analysis with topic discovery package
Project description
catd
A Chinese co-word analysis with topic discovery package.
Overview
The catd co-word analysis with topic discovery package is intend for Chinese corpus analysis.
Use case
For better experience, you can run this script (with your corpus which have list of documents separated by '\n'.)
Corpus('$ProjectRoot/data/original_data/tianya_posts_test_set_10.txt):
documents1
documents2
...
Program:
import catd
import os
corpus = []
with open(os.path.join('data', 'original_data', 'tianya_posts_test_set_10.txt'), encoding='utf-8') as f:
for line in f:
corpus.append(line)
stop_words_set = catd.util.collect_all_words_to_set_from_dir(os.path.join('data', 'stop_words'))
cut_corpus = catd.util.word_cut(corpus, stop_words_set)
word_net = catd.WordNet()
coded_corpus = word_net.generate_nodes_hash_and_edge(cut_corpus)
word_net.add_cut_corpus(coded_corpus)
Note
Now I am working on the efficient visualization for big graph (hundreds of millions of edges).
If you have any question or suggestion, feel free to contact the Author in English or Chinese. But for the benefit of all users, please make communicate in English when it is public.
Data Structure
* WordNet
* nodes list[WordNode1, WordNode2, ...])
* edges dict[word][neighbors] -> weight)
* docs list[Doc1, Doc2, ...]
* get_node_by_str dict[word] -> WordNode
* WordNode
* id
* name
* doc_count
* word_count
* inverse_document_frequency
* Doc
* id
* word_count_in_doc
* word_tf_in_doc
* word_tf_idf
* num_of_words
log
0.3.0
Add support for lda model and topic information aggregation from words.
License
MIT License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file catd-0.5.0.tar.gz.
File metadata
- Download URL: catd-0.5.0.tar.gz
- Upload date:
- Size: 13.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ee6ad25b55697f542260d34c52aaa659409d0ed6da8e9b541796ae395ce2a8aa
|
|
| MD5 |
66f89ad6259f0dc3a1fff6b57ca14e95
|
|
| BLAKE2b-256 |
0bcdf767d99931792ae2fda3aeddc881a9ead70c0d9cc4a7d6130c8659c5ed2a
|
File details
Details for the file catd-0.5.0-py3-none-any.whl.
File metadata
- Download URL: catd-0.5.0-py3-none-any.whl
- Upload date:
- Size: 13.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f3a030223907e1201744dc05d350a5d19d3d03e04ca2139c303f6b409746de70
|
|
| MD5 |
4c44d6a01f75caed23c097c1e35663b6
|
|
| BLAKE2b-256 |
2317a226920320e1b5131dd0791a2498d1d0ba45471151174fe6eeac325f4425
|