Collocation extraction from segmented texts
Project description
Collocation
Install
pip install collocation
Usage
from collocation import Collocation
# Prepare corpus data
# https://yongfu.name/collocation/sampled_PTTposts.txt
corpus = []
with open("sampled_PTTposts.txt", encoding="utf-8") as f:
for sent in f.read().split("\n"):
if sent.strip() == "": continue
sentence = []
for tk in sent.split("\u3000"):
if tk == "": continue
sentence.append(tk)
corpus.append(sentence)
>>> corpus[:7]
[['物品', '名稱', ':', '學生證'],
['拾獲', '地點', ':', '大一女', '前'],
['拾獲', '時間', ':', '6', '/', '21'],
['18', ':', '20', '左右'],
['物品', '描述', ':', '就', '一', '張', '學生證'],
['聯絡', '方式', ':', '站', '內', '信'],
['其他', '說明', ':', '請', '失主', '或', '朋友', '速速', '聯絡', '喔']]
# Initialize
c = Collocation(corpus, left_window=3, right_window=3)
# Query
c.get_topn_collocates("[臺台]灣", cutoff=3, n=3)
[('臺灣', '主體性',
{'MI': 10.260437705682913,
'Xsq': 4899.751374916442,
'Gsq': 50.46859378066087,
'Dice': 0.02666666666666667,
'DeltaP21': 0.013881338057636753,
'DeltaP12': 0.33306534863488213,
'RawCount': 4}),
('臺灣', '師範',
{'MI': 9.260437705682913,
'Xsq': 2445.9071435603837,
'Gsq': 44.12442832538992,
'Dice': 0.02564102564102564,
'DeltaP21': 0.013870011810758549,
'DeltaP12': 0.16639867893371077,
'RawCount': 4}),
('臺灣', '國立',
{'MI': 8.801006087045614,
'Xsq': 3553.4674791397297,
'Gsq': 82.8726217054829,
'Dice': 0.04519774011299435,
'DeltaP21': 0.027723034251199794,
'DeltaP12': 0.12094789748256553,
'RawCount': 8})]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
collocation-0.0.2.tar.gz
(5.5 kB
view details)
Built Distribution
File details
Details for the file collocation-0.0.2.tar.gz
.
File metadata
- Download URL: collocation-0.0.2.tar.gz
- Upload date:
- Size: 5.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
9989ecf903b5aca97779c8907b3cbc9b923e96c2b9f636f1029bcf7f1180a61d
|
|
MD5 |
177ec16995c5d852b529a823af00ddbb
|
|
BLAKE2b-256 |
909a4cd759a825b07691b80a5eec4032af00af6be16f95b4def7b631bf237912
|
File details
Details for the file collocation-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: collocation-0.0.2-py3-none-any.whl
- Upload date:
- Size: 6.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
454b6c60a8731e37975ab54af4a14c2dde514bc800235ecc1d248137459e2e7c
|
|
MD5 |
1fe095664a94ab8da7bef5085e61405b
|
|
BLAKE2b-256 |
18414c92a85dd1a9c2a5631029915111e6b7743becd9751da0f6997fdcca623b
|