Skip to main content

Collocation extraction from segmented texts

Project description

Collocation

Install

pip install collocation

Usage

from collocation import Collocation

# Prepare corpus data
# https://yongfu.name/collocation/sampled_PTTposts.txt
corpus = []
with open("sampled_PTTposts.txt", encoding="utf-8") as f:
    for sent in f.read().split("\n"):
        if sent.strip() == "": continue
        sentence = []
        for tk in sent.split("\u3000"):
            if tk == "": continue
            sentence.append(tk)
        corpus.append(sentence)

>>> corpus[:7]
[['物品', '名稱', ':', '學生證'],
 ['拾獲', '地點', ':', '大一女', '前'],
 ['拾獲', '時間', ':', '6', '/', '21'],
 ['18', ':', '20', '左右'],
 ['物品', '描述', ':', '就', '一', '張', '學生證'],
 ['聯絡', '方式', ':', '站', '內', '信'],
 ['其他', '說明', ':', '請', '失主', '或', '朋友', '速速', '聯絡', '喔']]


# Initialize
c = Collocation(corpus, left_window=3, right_window=3)
# Query
c.get_topn_collocates("[臺台]灣", cutoff=3, n=3)
[('臺灣', '主體性',
  {'MI': 10.260437705682913,
   'Xsq': 4899.751374916442,
   'Gsq': 50.46859378066087,
   'Dice': 0.02666666666666667,
   'DeltaP21': 0.013881338057636753,
   'DeltaP12': 0.33306534863488213,
   'RawCount': 4}),
 ('臺灣', '師範',
  {'MI': 9.260437705682913,
   'Xsq': 2445.9071435603837,
   'Gsq': 44.12442832538992,
   'Dice': 0.02564102564102564,
   'DeltaP21': 0.013870011810758549,
   'DeltaP12': 0.16639867893371077,
   'RawCount': 4}),
 ('臺灣', '國立',
  {'MI': 8.801006087045614,
   'Xsq': 3553.4674791397297,
   'Gsq': 82.8726217054829,
   'Dice': 0.04519774011299435,
   'DeltaP21': 0.027723034251199794,
   'DeltaP12': 0.12094789748256553,
   'RawCount': 8})]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

collocation-0.0.2.tar.gz (5.5 kB view details)

Uploaded Source

Built Distribution

collocation-0.0.2-py3-none-any.whl (6.0 kB view details)

Uploaded Python 3

File details

Details for the file collocation-0.0.2.tar.gz.

File metadata

  • Download URL: collocation-0.0.2.tar.gz
  • Upload date:
  • Size: 5.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.0

File hashes

Hashes for collocation-0.0.2.tar.gz
Algorithm Hash digest
SHA256 9989ecf903b5aca97779c8907b3cbc9b923e96c2b9f636f1029bcf7f1180a61d
MD5 177ec16995c5d852b529a823af00ddbb
BLAKE2b-256 909a4cd759a825b07691b80a5eec4032af00af6be16f95b4def7b631bf237912

See more details on using hashes here.

File details

Details for the file collocation-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: collocation-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 6.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.0

File hashes

Hashes for collocation-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 454b6c60a8731e37975ab54af4a14c2dde514bc800235ecc1d248137459e2e7c
MD5 1fe095664a94ab8da7bef5085e61405b
BLAKE2b-256 18414c92a85dd1a9c2a5631029915111e6b7743becd9751da0f6997fdcca623b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page