Extract concordance lines from corpus with CQL
Project description
Concordancer
This module loads and indexes a corpus in RAM and provides concordance search to retrieve data from the corpus using (a subset of) Corpus Query Language (CQL).
Installation
pip install concordaner
Usage
Loading a corpus from file
import json
from concordancer.concordancer import Concordancer
from concordancer.kwic_print import KWIC
# Read corpus from file
corpus = []
with open("test-data/tokenDict.jsonl") as f:
for l in f:
corpus.append(json.loads(l))
# Index and initiate the corpus as a concordance object
C = Concordancer(corpus)
C.set_cql_parameters(default_attr="word", max_quant=3)
CQL Concordance search
cql = '''
[pos="V.*"]+
'''
concord_list = C.cql_search(cql, left=2, right=2)
The result of the concordance search is a list of dictionaries, which can easily be converted to JSON or other data structures for further uses:
>>> concord_list[:2]
[
{
'left': [{'word': '來', 'pos': 'VA'}, {'word': '一', 'pos': 'Neu'}],
'keyword': [{'word': '小', 'pos': 'VH'}, {'word': '段', 'pos': 'Nf'}],
'right': [{'word': '故事', 'pos': 'Na'},
{'word': '。', 'pos': 'PERIODCATEGORY'}],
'position': {'doc_idx': 16, 'sent_idx': 3, 'tk_idx': 21},
'captureGroups': {'verb': [{'word': '小', 'pos': 'VH'}],
'noun': [{'word': '段', 'pos': 'Nf'}]}
},
{
'left': [{'word': '長度', 'pos': 'Na'}, {'word': '剛好', 'pos': 'Da'}],
'keyword': [{'word': '蓋住', 'pos': 'VC'}, {'word': '這', 'pos': 'Nep'}],
'right': [{'word': '件', 'pos': 'Nf'}, {'word': '裙子', 'pos': 'Na'}],
'position': {'doc_idx': 78, 'sent_idx': 59, 'tk_idx': 4},
'captureGroups': {'verb': [{'word': '蓋住', 'pos': 'VC'}],
'noun': [{'word': '這', 'pos': 'Nep'}]}
}
]
To better read the concordance lines, you can pass concord_list
into concordancer.kwic_print.KWIC()
to read them as keyword-in-context format in the console:
>>> KWIC(concord_list[:5])
left keyword right LABEL: verb LABEL: noun
--------------- ------------------------------ ------------------------- ------------- ----------------------
來/VA 一/Neu 小/VH 段/Nf 故事/Na 。/PERIODCATEGORY 小/VH 段/Nf
長度/Na 剛好/Da 蓋住/VC 這/Nep 件/Nf 裙子/Na 蓋住/VC 這/Nep
復古/VH 帥氣/VH 穿搭/VA !/EXCLAMATIONCATEGORY <URL>/FW 身高/Na 穿搭/VA !/EXCLAMATIONCATEGORY
長/VH 的話/Cba 有/V_2 什麼/Nep 解決/VC 的/DE 有/V_2 什麼/Nep
買/VC 了/Di 覺得/VK 材質/Na 很/Dfa 對/VH 覺得/VK 材質/Na
Supported CQL features
CQL search is supported through cqls
. A subset of useful CQL is supported:
- token:
[]
,"我"
,[word="我"]
,[word!="我" & pos="N.*"]
- token-level quantifier:
+
,*
,?
,{n,m}
- grouping:
("a" "b"? "c"){1,2}
- label:
lab1:[word="我" & pos="N.*"] lab2:("a" "b")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
concordancer-0.1.0.tar.gz
(9.0 kB
view hashes)
Built Distribution
Close
Hashes for concordancer-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2b3440c943f0aaa3b835edff32fb0cc509de379fe8e0084d3228714d6a4a7543 |
|
MD5 | 4104430debf28d0f1c3ef5b6bb57e77d |
|
BLAKE2b-256 | 56de99351a49acd978be8b9edc9c0d07524bfc5ca7d1f3b75283c76c7305980c |