Extract concordance lines from corpus with CQL
Project description
Concordancer
This module loads and indexes a corpus in RAM and provides concordance search to retrieve data from the corpus using (a subset of) Corpus Query Language (CQL).
Installation
pip install concordancer
Usage
Loading a corpus from file
import json
from concordancer.demo import download_demo_corpus
from concordancer.concordancer import Concordancer
# Load demo corpus
fp = download_demo_corpus(to="~/Desktop")
with open(fp, encoding="utf-8") as f:
corpus = [ json.loads(l) for l in f ]
# Index and initiate the corpus as a concordancer object
C = Concordancer(corpus)
C.set_cql_parameters(default_attr="word", max_quant=3)
CQL Concordance search
cql = '''
verb:[pos="V.*"] noun:[pos="N[abch]"]
'''
concord_list = C.cql_search(cql, left=2, right=2)
The result of the concordance search is a generator, which can be converted to a list of dictionaries (and then to JSON or other data structures for further uses):
>>> concord_list = list(concord_list)
>>> concord_list[:2]
[
{
'left': [{'word': '買', 'pos': 'VC'}, {'word': '了', 'pos': 'Di'}],
'keyword': [{'word': '覺得', 'pos': 'VK'}, {'word': '材質', 'pos': 'Na'}],
'right': [{'word': '很', 'pos': 'Dfa'}, {'word': '對', 'pos': 'VH'}],
'position': {'doc_idx': 78, 'sent_idx': 13, 'tk_idx': 9},
'captureGroups': {'verb': [{'word': '覺得', 'pos': 'VK'}],
'noun': [{'word': '材質', 'pos': 'Na'}]}
},
{
'left': [{'word': '“', 'pos': 'PARENTHESISCATEGORY'},
{'word': '不', 'pos': 'D'}],
'keyword': [{'word': '戴', 'pos': 'VC'}, {'word': '錶', 'pos': 'Na'}],
'right': [{'word': '世代', 'pos': 'Na'}, {'word': '”', 'pos': 'VC'}],
'position': {'doc_idx': 52, 'sent_idx': 7, 'tk_idx': 36},
'captureGroups': {'verb': [{'word': '戴', 'pos': 'VC'}],
'noun': [{'word': '錶', 'pos': 'Na'}]}
}
]
Keyword in Context
To better read the concordance lines, pass concord_list
into concordancer.kwic_print.KWIC()
to print them as a keyword-in-context format in the console:
>>> from concordancer.kwic_print import KWIC
>>> KWIC(concord_list[:5])
left keyword right LABEL: verb LABEL: noun
-------------------------- --------------- ---------------- ------------- -------------
買/VC 了/Di 覺得/VK 材質/Na 很/Dfa 對/VH 覺得/VK 材質/Na
“/PARENTHESISCATEGORY 不/D 戴/VC 錶/Na 世代/Na ”/VC 戴/VC 錶/Na
聯名鞋/Na 趁著/P 過年/VA 期間/Na 穿出去/VB 四處/D 過年/VA 期間/Na
走/VA /WHITESPACE 燒/VC 錢/Na 啊/T ~/FW 燒/VC 錢/Na
正/VH 韓/Nc 賣/VD 家/Nc 裡面/Ncd 很/Dfa 賣/VD 家/Nc
Interactive Search Interface
Alternatively, you can start an interactive server to query and read results through your browser:
>>> from concordancer import server
>>> server.run(C)
Initializing server...
Start serving at http://localhost:1420
This will open a query interface where you can interact with the corpus.
Currently, due to the conflicts between some CQL metacharacters and URI special characters, some query may break. Avoid this by NOT using characters such as {
and }
(other metacharacters not tested yet).
Supported CQL features
CQL search is supported through cqls
, in which a (quite useful) subset of CQL is implemented:
- token:
[]
,"我"
,[word="我"]
,[word!="我" & pos="N.*"]
- token-level quantifier:
+
,*
,?
,{n,m}
- grouping:
("a" "b"? "c"){1,2}
- label:
lab1:[word="我" & pos="N.*"] lab2:("a" "b")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for concordancer-0.1.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9067effdee44d80a854ff0a52821e1326ed36760c3d8c00b70a4c2d0f03901ba |
|
MD5 | dc22b23d7477c95e97102fe2948d8e7e |
|
BLAKE2b-256 | d38f4f80164e2a7a0a2c3d77aba98abeeccb5fb1606279adc4a75c67e80f818d |