Extract concordance lines from corpus with CQL
Project description
Concordancer
This module loads and indexes a corpus in RAM and provides concordance search to retrieve data from the corpus using (a subset of) Corpus Query Language (CQL).
Installation
pip install -U concordancer
Usage
Loading a corpus from file
import json
from concordancer.demo import download_demo_corpus
from concordancer.concordancer import Concordancer
from concordancer import server
# Load demo corpus
fp = download_demo_corpus(to="~/Desktop")
with open(fp, encoding="utf-8") as f:
corpus = [ json.loads(l) for l in f ]
# Index and initiate the corpus as a concordancer object
C = Concordancer(corpus)
C.set_cql_parameters(default_attr="word", max_quant=3)
Interactive Search Interface
You can start an interactive server to query and read results through your browser:
server.run(C)
CQL Concordance search
cql = '''
verb:[pos="V.*"] noun:[pos="N[abch]"]
'''
concord_list = C.cql_search(cql, left=2, right=2)
The result of the concordance search is a generator, which can be converted to a list of dictionaries (and then to JSON or other data structures for further uses):
>>> concord_list = list(concord_list)
>>> concord_list[:2]
[
{
'left': [{'word': '買', 'pos': 'VC'}, {'word': '了', 'pos': 'Di'}],
'keyword': [{'word': '覺得', 'pos': 'VK'}, {'word': '材質', 'pos': 'Na'}],
'right': [{'word': '很', 'pos': 'Dfa'}, {'word': '對', 'pos': 'VH'}],
'position': {'doc_idx': 78, 'sent_idx': 13, 'tk_idx': 9},
'captureGroups': {'verb': [{'word': '覺得', 'pos': 'VK'}],
'noun': [{'word': '材質', 'pos': 'Na'}]}
},
{
'left': [{'word': '“', 'pos': 'PARENTHESISCATEGORY'},
{'word': '不', 'pos': 'D'}],
'keyword': [{'word': '戴', 'pos': 'VC'}, {'word': '錶', 'pos': 'Na'}],
'right': [{'word': '世代', 'pos': 'Na'}, {'word': '”', 'pos': 'VC'}],
'position': {'doc_idx': 52, 'sent_idx': 7, 'tk_idx': 36},
'captureGroups': {'verb': [{'word': '戴', 'pos': 'VC'}],
'noun': [{'word': '錶', 'pos': 'Na'}]}
}
]
Keyword in Context
To better read the concordance lines, pass concord_list
into concordancer.kwic_print.KWIC()
to print them as a keyword-in-context format in the console:
>>> from concordancer.kwic_print import KWIC
>>> KWIC(concord_list[:5])
left keyword right LABEL: verb LABEL: noun
-------------------------- --------------- ---------------- ------------- -------------
買/VC 了/Di 覺得/VK 材質/Na 很/Dfa 對/VH 覺得/VK 材質/Na
“/PARENTHESISCATEGORY 不/D 戴/VC 錶/Na 世代/Na ”/VC 戴/VC 錶/Na
聯名鞋/Na 趁著/P 過年/VA 期間/Na 穿出去/VB 四處/D 過年/VA 期間/Na
走/VA /WHITESPACE 燒/VC 錢/Na 啊/T ~/FW 燒/VC 錢/Na
正/VH 韓/Nc 賣/VD 家/Nc 裡面/Ncd 很/Dfa 賣/VD 家/Nc
This will open a query interface where you can interact with the corpus.
Supported CQL features
CQL search is supported through cqls
, in which a (quite useful) subset of CQL is implemented:
- token:
[]
,"我"
,[word="我"]
,[word!="我" & pos="N.*"]
- token-level quantifier:
+
,*
,?
,{n,m}
- grouping:
("a" "b"? "c"){1,2}
- label:
lab1:[word="我" & pos="N.*"] lab2:("a" "b")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for concordancer-0.1.10-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 462344e18d245b42006d1656270e27935e90db06fb22221eb815bcf9c072f3b3 |
|
MD5 | 8bbc60bceda7765435aa230cfcc96aaa |
|
BLAKE2b-256 | 11fe8dbeae8a1d249c12f24da31bfc57145fd5c42869ac0b890b6e69b634b9e6 |