Skip to main content

Extract concordance lines from corpus with CQL

Project description

Support Python Version

Concordancer

This module loads and indexes a corpus in RAM and provides concordance search to retrieve data from the corpus using (a subset of) Corpus Query Language (CQL).

Installation

pip install -U concordancer

Usage

Loading a corpus from file

import json
from concordancer.demo import download_demo_corpus
from concordancer.concordancer import Concordancer

# Load demo corpus
fp = download_demo_corpus(to="~/Desktop")
with open(fp, encoding="utf-8") as f:
    corpus = [ json.loads(l) for l in f ]

# Index and initiate the corpus as a concordancer object
C = Concordancer(corpus)
C.set_cql_parameters(default_attr="word", max_quant=3)

CQL Concordance search

cql = '''
verb:[pos="V.*"] noun:[pos="N[abch]"]
'''
concord_list = C.cql_search(cql, left=2, right=2)

The result of the concordance search is a generator, which can be converted to a list of dictionaries (and then to JSON or other data structures for further uses):

>>> concord_list = list(concord_list)
>>> concord_list[:2]
[
    {
        'left': [{'word': '買', 'pos': 'VC'}, {'word': '了', 'pos': 'Di'}],
        'keyword': [{'word': '覺得', 'pos': 'VK'}, {'word': '材質', 'pos': 'Na'}],
        'right': [{'word': '很', 'pos': 'Dfa'}, {'word': '對', 'pos': 'VH'}],
        'position': {'doc_idx': 78, 'sent_idx': 13, 'tk_idx': 9},
        'captureGroups': {'verb': [{'word': '覺得', 'pos': 'VK'}],
                          'noun': [{'word': '材質', 'pos': 'Na'}]}
    },
    {
        'left': [{'word': '“', 'pos': 'PARENTHESISCATEGORY'},
                 {'word': '不', 'pos': 'D'}],
        'keyword': [{'word': '戴', 'pos': 'VC'}, {'word': '錶', 'pos': 'Na'}],
        'right': [{'word': '世代', 'pos': 'Na'}, {'word': '”', 'pos': 'VC'}],
        'position': {'doc_idx': 52, 'sent_idx': 7, 'tk_idx': 36},
        'captureGroups': {'verb': [{'word': '戴', 'pos': 'VC'}],
                          'noun': [{'word': '錶', 'pos': 'Na'}]}
    }
]

Keyword in Context

To better read the concordance lines, pass concord_list into concordancer.kwic_print.KWIC() to print them as a keyword-in-context format in the console:

>>> from concordancer.kwic_print import KWIC
>>> KWIC(concord_list[:5])
left                        keyword          right             LABEL: verb    LABEL: noun
--------------------------  ---------------  ----------------  -------------  -------------
/VC /Di                 覺得/VK 材質/Na  /Dfa /VH      覺得/VK        材質/Na
/PARENTHESISCATEGORY /D  /VC /Na      世代/Na /VC      /VC          /Na
聯名鞋/Na 趁著/P            過年/VA 期間/Na  穿出去/VB 四處/D  過年/VA        期間/Na
/VA  /WHITESPACE          /VC /Na      /T /FW        /VC          /Na
/VH /Nc                 /VD /Nc      裡面/Ncd /Dfa   /VD          /Nc

Interactive Search Interface

Alternatively, you can start an interactive server to query and read results through your browser:

>>> from concordancer import server 
>>> server.run(C)
Initializing server...
Start serving at http://localhost:1420

This will open a query interface where you can interact with the corpus.

Supported CQL features

CQL search is supported through cqls, in which a (quite useful) subset of CQL is implemented:

  • token: [], "我", [word="我"], [word!="我" & pos="N.*"]
  • token-level quantifier: +, *, ?, {n,m}
  • grouping: ("a" "b"? "c"){1,2}
  • label: lab1:[word="我" & pos="N.*"] lab2:("a" "b")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

concordancer-0.1.5.tar.gz (11.5 kB view hashes)

Uploaded Source

Built Distribution

concordancer-0.1.5-py3-none-any.whl (11.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page