Skip to main content

Extract concordance lines from corpus with CQL

Project description

Support Python Version

Concordancer

This library loads and indexes a corpus in RAM and provides concordance search to retrieve data from the corpus with (a subset of) Corpus Query Language (CQL).

Installation

pip install -U concordancer

Usage

Concordancer is designed with this workflow in mind:

The user is expected to preprocess the text data to match the corpus data required by concordancer. Once this is done, subsequent tasks such as indexing the copus, writing query functions to search the corpus, and displaying results in an aligned keyword-in-context format are all done by concordancer. The user could then further process the search results (exported as JSON by concordancer) for other uses.

Input corpus data structure

concordancer requires the corpus to be structured (minimally) as:

[  # a corpus
    {       # a text
        'text': [
            [<tk>, <tk>, <tk>, ...],   # a sentence in a text
            [<tk>, <tk>, <tk>, ...],   # another sentence in a text
            ...
            [<tk>, <tk>, <tk>, ...]    # the last sentence in a text
        ]
    },
    {...},  # another text                      
    ...
]

where <tk> is a dictionary representating a token, which may resemble something like:

{
    'word': 'hits',
    'lemma': 'hit',
    'pos': 'V'
}

This structure allows the corpus to be saved conveniently as a newline-delimited JSON file (.jsonl), where each line of the file corresponds to a single text in the corpus, represented as a JSON object (i.e., a dictionary in Python). You can see an example of the corpus file saved in .jsonl here. The code below uses a corpus saved in .jsonl format for demonstration.

Loading a corpus from file

The code below uses an example corpus, which is saved as a newline-delimited JSON file (described in the previous section).

import json
from concordancer.demo import download_demo_corpus
from concordancer.concordancer import Concordancer
from concordancer import server

# Load demo corpus
fp = download_demo_corpus(to="~/Desktop")
with open(fp, encoding="utf-8") as f:
    corpus = [ json.loads(l) for l in f ]

# Index and initiate the corpus as a concordancer object
C = Concordancer(corpus)
C.set_cql_parameters(default_attr="word", max_quant=3)

Interactive Search Interface

You can start an interactive server to query and read results through your browser:

>>> server.run(C)
Initializing server...
Start serving at http://localhost:1420

CQL Concordance search

Alternatively, you can work with the Concordancer object, which allows you to send CQL queries to the corpus:

cql = '''
verb:[pos="V.*"] noun:[pos="N[abch]"]
'''
concord_list = C.cql_search(cql, left=2, right=2)

The results of a query is returned as a generator, which can be converted to a list of dictionaries (and then to JSON or other data structures for further uses):

>>> concord_list = list(concord_list)
>>> concord_list[:2]
[
    {
        'left': [{'word': '買', 'pos': 'VC'}, {'word': '了', 'pos': 'Di'}],
        'keyword': [{'word': '覺得', 'pos': 'VK'}, {'word': '材質', 'pos': 'Na'}],
        'right': [{'word': '很', 'pos': 'Dfa'}, {'word': '對', 'pos': 'VH'}],
        'position': {'doc_idx': 78, 'sent_idx': 13, 'tk_idx': 9},
        'captureGroups': {'verb': [{'word': '覺得', 'pos': 'VK'}],
                          'noun': [{'word': '材質', 'pos': 'Na'}]}
    },
    {
        'left': [{'word': '“', 'pos': 'PARENTHESISCATEGORY'},
                 {'word': '不', 'pos': 'D'}],
        'keyword': [{'word': '戴', 'pos': 'VC'}, {'word': '錶', 'pos': 'Na'}],
        'right': [{'word': '世代', 'pos': 'Na'}, {'word': '”', 'pos': 'VC'}],
        'position': {'doc_idx': 52, 'sent_idx': 7, 'tk_idx': 36},
        'captureGroups': {'verb': [{'word': '戴', 'pos': 'VC'}],
                          'noun': [{'word': '錶', 'pos': 'Na'}]}
    }
]

Keyword in Context

To better read the concordance lines, pass concord_list into concordancer.kwic_print.KWIC() to print them as a keyword-in-context format in the console:

>>> from concordancer.kwic_print import KWIC
>>> KWIC(concord_list[:5])
left                        keyword          right             LABEL: verb    LABEL: noun
--------------------------  ---------------  ----------------  -------------  -------------
/VC /Di                 覺得/VK 材質/Na  /Dfa /VH      覺得/VK        材質/Na
/PARENTHESISCATEGORY /D  /VC /Na      世代/Na /VC      /VC          /Na
聯名鞋/Na 趁著/P            過年/VA 期間/Na  穿出去/VB 四處/D  過年/VA        期間/Na
/VA  /WHITESPACE          /VC /Na      /T /FW        /VC          /Na
/VH /Nc                 /VD /Nc      裡面/Ncd /Dfa   /VD          /Nc

Supported CQL features

CQL search is supported through cqls, which implements a (quite useful) subset of CQL:

  • token: [], "我", [word="我"], [word!="我" & pos="N.*"]
  • token-level quantifier: +, *, ?, {n,m}
  • grouping: ("a" "b"? "c"){1,2}
  • label: lab1:[word="我" & pos="N.*"] lab2:("a" "b")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

concordancer-0.1.16.tar.gz (14.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

concordancer-0.1.16-py3-none-any.whl (13.9 kB view details)

Uploaded Python 3

File details

Details for the file concordancer-0.1.16.tar.gz.

File metadata

  • Download URL: concordancer-0.1.16.tar.gz
  • Upload date:
  • Size: 14.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for concordancer-0.1.16.tar.gz
Algorithm Hash digest
SHA256 c783f34d7a88eda5e5db4439c621a6f54f9d3f785d404246d4c64aa7d45eff49
MD5 f654644ec2471c3449caf8f7aea8e70f
BLAKE2b-256 38c7df64e983d09c6e8229094228355672a3ad426351d3fa0ca6b22770715f9c

See more details on using hashes here.

File details

Details for the file concordancer-0.1.16-py3-none-any.whl.

File metadata

  • Download URL: concordancer-0.1.16-py3-none-any.whl
  • Upload date:
  • Size: 13.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for concordancer-0.1.16-py3-none-any.whl
Algorithm Hash digest
SHA256 a98dfac8d0b791ea41fcd7296935fbfc06250aa61aff6177815a82ad63e508a7
MD5 a84a3a3a8340e33ce59312e04e89dec4
BLAKE2b-256 68cb62afcacbf260f9115a101f855a3ae3f41a8ac0822785e2cb5ef5de573496

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page