Extract concordance lines from corpus with CQL

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

Project description

Support Python Version

Concordancer

This module loads and indexes a corpus in RAM and provides concordance search to retrieve data from the corpus using (a subset of) Corpus Query Language (CQL).

Installation

pip install -U concordancer

Usage

Loading a corpus from file

import json
from concordancer.demo import download_demo_corpus
from concordancer.concordancer import Concordancer

# Load demo corpus
fp = download_demo_corpus(to="~/Desktop")
with open(fp, encoding="utf-8") as f:
    corpus = [ json.loads(l) for l in f ]

# Index and initiate the corpus as a concordancer object
C = Concordancer(corpus)
C.set_cql_parameters(default_attr="word", max_quant=3)

CQL Concordance search

cql = '''
verb:[pos="V.*"] noun:[pos="N[abch]"]
'''
concord_list = C.cql_search(cql, left=2, right=2)

The result of the concordance search is a generator, which can be converted to a list of dictionaries (and then to JSON or other data structures for further uses):

>>> concord_list = list(concord_list)
>>> concord_list[:2]
[
    {
        'left': [{'word': '買', 'pos': 'VC'}, {'word': '了', 'pos': 'Di'}],
        'keyword': [{'word': '覺得', 'pos': 'VK'}, {'word': '材質', 'pos': 'Na'}],
        'right': [{'word': '很', 'pos': 'Dfa'}, {'word': '對', 'pos': 'VH'}],
        'position': {'doc_idx': 78, 'sent_idx': 13, 'tk_idx': 9},
        'captureGroups': {'verb': [{'word': '覺得', 'pos': 'VK'}],
                          'noun': [{'word': '材質', 'pos': 'Na'}]}
    },
    {
        'left': [{'word': '“', 'pos': 'PARENTHESISCATEGORY'},
                 {'word': '不', 'pos': 'D'}],
        'keyword': [{'word': '戴', 'pos': 'VC'}, {'word': '錶', 'pos': 'Na'}],
        'right': [{'word': '世代', 'pos': 'Na'}, {'word': '”', 'pos': 'VC'}],
        'position': {'doc_idx': 52, 'sent_idx': 7, 'tk_idx': 36},
        'captureGroups': {'verb': [{'word': '戴', 'pos': 'VC'}],
                          'noun': [{'word': '錶', 'pos': 'Na'}]}
    }
]

Keyword in Context

To better read the concordance lines, pass concord_list into concordancer.kwic_print.KWIC() to print them as a keyword-in-context format in the console:

>>> from concordancer.kwic_print import KWIC
>>> KWIC(concord_list[:5])
left                        keyword          right             LABEL: verb    LABEL: noun
--------------------------  ---------------  ----------------  -------------  -------------
買/VC 了/Di                 覺得/VK 材質/Na  很/Dfa 對/VH      覺得/VK        材質/Na
“/PARENTHESISCATEGORY 不/D  戴/VC 錶/Na      世代/Na ”/VC      戴/VC          錶/Na
聯名鞋/Na 趁著/P            過年/VA 期間/Na  穿出去/VB 四處/D  過年/VA        期間/Na
走/VA  /WHITESPACE          燒/VC 錢/Na      啊/T ～/FW        燒/VC          錢/Na
正/VH 韓/Nc                 賣/VD 家/Nc      裡面/Ncd 很/Dfa   賣/VD          家/Nc

Interactive Search Interface

Alternatively, you can start an interactive server to query and read results through your browser:

>>> from concordancer import server 
>>> server.run(C)
Initializing server...
Start serving at http://localhost:1420

This will open a query interface where you can interact with the corpus.

Supported CQL features

CQL search is supported through cqls, in which a (quite useful) subset of CQL is implemented:

token: [], "我", [word="我"], [word!="我" & pos="N.*"]
token-level quantifier: +, *, ?, {n,m}
grouping: ("a" "b"? "c"){1,2}
label: lab1:[word="我" & pos="N.*"] lab2:("a" "b")

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

Release history Release notifications | RSS feed

0.1.14

Oct 11, 2021

0.1.13

Jan 13, 2021

0.1.12

Jan 12, 2021

0.1.11

Jan 8, 2021

0.1.10

Jan 7, 2021

0.1.9

Jan 7, 2021

0.1.8

Jan 7, 2021

0.1.7

Jan 7, 2021

0.1.6

Jan 7, 2021

This version

0.1.5

Jan 7, 2021

0.1.4

Jan 6, 2021

0.1.3

Jan 6, 2021

0.1.2

Dec 21, 2020

0.1.1

Dec 20, 2020

0.1.0

Dec 20, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

concordancer-0.1.5.tar.gz (11.5 kB view hashes)

Uploaded Jan 7, 2021 Source

Built Distribution

concordancer-0.1.5-py3-none-any.whl (11.6 kB view hashes)

Uploaded Jan 7, 2021 Python 3

Hashes for concordancer-0.1.5.tar.gz

Hashes for concordancer-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`b296703c3ccde9da5936058e779c0de2215f3cfe5d93e8ee3130edc1c69513dd`
MD5	`9dd5e4497aa4cc761d149a40ef564a1b`
BLAKE2b-256	`a049df09228e85000055a6d1f75097227e12c54315d0274308a8cb4698ce6d30`

Hashes for concordancer-0.1.5-py3-none-any.whl

Hashes for concordancer-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7dcaf1a1befa9cdc2dc2277185d12bb7237ccb008a5c7b0a9f64615f3716b197`
MD5	`d698e532fc18b590f5f345c9c80f71fb`
BLAKE2b-256	`8e01dcd95292d9f28ebfca4e9d72808d4ef19df6e20e1e13ec713ec92e03d7a9`