Skip to main content

Sketch Grammar Explorer (Sketch Engine API wrapper)

Project description

Sketch Grammar Explorer

Introduction

Sketch Grammar Explorer (SGEX) is a Python package for using the Sketch Engine API. The goal is to develop a flexible scaffold for any kind of programmatic work with Sketch Engine and NoSketch Engine.

Setup

Built with Python 3.10 and tested on 3.7.

Installation

Install with pip:

  • pip install sgex

Or manual install:

  • clone this repo
  • install dependencies:
    • current versions pip install -r requirements.txt
    • required pip install numpy pandas requests pyyaml
    • optional pip install keyring openpyxl lxml

API credentials

Run sgex.config.credentials() to automate the creation of a config.yml file in the project directory. Follow the prompts to store an API key in plaintext or with the keyring package. If a server doesn't require credentials, use any non-empty string, e.g., 'null' for both username and api_key. If necessary, a keyring entry can be modified directly as shown below.

import keyring

# to add credentials
keyring.set_password("<server>","<username>", "<api_key>")

# to delete credentials later
keyring.delete_password("<server>", "<username>")

Making API calls

To get started using example calls, run sgex.config.examples() to generate basic input files in the current working directory. Then run sgex.Call() with a path to an input file. Retrieved API data is stored in a folder of the same name.

import sgex

sgex.config.examples()
job = sgex.Call("calls/freqs.yml")

Options

input a dictionary or a path to a YAML/JSON file containing API calls

  • if a dict, requires dest="<destination folder>"

dry_run make a Call object that can be inspected prior to executing requests (False)

  • with job as an instance of Call:
  • job prints a summary
  • job.print_calls() prints 10 call details at a time
  • job.calls accesses all call details

skip skip calls when an identical calls already exist in the destination folder (True)

  • only compares files of the same format
  • note: close data files to ensure read access

clear remove existing data in destination folder before running current calls (False)

timestamp include a timestamp (False)

format specify output format ("json")

  • "csv", "txt", "json", "xlsx", or "xml" (see compatibilities table)
  • "json" offers more detailed metadata and API error messages

any_format allow any combination of call types and formats (False)

asyn retrieve rough calculations, "0" (default) or "1"

server specify what server to call ("https://api.sketchengine.eu/bonito/run.cgi")

  • be sure to omit trailing forward slashes

wait enable waiting between calls (True)

Output formats

SGEX can save data in all formats provided by Sketch Engine, although only JSON is compatible with all call types. Known incompatibilities are blocked unless any_format=True.

Compatible call types and file formats

call type csv txt json xlsx xml
collx +
freqs + + + + +
wordlist + + + +
wsketch + +
view +

Input files

SGEX call structure

One or more calls can be executed by creating an input file readable by SGEX that contains API calls in the form dictionaries of parameters.

  • input files require a "type" key indicating what kind of call it is ("freqs")
  • the key of each call serves as a call-id ("call0")
  • each call has a dictionary of API parameters in "call"
  • calls can optionally contain metadata in other key:value pairs

The call below queries the lemma "rock" in the EcoLexicon English Corpus and retrieves frequencies by several text types.

YAML

Queries can be copied directly from YAML files into Sketch Engine's browser application without adding/removing escape characters.

type: freqs
call0:
  metadata:
    category1: tag1
  call:
    q:
    - alemma,"rock"
    corpname: preloaded/ecolexicon_en
    freq_sort: freq
    fcrit:
    - doc.domains 0
    - doc.genre 0
    - doc.editor 0

JSON

JSON requires consistent usage of double quotes and escape characters:

  • interior double quotes escaped "alemma,\"rock\"", "aword,\"it's\""
  • double-escaping for special characters: "atag,1:\"N.*\" [word=\",|\\(\"]
{ "type": "freqs",
  "call0": {
    "metadata": {
      "category1": "tag1"
    },
    "call":{
      "q": [
        "alemma,\"rock\""
      ],
      "corpname": "preloaded/ecolexicon_en",
      "freq_sort": "freq",
      "fcrit": [
        "doc.domains 0",
        "doc.genre 0",
        "doc.editor 0"]}}}

Features

Recycling parameters

Parameters are reused unless defined explicitly in every call. For example, the job below contains three similar calls. Instead of writing out every parameter for each, only the first call is complete. The proceeding calls only contain differing parameters (their queries). Other parameters (corpname, etc.), are passed from the first call successively to the rest.

type: freqs

call0:
  call:
    q:
    - alemma,"rock"
    corpname: preloaded/ecolexicon_en
    freq_sort: freq
    fcrit:
    - doc.domains 0

call1:
  call:
    q:
    - alemma,"stone"

call2:
  call:
    q:
    - alemma,"pebble"

Skipping repeats

If skip=True, calls won't be repeated when identical data of the same file type already exists. Repeats are identified using hashes of call dictionaries. If the contents of "call" change at all, they are considered unique calls.

Repeats are not detected across input files. Queries from calls1.yml and calls2.yml are stored in their respective data folders and are treated as independent samples.

Notes

Modifying saved data

SGEX doesn't track changes to downloaded data and will overwrite files if skip=False or clear=True. Be sure to separate/backup data sets to prevent data loss.

Working with different call types

Each call type, freqs (frequencies), view (concordance), wsketch (word sketch), etc., has its own parameters and requirements: parameters are not interchangeable between call types but do share similarities.

Too many requests

Sketch Engine monitors API activity and will block excessive calls or other activity outside of their Fair Use Policy. While learning the API, test calls selectively, slowly, and avoid repeated identical calls. Keep wait=True unless using a local server.

API usage

To learn more about the API, it's helpful to inspect network activity while making queries in Sketch Engine with a web browser (using Developer Tools). Importantly, Sketch Engine has internal API methods that only function in web browsers, so merely copy-pasting certain methods into SGEX won't necessarily work. Sketch Engine's API is also actively developed and syntax/functionalities may also change.

Double-checking accuracy

Before relying heavily on the API, it's a good idea to practice trying the same queries both in a web browser and via API to make sure the results are identical.

Tools

SGEX will offer more features to automate repetitive tasks and procedures for certain methodologies. Feel free to suggest features.

convert_grammar() converts a sketch grammar into SGEX-formatted queries (requires modifications depending on input)

Parse() parses and returns a dict of API calls or saves to a JSON/YAML file

  • dest="<filepath>" saves an object to file (can be used to convert between file formats)

About

SGEX has been developed to meet research needs at the University of Granada (Spain) Translation and Interpreting Department, in part to support the computational linguistics techniques that feed the EcoLexicon terminological knowledge base (see the articles here and here for an introduction).

The name refers to sketch grammars, which are series of generalized corpus queries in Sketch Engine that are useful for studying terminology and other lexical items (see their bibliography).

Questions, suggestions, and support are welcome.

Citation

If you use SGEX, please cite it.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sgex-0.4.0.tar.gz (17.3 kB view details)

Uploaded Source

Built Distribution

sgex-0.4.0-py3-none-any.whl (15.7 kB view details)

Uploaded Python 3

File details

Details for the file sgex-0.4.0.tar.gz.

File metadata

  • Download URL: sgex-0.4.0.tar.gz
  • Upload date:
  • Size: 17.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.2

File hashes

Hashes for sgex-0.4.0.tar.gz
Algorithm Hash digest
SHA256 7017608209c166cacc53510dd705d234533e422c015db209b9667882ca854033
MD5 94071753c597c0321f47bd6d02b97823
BLAKE2b-256 a4b3b04dbd4cf51962bd8b48e5bd2cfe336cb7095226398d331c20b83a62a035

See more details on using hashes here.

File details

Details for the file sgex-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: sgex-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 15.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.2

File hashes

Hashes for sgex-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fb9d1a20ce0e08f893df7b357240b86b378e7498f59f9e5b58c12f1fa78c5025
MD5 dc8dd98e7478fb9ca826a7def074d6b1
BLAKE2b-256 b1d595d6eea6d6fdf675268b5279376c433fed7c0447d81f1e3d46cc249a20b9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page