Skip to main content

A natural language toolkit designed to facilitate the process of generating collocates, ngrams, KWIC lines, and more.

Project description

TextElixir

TextElixir is a Python module that facilitates the data collection and analysis of textual corpora. While some programs like AntConc and WordCruncher are faster in speed, TextElixir provides the flexibility to gather data on limitless queries without needing to navigate through a graphical interface.

Installation

Install textelixir with pip

pip install textelixir

Importing TextElixir

from textelixir import TextElixir

Tagging a Corpus

TextElixir will tag a TXT file, a glob of TXT files, or a TSV file. By running this line of code, it'll create a subfolder within the directory you provide and add all of the POS tags and lemmas to each word. By saving the tagged corpus into that subfolder, this line only needs to be run once.

  elixir = TextElixir('text.txt', lang='en', tagger_option='spacy:efficient:pos')
Parameter Type Description
filename string or glob Required. Accepts a path to a filename, which can be a TXT file, a TSV file, or a glob of multiple TXT files.
lang string Optional. Accepts a language code. Defaults to en for English. For available languages, see SpaCy or Stanza for available tagging models.
tagger_option string Required. Accepts one of these for options for tagging:
spacy:efficient:pos: Uses a fast SpaCy tagger and uses tags like VERB, NOUN
spacy:accurate:xpos: Uses a slow SpaCy tagger and uses tags like VBZ, NN1
stanza:pos: Uses a Stanza tagger and uses tags like VERB, NOUN
stanza:xpos: Uses a Stanza tagger and uses tags like VBZ, NN1

Search

Performing a search is one of the main methods that you can use on a TextElixir. The search method returns a SearchResults class, which contains the frequency of the search query and a list of indices for where your search query occurs in the corpus. The SearchResults class contains several other methods for calculating collocates, frequency distribution charts, sentences, and concordance lines.

Searching for a Word

results = elixir.search('engage')

Searching for a Lemma

results = elixir.search('ENGAGE')

Searching for a Part of Speech

results = elixir.search('/VERB/')

Searching for a Word with its Part of Speech

results = elixir.search('leaves_NOUN')

Searching for a Lemma with its Part of Speech

results = elixir.search('ADVOCATE_NOUN')

Searching for a Phrase

results = elixir.search('/ADJ/ cat')

Searching with Wildcards

# Find inform, informs, information, etc.
results = elixir.search('inform*') 
# Find bat, cat, etc.
results = elixir.search('?at') 

Searching with Regular Expressions

# Find 4-digit numbers
results = elixir.search(r'\d{4}', regex=True)

Search for Words Separated by Distance

# Finds the word 'supporter' 1-5 words away from the lemma 'cat'
results = elixir.search('supporter ~5~ CAT')

Filter the Corpus Prior to Search

Filters can be applied to a corpus prior to searching. This allows you to get data on specific sections of your corpus rather than the entire corpus.

Positive Filter

# Searches for the word 'advocate' as long as it's within the category of 'Philosophy'
results = elixir.search('advocate', text_filter={'category': 'Philosophy'})

Negative Filter

# Searches for the word 'advocate' as long as it's NOT within the category of 'Philosophy'
results = elixir.search('advocate', text_filter={'category': '!Philosophy'})

Calculate KWIC/Concordance Lines

kwic = results.kwic_lines(before=8, after=8, group_by='lower')

Export KWIC Lines to HTML

This will generate a webpage with an interface to sort and filter KWIC lines. You can also switch the display of columns to show pos, lemma, and lower text.

results.export_as_html('my_kwic_lines.html')

Export KWIC Lines to TXT

This will generate a text file with just the KWIC lines. There is a tab character before and after the search hit, making it easy to paste into Google Sheets.

results.export_as_txt('my_kwic_lines.txt')

Calculate Sentences

Get the full sentence for more context of your search query.

sentences = results.sentences()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textelixir-0.0.17.tar.gz (26.3 kB view hashes)

Uploaded Source

Built Distribution

textelixir-0.0.17-py3-none-any.whl (30.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page