A natural language toolkit designed to facilitate the process of generating collocates, ngrams, KWIC lines, and more.
Project description
TextElixir
TextElixir is a Python module that facilitates the data collection and analysis of textual corpora. While some programs like AntConc and WordCruncher are faster in speed, TextElixir provides the flexibility to gather data on limitless queries without needing to navigate through a graphical interface.
Installation
Install textelixir with pip
pip install textelixir
Importing TextElixir
from textelixir import TextElixir
Tagging a Corpus
TextElixir will tag a TXT file, a glob of TXT files, or a TSV file. By running this line of code, it'll create a subfolder within the directory you provide and add all of the POS tags and lemmas to each word. By saving the tagged corpus into that subfolder, this line only needs to be run once.
elixir = TextElixir('text.txt', lang='en', tagger_option='spacy:efficient:pos')
Parameter | Type | Description |
---|---|---|
filename |
string or glob |
Required. Accepts a path to a filename, which can be a TXT file, a TSV file, or a glob of multiple TXT files. |
lang |
string |
Optional. Accepts a language code. Defaults to en for English. For available languages, see SpaCy or Stanza for available tagging models. |
tagger_option |
string |
Required. Accepts one of these for options for tagging:spacy:efficient:pos : Uses a fast SpaCy tagger and uses tags like VERB, NOUNspacy:accurate:xpos : Uses a slow SpaCy tagger and uses tags like VBZ, NN1stanza:pos : Uses a Stanza tagger and uses tags like VERB, NOUNstanza:xpos : Uses a Stanza tagger and uses tags like VBZ, NN1 |
Search
Performing a search is one of the main methods that you can use on a TextElixir. The search method returns a SearchResults class, which contains the frequency of the search query and a list of indices for where your search query occurs in the corpus. The SearchResults class contains several other methods for calculating collocates, frequency distribution charts, sentences, and concordance lines.
Searching for a Word
results = elixir.search('engage')
Searching for a Lemma
results = elixir.search('ENGAGE')
Searching for a Part of Speech
results = elixir.search('/VERB/')
Searching for a Word with its Part of Speech
results = elixir.search('leaves_NOUN')
Searching for a Lemma with its Part of Speech
results = elixir.search('ADVOCATE_NOUN')
Searching for a Phrase
results = elixir.search('/ADJ/ cat')
Searching with Wildcards
# Find inform, informs, information, etc.
results = elixir.search('inform*')
# Find bat, cat, etc.
results = elixir.search('?at')
Searching with Regular Expressions
# Find 4-digit numbers
results = elixir.search(r'\d{4}', regex=True)
Search for Words Separated by Distance
# Finds the word 'supporter' 1-5 words away from the lemma 'cat'
results = elixir.search('supporter ~5~ CAT')
Filter the Corpus Prior to Search
Filters can be applied to a corpus prior to searching. This allows you to get data on specific sections of your corpus rather than the entire corpus.
Positive Filter
# Searches for the word 'advocate' as long as it's within the category of 'Philosophy'
results = elixir.search('advocate', text_filter={'category': 'Philosophy'})
Negative Filter
# Searches for the word 'advocate' as long as it's NOT within the category of 'Philosophy'
results = elixir.search('advocate', text_filter={'category': '!Philosophy'})
Calculate KWIC/Concordance Lines
kwic = results.kwic_lines(before=8, after=8, group_by='lower')
Export KWIC Lines to HTML
This will generate a webpage with an interface to sort and filter KWIC lines. You can also switch the display of columns to show pos, lemma, and lower text.
results.export_as_html('my_kwic_lines.html')
Export KWIC Lines to TXT
This will generate a text file with just the KWIC lines. There is a tab character before and after the search hit, making it easy to paste into Google Sheets.
results.export_as_txt('my_kwic_lines.txt')
Calculate Sentences
Get the full sentence for more context of your search query.
sentences = results.sentences()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for textelixir-0.0.17-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0735a4e76d132fbe65cfce1976b74d212465c63f5e99fc7030a648ec89c9895f |
|
MD5 | e698772e01fcf0c7dd84d4cd788cdb67 |
|
BLAKE2b-256 | 42ac0a8471291e685659229377c995882b0afd7eb597f63f091dbd042dea18bf |