Skip to main content

Sophisticated corpus linguistics

Project description

Build Status codecov.io readthedocs PyPI version Code style: black

buzz: python corpus linguistics

Version 2.0.3

buzz is a linguistics tool for parsing and then exploring plain or metadata-rich text. This README provides an overview of functionality. Visit the full documentation for a more complete user guide.

Install

bash

pip install buzz
# or
git clone http://github.com/interrogator/buzz
cd buzz
python setup.py install

Creating a corpus

buzz models plain text, or CONLL-U formatted files. This guide will assume that you are have plain text data, and want to process and analyse it.

So, first, you need to make sure that your corpus is in a format and structure that buzz can work with. This simply means putting all your text files into a folder, and optionally within subfolders (representing subcorpora).

Text files should be plain text, with a .txt extension. Importantly though, they can be augmented with metadata, which can be stored in two ways. First, speaker names can be added by using capital letters and a colon, much like in a script. Second, you can use XML style metadata markup. Here is an example file, sopranos/s1/e01.txt:

<metadata aired="10.01.1999">
MELFI: My understanding from Dr. Cusamano, your family physician, is you collapsed? Possibly a panic attack? <metadata exposition=true interrogative-type="intonation" move="info-request">
TONY: They said it was a panic attack <metadata emph-token=0 move="refute">
MELFI: You don't agree that you had a panic attack? <metadata move="info-request" question=type="in">
...

If you add a metadata element at the start of the text file, it will be understood as file-level metadata. For sentence-specific metadata, the element should follow the sentence, ideally at the end of a line. All metadata will be searchable later, so the more you can add, the more you can do with your corpus.

Parsing

buzz uses spaCy to parse your text, saving the results as CONLL-U files to your hard drive. Parsing a corpus is very simply:

from buzz import Corpus

corpus = Corpus('sopranos')
parsed = corpus.parse()
# if you don't need constituency parses, you can speed things up with:
parsed = corpus.parse(cons_parser=None)

The main advantages of parsing with buzz are that:

  • Parse results are stored as valid CONLL-U 2.0
  • Metadata is respected, and transferred into the output files
  • You can do constituency and dependency parsing at the same time (with parse trees being stored as CONLL-U metadata)

the parse() method returns another Corpus object, representing the newly created files. We can explore this corpus via commands like:

parsed.subcorpora.s1.files.e01
parsed.files[0]
parsed.subcorpora.s1[:5]
parsed.subcorpora['s1']

Loading corpora into memory

You can use the load() method to load a whole or partial corpus into memory, as a Dataset object, which extends the pandas DataFrame.

loaded = parsed.load()

You don't need to load corpora into memory to work on them, but it's great for small corpora. As a rule of thumb, datasets under a million words should be easily loadable on a personal computer.

The loaded corpus is a Dataset object, which is based on the pandas DataFrame. So, you can use pandas methods on it:

loaded.head()
file s i w l x p g f e aired emph_token exposition interrogative_type move parse question sent_id sent_len speaker text
e01 1 1 My -PRON- DET PRP$ 2 poss _ 10.01.1999 _ True intonation info-request (S (NP (NP (PRP$ My) (NN understanding)) (PP (... _ 1 14 MELFI My understanding from Dr. Cusamano, your famil...
2 understanding understanding NOUN NN 13 nsubjpass _ 10.01.1999 _ True intonation info-request (S (NP (NP (PRP$ My) (NN understanding)) (PP (... _ 1 14 MELFI My understanding from Dr. Cusamano, your famil...
3 from from ADP IN 2 prep _ 10.01.1999 _ True intonation info-request (S (NP (NP (PRP$ My) (NN understanding)) (PP (... _ 1 14 MELFI My understanding from Dr. Cusamano, your famil...
4 Dr. Dr. PROPN NNP 5 compound _ 10.01.1999 _ True intonation info-request (S (NP (NP (PRP$ My) (NN understanding)) (PP (... _ 1 14 MELFI My understanding from Dr. Cusamano, your famil...
5 Cusamano Cusamano PROPN NNP 3 pobj _ 10.01.1999 _ True intonation info-request (S (NP (NP (PRP$ My) (NN understanding)) (PP (... _ 1 14 MELFI My understanding from Dr. Cusamano, your famil...

You can also interactively explore the corpus with tabview using the view() method:

loaded.view()

The interactive view has a number of cool features, such as the ability to sort by row or column. Also, pressing enter on a given line will generate a concordance based on that line's contents. Neat!

Exploring parsed and loaded corpora

A corpus is a pandas DataFrame object. The index is a multiindex, comprised of filename, sent_id and token. Each token in the corpus is therefore uniquely identifiable through this index. The columns for the loaded copus are all the CONLL columns, plus anything included as metadata.

# get the first sentence using buzz.sent()
first = loaded.sent(0)
# using pandas syntax to get first 5 words
first.iloc[:5]['w']
# join the wordclasses and words
print(' '.join(first.x.str.cat(first.w, sep='/')))
"DET/My NOUN/understanding ADP/from PROPN/Dr. PROPN/Cusamano PUNCT/, DET/your NOUN/family NOUN/physician PUNCT/, VERB/is PRON/you VERB/collapsed PUNCT/?

You don't need to know pandas, however, in order to use buzz, because buzz makes possible some more intuitive measures with linguisitcs in mind. For example, if you want to slice the corpus some way, you can easily do this using the just and skip properties:

tony = loaded.just.speaker.TONY
# for regular expressions, like removing all punctuation:
no_punct = loaded.skip.wordlcass.PUNCT

Any object created by buzz has a .view() method, which launches a tabview interactive space where you can explore corpora, frequencies or concordances.

spaCy

spaCy is used under the hood for dependency parsing, and a couple of other things. spaCy bring with it a lot of state of the art methods in NLP. You can access the spaCy representation of your data with:

corpus.to_spacy()
# or
loaded.to_spacy()

Searching dependencies

To search the dependency graph generated by spaCy during parsing, you can use the depgrep method.

# search dependencies for nominal subjects with definite articles
nsubj = loaded.depgrep('f/nsubj.*/ -> (w"the" & x"DET")')

The search language works by modelling nodes and the links between them. Specifying a node, like f/nsubj/, is done by specifying the feature you want to match (f for function), and a query inside slashes (for regular expressions) or inside quotation marks (for literal matches).

The arrow-like link specifies that the nsubj must govern the determiner. The & relation specifies that the two nodes are actually the same node. Brackets may be necessary to contain the query.

This language is based on tgrep, syntax, customised for dependencies. It is still a work in progress, but documentation should emerge here.

When you search a Corpus or a Dataset, the result is simply another Dataset, representing a subset of the Corpus. Therefore, rather than trying to construct one query string that gets everything you want, it is often easier to perform multiple small searches:

tony_subjects = loaded.just.speaker.TONY.depgrep('f/nsubj/ <- f/ROOT/')

rather than the more error-prone:

tony_subjects = loaded.depgrep('f/nsubj/ <- f/ROOT/ & speaker/TONY/')

Note that for any searches that do not require traversal of the grammatical structure, you should use the skip and just methods. tgrep and depgrep only need to be used when your search involves the grammar, and not just token features.

Searching constituency trees

Constituency tree searching can be done with the tgrep method, which provides a Python implementation of the tgrep2 query syntax:

nps_with_adjectives = loaded.tgrep('NP < JJ')

It also works with nodes and links, though there are numerous differences. In particular, note that arrows appear reversed --- NP < JJ is an NP that dominates a JJ, while something similar in depgrep would be f/nsubj/ -> f/amod/, a nominal subject governing an adjective.

Viewing search results

An important principle in buzz is the separation of searching and viewing results. Unlike many other tools, you do not search for a concordance---instead, you search the corpus, and then visualise the output of the data as a concordance.

Concordancing

Concordancing is a nice way of looking at results. The main thing you have to do is tell buzz how you want the match column to look---it can be just the matching words, but also any combination of things. To show words and their parts of speech, you can do:

nsubj = loaded.just.function.nsubj
nsubj.concordance(show=['w', 'p'])

Frequency tables

You can turn your dataset into frequency tables, both before or after searching or filtering. Tabling takes a show argument similar to the show argument for concordancing, as well as an additional subcorpora argument. show represents the how the columns will be formatted, and subcorpora is used as the index. Below we create a frequency table of nsubj tokens, in lemma form, organised by speaker.

tab = nsubj.table(show='l', subcorpora=['speaker'])

Possible keyword arguments for the .table() method are as follows:

Argument Description Default
subcorpora Feature(s) to use as the index of the table. Passing in a list of multiple features will create a multiindex ['file']
show Feature(s) to use as the columns of the table. Passing a list will join the features with slash, so ['w', 'p'] results in columns with names like 'friend/NN' ['w']
sort How to sort the results. 'total'/'infreq', 'increase/'decrease', 'static/turbulent', 'name'/'inverse' 'total'
relative Use relative, rather than absolute frequencies with True. You can also pass in Series, DataFrame or buzz objects to calculate relative frequencies against the passed in data. False
remove_above_p Sorting by increase/decrease/static/turbulent calculates the slope of the frequencies across each subcorpus, and p-values where the null hypothesis is no slope. If you pass in a float, entries with p-values above this float are dropped from the results. Passing in True will use 0.05. False
keep_stats If True, keep generated statistics related to the trajectory calculation False
preserve_case Keep the original case for show (column) values False
multiindex_columns When show is a list with multiple features, rather than joining show with slashes, build a multiindex False

This creates a Table object, which is also based on DataFrame. You can use its .view() method to quickly explore results. Pressing enter on a given frequency will bring up a concordance of instances of this entry.

Plotting

You can also use buzz to create high-quality visualisations of frequency data. This relies completely on pandas' plotting method. A plot method more tailored to language datasets is still in development.

tab.plot(...)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

buzz-2.0.3.tar.gz (33.1 kB view details)

Uploaded Source

Built Distribution

buzz-2.0.3-py3.7.egg (72.3 kB view details)

Uploaded Source

File details

Details for the file buzz-2.0.3.tar.gz.

File metadata

  • Download URL: buzz-2.0.3.tar.gz
  • Upload date:
  • Size: 33.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3

File hashes

Hashes for buzz-2.0.3.tar.gz
Algorithm Hash digest
SHA256 af27862ca1df0b7b6581a42b53ba554e45ed790e4644d977d9ceaa1881f49f24
MD5 4e6dc357b0d491a649ac231b42aa59f2
BLAKE2b-256 1365759ae4a87973f6ac9222557c39c1cbb879c9340721524eecc384ce396932

See more details on using hashes here.

File details

Details for the file buzz-2.0.3-py3.7.egg.

File metadata

  • Download URL: buzz-2.0.3-py3.7.egg
  • Upload date:
  • Size: 72.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3

File hashes

Hashes for buzz-2.0.3-py3.7.egg
Algorithm Hash digest
SHA256 61b2abe823e58acace29d89bddbb9c90f8cd3eaa5ad8031e711b0fdbfad80b68
MD5 e07f5400eb04afa7420361bf8cfe82c6
BLAKE2b-256 3d8f0ce9ba32878d6cd6b9afa0667090105850b9ebbf1e408247c3e09bf02133

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page