Skip to main content

Extract Biber features from a document parsed and annotated by spaCy.

Project description

PyPI Version Downloads from PyPI Test Status

The pybiber package provides tools for extracting 67 lexicogrammatical and functional features described by Biber (1988) and widely used for text-type, register, and genre classification tasks in corpus linguistics.

Key Features:

  • 67 Linguistic Features: Automated extraction of tense markers, pronouns, subordination patterns, modal verbs, and more

  • Multi-Dimensional Analysis: Complete implementation of Biber’s MDA methodology for register analysis

  • Principal Component Analysis: Alternative dimensionality reduction approaches with visualization tools

  • High Performance: Built on spaCy and Polars for efficient text processing

  • End-to-End Pipeline: From raw text files to statistical analysis in just a few lines of code

  • Comprehensive Visualization: Built-in plotting functions for exploratory data analysis

Applications:

  • Register and genre analysis in corpus linguistics

  • Text classification and machine learning preprocessing

  • Diachronic language change studies

  • Cross-linguistic variation research

  • Academic writing analysis and pedagogical applications

  • Stylometric analysis and authorship attribution

The package uses spaCy part-of-speech tagging and dependency parsing with Polars DataFrames for high-performance analytics.

Accuracy Note: Feature extraction builds from probabilistic taggers, so accuracy depends on model quality. Texts with irregular spellings or non-standard punctuation may produce unreliable outputs unless taggers are specifically tuned for those domains.

See the documentation for comprehensive guides and API reference.

See pseudobibeR for the R implementation.

Quick Start

One-line processing from a folder of text files:

import pybiber as pb

# Process all .txt files in a directory
pipeline = pb.PybiberPipeline(model="en_core_web_sm")
features = pipeline.run_from_folder("path/to/texts")

Multi-Dimensional Analysis with visualization:

# Create analyzer for statistical analysis
analyzer = pb.BiberAnalyzer(features)

# Perform MDA and generate scree plot
mda_results = analyzer.mda()
analyzer.mdaviz_screeplot()

# Plot group means by dimension
analyzer.mdaviz_groupmeans(grouping_var="register")

Installation

You can install the released version of pybiber from PyPI:

pip install pybiber

Install a spaCY model:

python -m spacy download en_core_web_sm

Usage

Data Requirements

The pybiber package works with corpora structured as DataFrames with: - doc_id column: Unique document identifiers - text column: Raw text content

This follows conventions from readtext and quanteda.

Step-by-Step Workflow

  1. Import libraries and load spaCy model:

import spacy
import pybiber as pb
from pybiber.data import micusp_mini  # Sample corpus

nlp = spacy.load("en_core_web_sm")
  1. Parse corpus with spaCy:

# Parse texts to extract linguistic annotations (modern approach)
processor = pb.CorpusProcessor()
tokens_df = processor.process_corpus(micusp_mini, nlp)
  1. Extract Biber features:

# Aggregate 67 linguistic features per document
features_df = pb.biber(tokens_df)
  1. Advanced Analysis (optional):

# Statistical analysis and visualization
analyzer = pb.BiberAnalyzer(features_df)

# Multi-Dimensional Analysis
mda_results = analyzer.mda()

# Principal Component Analysis
pca_results = analyzer.pca()

# Visualization options
analyzer.mdaviz_screeplot()           # Eigenvalue plot
analyzer.pcaviz_contrib()             # Feature contributions
analyzer.mdaviz_groupmeans(group_var="genre")  # Group comparisons

Pipeline Convenience Functions

For streamlined processing, use the high-level pipeline:

from pybiber import PybiberPipeline

pipeline = PybiberPipeline(model="en_core_web_sm", disable_ner=True)

# From folder of .txt files
features_df = pipeline.run_from_folder("/path/to/texts")

# From in-memory corpus
features_df, tokens_df = pipeline.run(corpus_df, return_tokens=True)

# One-liner convenience functions
features_df = pb.run_biber_from_folder("/path/to/texts")
features_df = pb.run_biber(corpus_df)

Feature Categories

The package extracts 67 linguistic features across 16 categories:

  • Tense & Aspect: Past tense, perfect aspect, present tense

  • Adverbials: Place and time adverbials

  • Pronouns: 1st/2nd/3rd person, demonstrative, indefinite pronouns

  • Questions: Direct wh-questions

  • Nominal Forms: Nominalizations, gerunds, nouns

  • Passives: Agentless and by-passives

  • Stative Forms: be as main verb, existential there

  • Subordination: 18 different clause types (that-clauses, wh-clauses, infinitives, relatives, etc.)

  • Modification: Prepositional phrases, attributive/predicative adjectives, adverbs

  • Lexical Specificity: Type-token ratio, word length

  • Lexical Classes: Conjuncts, hedges, amplifiers, emphatics, discourse particles

  • Modals: Possibility, necessity, and predictive modals

  • Specialized Verbs: Public, private, suasive verbs

  • Reduced Forms: Contractions, deletions, split constructions

  • Coordination: Phrasal and clausal coordination

  • Negation: Synthetic and analytic negation

See the full feature list for detailed descriptions.

Performance & Requirements

System Requirements: - Python 3.10+ - spaCy model with POS tagging and dependency parsing (e.g., en_core_web_sm)

Performance Notes: - Built on Polars for fast DataFrame operations - Supports multiprocessing for large corpora - Memory-efficient processing with configurable batch sizes - Processing time: ~20-30 seconds for small corpora (e.g., 500 documents)

License

Code licensed under the MIT License. See the LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pybiber-0.3.1.tar.gz (1.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pybiber-0.3.1-py2.py3-none-any.whl (1.8 MB view details)

Uploaded Python 2Python 3

File details

Details for the file pybiber-0.3.1.tar.gz.

File metadata

  • Download URL: pybiber-0.3.1.tar.gz
  • Upload date:
  • Size: 1.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pybiber-0.3.1.tar.gz
Algorithm Hash digest
SHA256 d814d14cfa82b2becffc759ebae95b96feb181d44e780a05ae8535152237c27d
MD5 aefc2931395e5d5f9e9caa22f3a6d255
BLAKE2b-256 ecc3c0eb14cfc3c3d322f6ed51c7183d59dacde65ffa6595aaaa4bab2c6707cf

See more details on using hashes here.

Provenance

The following attestation bundles were made for pybiber-0.3.1.tar.gz:

Publisher: ci.yml on browndw/pybiber

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pybiber-0.3.1-py2.py3-none-any.whl.

File metadata

  • Download URL: pybiber-0.3.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pybiber-0.3.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 913d407dc1e8716cc5e191e4ebf9bebe8182cc44722513ccc73608a09f7353b9
MD5 d06bb23a943df53930d0150914ce807e
BLAKE2b-256 6b4aed775245b4f07d776d7dcb3738f70c2d4797197536172adfb7efaadce655

See more details on using hashes here.

Provenance

The following attestation bundles were made for pybiber-0.3.1-py2.py3-none-any.whl:

Publisher: ci.yml on browndw/pybiber

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page