pybiber

Extract Biber features from a document parsed and annotated by spaCy.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

The pybiber package provides tools for extracting 67 lexicogrammatical and functional features described by Biber (1988) and widely used for text-type, register, and genre classification tasks in corpus linguistics.

Key Features:

67 Linguistic Features: Automated extraction of tense markers, pronouns, subordination patterns, modal verbs, and more
Multi-Dimensional Analysis: Complete implementation of Biber’s MDA methodology for register analysis
Principal Component Analysis: Alternative dimensionality reduction approaches with visualization tools
High Performance: Built on spaCy and Polars for efficient text processing
End-to-End Pipeline: From raw text files to statistical analysis in just a few lines of code
Comprehensive Visualization: Built-in plotting functions for exploratory data analysis

Applications:

Register and genre analysis in corpus linguistics
Text classification and machine learning preprocessing
Diachronic language change studies
Cross-linguistic variation research
Academic writing analysis and pedagogical applications
Stylometric analysis and authorship attribution

The package uses spaCy part-of-speech tagging and dependency parsing with Polars DataFrames for high-performance analytics.

Accuracy Note: Feature extraction builds from probabilistic taggers, so accuracy depends on model quality. Texts with irregular spellings or non-standard punctuation may produce unreliable outputs unless taggers are specifically tuned for those domains.

See the documentation for comprehensive guides and API reference.

See pseudobibeR for the R implementation.

Quick Start

One-line processing from a folder of text files:

import pybiber as pb

# Process all .txt files in a directory
pipeline = pb.PybiberPipeline(model="en_core_web_sm")
features = pipeline.run_from_folder("path/to/texts")

Multi-Dimensional Analysis with visualization:

# Create analyzer for statistical analysis
analyzer = pb.BiberAnalyzer(features)

# Perform MDA and generate scree plot
mda_results = analyzer.mda()
analyzer.mdaviz_screeplot()

# Plot group means by dimension
analyzer.mdaviz_groupmeans(grouping_var="register")

Installation

You can install the released version of pybiber from PyPI:

pip install pybiber

Install a spaCY model:

python -m spacy download en_core_web_sm

Usage

Data Requirements

The pybiber package works with corpora structured as DataFrames with: - doc_id column: Unique document identifiers - text column: Raw text content

This follows conventions from readtext and quanteda.

Step-by-Step Workflow

Import libraries and load spaCy model:

import spacy
import pybiber as pb
from pybiber.data import micusp_mini  # Sample corpus

nlp = spacy.load("en_core_web_sm")

Parse corpus with spaCy:

# Parse texts to extract linguistic annotations (modern approach)
processor = pb.CorpusProcessor()
tokens_df = processor.process_corpus(micusp_mini, nlp)

Extract Biber features:

# Aggregate 67 linguistic features per document
features_df = pb.biber(tokens_df)

Advanced Analysis (optional):

# Statistical analysis and visualization
analyzer = pb.BiberAnalyzer(features_df)

# Multi-Dimensional Analysis
mda_results = analyzer.mda()

# Principal Component Analysis
pca_results = analyzer.pca()

# Visualization options
analyzer.mdaviz_screeplot()           # Eigenvalue plot
analyzer.pcaviz_contrib()             # Feature contributions
analyzer.mdaviz_groupmeans(group_var="genre")  # Group comparisons

Pipeline Convenience Functions

For streamlined processing, use the high-level pipeline:

from pybiber import PybiberPipeline

pipeline = PybiberPipeline(model="en_core_web_sm", disable_ner=True)

# From folder of .txt files
features_df = pipeline.run_from_folder("/path/to/texts")

# From in-memory corpus
features_df, tokens_df = pipeline.run(corpus_df, return_tokens=True)

# One-liner convenience functions
features_df = pb.run_biber_from_folder("/path/to/texts")
features_df = pb.run_biber(corpus_df)

Feature Categories

The package extracts 67 linguistic features across 16 categories:

Tense & Aspect: Past tense, perfect aspect, present tense
Adverbials: Place and time adverbials
Pronouns: 1st/2nd/3rd person, demonstrative, indefinite pronouns
Questions: Direct wh-questions
Nominal Forms: Nominalizations, gerunds, nouns
Passives: Agentless and by-passives
Stative Forms: be as main verb, existential there
Subordination: 18 different clause types (that-clauses, wh-clauses, infinitives, relatives, etc.)
Modification: Prepositional phrases, attributive/predicative adjectives, adverbs
Lexical Specificity: Type-token ratio, word length
Lexical Classes: Conjuncts, hedges, amplifiers, emphatics, discourse particles
Modals: Possibility, necessity, and predictive modals
Specialized Verbs: Public, private, suasive verbs
Reduced Forms: Contractions, deletions, split constructions
Coordination: Phrasal and clausal coordination
Negation: Synthetic and analytic negation

See the full feature list for detailed descriptions.

Performance & Requirements

System Requirements: - Python 3.10+ - spaCy model with POS tagging and dependency parsing (e.g., en_core_web_sm)

Performance Notes: - Built on Polars for fast DataFrame operations - Supports multiprocessing for large corpora - Memory-efficient processing with configurable batch sizes - Processing time: ~20-30 seconds for small corpora (e.g., 500 documents)

License

Code licensed under the MIT License. See the LICENSE file.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

browndw

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.3.1

Mar 6, 2026

This version

0.3.0

Jan 10, 2026

0.2.0

Sep 8, 2025

0.1.1

Feb 21, 2025

0.1.0

Jan 20, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pybiber-0.3.0.tar.gz (1.8 MB view details)

Uploaded Jan 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pybiber-0.3.0-py2.py3-none-any.whl (1.8 MB view details)

Uploaded Jan 10, 2026 Python 2Python 3

File details

Details for the file pybiber-0.3.0.tar.gz.

File metadata

Download URL: pybiber-0.3.0.tar.gz
Upload date: Jan 10, 2026
Size: 1.8 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pybiber-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`787c9943ac83050b6d00dc86f0340cfe2ea7a35953ce69c193716ce15484ce2b`
MD5	`4089bb02ea2887301b08a0114a46e1e7`
BLAKE2b-256	`b7174946f6f4bdf81b9fbb17535b511880993c761a6612adfc50c0c8ad374880`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pybiber-0.3.0.tar.gz:

Publisher: ci.yml on browndw/pybiber

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pybiber-0.3.0.tar.gz
- Subject digest: 787c9943ac83050b6d00dc86f0340cfe2ea7a35953ce69c193716ce15484ce2b
- Sigstore transparency entry: 813346887
- Sigstore integration time: Jan 10, 2026
Source repository:
- Permalink: browndw/pybiber@d400af102f1168960850ec360debaf1b40cda761
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/browndw
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@d400af102f1168960850ec360debaf1b40cda761
- Trigger Event: push

File details

Details for the file pybiber-0.3.0-py2.py3-none-any.whl.

File metadata

Download URL: pybiber-0.3.0-py2.py3-none-any.whl
Upload date: Jan 10, 2026
Size: 1.8 MB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pybiber-0.3.0-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`85e97748db39098f74f3022931fa4ca728acfdc313d883f450fb4816e6ad75a0`
MD5	`979269547d89ef3bc123056193ac442a`
BLAKE2b-256	`899d2c4ad08b518ee12c8ea2201275f45c0a874f40f28b7a09ebfc69b227e3f5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pybiber-0.3.0-py2.py3-none-any.whl:

Publisher: ci.yml on browndw/pybiber

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pybiber-0.3.0-py2.py3-none-any.whl
- Subject digest: 85e97748db39098f74f3022931fa4ca728acfdc313d883f450fb4816e6ad75a0
- Sigstore transparency entry: 813346891
- Sigstore integration time: Jan 10, 2026
Source repository:
- Permalink: browndw/pybiber@d400af102f1168960850ec360debaf1b40cda761
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/browndw
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@d400af102f1168960850ec360debaf1b40cda761
- Trigger Event: push

pybiber 0.3.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Quick Start

Installation

Usage

Feature Categories

Performance & Requirements

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance