Extract Biber features from a document parsed and annotated by spaCy.
Project description
The pybiber package provides tools for extracting 67 lexicogrammatical and functional features described by Biber (1988) and widely used for text-type, register, and genre classification tasks in corpus linguistics.
Key Features:
67 Linguistic Features: Automated extraction of tense markers, pronouns, subordination patterns, modal verbs, and more
Multi-Dimensional Analysis: Complete implementation of Biber’s MDA methodology for register analysis
Principal Component Analysis: Alternative dimensionality reduction approaches with visualization tools
High Performance: Built on spaCy and Polars for efficient text processing
End-to-End Pipeline: From raw text files to statistical analysis in just a few lines of code
Comprehensive Visualization: Built-in plotting functions for exploratory data analysis
Applications:
Register and genre analysis in corpus linguistics
Text classification and machine learning preprocessing
Diachronic language change studies
Cross-linguistic variation research
Academic writing analysis and pedagogical applications
Stylometric analysis and authorship attribution
The package uses spaCy part-of-speech tagging and dependency parsing with Polars DataFrames for high-performance analytics.
Accuracy Note: Feature extraction builds from probabilistic taggers, so accuracy depends on model quality. Texts with irregular spellings or non-standard punctuation may produce unreliable outputs unless taggers are specifically tuned for those domains.
See the documentation for comprehensive guides and API reference.
See pseudobibeR for the R implementation.
Quick Start
One-line processing from a folder of text files:
import pybiber as pb
# Process all .txt files in a directory
pipeline = pb.PybiberPipeline(model="en_core_web_sm")
features = pipeline.run_from_folder("path/to/texts")
Multi-Dimensional Analysis with visualization:
# Create analyzer for statistical analysis
analyzer = pb.BiberAnalyzer(features)
# Perform MDA and generate scree plot
mda_results = analyzer.mda()
analyzer.mdaviz_screeplot()
# Plot group means by dimension
analyzer.mdaviz_groupmeans(grouping_var="register")
Installation
You can install the released version of pybiber from PyPI:
pip install pybiber
Install a spaCY model:
python -m spacy download en_core_web_sm
Usage
Data Requirements
The pybiber package works with corpora structured as DataFrames with: - doc_id column: Unique document identifiers - text column: Raw text content
This follows conventions from readtext and quanteda.
Step-by-Step Workflow
Import libraries and load spaCy model:
import spacy
import pybiber as pb
from pybiber.data import micusp_mini # Sample corpus
nlp = spacy.load("en_core_web_sm")
Parse corpus with spaCy:
# Parse texts to extract linguistic annotations (modern approach)
processor = pb.CorpusProcessor()
tokens_df = processor.process_corpus(micusp_mini, nlp)
Extract Biber features:
# Aggregate 67 linguistic features per document
features_df = pb.biber(tokens_df)
Advanced Analysis (optional):
# Statistical analysis and visualization
analyzer = pb.BiberAnalyzer(features_df)
# Multi-Dimensional Analysis
mda_results = analyzer.mda()
# Principal Component Analysis
pca_results = analyzer.pca()
# Visualization options
analyzer.mdaviz_screeplot() # Eigenvalue plot
analyzer.pcaviz_contrib() # Feature contributions
analyzer.mdaviz_groupmeans(group_var="genre") # Group comparisons
Pipeline Convenience Functions
For streamlined processing, use the high-level pipeline:
from pybiber import PybiberPipeline
pipeline = PybiberPipeline(model="en_core_web_sm", disable_ner=True)
# From folder of .txt files
features_df = pipeline.run_from_folder("/path/to/texts")
# From in-memory corpus
features_df, tokens_df = pipeline.run(corpus_df, return_tokens=True)
# One-liner convenience functions
features_df = pb.run_biber_from_folder("/path/to/texts")
features_df = pb.run_biber(corpus_df)
Feature Categories
The package extracts 67 linguistic features across 16 categories:
Tense & Aspect: Past tense, perfect aspect, present tense
Adverbials: Place and time adverbials
Pronouns: 1st/2nd/3rd person, demonstrative, indefinite pronouns
Questions: Direct wh-questions
Nominal Forms: Nominalizations, gerunds, nouns
Passives: Agentless and by-passives
Stative Forms: be as main verb, existential there
Subordination: 18 different clause types (that-clauses, wh-clauses, infinitives, relatives, etc.)
Modification: Prepositional phrases, attributive/predicative adjectives, adverbs
Lexical Specificity: Type-token ratio, word length
Lexical Classes: Conjuncts, hedges, amplifiers, emphatics, discourse particles
Modals: Possibility, necessity, and predictive modals
Specialized Verbs: Public, private, suasive verbs
Reduced Forms: Contractions, deletions, split constructions
Coordination: Phrasal and clausal coordination
Negation: Synthetic and analytic negation
See the full feature list for detailed descriptions.
Performance & Requirements
System Requirements: - Python 3.10+ - spaCy model with POS tagging and dependency parsing (e.g., en_core_web_sm)
Performance Notes: - Built on Polars for fast DataFrame operations - Supports multiprocessing for large corpora - Memory-efficient processing with configurable batch sizes - Processing time: ~20-30 seconds for small corpora (e.g., 500 documents)
License
Code licensed under the MIT License. See the LICENSE file.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pybiber-0.3.0.tar.gz.
File metadata
- Download URL: pybiber-0.3.0.tar.gz
- Upload date:
- Size: 1.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
787c9943ac83050b6d00dc86f0340cfe2ea7a35953ce69c193716ce15484ce2b
|
|
| MD5 |
4089bb02ea2887301b08a0114a46e1e7
|
|
| BLAKE2b-256 |
b7174946f6f4bdf81b9fbb17535b511880993c761a6612adfc50c0c8ad374880
|
Provenance
The following attestation bundles were made for pybiber-0.3.0.tar.gz:
Publisher:
ci.yml on browndw/pybiber
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pybiber-0.3.0.tar.gz -
Subject digest:
787c9943ac83050b6d00dc86f0340cfe2ea7a35953ce69c193716ce15484ce2b - Sigstore transparency entry: 813346887
- Sigstore integration time:
-
Permalink:
browndw/pybiber@d400af102f1168960850ec360debaf1b40cda761 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/browndw
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@d400af102f1168960850ec360debaf1b40cda761 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pybiber-0.3.0-py2.py3-none-any.whl.
File metadata
- Download URL: pybiber-0.3.0-py2.py3-none-any.whl
- Upload date:
- Size: 1.8 MB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
85e97748db39098f74f3022931fa4ca728acfdc313d883f450fb4816e6ad75a0
|
|
| MD5 |
979269547d89ef3bc123056193ac442a
|
|
| BLAKE2b-256 |
899d2c4ad08b518ee12c8ea2201275f45c0a874f40f28b7a09ebfc69b227e3f5
|
Provenance
The following attestation bundles were made for pybiber-0.3.0-py2.py3-none-any.whl:
Publisher:
ci.yml on browndw/pybiber
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pybiber-0.3.0-py2.py3-none-any.whl -
Subject digest:
85e97748db39098f74f3022931fa4ca728acfdc313d883f450fb4816e6ad75a0 - Sigstore transparency entry: 813346891
- Sigstore integration time:
-
Permalink:
browndw/pybiber@d400af102f1168960850ec360debaf1b40cda761 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/browndw
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@d400af102f1168960850ec360debaf1b40cda761 -
Trigger Event:
push
-
Statement type: