Lyrics word frequency analyzer using Genius API
Project description
BarScan
A Python CLI tool that analyzes word frequency in song lyrics using the Genius API.
Features
- Fetch lyrics for any artist from the Genius API
- Analyze word frequency across multiple songs
- Natural language processing with NLTK for accurate tokenization
- Customizable stop word filtering and exclusions
- Multiple output formats: table, JSON, CSV, and WordGrain
- Local caching to reduce API calls and improve performance
- Retry logic with exponential backoff for robust API communication
Installation
Prerequisites
- Python 3.11 or higher
- pip (latest version recommended)
From PyPI
pip install barscan
With Japanese Support
To analyze Japanese lyrics, install with the japanese extra:
pip install barscan[japanese]
This includes Janome for Japanese tokenization and additional stop words.
From Source
git clone https://github.com/shimpeiws/barscan.git
cd barscan
pip install -e ".[dev]"
Setup
Getting a Genius API Token
- Go to Genius API Clients
- Sign in with your Genius account (or create one)
- Click "Create an API Client"
- Fill in the app details:
- App Name: Any name (e.g., "BarScan CLI")
- App Website URL: Any URL (e.g., your GitHub profile)
- Redirect URI: Leave default or use
http://localhost
- Click "Save"
- Copy the "Client Access Token" (not the Client ID or Secret)
Configuring the Token
Set the token as an environment variable:
export BARSCAN_GENIUS_ACCESS_TOKEN=your_token_here
Or create a .env file in your project directory:
BARSCAN_GENIUS_ACCESS_TOKEN=your_token_here
Usage
Basic Analysis
Analyze the most common words in an artist's lyrics:
barscan analyze "Kendrick Lamar"
Command Options
# Analyze more songs
barscan analyze "Drake" --max-songs 20
# Show more words in results
barscan analyze "J. Cole" --top 100
# Combine options
barscan analyze "Tyler, The Creator" -n 15 -t 50
Output Formats
# Default table format (console)
barscan analyze "Beyonce"
# JSON format
barscan analyze "Beyonce" --format json
# CSV format
barscan analyze "Beyonce" --format csv
# WordGrain format (structured JSON schema)
barscan analyze "Beyonce" --format wordgrain
# Save to file
barscan analyze "Beyonce" --format json --output results.json
Filtering Options
# Disable stop word filtering (include "the", "a", "is", etc.)
barscan analyze "Eminem" --no-stop-words
# Exclude specific words
barscan analyze "Eminem" --exclude "yeah" --exclude "oh"
# Combine exclusions
barscan analyze "Eminem" -e "uh" -e "like" -e "yo"
Cache Management
BarScan caches lyrics locally to reduce API calls:
# Clear all cached lyrics
barscan clear-cache --force
# Clear only expired cache entries
barscan clear-cache --expired-only --force
# Interactive confirmation (without --force)
barscan clear-cache
View Configuration
# Show current configuration and cache statistics
barscan config
Configuration Options
All settings can be configured via environment variables with the BARSCAN_ prefix:
| Variable | Description | Default |
|---|---|---|
BARSCAN_GENIUS_ACCESS_TOKEN |
Genius API access token | (required) |
BARSCAN_CACHE_DIR |
Directory for caching lyrics | ~/.cache/barscan |
BARSCAN_CACHE_TTL_HOURS |
Cache time-to-live in hours | 168 (7 days) |
BARSCAN_DEFAULT_MAX_SONGS |
Default number of songs to analyze | 10 |
BARSCAN_DEFAULT_TOP_WORDS |
Default number of top words to show | 50 |
Output Formats
Table Format (default)
Human-readable table with word rankings:
Artist: Kendrick Lamar
Songs analyzed: 10
Total words: 5,432
Unique words: 1,203
Word Frequencies
┌──────┬─────────┬───────┬────────────┐
│ Rank │ Word │ Count │ Percentage │
├──────┼─────────┼───────┼────────────┤
│ 1 │ love │ 87 │ 1.60% │
│ 2 │ know │ 65 │ 1.20% │
│ ... │ ... │ ... │ ... │
└──────┴─────────┴───────┴────────────┘
JSON Format
Structured JSON for programmatic use:
{
"artist": "Kendrick Lamar",
"songs_analyzed": 10,
"total_words": 5432,
"unique_words": 1203,
"frequencies": [
{"word": "love", "count": 87, "percentage": 1.60},
{"word": "know", "count": 65, "percentage": 1.20}
]
}
CSV Format
Comma-separated values for spreadsheet import:
word,count,percentage
love,87,1.60
know,65,1.20
WordGrain Format
WordGrain is a standardized JSON schema for vocabulary analysis data. It enables interoperability between different word frequency analysis tools. See the documentation for details.
Output example:
{
"$schema": "https://raw.githubusercontent.com/shimpeiws/word-grain/main/schema/v0.1.0/wordgrain.schema.json",
"meta": {
"source": "genius",
"artist": "Kendrick Lamar",
"generated_at": "2024-01-15T10:30:00Z",
"corpus_size": 10,
"total_words": 5432,
"generator": "barscan/0.1.0",
"language": "en"
},
"grains": [
{"word": "love", "frequency": 87, "frequency_normalized": 160.18}
]
}
Development
Setup
# Clone repository
git clone https://github.com/shimpeiws/barscan.git
cd barscan
# Install with development dependencies
pip install -e ".[dev]"
Running Tests
# Run all tests with coverage
pytest
# Run specific test file
pytest tests/test_genius/test_client.py -v
# Run specific test
pytest tests/test_genius/test_client.py::TestSearchArtist::test_search_artist_success -v
Code Quality
# Lint code
ruff check src/
# Format code
ruff format src/
# Type check
mypy src/barscan/ --ignore-missing-imports
Architecture
src/barscan/
├── cli.py # Typer CLI entry point (barscan command)
├── config.py # Pydantic Settings configuration
├── exceptions.py # Exception hierarchy (BarScanError base)
├── genius/ # Genius API integration
│ ├── models.py # Pydantic models (Artist, Song, Lyrics)
│ ├── client.py # GeniusClient with retry logic
│ └── cache.py # File-based lyrics cache with TTL
├── analyzer/ # Word frequency analysis
│ ├── models.py # Analysis result models
│ ├── processor.py # Text preprocessing with NLTK
│ ├── filters.py # Stop word and length filtering
│ └── frequency.py # Word counting and aggregation
└── output/ # Result formatting
└── wordgrain.py # WordGrain schema export
Troubleshooting
"Genius API token not configured"
Make sure you've set the BARSCAN_GENIUS_ACCESS_TOKEN environment variable or created a .env file with the token.
"Artist not found"
- Check the spelling of the artist name
- Try using the artist's name exactly as it appears on Genius
- Some artists may have limited or no presence on Genius
Rate Limiting
BarScan includes automatic retry logic with exponential backoff. If you encounter rate limiting:
- The tool will automatically retry failed requests
- Consider reducing
--max-songsfor large analyses - Cached lyrics won't trigger new API calls
Empty Results
If no words appear in results after filtering:
- Try
--no-stop-wordsto include common words - Check if the artist has lyrics available on Genius
- Some songs may be instrumental or have no lyrics
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file barscan-0.3.0.tar.gz.
File metadata
- Download URL: barscan-0.3.0.tar.gz
- Upload date:
- Size: 65.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d1a52923b1b93073181cb1a4320635acf84b5798eac09cde4a0b10185e21d267
|
|
| MD5 |
cb1a8f39688d1e6fe503da71070f36ce
|
|
| BLAKE2b-256 |
fc56c307df53e9ae51e960397e48a74233c5b500625f5b8335ab0c27eb9812e8
|
Provenance
The following attestation bundles were made for barscan-0.3.0.tar.gz:
Publisher:
release.yml on shimpeiws/barscan
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
barscan-0.3.0.tar.gz -
Subject digest:
d1a52923b1b93073181cb1a4320635acf84b5798eac09cde4a0b10185e21d267 - Sigstore transparency entry: 1034658556
- Sigstore integration time:
-
Permalink:
shimpeiws/barscan@10326408865dbbb791894666d865c83ca8e99d77 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/shimpeiws
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@10326408865dbbb791894666d865c83ca8e99d77 -
Trigger Event:
push
-
Statement type:
File details
Details for the file barscan-0.3.0-py3-none-any.whl.
File metadata
- Download URL: barscan-0.3.0-py3-none-any.whl
- Upload date:
- Size: 44.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
435e5d0ee9f35f931af5675c46f8bafd417b0a8cde7e36c15a3a7e1e7af435c2
|
|
| MD5 |
645840b2d50dff79984302aa43116a34
|
|
| BLAKE2b-256 |
3eca924de2be075356af8a4d90d4f6fcb6afa8ccc7d3a272f5a60c3607031f3a
|
Provenance
The following attestation bundles were made for barscan-0.3.0-py3-none-any.whl:
Publisher:
release.yml on shimpeiws/barscan
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
barscan-0.3.0-py3-none-any.whl -
Subject digest:
435e5d0ee9f35f931af5675c46f8bafd417b0a8cde7e36c15a3a7e1e7af435c2 - Sigstore transparency entry: 1034658608
- Sigstore integration time:
-
Permalink:
shimpeiws/barscan@10326408865dbbb791894666d865c83ca8e99d77 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/shimpeiws
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@10326408865dbbb791894666d865c83ca8e99d77 -
Trigger Event:
push
-
Statement type: