A package to search, fetch, and export posts from Bluesky social network
Project description
Bluesky Search
A Python package for retrieving, searching, and exporting posts from Bluesky (AT Protocol) social network.
Features
- Secure authentication with official AT Protocol API
- Post retrieval from multiple users
- Customizable number of posts per user
- Advanced post search with multiple criteria:
- Keyword/phrase search
- Filter by author, mentions, or language
- Date range filtering
- Domain filtering
- Automatic pagination to retrieve beyond 100-post API limit
- Multiple export formats:
- JSON (structured by user)
- CSV (flattened for analysis)
- Parquet (optimized for big data)
Requirements
- Python 3.8+
atprotolibrarypolarslibrary (for CSV/Parquet export)
Installation
Development Installation
# Clone the repository
git clone https://github.com/your-username/bluesky-search.git
cd bluesky-search
# Using uv (recommended)
uv venv
source .venv/bin/activate # Linux/macOS
# or
.venv\Scripts\activate # Windows
# Install in development mode
uv pip install -e .
# Alternative with pip
python -m venv venv
source venv/bin/activate # Linux/macOS
# or
venv\Scripts\activate # Windows
pip install -e .
Regular Installation (once published to PyPI)
# Using uv
uv pip install bluesky-search
# Using pip
pip install bluesky-search
Development
Building and Publishing
To build and publish this package to PyPI using uv:
# Build the package - creates distributions in the dist/ directory
uv build
# Publish to PyPI using a token (recommended)
uv publish --token "your-token-here"
For security, you can avoid putting tokens in your command history by:
-
Using environment variables:
export UV_PUBLISH_TOKEN=your-token-here uv publish
-
Creating a
.pypircfile in your home directory:[pypi] username = __token__ password = your-token-here
Usage
As a Command Line Tool
The package includes a command-line interface for easy access to all functionality:
# Run directly after installation
bluesky-search --help
# Or from the source directory
python -m src.bluesky_search.cli --help
Programmatic Usage
from src.bluesky_search import BlueskyPostsFetcher
# Initialize with authentication
fetcher = BlueskyPostsFetcher(username="your_username", password="your_password")
# Get posts from a user
posts = fetcher.get_user_posts("username.bsky.social", limit=20)
# Search for posts
search_results = fetcher.search_posts("keyword", limit=50)
# Export results
fetcher.export_to_json(posts, "output.json")
fetcher.export_to_csv(posts, "output.csv")
fetcher.export_to_parquet(posts, "output.parquet")
Command Line Parameters
-u,--username: Username/email for authentication-p,--password: Password for authentication-f,--file: File containing user list (one per line)-l,--list: Space-separated list of users-b,--bsky-list: Bluesky list URL-n,--limit: Max posts per user/search (default: 20, no upper limit for searches)-o,--output: Output filename or path--output-dir: Specific directory to save output files (highest priority)-d,--data-dir: Save output to the project's data directory-e,--export: Export format (json,csv, orparquet, default:csv)-x,--format: Legacy alias for-e
Search Parameters
-s,--search: Search posts (use quotes for exact phrases)--from: Search posts from specific user--mention: Search posts mentioning specific user--lang: Search posts in specific language (e.g., es, en, fr)--since: Search posts from date (YYYY-MM-DD)--until: Search posts until date (YYYY-MM-DD)--domain: Search posts containing links to specific domain
Examples
# Get posts from specific users
bluesky-search -u your_username -p your_password -a user1.bsky.social
# Get posts from users in a file
bluesky-search -u your_username -p your_password -f users.txt
# Specify post limit per user and output file
bluesky-search -u your_username -p your_password -a user1.bsky.social -n 50 -o results.json
# Using the CLI with export to CSV
bluesky-search -a user.bsky.social -e csv -o user_posts.csv
# Export to Parquet format
bluesky-search -a user.bsky.social -e parquet -o my_posts.parquet
# Save output to a specific directory
bluesky-search -a user.bsky.social --output-dir /path/to/custom/directory
# Save output to the project's data directory
bluesky-search -a user.bsky.social -d
# Run directly with uv during development
uv run -m src.bluesky_search.cli -s "keyword" -d
Complete Search Guide
The script offers multiple ways to search and retrieve Bluesky posts. Here are all available options:
1. User Post Retrieval
# Get posts from specific user
bluesky-search -a user.bsky.social
# Load users from file
bluesky-search -f users.txt
# Limit posts per user
bluesky-search -a user.bsky.social -n 50
Using the Python API:
from src.bluesky_search import BlueskyPostsFetcher
# Initialize with your credentials
fetcher = BlueskyPostsFetcher(username="your_username", password="your_password")
# Get posts from a single user
posts = fetcher.get_user_posts("user.bsky.social", limit=50)
# Process the posts
for post in posts:
print(f"Post from {post['author']['handle']}: {post['text'][:50]}...")
# Export the results
fetcher.export_to_json(posts, "user_posts.json")
2. Bluesky List Retrieval
# Get posts from all users in a Bluesky list
bluesky-search -l https://bsky.app/profile/user.bsky.social/lists/123abc
# Limit posts per list user
bluesky-search -l https://bsky.app/profile/user.bsky.social/lists/123abc -n 30
Using the Python API:
from src.bluesky_search import BlueskyPostsFetcher
# Initialize with your credentials
fetcher = BlueskyPostsFetcher(username="your_username", password="your_password")
# Get posts from a Bluesky list
list_url = "https://bsky.app/profile/user.bsky.social/lists/123abc"
list_posts = fetcher.get_list_posts(list_url, limit=30)
# Export the results
fetcher.export_to_csv(list_posts, "list_posts.csv")
3. Keyword Search
# Simple keyword/phrase search
bluesky-search -s "artificial intelligence"
# Limit search results
bluesky-search -s "artificial intelligence" -n 50
Using the Python API:
from src.bluesky_search import BlueskyPostsFetcher
# Initialize with your credentials
fetcher = BlueskyPostsFetcher(username="your_username", password="your_password")
# Search for posts with a keyword
search_results = fetcher.search_posts("artificial intelligence", limit=50)
# Print number of results
print(f"Found {len(search_results)} posts about AI")
# Export the results
fetcher.export_to_parquet(search_results, "ai_posts.parquet")
4. Filtered Search
# Filter by language
bluesky-search -s "inteligencia artificial" --language es
bluesky-search -s "artificial intelligence" --language en
# Filter by author (posts from specific user)
bluesky-search -s "economics" --from economist.bsky.social
# Filter by mentions (posts mentioning user)
bluesky-search -s "event" --mention organizer.bsky.social
# Date range filter
bluesky-search -s "news" --since 2025-01-01 --until 2025-01-31
# Domain filter
bluesky-search -s "analysis" --domain example.com
Using the Python API:
from src.bluesky_search import BlueskyPostsFetcher
# Initialize with your credentials
fetcher = BlueskyPostsFetcher(username="your_username", password="your_password")
# Advanced search with filters
results = fetcher.search_posts(
"economics",
limit=100,
from_user="economist.bsky.social",
since="2025-01-01",
until="2025-01-31",
language="en"
)
# Export the results
fetcher.export_to_csv(results, "economics_articles.csv")
5. Combined Filters
# Combine multiple filters in one search
bluesky-search -s "politics" --from journalist.bsky.social --language es --since 2025-02-01
# Advanced multi-criteria search with specific export
bluesky-search -s "elections" --language es --since 2025-01-01 --until 2025-02-29 --domain news.com -n 200 -e csv -o elections_2025.csv
Using the Python API:
from src.bluesky_search import BlueskyPostsFetcher
# Initialize with your credentials
fetcher = BlueskyPostsFetcher(username="your_username", password="your_password")
# Complex search with multiple filters
results = fetcher.search_posts(
query="elections",
limit=200,
language="es",
since="2025-01-01",
until="2025-02-29",
domain="news.com"
)
# Export directly to CSV
fetcher.export_to_csv(results, "elections_2025.csv")
6. Pagination for Large Datasets
# Get large number of posts (500+) with auto-pagination
bluesky-search -s "Granada" -n 500 -e csv -o granada_posts.csv
# Build extensive dataset on a topic
bluesky-search -s "climate" --since 2024-01-01 -n 1000 -e parquet -o climate_dataset.parquet
Using the Python API:
from src.bluesky_search import BlueskyPostsFetcher
# Initialize with your credentials
fetcher = BlueskyPostsFetcher(username="your_username", password="your_password")
# Large-scale search with automatic pagination
big_dataset = fetcher.search_posts("climate", limit=1000, since="2024-01-01")
print(f"Collected {len(big_dataset)} posts about climate")
# Export as Parquet for efficient storage and analysis
fetcher.export_to_parquet(big_dataset, "climate_dataset.parquet")
7. Export Formats
# Export to JSON (default format)
bluesky-search -s "sports" -o sports.json
# Export to CSV for spreadsheet analysis
bluesky-search -s "sports" -e csv -o sports.csv
# Export to Parquet for big data analysis
bluesky-search -s "sports" -e parquet -o sports.parquet
Using the Python API:
from src.bluesky_search import BlueskyPostsFetcher
# Initialize with your credentials
fetcher = BlueskyPostsFetcher(username="your_username", password="your_password")
# Get posts to export
sports_posts = fetcher.search_posts("sports", limit=100)
# Export in multiple formats
fetcher.export_to_json(sports_posts, "sports.json")
fetcher.export_to_csv(sports_posts, "sports.csv")
fetcher.export_to_parquet(sports_posts, "sports.parquet")
Running Manual Queries During Development
During development or for ad-hoc analysis, you can run manual queries directly from the command line without installing the package. This is useful for quick data exploration, testing new search parameters, or during the development process.
Using the uv run Command
uv is a fast Python package installer and resolver that can also run Python modules directly. This is ideal for development usage:
# Basic search query
uv run -m src.bluesky_search.cli -u your_username -p your_password -s "search term" -n 100
# Search with export to parquet
uv run -m src.bluesky_search.cli -u your_username -p your_password -s "search term" -n 350 -e parquet -o results.parquet
# Search with legacy -x parameter for export format
uv run -m src.bluesky_search.cli -u your_username -p your_password -s "search term" -n 350 -x parquet -o results.parquet
Using python -m Command
Alternatively, you can use Python's module execution capability:
# Using python -m
python -m src.bluesky_search.cli -u your_username -p your_password -s "search term" -n 100 -e json -o results.json
Tips for Manual Queries
- Add the
-oparameter to specify the output file name, otherwise a timestamped file will be generated automatically - Include the
-nparameter to control the number of results (especially useful for large searches) - Use quotes around search terms containing spaces or special characters
- For regular usage of the tool, consider installing it in development mode with
uv pip install -e .orpip install -e . - When searching for a large number of posts, use progress indicators in the terminal output to monitor the collection process
Package Structure
The package is organized into logical modules:
bluesky_search/
├── src/
│ └── bluesky_search/
│ ├── __init__.py # Package exports
│ ├── client.py # Base client functionality
│ ├── fetcher.py # Post fetching functionality
│ ├── search.py # Search functionality
│ ├── list.py # List handling functionality
│ ├── cli.py # Command-line interface
│ ├── export/ # Export utilities
│ │ ├── __init__.py
│ │ ├── json.py # JSON export
│ │ ├── csv.py # CSV export
│ │ └── parquet.py # Parquet export
│ └── utils/ # Utility functions
│ ├── __init__.py
│ ├── url.py # URL handling
│ └── text.py # Text processing
├── test/ # Test suite
└── pyproject.toml # Package configuration
Advanced Features
Automatic Pagination
The package supports retrieving more than 100 posts per search (Bluesky API limit) through automatic pagination:
- Makes multiple API calls automatically
- Shows progress for each call and total collected posts
- Combines results into single dataset
- Includes brief pauses between calls to avoid API overload
Web URLs for Posts
All retrieved posts include web URLs for direct browser access:
- Format:
https://bsky.app/profile/user.bsky.social/post/identifier - Included in all export formats (JSON, CSV, Parquet)
- Enables direct verification and access to original posts
Example web_url in exported data:
https://bsky.app/profile/user.bsky.social/post/3abc123xyz
This allows easy verification of any post from exported data.
User File Format
Create a text file with one username per line:
user1
user2
user3
Console Input
If no parameters are provided, the script will prompt for:
- Comma-separated list of users
- Bluesky list URL
- Search query with
search:prefix (e.g.,search:artificial intelligence)
Output Formats
JSON
Structured output format:
{
"user1": [
{
"uri": "at://...",
"cid": "...",
"web_url": "https://bsky.app/profile/user.bsky.social/post/abc123",
"author": {
"did": "did:plc:...",
"handle": "user1",
"display_name": "Display Name"
},
"text": "Post content",
"created_at": "2025-...",
"likes": 5,
"reposts": 2,
"replies": 3
},
...
]
}
Export Formats Specification
All export formats (CSV, JSON, and Parquet) maintain the exact same column order and structure for consistency across different output formats. This allows for easy switching between formats based on your specific needs.
Exact Column Order
All exports follow this precise column order:
user_handle- The handle under which the post was found (useful for search results)author_handle- The post author's handleauthor_display_name- The post author's display namecreated_at- Timestamp when the post was createdpost_type- Type of post (original, reply, repost, etc.)text- The main text content of the postweb_url- URL to view the post on Bluesky's web interfacelikes- Number of likes the post has receivedreposts- Number of reposts the post has receivedreplies- Number of replies the post has receivedurls- Array of URLs mentioned in the postimages- Array of image URLs in the postmentions- Array of user mentions in the postlang- Language of the post (e.g., 'en' for English)replied_to_handle- Handle of the user being replied to (only for reply posts)replied_to_id- Decentralized identifier (DID) of the user being replied to (only for reply posts)cid- Content identifier for the postauthor_did- Decentralized identifier for the post authoruri- AT Protocol URI for the post
Format-Specific Details
CSV Export
- Array fields (
urls,images,mentions) are preserved as JSON-formatted arrays in string form - Example:
["https://example.com", "https://another-example.com"] - Requires the
polarspackage
JSON Export
- Maintains native array formats for
urls,images, andmentions - Preserves the nested dictionary structure where each key is an author handle
- Empty arrays are preserved as
[]rather than being omitted
Parquet Export
- Array fields (
urls,images,mentions) are stored as JSON-formatted array strings - Example:
["https://example.com", "https://another-example.com"] - Consistent format across CSV and Parquet exports for easier data integration
- Most efficient for analytical workloads and data science pipelines
- Requires the
polarspackage
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bluesky_search-0.1.6.tar.gz.
File metadata
- Download URL: bluesky_search-0.1.6.tar.gz
- Upload date:
- Size: 40.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f7a53abc3590a050bec2358e1baad6854b9ea9b3b5fa3fac9d56c94f9e95ff81
|
|
| MD5 |
aa65a1cc246645dbc31ed5a5db05f495
|
|
| BLAKE2b-256 |
09992dc9bfc0dc8313d5852f4e484b1e7e76c46484893bc94d514a389ad8b0aa
|
File details
Details for the file bluesky_search-0.1.6-py3-none-any.whl.
File metadata
- Download URL: bluesky_search-0.1.6-py3-none-any.whl
- Upload date:
- Size: 35.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d727215a46bd30d0f33d12ca3c20e292f55e7f6fcad0a8b6f8a55dbbbd59cb7d
|
|
| MD5 |
425bb0fa3f057f6d1180448383035f03
|
|
| BLAKE2b-256 |
718030449e3898d186601275b214f51e4e4594853f15ecbba8e88f7effbcc4fa
|