A comprehensive toolkit for collecting, composing, and filtering information from various web resources with AI-powered markdown processing
Project description
Information Composer
A comprehensive toolkit for collecting, composing, and filtering information from various web resources with AI-powered markdown processing.
Features
Core Modules
- PDF Validation: Validate PDF file formats and integrity
- Markdown Processing: Advanced markdown processing with LLM filtering
- DOI Management: Download and manage academic papers by DOI
- PubMed Integration: Query and process PubMed data with CLI tool
AI-Powered Features
- LLM Filtering: Support for DashScope, Ollama, and OpenAI
- PubMed Analyzer: AI-powered literature analysis
- Markdown Filter: Intelligent content extraction and filtering
Web Scraping & Data Collection
- Crossref Integration: Query Crossref API for bibliographic data
- Google Scholar Integration: Crawl and process Google Scholar citations
- RSS Feed Processing: Parse and manage scientific RSS feeds
- RiceDataCN Parser: Extract gene data from RiceDataCN database
Developer Tools
- Code Quality: Ruff linter and formatter (primary tool)
- Testing: Pytest with 51%+ coverage (570 tests passed)
- Multi-format Support: PDF, Markdown, JSON, XML, TXT
Installation
Prerequisites
- Python 3.12 or 3.13 (Python 3.12 is the minimum required version)
- Virtual environment (recommended)
Setup
- Clone the repository:
git clone https://github.com/yourusername/information-composer.git
cd information-composer
- Create and activate virtual environment:
# Linux/macOS
python -m venv .venv
source .venv/bin/activate
# Windows
python -m venv .venv
.venv\Scripts\activate
- Install dependencies:
pip install -e .
Quick Start
Activate Environment
# Linux/macOS
source activate.sh
# Windows
activate.bat
Available CLI Commands
| Command | Description |
|---|---|
pdf-validator |
Validate PDF files |
md-llm-filter |
Filter markdown with LLM |
pubmed-cli |
Search and fetch PubMed data |
google-scholar-crawler |
Crawl Google Scholar citations |
rss-fetcher |
Fetch and process RSS feeds |
crossref-cli |
Query Crossref API |
Examples
# Validate PDF files
pdf-validator document.pdf
# Validate directory of PDFs
pdf-validator -d /path/to/directory -r
# Filter markdown with LLM
md-llm-filter -i input.md -o output.md
# Search PubMed
pubmed-cli search "cancer research" -e user@example.com
# Get details for specific PMIDs
pubmed-cli details 12345678 23456789 -e user@example.com
# Crawl Google Scholar
google-scholar-crawler -q "machine learning" -n 20
# Fetch RSS feeds
rss-fetcher -u "https://example.com/feed.xml" -o output.json
# Query Crossref
crossref-cli query --doi "10.1038/nature12373"
Python API Usage
PubMed Integration
from information_composer.pubmed import query_pmid, fetch_pubmed_details_batch_sync
# Search for articles
pmids = query_pmid("cancer immunotherapy", "your-email@example.com", 50)
# Fetch detailed information
details = fetch_pubmed_details_batch_sync(pmids, "your-email@example.com")
Crossref Integration
from information_composer import CrossrefClient, query_crossref
# Query Crossref API
client = CrossrefClient()
results = client.query_works(query="machine learning", limit=10)
# Or use the convenience function
works = query_crossref("machine learning")
DOI Downloader
from information_composer import DOIDownloader
# Download paper by DOI
downloader = DOIDownloader()
result = downloader.download_doi("10.1038/nature12373")
Markdown Processing
from information_composer import jsonify, markdownify
# Convert markdown to JSON
json_data = jsonify(markdown_content)
# Convert JSON to markdown
markdown_content = markdownify(json_data)
PDF Validation
from information_composer import PDFValidator
# Validate PDF
validator = PDFValidator(verbose=True)
is_valid, error = validator.validate_single_pdf("document.pdf")
Google Scholar Crawling
from information_composer.sites.google_scholar import SearchConfig, google_scholar_search
# Search Google Scholar
config = SearchConfig(query="deep learning", num_results=20)
results = google_scholar_search(config)
Development
Code Quality
This project uses Ruff as the primary code quality tool:
# Run all checks
python scripts/check_code.py
# Auto-fix issues
python scripts/check_code.py --fix
# With tests
python scripts/check_code.py --with-tests
# Verbose output
python scripts/check_code.py --verbose
Testing
# Run tests
pytest tests/ -v
# Run tests with coverage
python scripts/check_code.py --with-tests
Project Structure
information-composer/
├── src/information_composer/
│ ├── core/ # Core functionality (DOI downloader)
│ ├── crossref/ # Crossref API integration
│ ├── llm_filter/ # LLM-based markdown filtering
│ ├── markdown/ # Markdown processing utilities
│ ├── pdf/ # PDF validation
│ ├── pubmed/ # PubMed integration
│ ├── rss/ # RSS feed processing
│ └── sites/ # Web scraping (Google Scholar, RiceDataCN)
├── examples/ # Usage examples
├── scripts/ # Utility scripts
├── docs/ # Documentation
└── tests/ # Test files
Documentation
- 📚 Complete Documentation - Full project documentation
- 🚀 Quick Start - Get started in 5 minutes
- ⚙️ Configuration - Configuration options
- 📖 Feature Guides - Detailed feature documentation
- 🔧 Development - Development and contributing guide
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Run code quality checks:
python scripts/check_code.py --fix - Run tests:
python scripts/check_code.py --with-tests - Submit a pull request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Support
For questions and support, please open an issue on GitHub.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file information_composer-0.3.0.tar.gz.
File metadata
- Download URL: information_composer-0.3.0.tar.gz
- Upload date:
- Size: 341.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2480777b6c3c1c6247e94782c3f66bd3808ac4bb87995afb3d8e959cfd309865
|
|
| MD5 |
8544dbbce84bf4b55c496e826706bb26
|
|
| BLAKE2b-256 |
0fce3104241c2e2ef4f83b6d5ce9446c147f7b95416d7c09a4134a34907c90f4
|
File details
Details for the file information_composer-0.3.0-py3-none-any.whl.
File metadata
- Download URL: information_composer-0.3.0-py3-none-any.whl
- Upload date:
- Size: 211.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0784d8d490198b46b3cad624187172fca7f17a19eee7fb2498e7d520f685b31f
|
|
| MD5 |
8f271be0874c510692ff8f4b94aa50cb
|
|
| BLAKE2b-256 |
39b5c911a643099c6cc43760328402b635e5b76d2d9fdd6bdbadebea83ba7404
|