Skip to main content

A comprehensive toolkit for collecting, composing, and filtering information from various web resources with AI-powered markdown processing

Project description

Information Composer

Code Quality Python Version License Ruff

A comprehensive toolkit for collecting, composing, and filtering information from various web resources with AI-powered markdown processing.

Features

Core Modules

  • PDF Validation: Validate PDF file formats and integrity
  • Markdown Processing: Advanced markdown processing with LLM filtering
  • DOI Management: Download and manage academic papers by DOI
  • PubMed Integration: Query and process PubMed data with CLI tool

AI-Powered Features

  • LLM Filtering: Support for DashScope, Ollama, and OpenAI
  • PubMed Analyzer: AI-powered literature analysis
  • Markdown Filter: Intelligent content extraction and filtering

Web Scraping & Data Collection

  • Crossref Integration: Query Crossref API for bibliographic data
  • Google Scholar Integration: Crawl and process Google Scholar citations
  • RSS Feed Processing: Parse and manage scientific RSS feeds
  • RiceDataCN Parser: Extract gene data from RiceDataCN database

Developer Tools

  • Code Quality: Ruff linter and formatter (primary tool)
  • Testing: Pytest with 51%+ coverage (570 tests passed)
  • Multi-format Support: PDF, Markdown, JSON, XML, TXT

Installation

Prerequisites

  • Python 3.12 or 3.13 (Python 3.12 is the minimum required version)
  • Virtual environment (recommended)

Setup

  1. Clone the repository:
git clone https://github.com/yourusername/information-composer.git
cd information-composer
  1. Create and activate virtual environment:
# Linux/macOS
python -m venv .venv
source .venv/bin/activate

# Windows
python -m venv .venv
.venv\Scripts\activate
  1. Install dependencies:
pip install -e .

Quick Start

Activate Environment

# Linux/macOS
source activate.sh

# Windows
activate.bat

Available CLI Commands

Command Description
pdf-validator Validate PDF files
md-llm-filter Filter markdown with LLM
pubmed-cli Search and fetch PubMed data
google-scholar-crawler Crawl Google Scholar citations
rss-fetcher Fetch and process RSS feeds
crossref-cli Query Crossref API

Examples

# Validate PDF files
pdf-validator document.pdf

# Validate directory of PDFs
pdf-validator -d /path/to/directory -r

# Filter markdown with LLM
md-llm-filter -i input.md -o output.md

# Search PubMed
pubmed-cli search "cancer research" -e user@example.com

# Get details for specific PMIDs
pubmed-cli details 12345678 23456789 -e user@example.com

# Crawl Google Scholar
google-scholar-crawler -q "machine learning" -n 20

# Fetch RSS feeds
rss-fetcher -u "https://example.com/feed.xml" -o output.json

# Query Crossref
crossref-cli query --doi "10.1038/nature12373"

Python API Usage

PubMed Integration

from information_composer.pubmed import query_pmid, fetch_pubmed_details_batch_sync

# Search for articles
pmids = query_pmid("cancer immunotherapy", "your-email@example.com", 50)

# Fetch detailed information
details = fetch_pubmed_details_batch_sync(pmids, "your-email@example.com")

Crossref Integration

from information_composer import CrossrefClient, query_crossref

# Query Crossref API
client = CrossrefClient()
results = client.query_works(query="machine learning", limit=10)

# Or use the convenience function
works = query_crossref("machine learning")

DOI Downloader

from information_composer import DOIDownloader

# Download paper by DOI
downloader = DOIDownloader()
result = downloader.download_doi("10.1038/nature12373")

Markdown Processing

from information_composer import jsonify, markdownify

# Convert markdown to JSON
json_data = jsonify(markdown_content)

# Convert JSON to markdown
markdown_content = markdownify(json_data)

PDF Validation

from information_composer import PDFValidator

# Validate PDF
validator = PDFValidator(verbose=True)
is_valid, error = validator.validate_single_pdf("document.pdf")

Google Scholar Crawling

from information_composer.sites.google_scholar import SearchConfig, google_scholar_search

# Search Google Scholar
config = SearchConfig(query="deep learning", num_results=20)
results = google_scholar_search(config)

Development

Code Quality

This project uses Ruff as the primary code quality tool:

# Run all checks
python scripts/check_code.py

# Auto-fix issues
python scripts/check_code.py --fix

# With tests
python scripts/check_code.py --with-tests

# Verbose output
python scripts/check_code.py --verbose

Testing

# Run tests
pytest tests/ -v

# Run tests with coverage
python scripts/check_code.py --with-tests

Project Structure

information-composer/
├── src/information_composer/
│   ├── core/              # Core functionality (DOI downloader)
│   ├── crossref/          # Crossref API integration
│   ├── llm_filter/        # LLM-based markdown filtering
│   ├── markdown/          # Markdown processing utilities
│   ├── pdf/               # PDF validation
│   ├── pubmed/            # PubMed integration
│   ├── rss/               # RSS feed processing
│   └── sites/             # Web scraping (Google Scholar, RiceDataCN)
├── examples/              # Usage examples
├── scripts/               # Utility scripts
├── docs/                  # Documentation
└── tests/                 # Test files

Documentation

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Run code quality checks: python scripts/check_code.py --fix
  5. Run tests: python scripts/check_code.py --with-tests
  6. Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

For questions and support, please open an issue on GitHub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

information_composer-0.3.0.tar.gz (341.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

information_composer-0.3.0-py3-none-any.whl (211.7 kB view details)

Uploaded Python 3

File details

Details for the file information_composer-0.3.0.tar.gz.

File metadata

  • Download URL: information_composer-0.3.0.tar.gz
  • Upload date:
  • Size: 341.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for information_composer-0.3.0.tar.gz
Algorithm Hash digest
SHA256 2480777b6c3c1c6247e94782c3f66bd3808ac4bb87995afb3d8e959cfd309865
MD5 8544dbbce84bf4b55c496e826706bb26
BLAKE2b-256 0fce3104241c2e2ef4f83b6d5ce9446c147f7b95416d7c09a4134a34907c90f4

See more details on using hashes here.

File details

Details for the file information_composer-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for information_composer-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0784d8d490198b46b3cad624187172fca7f17a19eee7fb2498e7d520f685b31f
MD5 8f271be0874c510692ff8f4b94aa50cb
BLAKE2b-256 39b5c911a643099c6cc43760328402b635e5b76d2d9fdd6bdbadebea83ba7404

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page