Skip to main content

A comprehensive toolkit for collecting, composing, and filtering information from various web resources with AI-powered markdown processing

Project description

Information Composer

Code Quality Python Version License Ruff

A comprehensive toolkit for collecting, composing, and filtering information from various web resources with AI-powered markdown processing.

Features

Core Modules

  • PDF Validation: Validate PDF file formats and integrity
  • Markdown Processing: Advanced markdown processing with LLM filtering
  • DOI Management: Download and manage academic papers by DOI
  • PubMed Integration: Query and process PubMed data with CLI tool

AI-Powered Features

  • LLM Filtering: Support for DashScope, Ollama, and OpenAI
  • PubMed Analyzer: AI-powered literature analysis using Pydantic AI
  • Markdown Filter: Intelligent content extraction and filtering

Migration Note: This project uses Pydantic AI for type-safe LLM integration. See Migration Guide for details.

Web Scraping & Data Collection

  • Crossref Integration: Query Crossref API for bibliographic data
  • Google Scholar Integration: Crawl and process Google Scholar citations
  • RSS Feed Processing: Parse and manage scientific RSS feeds
  • RiceDataCN Parser: Extract gene data from RiceDataCN database

Developer Tools

  • Code Quality: Ruff linter and formatter (primary tool)
  • Testing: Pytest with 51%+ coverage (570 tests passed)
  • Multi-format Support: PDF, Markdown, JSON, XML, TXT

Installation

Prerequisites

  • Python 3.12 or 3.13 (Python 3.12 is the minimum required version)
  • Virtual environment (recommended)

Setup

  1. Clone the repository:
git clone https://github.com/yourusername/information-composer.git
cd information-composer
  1. Create and activate virtual environment:
# Linux/macOS
python -m venv .venv
source .venv/bin/activate

# Windows
python -m venv .venv
.venv\Scripts\activate
  1. Install dependencies:
pip install -e .

Quick Start

Activate Environment

# Linux/macOS
source activate.sh

# Windows
activate.bat

Available CLI Commands

Command Description
pdf-validator Validate PDF files
md-llm-filter Filter markdown with LLM
pubmed-cli Search and fetch PubMed data
google-scholar-crawler Crawl Google Scholar citations
rss-fetcher Fetch and process RSS feeds
crossref-cli Query Crossref API

Examples

# Validate PDF files
pdf-validator document.pdf

# Validate directory of PDFs
pdf-validator -d /path/to/directory -r

# Filter markdown with LLM
md-llm-filter -i input.md -o output.md

# Search PubMed
pubmed-cli search "cancer research" -e user@example.com

# Get details for specific PMIDs
pubmed-cli details 12345678 23456789 -e user@example.com

# Crawl Google Scholar
google-scholar-crawler -q "machine learning" -n 20

# Fetch RSS feeds
rss-fetcher -u "https://example.com/feed.xml" -o output.json

# Query Crossref
crossref-cli query --doi "10.1038/nature12373"

Python API Usage

PubMed Integration

from information_composer.pubmed import query_pmid, fetch_pubmed_details_batch_sync

# Search for articles
pmids = query_pmid("cancer immunotherapy", "your-email@example.com", 50)

# Fetch detailed information
details = fetch_pubmed_details_batch_sync(pmids, "your-email@example.com")

Crossref Integration

from information_composer import CrossrefClient, query_crossref

# Query Crossref API
client = CrossrefClient()
results = client.query_works(query="machine learning", limit=10)

# Or use the convenience function
works = query_crossref("machine learning")

DOI Downloader

from information_composer import DOIDownloader

# Download paper by DOI
downloader = DOIDownloader()
result = downloader.download_doi("10.1038/nature12373")

Markdown Processing

from information_composer import jsonify, markdownify

# Convert markdown to JSON
json_data = jsonify(markdown_content)

# Convert JSON to markdown
markdown_content = markdownify(json_data)

PDF Validation

from information_composer import PDFValidator

# Validate PDF
validator = PDFValidator(verbose=True)
is_valid, error = validator.validate_single_pdf("document.pdf")

Google Scholar Crawling

from information_composer.sites.google_scholar import SearchConfig, google_scholar_search

# Search Google Scholar
config = SearchConfig(query="deep learning", num_results=20)
results = google_scholar_search(config)

Development

Code Quality

This project uses Ruff as the primary code quality tool:

# Run all checks
python scripts/check_code.py

# Auto-fix issues
python scripts/check_code.py --fix

# With tests
python scripts/check_code.py --with-tests

# Verbose output
python scripts/check_code.py --verbose

Testing

# Run tests
pytest tests/ -v

# Run tests with coverage
python scripts/check_code.py --with-tests

Project Structure

information-composer/
├── src/information_composer/
│   ├── core/              # Core functionality (DOI downloader)
│   ├── crossref/          # Crossref API integration
│   ├── llm_filter/        # LLM-based markdown filtering
│   ├── markdown/          # Markdown processing utilities
│   ├── pdf/               # PDF validation
│   ├── pubmed/            # PubMed integration
│   ├── rss/               # RSS feed processing
│   └── sites/             # Web scraping (Google Scholar, RiceDataCN)
├── examples/              # Usage examples
├── scripts/               # Utility scripts
├── docs/                  # Documentation
└── tests/                 # Test files

Documentation

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Run code quality checks: python scripts/check_code.py --fix
  5. Run tests: python scripts/check_code.py --with-tests
  6. Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

For questions and support, please open an issue on GitHub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

information_composer-0.4.0.tar.gz (354.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

information_composer-0.4.0-py3-none-any.whl (216.1 kB view details)

Uploaded Python 3

File details

Details for the file information_composer-0.4.0.tar.gz.

File metadata

  • Download URL: information_composer-0.4.0.tar.gz
  • Upload date:
  • Size: 354.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for information_composer-0.4.0.tar.gz
Algorithm Hash digest
SHA256 2dc3ce62a96b076255e2016a1b97329b4be21208829f388414705881a707cb8b
MD5 1341235b6e3ec8a4d3f2a54cfa8b3c68
BLAKE2b-256 57b00c4766674b9a37492baebc5da7dc129846609709b08173c37ca9eaccc17a

See more details on using hashes here.

File details

Details for the file information_composer-0.4.0-py3-none-any.whl.

File metadata

File hashes

Hashes for information_composer-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 77fef052637fda9353d6692c2883c9f54dc42f25e5603fd4a3e35b5f4596f5f9
MD5 7c537eda09542862f4ab080969891ca5
BLAKE2b-256 3aba69211fd6d97f6cd2aebe8b430cd282c8ada9a1171d983c8534d05a4e5cdf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page