A comprehensive toolkit for collecting, composing, and filtering information from various web resources with AI-powered markdown processing

These details have not been verified by PyPI

Project links

Project description

Information Composer

A comprehensive toolkit for collecting, composing, and filtering information from various web resources with AI-powered markdown processing.

Features

Core Modules

PDF Validation: Validate PDF file formats and integrity
Markdown Processing: Advanced markdown processing with LLM filtering
DOI Management: Download and manage academic papers by DOI
PubMed Integration: Query and process PubMed data with CLI tool

AI-Powered Features

LLM Filtering: Support for DashScope, Ollama, and OpenAI
PubMed Analyzer: AI-powered literature analysis using Pydantic AI
Markdown Filter: Intelligent content extraction and filtering

Migration Note: This project uses Pydantic AI for type-safe LLM integration. See Migration Guide for details.

Web Scraping & Data Collection

Crossref Integration: Query Crossref API for bibliographic data
Google Scholar Integration: Crawl and process Google Scholar citations
RSS Feed Processing: Parse and manage scientific RSS feeds
RiceDataCN Parser: Extract gene data from RiceDataCN database

Developer Tools

Code Quality: Ruff linter and formatter (primary tool)
Testing: Pytest with 51%+ coverage (570 tests passed)
Multi-format Support: PDF, Markdown, JSON, XML, TXT

Installation

Prerequisites

Python 3.12 or 3.13 (Python 3.12 is the minimum required version)
Virtual environment (recommended)

Setup

Clone the repository:

git clone https://github.com/yourusername/information-composer.git
cd information-composer

Create and activate virtual environment:

# Linux/macOS
python -m venv .venv
source .venv/bin/activate

# Windows
python -m venv .venv
.venv\Scripts\activate

Install dependencies:

pip install -e .

Quick Start

Activate Environment

# Linux/macOS
source activate.sh

# Windows
activate.bat

Available CLI Commands

Command	Description
`pdf-validator`	Validate PDF files
`md-llm-filter`	Filter markdown with LLM
`pubmed-cli`	Search and fetch PubMed data
`google-scholar-crawler`	Crawl Google Scholar citations
`rss-fetcher`	Fetch and process RSS feeds
`crossref-cli`	Query Crossref API

Examples

# Validate PDF files
pdf-validator document.pdf

# Validate directory of PDFs
pdf-validator -d /path/to/directory -r

# Filter markdown with LLM
md-llm-filter -i input.md -o output.md

# Search PubMed
pubmed-cli search "cancer research" -e user@example.com

# Get details for specific PMIDs
pubmed-cli details 12345678 23456789 -e user@example.com

# Crawl Google Scholar
google-scholar-crawler -q "machine learning" -n 20

# Fetch RSS feeds
rss-fetcher -u "https://example.com/feed.xml" -o output.json

# Query Crossref
crossref-cli query --doi "10.1038/nature12373"

Python API Usage

PubMed Integration

from information_composer.pubmed import query_pmid, fetch_pubmed_details_batch_sync

# Search for articles
pmids = query_pmid("cancer immunotherapy", "your-email@example.com", 50)

# Fetch detailed information
details = fetch_pubmed_details_batch_sync(pmids, "your-email@example.com")

Crossref Integration

from information_composer import CrossrefClient, query_crossref

# Query Crossref API
client = CrossrefClient()
results = client.query_works(query="machine learning", limit=10)

# Or use the convenience function
works = query_crossref("machine learning")

DOI Downloader

from information_composer import DOIDownloader

# Download paper by DOI
downloader = DOIDownloader()
result = downloader.download_doi("10.1038/nature12373")

Markdown Processing

from information_composer import jsonify, markdownify

# Convert markdown to JSON
json_data = jsonify(markdown_content)

# Convert JSON to markdown
markdown_content = markdownify(json_data)

PDF Validation

from information_composer import PDFValidator

# Validate PDF
validator = PDFValidator(verbose=True)
is_valid, error = validator.validate_single_pdf("document.pdf")

Google Scholar Crawling

from information_composer.sites.google_scholar import SearchConfig, google_scholar_search

# Search Google Scholar
config = SearchConfig(query="deep learning", num_results=20)
results = google_scholar_search(config)

Development

Code Quality

This project uses Ruff as the primary code quality tool:

# Run all checks
python scripts/check_code.py

# Auto-fix issues
python scripts/check_code.py --fix

# With tests
python scripts/check_code.py --with-tests

# Verbose output
python scripts/check_code.py --verbose

Testing

# Run tests
pytest tests/ -v

# Run tests with coverage
python scripts/check_code.py --with-tests

Project Structure

information-composer/
├── src/information_composer/
│   ├── core/              # Core functionality (DOI downloader)
│   ├── crossref/          # Crossref API integration
│   ├── llm_filter/        # LLM-based markdown filtering
│   ├── markdown/          # Markdown processing utilities
│   ├── pdf/               # PDF validation
│   ├── pubmed/            # PubMed integration
│   ├── rss/               # RSS feed processing
│   └── sites/             # Web scraping (Google Scholar, RiceDataCN)
├── examples/              # Usage examples
├── scripts/               # Utility scripts
├── docs/                  # Documentation
└── tests/                 # Test files

Documentation

📚 Complete Documentation - Full project documentation
🚀 Quick Start - Get started in 5 minutes
⚙️ Configuration - Configuration options
📖 Feature Guides - Detailed feature documentation
🔧 Development - Development and contributing guide

Contributing

Fork the repository
Create a feature branch
Make your changes
Run code quality checks: python scripts/check_code.py --fix
Run tests: python scripts/check_code.py --with-tests
Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

For questions and support, please open an issue on GitHub.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.4.0

Mar 31, 2026

0.3.0

Feb 6, 2026

0.2.1

Jan 21, 2026

0.2.0

Nov 5, 2025

0.1.3

Sep 10, 2025

0.1.2.1

Nov 14, 2024

0.1.0

Nov 14, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

information_composer-0.4.0.tar.gz (354.1 kB view details)

Uploaded Mar 31, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

information_composer-0.4.0-py3-none-any.whl (216.1 kB view details)

Uploaded Mar 31, 2026 Python 3

File details

Details for the file information_composer-0.4.0.tar.gz.

File metadata

Download URL: information_composer-0.4.0.tar.gz
Upload date: Mar 31, 2026
Size: 354.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for information_composer-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`2dc3ce62a96b076255e2016a1b97329b4be21208829f388414705881a707cb8b`
MD5	`1341235b6e3ec8a4d3f2a54cfa8b3c68`
BLAKE2b-256	`57b00c4766674b9a37492baebc5da7dc129846609709b08173c37ca9eaccc17a`

See more details on using hashes here.

File details

Details for the file information_composer-0.4.0-py3-none-any.whl.

File metadata

Download URL: information_composer-0.4.0-py3-none-any.whl
Upload date: Mar 31, 2026
Size: 216.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for information_composer-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`77fef052637fda9353d6692c2883c9f54dc42f25e5603fd4a3e35b5f4596f5f9`
MD5	`7c537eda09542862f4ab080969891ca5`
BLAKE2b-256	`3aba69211fd6d97f6cd2aebe8b430cd282c8ada9a1171d983c8534d05a4e5cdf`

See more details on using hashes here.

information-composer 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Information Composer

Features

Core Modules

AI-Powered Features

Web Scraping & Data Collection

Developer Tools

Installation

Prerequisites

Setup

Quick Start

Activate Environment

Available CLI Commands

Examples

Python API Usage

PubMed Integration

Crossref Integration

DOI Downloader

Markdown Processing

PDF Validation

Google Scholar Crawling

Development

Code Quality

Testing

Project Structure

Documentation

Contributing

License

Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes