Extract what matters from any media source
Project description
Content Core
Content Core is a versatile Python library designed to extract and process content from various sources, providing a unified interface for handling text, web pages, and local files.
Overview
Note: As of v0.8, the default extraction engine is
'auto'. Content Core will automatically select the best extraction method based on your environment and available API keys, with a smart fallback order for both URLs and files. For files/documents,'auto'now tries Docling first, then falls back to simple extraction. You can override the engine if needed, but'auto'is recommended for most users.
The primary goal of Content Core is to simplify the process of ingesting content from diverse origins. Whether you have raw text, a URL pointing to an article, or a local file like a video or markdown document, Content Core aims to extract the meaningful content for further use.
Key Features
- Multi-Source Extraction: Handles content from:
- Direct text strings.
- Web URLs (using robust extraction methods).
- Local files (including automatic transcription for video/audio files and parsing for text-based formats).
- Intelligent Processing: Applies appropriate extraction techniques based on the source type. See the Processors Documentation for detailed information on how different content types are handled.
- Smart Engine Selection: By default, Content Core uses the
'auto'engine, which:- For URLs: Uses Firecrawl if
FIRECRAWL_API_KEYis set, else tries Jina. Jina might fail because of rate limits, which can be fixed by addingJINA_API_KEY. If Jina failes, BeautifulSoup is used as a fallback. - For files: Tries Docling extraction first (for robust document parsing), then falls back to simple extraction if needed.
- You can override this by specifying an engine, but
'auto'is recommended for most users.
- For URLs: Uses Firecrawl if
- Content Cleaning (Optional): Likely integrates with LLMs (via
prompter.pyand Jinja templates) to refine and clean the extracted content. - Asynchronous: Built with
asynciofor efficient I/O operations.
Getting Started
Installation
Install Content Core using pip:
# Install the package (without Docling)
pip install content-core
Alternatively, if you’re developing locally:
# Clone the repository
git clone https://github.com/lfnovo/content-core
cd content-core
# Install with uv
uv sync
Command-Line Interface
Content Core provides three CLI commands for extracting, cleaning, and summarizing content: ccore, cclean, and csum. These commands support input from text, URLs, files, or piped data (e.g., via cat file | command).
ccore - Extract Content
Extracts content from text, URLs, or files, with optional formatting. Usage:
ccore [-f|--format xml|json|text] [-d|--debug] [content]
Options:
-f,--format: Output format (xml, json, or text). Default: text.-d,--debug: Enable debug logging.content: Input content (text, URL, or file path). If omitted, reads from stdin.
Examples:
# Extract from a URL as text
ccore https://example.com
# Extract from a file as JSON
ccore -f json document.pdf
# Extract from piped text as XML
echo "Sample text" | ccore --format xml
cclean - Clean Content
Cleans content by removing unnecessary formatting, spaces, or artifacts. Accepts text, JSON, XML input, URLs, or file paths. Usage:
cclean [-d|--debug] [content]
Options:
-d,--debug: Enable debug logging.content: Input content to clean (text, URL, file path, JSON, or XML). If omitted, reads from stdin.
Examples:
# Clean a text string
cclean " messy text "
# Clean piped JSON
echo '{"content": " messy text "}' | cclean
# Clean content from a URL
cclean https://example.com
# Clean a file’s content
cclean document.txt
csum - Summarize Content
Summarizes content with an optional context to guide the summary style. Accepts text, JSON, XML input, URLs, or file paths.
Usage:
csum [--context "context text"] [-d|--debug] [content]
Options:
--context: Context for summarization (e.g., "explain to a child"). Default: none.-d,--debug: Enable debug logging.content: Input content to summarize (text, URL, file path, JSON, or XML). If omitted, reads from stdin.
Examples:
# Summarize text
csum "AI is transforming industries."
# Summarize with context
csum --context "in bullet points" "AI is transforming industries."
# Summarize piped content
cat article.txt | csum --context "one sentence"
# Summarize content from URL
csum https://example.com
# Summarize a file's content
csum document.txt
Quick Start
You can quickly integrate content-core into your Python projects to extract, clean, and summarize content from various sources.
import content_core as cc
# Extract content from a URL, file, or text
result = await cc.extract("https://example.com/article")
# Clean messy content
cleaned_text = await cc.clean("...messy text with [brackets] and extra spaces...")
# Summarize content with optional context
summary = await cc.summarize_content("long article text", context="explain to a child")
Documentation
For more information on how to use the Content Core library, including details on AI model configuration and customization, refer to our Usage Documentation.
Using with Langchain
For users integrating with the Langchain framework, content-core exposes a set of compatible tools. These tools, located in the src/content_core/tools directory, allow you to leverage content-core extraction, cleaning, and summarization capabilities directly within your Langchain agents and chains.
You can import and use these tools like any other Langchain tool. For example:
from content_core.tools import extract_content_tool, cleanup_content_tool, summarize_content_tool
from langchain.agents import initialize_agent, AgentType
tools = [extract_content_tool, cleanup_content_tool, summarize_content_tool]
agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True)
agent.run("Extract the content from https://example.com and then summarize it.")
Refer to the source code in src/content_core/tools for specific tool implementations and usage details.
Basic Usage
The core functionality revolves around the extract_content function.
import asyncio
from content_core.extraction import extract_content
async def main():
# Extract from raw text
text_data = await extract_content({"content": "This is my sample text content."})
print(text_data)
# Extract from a URL (uses 'auto' engine by default)
url_data = await extract_content({"url": "https://www.example.com"})
print(url_data)
# Extract from a local video file (gets transcript, engine='auto' by default)
video_data = await extract_content({"file_path": "path/to/your/video.mp4"})
print(video_data)
# Extract from a local markdown file (engine='auto' by default)
md_data = await extract_content({"file_path": "path/to/your/document.md"})
print(md_data)
# Per-execution override with Docling for documents
doc_data = await extract_content({
"file_path": "path/to/your/document.pdf",
"document_engine": "docling",
"output_format": "html"
})
# Per-execution override with Firecrawl for URLs
url_data = await extract_content({
"url": "https://www.example.com",
"url_engine": "firecrawl"
})
print(doc_data)
if __name__ == "__main__":
asyncio.run(main())
(See src/content_core/notebooks/run.ipynb for more detailed examples.)
Docling Integration
Content Core supports an optional Docling-based extraction engine for rich document formats (PDF, DOCX, PPTX, XLSX, Markdown, AsciiDoc, HTML, CSV, Images).
Enabling Docling
Docling is not the default engine when parsing documents. If you don't want to use it, you need to set engine to "simple".
Via configuration file
In your cc_config.yaml or custom config, set:
extraction:
document_engine: docling # 'auto' (default), 'simple', or 'docling'
url_engine: auto # 'auto' (default), 'simple', 'firecrawl', or 'jina'
docling:
output_format: markdown # markdown | html | json
Programmatically in Python
from content_core.config import set_document_engine, set_url_engine, set_docling_output_format
# switch document engine to Docling
set_document_engine("docling")
# switch URL engine to Firecrawl
set_url_engine("firecrawl")
# choose output format: 'markdown', 'html', or 'json'
set_docling_output_format("html")
# now use ccore.extract or ccore.ccore
result = await cc.extract("document.pdf")
Configuration
Configuration settings (like API keys for external services, logging levels) can be managed through environment variables or .env files, loaded automatically via python-dotenv.
Example .env:
OPENAI_API_KEY=your-key-here
GOOGLE_API_KEY=your-key-here
Custom Prompt Templates
Content Core allows you to define custom prompt templates for content processing. By default, the library uses built-in prompts located in the prompts directory. However, you can create your own prompt templates and store them in a dedicated directory. To specify the location of your custom prompts, set the PROMPT_PATH environment variable in your .env file or system environment.
Example .env with custom prompt path:
OPENAI_API_KEY=your-key-here
GOOGLE_API_KEY=your-key-here
PROMPT_PATH=/path/to/your/custom/prompts
When a prompt template is requested, Content Core will first look in the custom directory specified by PROMPT_PATH (if set and exists). If the template is not found there, it will fall back to the default built-in prompts. This allows you to override specific prompts while still using the default ones for others.
Development
To set up a development environment:
# Clone the repository
git clone <repository-url>
cd content-core
# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate
uv sync --group dev
# Run tests
make test
# Lint code
make lint
# See all commands
make help
License
This project is licensed under the MIT License. See the LICENSE file for details.
Contributing
Contributions are welcome! Please see our Contributing Guide for more details on how to get started.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file content_core-1.0.0.tar.gz.
File metadata
- Download URL: content_core-1.0.0.tar.gz
- Upload date:
- Size: 20.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7aefe6c4f0d4ea5638d77f5b208d3583763852513dd17399374c7eaa766c2ac7
|
|
| MD5 |
a20cd255b623e04513836f042d75d397
|
|
| BLAKE2b-256 |
792258dafe43676ab36eea18edbeb4b54050f49d9b261740d37cd93761fe6cb7
|
File details
Details for the file content_core-1.0.0-py3-none-any.whl.
File metadata
- Download URL: content_core-1.0.0-py3-none-any.whl
- Upload date:
- Size: 154.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8d7885e51a43976c59ff43acb32a5d4fc4e593d38ddda39cdafd4dd0c80a0818
|
|
| MD5 |
aa23e4d4863f10605ad9ca3ba2a0cfaa
|
|
| BLAKE2b-256 |
1f8a13c36e808f4feb93ec8778f442d88a0accdb8357ddeaac815353a9ca0594
|