A simple library to scrape articles from the web using newspaper3k and search with DuckDuckGo
Project description
Article Scrapper
A simple Python library to scrape articles from the web.
Installation
pip install article_scrapper
Features
- Scrape articles: Extract title, text, authors, and publication date from any article URL
- Search articles: Find articles using DuckDuckGo search
- Pipeline: Combine search and scrape to get article content for any query
Usage
Scrape a single article
from article_scrapper import scrape_article
article = scrape_article("https://example.com/article")
print(article["title"])
print(article["text"])
print(article["authors"])
print(article["publish_date"])
Search for articles
from article_scrapper import search_articles
urls = search_articles("python programming", max_results=5)
for url in urls:
print(url)
Search and scrape articles
from article_scrapper import get_articles_for_query
articles = get_articles_for_query("artificial intelligence news", max_results=5)
for article in articles:
print(f"Title: {article['title']}")
print(f"URL: {article['url']}")
print(f"Text: {article['text'][:200]}...")
print("---")
API Reference
scrape_article(url: str) -> dict
Scrape an article from a given URL.
Parameters:
url: The URL of the article to scrape
Returns: A dictionary containing:
url: The original URLtitle: The article titletext: The full article textauthors: List of article authorspublish_date: The publication date as a string
search_articles(query: str, max_results: int = 5) -> list[str]
Search for articles using DuckDuckGo.
Parameters:
query: The search query stringmax_results: Maximum number of URLs to return (default: 5)
Returns: A list of URLs matching the search query.
get_articles_for_query(query: str, max_results: int = 5) -> list[dict]
Search for articles and scrape their content.
Parameters:
query: The search query stringmax_results: Maximum number of articles to retrieve (default: 5)
Returns:
A list of article dictionaries (same format as scrape_article).
Dependencies
- newspaper3k - Article scraping
- ddgs - DuckDuckGo search
- lxml-html-clean - HTML cleaning
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file article_scrapper-0.1.0.tar.gz.
File metadata
- Download URL: article_scrapper-0.1.0.tar.gz
- Upload date:
- Size: 3.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
70cb7558dcd02a1bc2639621010f4a51c5f7ed1e3f7463aeb2b2f303e02c0bed
|
|
| MD5 |
81f81265ac8b374538cd95d8c587cfc7
|
|
| BLAKE2b-256 |
ada09b050fe885792bd8a459ce2b444306268ae5abf412fb16f9efb6efe34837
|
File details
Details for the file article_scrapper-0.1.0-py3-none-any.whl.
File metadata
- Download URL: article_scrapper-0.1.0-py3-none-any.whl
- Upload date:
- Size: 5.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6c00c166be9faaf0b84ceb6664e649b2696d4e0899d072694068924672460228
|
|
| MD5 |
e69622a5eb38dbc1e603333851a24bc1
|
|
| BLAKE2b-256 |
0b2d1b93ba2e3c533fde64814045449ae8d1c11f1023f1de562549d78c0b94b2
|