A very simple news crawler
Project description
A very simple news crawler in Python. Developed at Humboldt University of Berlin.
Fundus is:
-
A static news crawler. Fundus lets you crawl online news articles with only a few lines of Python code! Be it from live websites or the CC-NEWS dataset.
-
An open-source Python package. Fundus is built on the idea of building something together. We welcome your contribution to help Fundus grow!
Quick Start
To install from pip, simply do:
pip install fundus
Fundus requires Python 3.8+.
Example 1: Crawl a bunch of English-language news articles
Let's use Fundus to crawl 2 articles from publishers based in the US.
from fundus import PublisherCollection, Crawler
# initialize the crawler for news publishers based in the US
crawler = Crawler(PublisherCollection.us)
# crawl 2 articles and print
for article in crawler.crawl(max_articles=2):
print(article)
That's already it!
If you run this code, it should print out something like this:
Fundus-Article:
- Title: "Feinstein's Return Not Enough for Confirmation of Controversial New [...]"
- Text: "Democrats jammed three of President Joe Biden's controversial court nominees
through committee votes on Thursday thanks to a last-minute [...]"
- URL: https://freebeacon.com/politics/feinsteins-return-not-enough-for-confirmation-of-controversial-new-hampshire-judicial-nominee/
- From: FreeBeacon (2023-05-11 18:41)
Fundus-Article:
- Title: "Northwestern student government freezes College Republicans funding over [...]"
- Text: "Student government at Northwestern University in Illinois "indefinitely" froze
the funds of the university's chapter of College Republicans [...]"
- URL: https://www.foxnews.com/us/northwestern-student-government-freezes-college-republicans-funding-poster-critical-lgbtq-community
- From: FoxNews (2023-05-09 14:37)
This printout tells you that you successfully crawled two articles!
For each article, the printout details:
- the "Title" of the article, i.e. its headline
- the "Text", i.e. the main article body text
- the "URL" from which it was crawled
- the news source it is "From"
Example 2: Crawl a specific news source
Maybe you want to crawl a specific news source instead. Let's crawl news articles from Washington Times only:
from fundus import PublisherCollection, Crawler
# initialize the crawler for The New Yorker
crawler = Crawler(PublisherCollection.us.TheNewYorker)
# crawl 2 articles and print
for article in crawler.crawl(max_articles=2):
print(article)
Example 3: Crawl 1 Million articles
To crawl such a vast amount of data, Fundus relies on the CommonCrawl
web archive, in particular the news crawl CC-NEWS
.
If you're not familiar with CommonCrawl
or CC-NEWS
check out their websites.
Simply import our CCNewsCrawler
and make sure to check out our tutorial beforehand.
from fundus import PublisherCollection, CCNewsCrawler
# initialize the crawler using all publishers supported by fundus
crawler = CCNewsCrawler(*PublisherCollection)
# crawl 1 million articles and print
for article in crawler.crawl(max_articles=1000000):
print(article)
Note: By default, the crawler utilizes all available CPU cores on your system.
For optimal performance, we recommend manually setting the number of processes using the processes
parameter.
A good rule of thumb is to allocate one process per 200 Mbps of bandwidth
.
This can vary depending on core speed.
Note: The crawl above took ~7 hours using the entire PublisherCollection
on a machine with 1000 Mbps connection, Core i9-13905H, 64GB Ram, Windows 11 and without printing the articles.
The estimated time can vary substantially depending on the publisher used and the available bandwidth.
Additionally, not all publishers are included in the CC-NEWS
crawl (especially US based publishers).
For large corpus creation, one can also use the regular crawler by utilizing only sitemaps, which requires significantly less bandwidth.
from fundus import PublisherCollection, Crawler, Sitemap
# initialize a crawler for us/uk based publishers and restrict to Sitemaps only
crawler = Crawler(PublisherCollection.us, PublisherCollection.uk, restrict_sources_to=[Sitemap])
# crawl 1 million articles and print
for article in crawler.crawl(max_articles=1000000):
print(article)
Tutorials
We provide quick tutorials to get you started with the library:
- Tutorial 1: How to crawl news with Fundus
- Tutorial 2: How to crawl articles from CC-NEWS
- Tutorial 3: The Article Class
- Tutorial 4: How to filter articles
- Tutorial 5: Advanced topics
- Tutorial 6: Logging
If you wish to contribute check out these tutorials:
Currently Supported News Sources
You can find the publishers currently supported here.
Also: Adding a new publisher is easy - consider contributing to the project!
Evaluation Benchmark
Check out our evaluation benchmark.
The following table summarizes the overall performance of Fundus and evaluated scrapers in terms of averaged ROUGE-LSum precision, recall and F1-score and their standard deviation. The table is sorted in descending order over the F1-score:
Scraper | Precision | Recall | F1-Score | Version |
---|---|---|---|---|
Fundus | 99.89±0.57 | 96.75±12.75 | 97.69±9.75 | 0.4.1 |
Trafilatura | 93.91±12.89 | 96.85±15.69 | 93.62±16.73 | 1.12.0 |
news-please | 97.95±10.08 | 91.89±16.15 | 93.39±14.52 | 1.6.13 |
BTE | 81.09±19.41 | 98.23±8.61 | 87.14±15.48 | / |
jusText | 86.51±18.92 | 90.23±20.61 | 86.96±19.76 | 3.0.1 |
BoilerNet | 85.96±18.55 | 91.21±19.15 | 86.52±18.03 | / |
Boilerpipe | 82.89±20.65 | 82.11±29.99 | 79.90±25.86 | 1.3.0 |
Cite
Please cite the following paper when using Fundus or building upon our work:
@inproceedings{dallabetta-etal-2024-fundus,
title = "Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions",
author = "Dallabetta, Max and
Dobberstein, Conrad and
Breiding, Adrian and
Akbik, Alan",
editor = "Cao, Yixin and
Feng, Yang and
Xiong, Deyi",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-demos.29",
pages = "305--314",
}
Contact
Please email your questions or comments to Max Dallabetta
Contributing
Thanks for your interest in contributing! There are many ways to get involved; start with our contributor guidelines and then check these open issues for specific tasks.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.