Scrapes the main text of web pages while preserving some structure.
Project description
- Code:
- Documentation:
see README file
- Issue tracker:
Trafilatura scrapes the main text of web pages while preserving some structure. The extraction focuses on the main text content, which is usually the part displayed centrally, without the left or right bars, the header or the footer, but including potential titles and comments. All the operations needed from web page download to HTML parsing are handled seamlessly, including scraping and textual analysis.
In a nutshell, with Python:
>>> import trafilatura
>>> downloaded = trafilatura.fetch_url('https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/')
>>> trafilatura.extract(downloaded)
# outputs main content as plain text ...
On the command-line:
$ trafilatura -u "https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/"
# outputs main content as plain text ...
Description
This library performs a robust extraction of main text content and boilerplate removal based on a combination of DOM-based examination, XPath expressions and rules. Trafilatura can seamlessly download, parse and convert web documents. It scrapes the main body text while preserving part of the text formatting and page structure, a task also known as web scraping, boilerplate removal or boilerplate detection, DOM-based content extraction, main content identification, web page template detection, web page cleaning, web content extraction, or HTML text cleaning.
Distinguishing between whole page and essential parts can help to alleviate many quality problems related to web texts as it can help with the noise consisting of recurring elements (headers and footers, ads, links/blogroll, etc.) It has to be precise enough not to miss texts or discard valid documents, it also has to be reasonably fast, as it is expected to run in production on millions of documents.
Features
URLs, HTML files or parsed HTML trees given as input
Formatting elements and the structure of the page are preserved (paragraphs, titles, lists, quotes, code, line breaks), which can be used for further processing
Main text and comments can be targeted separately
Result in plain text (newlines and lists preserved) or XML format (with source and structure)
Because it relies on lxml, trafilatura is comparatively fast. It is also robust, as the additional generic jusText algorithm is used as a backup solution.
Work in progress, currently experimental features:
[x] Duplicate detection at sentence, paragraph and document level using a least recently used (LRU) cache
[x] Language detection on the extracted content
[-] XML output compatible with the recommendations of the Text Encoding Initiative
[ ] Preservation of in-line text formatting (bold, italic, etc.)
Installation
trafilatura is a Python package (compatible with Python 3.5 upwards) which is currently tested on Linux and macOS and to some extent on Windows. It is available on PyPI and can be installed using pip. (Use pip3 install trafilatura on systems where both Python 2 and 3 are globally installed.)
Install from package repository: pip install trafilatura
Direct installation of the latest version (see build status): pip install git+https://github.com/adbar/trafilatura.git
For all experimental functionality please use pip install trafilatura[all] Most notably: language detection and faster processing of downloads. The cchardet package is currently not working on some macOS versions.
(For infos on dependency management of Python packages see this discussion thread)
Usage
With Python
Using trafilatura in a straightforward way:
>>> import trafilatura
>>> downloaded = trafilatura.fetch_url('https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/')
>>> downloaded is None # assuming the download was successful
False
>>> result = trafilatura.extract(downloaded)
>>> print(result)
# newlines preserved, TXT output ...
>>> result = trafilatura.extract(downloaded, xml_output=True)
>>> print(result)
# some formatting preserved in basic XML structure ...
The only required argument is the input document (here a downloaded HTML file), the rest is optional.
The inclusion of tables and comments can be deactivated at a function call. The use of a fallback algorithm (currently jusText) can also be bypassed in fast mode:
>>> result = trafilatura.extract(downloaded, include_comments=False) # no comments in output
>>> result = trafilatura.extract(downloaded, include_tables=True) # skip tables examination
>>> result = trafilatura.extract(downloaded, no_fallback=True) # skip justext algorithm used as fallback
>>> result = trafilatura.extract(downloaded, include_comments=False, include_tables=True, no_fallback=True) # probably the fastest execution
The input can consists of a previously parsed tree (i.e. a lxml.html object), which is then handled seamlessly:
>>> from lxml import html
>>> mytree = html.fromstring('<html><body><article><p>Here is the main text. It has to be long enough in order to bypass the safety checks. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p></article></body></html>')
>>> trafilatura.extract(mytree)
'Here is the main text. It has to be long enough in order to bypass the safety checks. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.\n'
Experimental feature: the target language can also be set using 2-letter codes (ISO 639-1), there will be no output if the detected language of the result does not match and no such filtering if the identification component has not been installed (see above for installation instructions).
>>> result = trafilatura.extract(downloaded, url, target_language='de')
For further configuration see the variables in settings.py.
On the command-line
A command-line interface is included, for general instructions see Comment Prompt (tutorial for Windows systems), How to use the Terminal command line in macOS, or An introduction to the Linux Terminal.
URLs can be used directly (-u/--URL):
$ trafilatura -u https://de.creativecommons.org/index.php/was-ist-cc/
$ # outputs main content in plain text format ...
$ trafilatura --xml --URL "https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/"
$ # outputs main text with basic XML structure ...
You can also pipe a HTML document (and response body) to trafilatura:
$ cat myfile.html | trafilatura # use the contents of an already existing file
$ wget -qO- "https://de.creativecommons.org/index.php/was-ist-cc/" | trafilatura # use a custom download
The -i/--inputfile option allows for bulk download and processing of a list of URLs from a file listing one link per line. Beware that there should be a tacit scraping etiquette and that a server may block you after the download a certain number of pages from the same website/domain in a short period of time. In addition, some website may block the requests user-agent. Thus, trafilatura waits a few seconds per default between requests.
For all usage instructions see trafilatura -h:
usage: trafilatura [-h] [-f] [-i INPUTFILE] [--nocomments] [--notables] [--xml] [--xmltei] [-u URL] [-v]
- optional arguments:
- -h, --help
show this help message and exit
- -f, --fast
fast (without fallback detection)
- -i INPUTFILE, --inputfile INPUTFILE
name of input file for batch processing
- --nocomments
don’t output any comments
- --notables
don’t output any table elements
- --xml
XML output
- --xmltei
XML TEI output
- -u URL, --URL URL
custom URL download
- -v, --verbose
increase output verbosity
Further documentation
To be released soon.
Tutorial video in German by Simon Meier-Vieracker: Content von Webseiten laden mit Trafilatura.
Additional information
Scientific context
This module is part of methods to derive information from web documents in order to build text databases for research (chiefly linguistic analysis and natural language processing). A significant challenge resides in the ability to extract and pre-process web texts to meet scientific expectations: Web corpus construction involves numerous design decisions, and this software packages can help facilitate collection and enhance corpus quality.
Barbaresi, A. “Generic Web Content Extraction with Open-Source Software”, Proceedings of KONVENS 2019, Kaleidoscope Abstracts, University of Erlangen, 2019.
Barbaresi, A. “The Vast and the Focused: On the need for domain-focused web corpora”, Proceedings of the 7th Workshop on Challenges in the Management of Large Corpora (CMLC-7), IDS Mannheim, 2019.
Barbaresi, A. “Efficient construction of metadata-enhanced web corpora”, Proceedings of the 10th Web as Corpus Workshop (WAC-X), ACL, 2016.
Name
Trafilatura: Italian word for wire drawing.
Kudos to…
Alternatives
Most corresponding Python packages are not actively maintained, the following alternatives exist:
dragnet features combined and machine-learning approaches, but requires many dependencies as well as extensive tuning
python-readability cleans the page and preserves some markup but is mostly geared towards news texts
goose can extract information for embedded content but doesn’t preserve markup and is not maintained
html2text converts HTML pages to Markup language and thus keeps the structure, though it doesn’t focus on main text extraction
Contact
Pull requests are welcome.
See my contact page for additional details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file trafilatura-0.2.0.tar.gz
.
File metadata
- Download URL: trafilatura-0.2.0.tar.gz
- Upload date:
- Size: 1.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 84d9d7d8c1a4a13c700607bc9ca8ea7d7d11a9ead5409d6492df641b39bf4f8e |
|
MD5 | 02404d5e816e8aa1192837c193b50a80 |
|
BLAKE2b-256 | 3c34642bf21956b8d505d0f1499406234e71609b683128073899d2ce64b21acc |
File details
Details for the file trafilatura-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: trafilatura-0.2.0-py3-none-any.whl
- Upload date:
- Size: 28.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 53deb15d7dbd28c108dd998ee1365c732c2448a441de569032bbcb6a7585e2ee |
|
MD5 | 61f726a1edd8adbb330ae8bc80df3a3d |
|
BLAKE2b-256 | 1a7d0c38ab91e91dbc2b72f8fb2168f591ea1d4e87d912a8fa324c143eefd59a |