Scrapes the main text of web pages while preserving some structure.
Project description
- Code:
- Issue tracker:
- License:
GNU GPL v3; see LICENSE file
Robust extraction of main text content and boilerplate removal based on a combination of DOM-based examination, XPath expressions and rules. Given a HTML document, this library parses it, retrieves the main body text and converts it to XML or plain text, while preserving part of the text formatting and page structure.
In a nutshell, with Python:
>>> import requests, trafilatura
>>> response = requests.get('https://www.iana.org/about')
>>> trafilatura.process_record(response.text)
>>> # outputs main content in plain text format ...
On the command-line:
$ trafilatura -u https://www.sueddeutsche.de/politik/usa-pompeo-maas-merkel-iran-nordstream-1.4434358
$ # outputs main content in plain text format ...
Features
Scrapes the main text of web pages while preserving some structure. Also known as boilerplate removal, DOM-based content extraction, main content identification, HTML text cleaning. The purpose is to find relevant and original text sections of a web page and also to remove the noise consisting of recurring elements (headers and footers, ads, links/blogroll, etc.)
Because it relies on lxml, trafilatura is comparatively fast. It is also robust, as the additional generic jusText algorithm is used as a backup solution.
The result of processing can be in plain text or XML format. In the latter case, basic formatting elements are preserved such as text formatting (bold, italic, etc.) and page structure (paragraphs, titles, lists), which can be used for further processing.
Work in progress, currently experimental features:
Separate extraction of main text and comments
Duplicate detection at paragraph level using a least recently used (LRU) cache
Language detection on the extracted content
XML output compatible with the recommendations of the Text Encoding Initiative (XML TEI)
Installation
trafilatura is a Python 3 package that is available on PyPI and can be installed using pip:
pip install trafilatura
(Or use ``pip3 install trafilatura`` on systems where Python 2 and 3 are both globally installed and pip refers to Python 2.)
For all experimental functionality please use pip install trafilatura[all] (installation issues on some platforms).
Direct installation of the latest version (see build status):
pip install git+https://github.com/adbar/trafilatura.git
With Python
Basic use
The simplest way to use trafilatura is as follows:
>>> import requests, trafilatura
>>> response = requests.get('https://www.iana.org/about')
>>> result = trafilatura.process_record(response.text)
>>> print(result) # newlines preserved, TXT output
>>> result = trafilatura.process_record(response.text, xml_output=True)
>>> print(result) # some formatting preserved in basic XML structure
The only required argument is the response element, the rest is optional. It is also possible to use a previously parsed tree (i.e. a lxml.html object) as input, which is then handled seamlessly.
>>> from lxml import html
>>> mytree = html.fromstring('<html><body><article><p>Here is the main text. It has to be long enough in order to bypass the safety checks. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p></article></body></html>')
>>> trafilatura.process_record(mytree)
'Here is the main text. It has to be long enough in order to bypass the safety checks. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.\n'
Experimental feature: the target language can also be set using 2-letter codes (ISO 639-1), there will be no output if the detected language of the result does not match and no such filtering if the identification component has not been installed (see above for installation instructions).
>>> result = trafilatura.process_record(response.text, url, target_language='de')
For further configuration see the variables in settings.py.
On the command-line
A command-line interface is included, URLs can be used directly (-u/--URL):
$ trafilatura -u https://www.sueddeutsche.de/politik/usa-pompeo-maas-merkel-iran-nordstream-1.4434358
$ # outputs main content in plain text format ...
$ trafilatura --xml --URL "https://de.creativecommons.org/index.php/was-ist-cc/"
$ # outputs main text with basic XML structure ...
You can also pipe a HTML document (and response body) to the trafilatura:
$ wget -qO- "https://de.creativecommons.org/index.php/was-ist-cc/" | trafilatura
For usage instructions see trafilatura -h:
usage: trafilatura [-h] [-f] [--nocomments] [--notables] [--xml] [--xmltei] [-u URL] [-v]
- optional arguments:
- -h, --help
show this help message and exit
- -f, --fast
Fast (without fallback detection)
- --nocomments
Don’t output any comments
- --notables
Don’t output any table elements
- --xml
XML output
- --xmltei
XML TEI output
- -u URL, --URL URL
custom URL download
- -v, --verbose
increase output verbosity
Additional information
Context
This module is part of methods to derive metadata from web documents in order to build text corpora for computational linguistic and NLP analysis. For more information:
Barbaresi, Adrien. “Efficient construction of metadata-enhanced web corpora”, Proceedings of the 10th Web as Corpus Workshop (WAC-X), 2016.
Name
Trafilatura: Italian word for wire drawing.
Kudos to…
Alternatives
Most corresponding Python modules are not actively maintained, following alternatives exist:
dragnet features combined and machine-learning approaches, but requires many dependencies as well as extensive tuning
python-readability cleans the page and preserves some markup but is mostly geared towards news texts
html2text converts HTML pages to Markup language and thus keeps the structure, though it doesn’t focus on main text extraction
Contact
Pull requests are welcome.
See my contact page for additional details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file trafilatura-0.0.5.tar.gz
.
File metadata
- Download URL: trafilatura-0.0.5.tar.gz
- Upload date:
- Size: 1.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 91ad9f7672f4b9291f8fee2e6c1c0aac33eeb4d787848d5c3463a1c06dcddb08 |
|
MD5 | a77e518abff743e0e96ee5b16dfdbb40 |
|
BLAKE2b-256 | 187d010d2df6e0f03b440bf6a7018716323e17dce9e0a02df538325e992937d4 |
File details
Details for the file trafilatura-0.0.5-py3-none-any.whl
.
File metadata
- Download URL: trafilatura-0.0.5-py3-none-any.whl
- Upload date:
- Size: 23.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 96082f9678e8fa15b67782c149e3063a60bf4dd39818fc31b602e90c837109c8 |
|
MD5 | 90d965ca13d8a66f08e662790af56072 |
|
BLAKE2b-256 | 133a4851075947b08ee17b7dc08fb3a158b1e0edacc5c04e09b8675dbf516bd8 |