Skip to main content

Scrapes the main text of web pages while preserving some structure.

Project description

https://img.shields.io/pypi/v/trafilatura.svg https://img.shields.io/pypi/l/trafilatura.svg https://img.shields.io/pypi/pyversions/trafilatura.svg https://img.shields.io/travis/adbar/trafilatura.svg https://img.shields.io/codecov/c/github/adbar/trafilatura.svg
Code:

https://github.com/adbar/trafilatura

Issue tracker:

https://github.com/adbar/trafilatura/issues

License:

GNU GPL v3; see LICENSE file

Robust extraction of main text content and boilerplate removal based on a combination of DOM-based examination, XPath expressions and rules. Given a HTML document, this library parses it, retrieves the main body text and converts it to XML or plain text, while preserving part of the text formatting and page structure.

>>> import requests, trafilatura
>>> response = requests.get('https://www.iana.org/about')
>>> trafilatura.process_record(response.text)
>>> # outputs main content in plain text format ...
$ trafilatura -u https://www.sueddeutsche.de/politik/usa-pompeo-maas-merkel-iran-nordstream-1.4434358
$ # outputs main content in plain text format ...

Features

Scrapes the main text of web pages while preserving some structure. Also known as boilerplate removal, DOM-based content extraction, main content identification, HTML text cleaning. The purpose is to find relevant and original text sections of a web page and also to remove the noise consisting of recurring elements (headers and footers, ads, links/blogroll, etc.)

Because it relies on lxml, trafilatura is comparatively fast. It is also robust, as the additional generic jusText algorithm is used as a backup solution.

The result of processing can be in plain text or XML format. In the latter case, basic formatting elements are preserved such as text formatting (bold, italic, etc.) and page structure (paragraphs, titles, lists), which can be used for further processing.

Work in progress, currently experimental features:

  • Separate extraction of main text and comments

  • Duplicate detection at paragraph level using a least recently used (LRU) cache

  • Language detection on the extracted content

  • XML output compatible with the recommendations of the Text Encoding Initiative (XML TEI)

Installation

trafilatura is a Python 3 package that is available on PyPI and can be installed using pip:

pip install trafilatura

(Or use ``pip3 install trafilatura`` on systems where Python 2 and 3 are both globally installed and pip refers to Python 2.)

Direct installation of the latest version over pip is possible (see build status):

pip install git+https://github.com/adbar/trafilatura.git

With Python

Basic use

The simplest way to use trafilatura is as follows:

>>> import requests, trafilatura
>>> response = requests.get('https://www.iana.org/about')
>>> result = trafilatura.process_record(response.text)
>>> print(result) # newlines preserved, TXT output
>>> result = trafilatura.process_record(response.text, xml_output=True)
>>> print(result) # some formatting preserved in basic XML structure

The only required argument is the response element, the rest is optional. It is also possible to use a previously parsed tree (i.e. a lxml.html object) as input, which is then handled seamlessly.

>>> from lxml import html
>>> mytree = html.fromstring('<html><body><article><p>Here is the main text. It has to be long enough in order to bypass the safety checks. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p></article></body></html>')
>>> trafilatura.process_record(mytree)
'Here is the main text. It has to be long enough in order to bypass the safety checks. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.\n'

Experimental feature: the target language can also be set using 2-letter codes (ISO 639-1), there will be no output if the detected language of the result does not match.

>>> result = trafilatura.process_record(response.text, url, target_language='de')

For further configuration see the variables in settings.py.

On the command-line

A command-line interface is included, URLs can be used directly (-u/--URL):

$ trafilatura -u https://www.sueddeutsche.de/politik/usa-pompeo-maas-merkel-iran-nordstream-1.4434358
$ # outputs main content in plain text format ...
$ trafilatura --xml --URL "https://de.creativecommons.org/index.php/was-ist-cc/"
$ # outputs main text with basic XML structure ...

You can also pipe a HTML document (and response body) to the trafilatura:

$ wget -qO- "https://de.creativecommons.org/index.php/was-ist-cc/" | trafilatura

For usage instructions see trafilatura -h

Additional information

Context

This module is part of methods to derive metadata from web documents in order to build text corpora for computational linguistic and NLP analysis. For more information:

Name

Trafilatura: Italian word for wire drawing.

Kudos to…

Contact

Pull requests are welcome.

See my contact page for additional details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trafilatura-0.0.1.tar.gz (1.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

trafilatura-0.0.1-py3-none-any.whl (19.9 kB view details)

Uploaded Python 3

File details

Details for the file trafilatura-0.0.1.tar.gz.

File metadata

  • Download URL: trafilatura-0.0.1.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.30.0 CPython/3.6.8

File hashes

Hashes for trafilatura-0.0.1.tar.gz
Algorithm Hash digest
SHA256 04497173116b2715c1129fa104dfcdb4ade2bfc5aacd2f0081299c3a27767d54
MD5 f1ebf62c78cfb35b0bcb55e492461289
BLAKE2b-256 4adf75eabf693851408ffeacc5447c49ebf7a85d8319c6db00ddb2ff874c8b1b

See more details on using hashes here.

File details

Details for the file trafilatura-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: trafilatura-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 19.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.30.0 CPython/3.6.8

File hashes

Hashes for trafilatura-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 131c37004eddb869259765318c5d14ed80714b408d346b478c67ac4bcb1af15c
MD5 0b2a258f1ed552feeb2c8b730d7ab480
BLAKE2b-256 a9d18742f9f32488b9999c9cb2a6bcf4d80a1ee3af14ce8687b19ef7c541cbfc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page