trafilatura

Scrapes the main text of web pages while preserving some structure.

These details have not been verified by PyPI

Project links

Homepage

Project description

Code:: https://github.com/adbar/trafilatura
Documentation:: see README file
Issue tracker:: https://github.com/adbar/trafilatura/issues

Robust extraction of main text content and boilerplate removal based on a combination of DOM-based examination, XPath expressions and rules. Given a HTML document, this library parses it, retrieves the main body text and converts it to XML or plain text, while preserving part of the text formatting and page structure.

In a nutshell, with Python:

>>> import requests, trafilatura
>>> response = requests.get('https://www.iana.org/about')
>>> trafilatura.process_record(response.text)
>>> # outputs main content in plain text format ...

On the command-line:

$ trafilatura -u https://www.sueddeutsche.de/politik/usa-pompeo-maas-merkel-iran-nordstream-1.4434358
$ # outputs main content in plain text format ...

Features

Scrapes the main text of web pages while preserving some structure. Also known as web scraping, boilerplate removal or boilerplate detection, DOM-based content extraction, main content identification, web page template detection, web page cleaning, web content extraction, or HTML text cleaning. The purpose is to find relevant sections of a web page, which is usually the part displayed centrally, without the left or right bars, the header or the footer, but including potential titles and comments. In addition, the extraction focuses on original text and can help with the noise consisting of recurring elements (headers and footers, ads, links/blogroll, etc.) Distinguishing between the whole page and the main text content can help alleviating many quality problems related to web texts.

Because it relies on lxml, trafilatura is comparatively fast. It is also robust, as the additional generic jusText algorithm is used as a backup solution.

The result of processing can be in plain text or XML format. In the latter case, basic formatting elements are preserved such as text formatting (bold, italic, etc.) and page structure (paragraphs, titles, lists), which can be used for further processing.

Work in progress, currently experimental features:

Separate extraction of main text and comments
Duplicate detection at paragraph level using a least recently used (LRU) cache
Language detection on the extracted content
XML output compatible with the recommendations of the Text Encoding Initiative (XML TEI)

Installation

trafilatura is a Python package (compatible with Python 3.5 upwards) that is tested on Linux and macOS, is available on PyPI and can be installed using pip:

Install from package repository: pip install trafilatura

(Or use ``pip3 install trafilatura`` on systems where Python 2 and 3 are both globally installed and pip refers to Python 2.)

For all experimental functionality please use pip install trafilatura[all] Most notably: language detection and faster processing of downloads. The cchardet package is currently not working on some macOS versions.

Direct installation of the latest version (see build status):

pip install git+https://github.com/adbar/trafilatura.git

(For dependency management see this thread)

With Python

Basic use

The simplest way to use trafilatura is as follows:

>>> import requests, trafilatura
>>> response = requests.get('https://www.iana.org/about')
>>> result = trafilatura.process_record(response.text)
>>> print(result) # newlines preserved, TXT output
>>> result = trafilatura.process_record(response.text, xml_output=True)
>>> print(result) # some formatting preserved in basic XML structure

The only required argument is the response element, the rest is optional. It is also possible to use a previously parsed tree (i.e. a lxml.html object) as input, which is then handled seamlessly.

>>> from lxml import html
>>> mytree = html.fromstring('<html><body><article><p>Here is the main text. It has to be long enough in order to bypass the safety checks. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p></article></body></html>')
>>> trafilatura.process_record(mytree)
'Here is the main text. It has to be long enough in order to bypass the safety checks. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.\n'

Experimental feature: the target language can also be set using 2-letter codes (ISO 639-1), there will be no output if the detected language of the result does not match and no such filtering if the identification component has not been installed (see above for installation instructions).

>>> result = trafilatura.process_record(response.text, url, target_language='de')

For further configuration see the variables in settings.py.

On the command-line

A command-line interface is included, URLs can be used directly (-u/--URL):

$ trafilatura -u https://www.sueddeutsche.de/politik/usa-pompeo-maas-merkel-iran-nordstream-1.4434358
$ # outputs main content in plain text format ...
$ trafilatura --xml --URL "https://de.creativecommons.org/index.php/was-ist-cc/"
$ # outputs main text with basic XML structure ...

You can also pipe a HTML document (and response body) to the trafilatura:

$ wget -qO- "https://de.creativecommons.org/index.php/was-ist-cc/" | trafilatura

For usage instructions see trafilatura -h:

usage: trafilatura [-h] [-f] [--nocomments] [--notables] [--xml] [--xmltei] [-u URL] [-v]

optional arguments:

-h, --help: show this help message and exit
-f, --fast: Fast (without fallback detection)
--nocomments: Don’t output any comments
--notables: Don’t output any table elements
--xml: XML output
--xmltei: XML TEI output
-u URL, --URL URL: custom URL download
-v, --verbose: increase output verbosity

Additional information

Context

This module is part of methods to derive information from web documents in order to build text databases for research (chiefly linguistic analysis and natural language processing). A significant challenge resides in the ability to extract and pre-process web texts to meet scientific expectations. For more information:

Barbaresi, Adrien. “The Vast and the Focused: On the need for domain-focused web corpora”, Proceedings of the 7th Workshop on Challenges in the Management of Large Corpora (CMLC-7), 2019.
Barbaresi, Adrien. “Efficient construction of metadata-enhanced web corpora”, Proceedings of the 10th Web as Corpus Workshop (WAC-X), 2016.

Name

Trafilatura: Italian word for wire drawing.

Kudos to…

Alternatives

Most corresponding Python modules are not actively maintained, following alternatives exist:

dragnet features combined and machine-learning approaches, but requires many dependencies as well as extensive tuning
python-readability cleans the page and preserves some markup but is mostly geared towards news texts
goose can extract information for embedded content but doesn’t preserve markup and is not maintained
html2text converts HTML pages to Markup language and thus keeps the structure, though it doesn’t focus on main text extraction

Contact

Pull requests are welcome.

See my contact page for additional details.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2.0.0

Dec 3, 2024

1.12.2

Sep 10, 2024

1.12.1

Aug 20, 2024

1.12.0

Jul 30, 2024

1.11.0

Jun 27, 2024

1.10.0

May 30, 2024

1.9.0

May 2, 2024

1.8.1

Apr 3, 2024

1.8.0

Mar 20, 2024

1.7.0

Jan 25, 2024

1.6.4

Jan 8, 2024

1.6.3

Nov 29, 2023

1.6.2

Sep 6, 2023

1.6.1

Jun 15, 2023

1.6.0

May 11, 2023

1.5.0

Mar 30, 2023

1.4.1

Jan 19, 2023

1.4.0

Oct 18, 2022

1.3.0

Jul 29, 2022

1.2.2

May 18, 2022

1.2.1

May 2, 2022

1.2.0

Mar 7, 2022

1.1.0

Feb 21, 2022

1.0.0

Nov 30, 2021

0.9.3

Oct 21, 2021

0.9.2

Oct 6, 2021

0.9.1

Aug 2, 2021

0.9.0

Jun 15, 2021

0.8.2

Apr 21, 2021

0.8.1

Mar 11, 2021

0.8.0

Feb 19, 2021

0.7.0

Jan 4, 2021

0.6.1

Dec 2, 2020

0.6.0

Nov 6, 2020

0.5.2

Sep 22, 2020

0.5.1

Jul 15, 2020

0.5.0

Jun 2, 2020

0.4.1

Apr 23, 2020

0.4

Mar 19, 2020

0.3.1

Jan 24, 2020

0.3.0

Jan 13, 2020

0.2.1

Dec 3, 2019

0.2.0

Nov 27, 2019

0.1.1

Oct 8, 2019

This version

0.1.0

Sep 25, 2019

0.0.5

Sep 16, 2019

0.0.4

Aug 23, 2019

0.0.3

Aug 9, 2019

0.0.2

Aug 2, 2019

0.0.1

Jul 17, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trafilatura-0.1.0.tar.gz (1.5 MB view details)

Uploaded Sep 25, 2019 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

trafilatura-0.1.0-py3-none-any.whl (24.4 kB view details)

Uploaded Sep 25, 2019 Python 3

File details

Details for the file trafilatura-0.1.0.tar.gz.

File metadata

Download URL: trafilatura-0.1.0.tar.gz
Upload date: Sep 25, 2019
Size: 1.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.8

File hashes

Hashes for trafilatura-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`e61586a0fe84262977444b8a0c0f95d73b0ac4f4e94d0d8b01b2f187a55e9944`
MD5	`ebcdd6bf82ee1b1b00d3d8f86c5c51d3`
BLAKE2b-256	`f41d59442e22c0091f0ded7825b59a482756a8abad37612c3556bac133861780`

See more details on using hashes here.

File details

Details for the file trafilatura-0.1.0-py3-none-any.whl.

File metadata

Download URL: trafilatura-0.1.0-py3-none-any.whl
Upload date: Sep 25, 2019
Size: 24.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.8

File hashes

Hashes for trafilatura-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9acb4686892da163bf25a36ee9a8e5ad18cd6b1be1462bde998e74b55ce639f9`
MD5	`4bea3e008eff5ca16f924bc22abd6b75`
BLAKE2b-256	`7d633f85e938df523bcaddcf55d509510da98f464513aa00a15f80eaa507a94a`

See more details on using hashes here.

trafilatura 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Features

Installation

With Python

On the command-line

Additional information

Context

Name

Kudos to…

Alternatives

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes