trafilatura

Downloads web pages, scrapes main text and comments while preserving some structure, and converts to TXT, CSV, XML & TEI-XML

These details have not been verified by PyPI

Project links

Homepage

Project description

Trafilatura downloads web pages, scrapes main text and comments while preserving some structure, and converts to TXT, CSV, XML & TEI-XML. All the operations needed are handled seamlessly.

In a nutshell, with Python:

>>> import trafilatura
>>> downloaded = trafilatura.fetch_url('https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/')
>>> trafilatura.extract(downloaded)
# outputs main content and comments as plain text ...

On the command-line:

$ trafilatura -u "https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/"
# outputs main content and comments as plain text ...

Description

This library performs a robust extraction which focuses on the main content, which is usually the part displayed centrally, without the left or right bars, the header or the footer, but including potential titles and comments. Trafilatura can seamlessly download, parse and convert web documents. It scrapes the main body text while preserving part of the text formatting and page structure, a task also known as web scraping, boilerplate removal, DOM-based content extraction, main content identification, or web page cleaning.

Distinguishing between whole page and essential parts can help to alleviate many quality problems related to web texts as it can help with the noise consisting of recurring elements (headers and footers, ads, links/blogroll, etc.) It has to be precise enough not to miss texts or discard valid documents, it also has to be reasonably fast, as it is expected to run in production on millions of documents.

Features

Seamless download and extraction: URLs, HTML files or parsed HTML trees as input
Focus on main text and/or comments
Formatting and structural elements preserved: paragraphs, titles, lists, quotes, code, line breaks, in-line text formatting (experimental)
Output in plain text (minimal formatting), CSV (with metadata, tab-separated values) or XML format (for metadata and structure)
Extraction of metadata (currently title and date, more to come)
Computationally efficient (relies on lxml)
Robust extraction and generic jusText algorithm used as fallback
Optional language detection on the extracted content

Evaluation and alternatives

For first experimental results see the evaluation page and evaluation script.

2020-01-24 – 50 documents, 123 positive and 142 negative segments
Python Package	Precision	Recall	F-Score	Time
everything with markup	0.481	0.902	0.627	0
inscriptis 1.0 (html to txt)	0.494	0.992	0.659	0.44
newspaper3k 0.2.8	0.893	0.545	0.677	2.53
justext 2.2.0	0.880	0.593	0.709	1.21
goose3 3.1.6	0.915	0.610	0.732	3.88
readability-lxml 0.7.1	0.873	0.724	0.791	1.15
trafilatura 0.3.1 (rule-based)	0.853	0.894	0.873	0.86
trafilatura 0.3.1 (+ fallback)	0.873	0.951	0.911	1.09

Installation

This Python package is tested on Linux, macOS and Windows systems, it is compatible with Python 3.5 upwards (see install Python guide). It is available on the package repository PyPI and can notably be installed with pip or pipenv:

$ pip install trafilatura # pip3 install on systems where both Python 2 and 3 are installed
$ pip install -U trafilatura # to make sure you have the latest version
$ pip install git+https://github.com/adbar/trafilatura.git # latest available code (see build status above)

A few additional libraries can be installed for extended functionality and faster processing: extraction of publication date (htmldate), language detection (langid), and faster processing of downloads (cchardet, currently not working on some macOS versions).

$ pip install trafilatura[metadata] # metadata extraction
$ pip install trafilatura[all] # all additional functionality

You can also install or update the packages separately, trafilatura will detect which ones are present on your system and opt for the best available combination.

For infos on dependency management of Python packages see this discussion thread

Usage with Python

>>> import trafilatura
>>> downloaded = trafilatura.fetch_url('https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/')
>>> downloaded is None # assuming the download was successful
False
>>> result = trafilatura.extract(downloaded) # trafilatura.process_record is deprecated but works
>>> print(result)
# newlines preserved, TXT output ...
>>> result = trafilatura.extract(downloaded, xml_output=True)
>>> print(result)
# some formatting preserved in basic XML structure ...

The only required argument is the input document (here a downloaded HTML file), the rest is optional.

The inclusion of tables and comments can be deactivated at a function call. The use of a fallback algorithm (currently jusText) can also be bypassed in fast mode:

>>> result = trafilatura.extract(downloaded, include_comments=False) # no comments in output
>>> result = trafilatura.extract(downloaded, include_tables=False) # skip tables examination
>>> result = trafilatura.extract(downloaded, no_fallback=True) # skip justext algorithm used as fallback

This values combined probably provide the fastest execution times:

>>> result = trafilatura.extract(downloaded, include_comments=False, include_tables=False, no_fallback=True)

The input can consist of a previously parsed tree (i.e. a lxml.html object), which is then handled seamlessly:

>>> from lxml import html
>>> mytree = html.fromstring('<html><body><article><p>Here is the main text. It has to be long enough in order to bypass the safety checks. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p></article></body></html>')
>>> trafilatura.extract(mytree)
'Here is the main text. It has to be long enough in order to bypass the safety checks. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.\n'

Experimental feature: the target language can also be set using 2-letter codes (ISO 639-1), there will be no output if the detected language of the result does not match and no such filtering if the identification component has not been installed (see above for installation instructions).

>>> result = trafilatura.extract(downloaded, url, target_language='de')

All currently available options, along with their default values:

>>>> trafilatura.extract(downloaded, url=None, record_id='0001', no_fallback=False, include_comments=True, csv_output=False, xml_output=False, tei_output=False, tei_validation=False, target_language=None, include_tables=True, include_formatting=False)

For further configuration see the variables in settings.py and re-compile the package locally.

On the command-line

A command-line interface is included, for general instructions see Comment Prompt (tutorial for Windows systems), How to use the Terminal command line in macOS, or An introduction to the Linux Terminal.

URLs can be used directly (-u/--URL):

$ trafilatura -u https://de.creativecommons.org/index.php/was-ist-cc/
$ # outputs main content in plain text format ...
$ trafilatura --xml --URL "https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/"
$ # outputs main text with basic XML structure ...

You can also pipe a HTML document (and response body) to trafilatura:

$ cat myfile.html | trafilatura # use the contents of an already existing file
$ wget -qO- "https://de.creativecommons.org/index.php/was-ist-cc/" | trafilatura # use a custom download

The -i/--inputfile option allows for bulk download and processing of a list of URLs from a file listing one link per line. Beware that there should be a tacit scraping etiquette and that a server may block you after the download a certain number of pages from the same website/domain in a short period of time. In addition, some website may block the requests user-agent. Thus, trafilatura waits a few seconds per default between requests.

For all usage instructions see trafilatura -h:

usage: trafilatura [-h] [-f] [--formatting] [-i INPUTFILE] [--nocomments] [--notables] [--xml] [--xmltei] [-u URL] [-v]

optional arguments:

-h, --help: show this help message and exit
-f, --fast: fast (without fallback detection)
--formatting: include text formatting (bold, italic, etc.)
-i INPUTFILE, --inputfile INPUTFILE: name of input file for batch processing
--nocomments: don’t output any comments
--notables: don’t output any table elements
--csv: CSV output
--xml: XML output
--xmltei: XML TEI output
--validate: validate TEI output
-u URL, --URL URL: custom URL download
-v, --verbose: increase output verbosity

License

trafilatura is distributed under the GNU General Public License v3.0

GPL and free software licensing: What’s in it for business?

Going further

Online documentation: trafilatura.readthedocs.io

Trafilatura: Italian word for wire drawing.

In order to gather web documents it can be useful to download the portions of a website programmatically, here is how to use sitemaps to crawl websites.

Tutorial video in German by Simon Meier-Vieracker: Content von Webseiten laden mit Trafilatura.

Tutorials in German by Noah Bubenhofer: Download von Web-Daten & Daten aufbereiten und verwalten.

Roadmap

[-] Duplicate detection at sentence, paragraph and document level using a least recently used (LRU) cache
[-] XML output compatible with the recommendations of the Text Encoding Initiative
[-] Metadata integration
[-] Language detection on the extracted content
[-] Preservation of in-line text formatting (bold, italic, etc.)
[ ] Configuration and extraction parameters

Contributing

Contributions are welcome!

Feel free to file bug reports on the issues page.

Thanks to these contributors who submitted features and bugfixes:

Kudos to the following software libraries:

lxml, jusText, cchardet

Author

This effort is part of methods to derive information from web documents in order to build text databases for research (chiefly linguistic analysis and natural language processing). A significant challenge resides in the ability to extract and pre-process web texts to meet scientific expectations: Web corpus construction involves numerous design decisions, and this software packages can help facilitate collection and enhance corpus quality.

https://zenodo.org/badge/DOI/10.5281/zenodo.3460969.svg

Barbaresi, A. “Generic Web Content Extraction with Open-Source Software”, Proceedings of KONVENS 2019, Kaleidoscope Abstracts, 2019.
Barbaresi, A. “Efficient construction of metadata-enhanced web corpora”, Proceedings of the 10th Web as Corpus Workshop (WAC-X), 2016.

You can contact me via my contact page or GitHub.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2.0.0

Dec 3, 2024

1.12.2

Sep 10, 2024

1.12.1

Aug 20, 2024

1.12.0

Jul 30, 2024

1.11.0

Jun 27, 2024

1.10.0

May 30, 2024

1.9.0

May 2, 2024

1.8.1

Apr 3, 2024

1.8.0

Mar 20, 2024

1.7.0

Jan 25, 2024

1.6.4

Jan 8, 2024

1.6.3

Nov 29, 2023

1.6.2

Sep 6, 2023

1.6.1

Jun 15, 2023

1.6.0

May 11, 2023

1.5.0

Mar 30, 2023

1.4.1

Jan 19, 2023

1.4.0

Oct 18, 2022

1.3.0

Jul 29, 2022

1.2.2

May 18, 2022

1.2.1

May 2, 2022

1.2.0

Mar 7, 2022

1.1.0

Feb 21, 2022

1.0.0

Nov 30, 2021

0.9.3

Oct 21, 2021

0.9.2

Oct 6, 2021

0.9.1

Aug 2, 2021

0.9.0

Jun 15, 2021

0.8.2

Apr 21, 2021

0.8.1

Mar 11, 2021

0.8.0

Feb 19, 2021

0.7.0

Jan 4, 2021

0.6.1

Dec 2, 2020

0.6.0

Nov 6, 2020

0.5.2

Sep 22, 2020

0.5.1

Jul 15, 2020

0.5.0

Jun 2, 2020

0.4.1

Apr 23, 2020

0.4

Mar 19, 2020

This version

0.3.1

Jan 24, 2020

0.3.0

Jan 13, 2020

0.2.1

Dec 3, 2019

0.2.0

Nov 27, 2019

0.1.1

Oct 8, 2019

0.1.0

Sep 25, 2019

0.0.5

Sep 16, 2019

0.0.4

Aug 23, 2019

0.0.3

Aug 9, 2019

0.0.2

Aug 2, 2019

0.0.1

Jul 17, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trafilatura-0.3.1.tar.gz (1.9 MB view details)

Uploaded Jan 24, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

trafilatura-0.3.1-py3-none-any.whl (138.0 kB view details)

Uploaded Jan 24, 2020 Python 3

File details

Details for the file trafilatura-0.3.1.tar.gz.

File metadata

Download URL: trafilatura-0.3.1.tar.gz
Upload date: Jan 24, 2020
Size: 1.9 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.9

File hashes

Hashes for trafilatura-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`94827b42777c09e2af095f24b13413f4fef9cc7cb748129ee25c04841bf0268c`
MD5	`1900c08e8db879d7996c8ae78ddcb0c7`
BLAKE2b-256	`1913643dda98e70f98908e890128c2ca4b0b83cfffc690b5b2cd8db8b3aea9de`

See more details on using hashes here.

File details

Details for the file trafilatura-0.3.1-py3-none-any.whl.

File metadata

Download URL: trafilatura-0.3.1-py3-none-any.whl
Upload date: Jan 24, 2020
Size: 138.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.9

File hashes

Hashes for trafilatura-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4a400b06497655a3db802b6dede86cad810bdbf0f2d31adaddd961265652922e`
MD5	`da05926a609eb8c8302d28d9a8d6e72c`
BLAKE2b-256	`947330aeac864bbbc04c4c967490504019f638d708a9d8f50fa5252153a1e763`

See more details on using hashes here.

trafilatura 0.3.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Description

Features

Evaluation and alternatives

Installation

Usage with Python

On the command-line

License

Going further

Roadmap

Contributing

Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes