Skip to main content

No project description provided

Project description

codecov Python Versions PyPI - Version GitHub

Articulo

Tiny library for extraction articles from html.
It can extract the content of an article, both in text and HTML, and it's title.

Usage

Basic usage

This library is designed to be as simple as possible.
To start using it just import it and instantiate with link you want to parse as a parameter.

Also the library designed to work in lazy manner.
So, until you make a request for some property, it does not send any requests.

from articulo import Articulo

# Step 1: initializing Articulo instance
article = Articulo('https://info.cern.ch/')

# Step 2: requesting article properties. All properties resolve lazily.
print(article.title) # article title as a string
print(article.text) # article content as a string
print(article.markup) # article content as an html markup string
print(article.icon) # link to article icon
print(article.description) # article meta description
print(article.preview) # link to article meta preview image
print(article.keywords) # article meta keywords list

Verbose mode

In case you want to see the whole procees just provide parameter verbose=True to the instance. It can be helpful for debugging.

from articulo import Articulo

# Initializing Articulo instance with verbose mode
article = Articulo('https://info.cern.ch/', verbose=True)

Controlling information loss coefficient

The whole idea of parsing article content is to define the part of the document that has the highest information density. To find that part there is the so-called information loss coefficient. This coefficient determines the decrease in the text density of the document during parsing.

The default value is 0.7 which stands for 70% information density decrease. In most cases, this works fine.
Nevertheless, you can change it in case you have insufficient parsing results. Just provide theshold parameter to the articulo instance, it might help.

from articulo import Articulo

# Initializing Articulo instance with information loss coefficient of 30%
article = Articulo('https://info.cern.ch/', threshold=0.3)

Providing headers

In some cases you need to provide additional headers to get an article html from url.
For that case you can provide headers with http_headers parameter when you create new instance of articulo.

from articulo import Articulo

# Initializing Articulo instance with custom user agent
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36' }
article = Articulo('https://info.cern.ch/', http_headers=headers)

Providing custom charset

Articulo uses requests library to get html from url. This library tries to guess the encoding of the response based on the HTTP headers. Although it works fine most of the time, in some cases this might not work as expected, and you'll get a mess instead of text. For that case you can provide custom charset with def_charset parameter when you create new instance of articulo.

from articulo import Articulo

# Initializing Articulo instance with cp1251 charset
article = Articulo('https://info.cern.ch/', def_charset='cp1251')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

articulo-0.3.0.tar.gz (18.8 kB view details)

Uploaded Source

Built Distribution

articulo-0.3.0-py3-none-any.whl (20.1 kB view details)

Uploaded Python 3

File details

Details for the file articulo-0.3.0.tar.gz.

File metadata

  • Download URL: articulo-0.3.0.tar.gz
  • Upload date:
  • Size: 18.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.10.9 Darwin/24.0.0

File hashes

Hashes for articulo-0.3.0.tar.gz
Algorithm Hash digest
SHA256 ce33e4cc2e5f06c9bc233b88cc602b7bc1799a6ddda6b6526080bfbcf2628e2d
MD5 eda8b8fa1651c881d1f2441b408a781b
BLAKE2b-256 9c521d078ce80128938674cc5bba0c69d25ce3886a0f5bc689eb771435b6bc96

See more details on using hashes here.

File details

Details for the file articulo-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: articulo-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 20.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.10.9 Darwin/24.0.0

File hashes

Hashes for articulo-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1ae49c89ead9dedfd51ca3e7c7c6b4641a49801c11eabba5dffa48c978b7b147
MD5 00ca2b5fa809607aa326a8ae47994dfe
BLAKE2b-256 00e14a609a1e721e5aec9dc7f6e62ce9147af4b4ab69bc3e5124d1447b0256ef

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page