Skip to main content

A lib to work with html and web data

Project description

datahtml

PyPI - Version PyPI - Python Version readthedocs


datahtml is a library for crawling and extraction of data from html and xml content.

Datahtml lets you:

  • Extract ld+json data from html
  • Extract frequently used meta tags from html (those that are used for SEO and social media, between others)
  • Extract Article data from a html, usually from Newspaper sites
  • Parse RSS feeds from sites
  • Crawl some specific social media sites like google and youtube

Under the hood datahtml uses libraries like BeautifoulSoup, Newspaper2k, feedparser between others, but datahtml takes an opinionated approach for crawling based on our expriencies doing so.

Quickstart

pip install datahtml
from datahtml import web, crawler

c = crawler.LocalCrawler()
w = web.download("https://www.infobae.com", crawler=c)
w.links()

License

datahtml is distributed under the terms of the MPL-2.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datahtml-0.6.0.tar.gz (665.5 kB view details)

Uploaded Source

Built Distribution

datahtml-0.6.0-py3-none-any.whl (79.5 kB view details)

Uploaded Python 3

File details

Details for the file datahtml-0.6.0.tar.gz.

File metadata

  • Download URL: datahtml-0.6.0.tar.gz
  • Upload date:
  • Size: 665.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.23.0

File hashes

Hashes for datahtml-0.6.0.tar.gz
Algorithm Hash digest
SHA256 bba0f54648b8775da325cf8b8b6c0572983a79b3d4a049a1efec4b0f49fc0fbe
MD5 1741c0b6cdbce3e22a54dce9f43722b2
BLAKE2b-256 39b2a9e8810db6368ddd66ff352e8e2949fb7e67c145afb35b4c08a5c7a39a90

See more details on using hashes here.

File details

Details for the file datahtml-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: datahtml-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 79.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.23.0

File hashes

Hashes for datahtml-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2de61dde0fc745e3a06afd6889f9ac53365a615f5073298c1095516bf74f8eb4
MD5 9ba691d9e2746f56eb97b95d280699da
BLAKE2b-256 4c2c056ec3bf258f682d9fa362d3598a2ed4b2f9308db9c60a81b780ade35712

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page