Skip to main content

Combine XPath, CSS Selector and JSONPath for Web data extracting.

Project description

license Pypi Status Python version Package version GitHub last commit Code style: black Build Status codecov

Data Extractor

Combine XPath, CSS Selector and JSONPath for Web data extracting.

Installation

pip install data-extractor

Usage

Download RSS Sample file

wget http://www.rssboard.org/files/sample-rss-2.xml

Simple Extractor

import json

from pathlib import Path

from data_extractor.item import Field, Item
from lxml.etree import fromstring


root = fromstring(Path("sample-rss-2.xml").read_text())

Using XPathExtractor to extract rss channel title

from data_extractor.lxml import XPathExtractor


XPathExtractor("//channel/title/text()").extract_first(root)
# 'Liftoff News'

Using TextCSSExtractor to extract all rss item link

from data_extractor.lxml import TextCSSExtractor


TextCSSExtractor("item>link").extract(root)
# ['http://liftoff.msfc.nasa.gov/news/2003/news-starcity.asp',
#  'http://liftoff.msfc.nasa.gov/news/2003/news-VASIMR.asp',
#  'http://liftoff.msfc.nasa.gov/news/2003/news-laundry.asp']

Using AttrRSSExtractor to extract rss version

from data_extractor.lxml import AttrCSSExtractor


AttrCSSExtractor("rss", attr="version").extract_first(root)
# '2.0'

Complex Extractor

Defining ChannelItem and Channel class, then extracting the data

import json

from pathlib import Path

from data_extractor.item import Field, Item
from data_extractor.lxml import XPathExtractor

from lxml.etree import fromstring


class ChannelItem(Item):
    title = Field(XPathExtractor("./title/text()"), default="")
    link = Field(XPathExtractor("./link/text()"), default="")
    description = Field(XPathExtractor("./description/text()"))
    publish_date = Field(XPathExtractor("./pubDate/text()"))
    guid = Field(XPathExtractor("./guid/text()"))

class Channel(Item):
    title = Field(XPathExtractor("./title/text()"))
    link = Field(XPathExtractor("./link/text()"))
    description = Field(XPathExtractor("./description/text()"))
    language = Field(XPathExtractor("./language/text()"))
    publish_date = Field(XPathExtractor("./pubDate/text()"))
    last_build_date = Field(XPathExtractor("./lastBuildDate/text()"))
    docs = Field(XPathExtractor("./docs/text()"))
    generator = Field(XPathExtractor("./generator/text()"))
    managing_editor = Field(XPathExtractor("./managingEditor/text()"))
    web_master = Field(XPathExtractor("./webMaster/text()"))

    items = ChannelItem(XPathExtractor("./item"), is_many=True)

Extracting the rss data from file

root = fromstring(Path("sample-rss-2.xml").read_text())
rv = Channel(XPathExtractor("//channel")).extract(root)
print(json.dumps(rv, indent=2))

Output:

{
  "title": "Liftoff News",
  "link": "http://liftoff.msfc.nasa.gov/",
  "description": "Liftoff to Space Exploration.",
  "language": "en-us",
  "publish_date": "Tue, 10 Jun 2003 04:00:00 GMT",
  "last_build_date": "Tue, 10 Jun 2003 09:41:01 GMT",
  "docs": "http://blogs.law.harvard.edu/tech/rss",
  "generator": "Weblog Editor 2.0",
  "managing_editor": "editor@example.com",
  "web_master": "webmaster@example.com",
  "items": [
    {
      "title": "Star City",
      "link": "http://liftoff.msfc.nasa.gov/news/2003/news-starcity.asp",
      "description": "How do Americans get ready to work with Russians aboard the International Space Station? They take a crash course in culture, language and protocol at Russia's <a href=\"http://howe.iki.rssi.ru/GCTC/gctc_e.htm\">Star City</a>.",
      "publish_date": "Tue, 03 Jun 2003 09:39:21 GMT",
      "guid": "http://liftoff.msfc.nasa.gov/2003/06/03.html#item573"
    },
    {
      "title": "",
      "link": "",
      "description": "Sky watchers in Europe, Asia, and parts of Alaska and Canada will experience a <a href=\"http://science.nasa.gov/headlines/y2003/30may_solareclipse.htm\">partial eclipse of the Sun</a> on Saturday, May 31st.",
      "publish_date": "Fri, 30 May 2003 11:06:42 GMT",
      "guid": "http://liftoff.msfc.nasa.gov/2003/05/30.html#item572"
    },
    {
      "title": "The Engine That Does More",
      "link": "http://liftoff.msfc.nasa.gov/news/2003/news-VASIMR.asp",
      "description": "Before man travels to Mars, NASA hopes to design new engines that will let us fly through the Solar System more quickly.  The proposed VASIMR engine would do that.",
      "publish_date": "Tue, 27 May 2003 08:37:32 GMT",
      "guid": "http://liftoff.msfc.nasa.gov/2003/05/27.html#item571"
    },
    {
      "title": "Astronauts' Dirty Laundry",
      "link": "http://liftoff.msfc.nasa.gov/news/2003/news-laundry.asp",
      "description": "Compared to earlier spacecraft, the International Space Station has many luxuries, but laundry facilities are not one of them.  Instead, astronauts have other options.",
      "publish_date": "Tue, 20 May 2003 08:56:02 GMT",
      "guid": "http://liftoff.msfc.nasa.gov/2003/05/20.html#item570"
    }
  ]
}

Or just extracting the channel item from file

root = fromstring(Path("sample-rss-2.xml").read_text())
rv = ChannelItem(XPathExtractor("//channel/item"), is_many=True).extract(root)
print(json.dumps(rv, indent=2))

Output:

[
  {
    "title": "Star City",
    "link": "http://liftoff.msfc.nasa.gov/news/2003/news-starcity.asp",
    "description": "How do Americans get ready to work with Russians aboard the International Space Station? They take a crash course in culture, language and protocol at Russia's <a href=\"http://howe.iki.rssi.ru/GCTC/gctc_e.htm\">Star City</a>.",
    "publish_date": "Tue, 03 Jun 2003 09:39:21 GMT",
    "guid": "http://liftoff.msfc.nasa.gov/2003/06/03.html#item573"
  },
  {
    "title": "",
    "link": "",
    "description": "Sky watchers in Europe, Asia, and parts of Alaska and Canada will experience a <a href=\"http://science.nasa.gov/headlines/y2003/30may_solareclipse.htm\">partial eclipse of the Sun</a> on Saturday, May 31st.",
    "publish_date": "Fri, 30 May 2003 11:06:42 GMT",
    "guid": "http://liftoff.msfc.nasa.gov/2003/05/30.html#item572"
  },
  {
    "title": "The Engine That Does More",
    "link": "http://liftoff.msfc.nasa.gov/news/2003/news-VASIMR.asp",
    "description": "Before man travels to Mars, NASA hopes to design new engines that will let us fly through the Solar System more quickly.  The proposed VASIMR engine would do that.",
    "publish_date": "Tue, 27 May 2003 08:37:32 GMT",
    "guid": "http://liftoff.msfc.nasa.gov/2003/05/27.html#item571"
  },
  {
    "title": "Astronauts' Dirty Laundry",
    "link": "http://liftoff.msfc.nasa.gov/news/2003/news-laundry.asp",
    "description": "Compared to earlier spacecraft, the International Space Station has many luxuries, but laundry facilities are not one of them.  Instead, astronauts have other options.",
    "publish_date": "Tue, 20 May 2003 08:56:02 GMT",
    "guid": "http://liftoff.msfc.nasa.gov/2003/05/20.html#item570"
  }
]

Changelog

v0.1.3

  • 5f4b0e0 Update README.md
  • 1b8bfb9 Add UserWarning when extractor can't extract first item from result
  • dd2cd25 Remove the useless _extract call
  • 655ec9d Add UserWarning when expr is conflict with parameter is_many=True
  • bcade2c No alow user to set is_many=True and default!=sentinel at same time
  • 761bd30 Add more unit tests

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_extractor-0.1.3.tar.gz (19.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

data_extractor-0.1.3-py3-none-any.whl (7.5 kB view details)

Uploaded Python 3

File details

Details for the file data_extractor-0.1.3.tar.gz.

File metadata

  • Download URL: data_extractor-0.1.3.tar.gz
  • Upload date:
  • Size: 19.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.8.0 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.7.3

File hashes

Hashes for data_extractor-0.1.3.tar.gz
Algorithm Hash digest
SHA256 1c1fe327627487bf9a408e4136132260f50c4d324c384408c04c8ff2554b989a
MD5 8193fe7192a0304f0bf2bfa8e9c63868
BLAKE2b-256 a68f3d9a8fa0464d6aacb87570115b687d22ecfb8591bd26a5e7e421ed2d2279

See more details on using hashes here.

File details

Details for the file data_extractor-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: data_extractor-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 7.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.8.0 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.7.3

File hashes

Hashes for data_extractor-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 c0ae0d34b36b091a16a8c0a4f13cf3cf93d606433680c4ae83629363e00656ae
MD5 0f8ac0d5a9688fb79e3c05b9f254564f
BLAKE2b-256 560adc01c05fc058453a15f164f774b791edbbf9c8472fcd94a9ddce486f0068

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page