Combine XPath, CSS Selector and JSONPath for Web data extracting.

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
Programming Language

Project description

Data Extractor

Combine XPath, CSS Selector and JSONPath for Web data extracting.

Installation

pip install data-extractor

Usage

Download RSS Sample file

wget http://www.rssboard.org/files/sample-rss-2.xml

Simple Extractor

import json

from pathlib import Path

from data_extractor.item import Field, Item
from lxml.etree import fromstring


root = fromstring(Path("sample-rss-2.xml").read_text())

Using XPathExtractor to extract rss channel title

from data_extractor.lxml import XPathExtractor


XPathExtractor("//channel/title/text()").extract_first(root)
# 'Liftoff News'

Using TextCSSExtractor to extract all rss item link

from data_extractor.lxml import TextCSSExtractor


TextCSSExtractor("item>link").extract(root)
# ['http://liftoff.msfc.nasa.gov/news/2003/news-starcity.asp',
#  'http://liftoff.msfc.nasa.gov/news/2003/news-VASIMR.asp',
#  'http://liftoff.msfc.nasa.gov/news/2003/news-laundry.asp']

Using AttrRSSExtractor to extract rss version

from data_extractor.lxml import AttrCSSExtractor


AttrCSSExtractor("rss", attr="version").extract_first(root)
# '2.0'

Complex Extractor

Defining ChannelItem and Channel class, then extracting the data

import json

from pathlib import Path

from data_extractor.item import Field, Item
from data_extractor.lxml import XPathExtractor

from lxml.etree import fromstring


class ChannelItem(Item):
    title = Field(XPathExtractor("./title/text()"), default="")
    link = Field(XPathExtractor("./link/text()"), default="")
    description = Field(XPathExtractor("./description/text()"))
    publish_date = Field(XPathExtractor("./pubDate/text()"))
    guid = Field(XPathExtractor("./guid/text()"))

class Channel(Item):
    title = Field(XPathExtractor("./title/text()"))
    link = Field(XPathExtractor("./link/text()"))
    description = Field(XPathExtractor("./description/text()"))
    language = Field(XPathExtractor("./language/text()"))
    publish_date = Field(XPathExtractor("./pubDate/text()"))
    last_build_date = Field(XPathExtractor("./lastBuildDate/text()"))
    docs = Field(XPathExtractor("./docs/text()"))
    generator = Field(XPathExtractor("./generator/text()"))
    managing_editor = Field(XPathExtractor("./managingEditor/text()"))
    web_master = Field(XPathExtractor("./webMaster/text()"))

    items = ChannelItem(XPathExtractor("./item"), is_many=True)

Extracting the rss data from file

root = fromstring(Path("sample-rss-2.xml").read_text())
rv = Channel(XPathExtractor("//channel")).extract(root)
print(json.dumps(rv, indent=2))

Output:

{
  "title": "Liftoff News",
  "link": "http://liftoff.msfc.nasa.gov/",
  "description": "Liftoff to Space Exploration.",
  "language": "en-us",
  "publish_date": "Tue, 10 Jun 2003 04:00:00 GMT",
  "last_build_date": "Tue, 10 Jun 2003 09:41:01 GMT",
  "docs": "http://blogs.law.harvard.edu/tech/rss",
  "generator": "Weblog Editor 2.0",
  "managing_editor": "editor@example.com",
  "web_master": "webmaster@example.com",
  "items": [
    {
      "title": "Star City",
      "link": "http://liftoff.msfc.nasa.gov/news/2003/news-starcity.asp",
      "description": "How do Americans get ready to work with Russians aboard the International Space Station? They take a crash course in culture, language and protocol at Russia's <a href=\"http://howe.iki.rssi.ru/GCTC/gctc_e.htm\">Star City</a>.",
      "publish_date": "Tue, 03 Jun 2003 09:39:21 GMT",
      "guid": "http://liftoff.msfc.nasa.gov/2003/06/03.html#item573"
    },
    {
      "title": "",
      "link": "",
      "description": "Sky watchers in Europe, Asia, and parts of Alaska and Canada will experience a <a href=\"http://science.nasa.gov/headlines/y2003/30may_solareclipse.htm\">partial eclipse of the Sun</a> on Saturday, May 31st.",
      "publish_date": "Fri, 30 May 2003 11:06:42 GMT",
      "guid": "http://liftoff.msfc.nasa.gov/2003/05/30.html#item572"
    },
    {
      "title": "The Engine That Does More",
      "link": "http://liftoff.msfc.nasa.gov/news/2003/news-VASIMR.asp",
      "description": "Before man travels to Mars, NASA hopes to design new engines that will let us fly through the Solar System more quickly.  The proposed VASIMR engine would do that.",
      "publish_date": "Tue, 27 May 2003 08:37:32 GMT",
      "guid": "http://liftoff.msfc.nasa.gov/2003/05/27.html#item571"
    },
    {
      "title": "Astronauts' Dirty Laundry",
      "link": "http://liftoff.msfc.nasa.gov/news/2003/news-laundry.asp",
      "description": "Compared to earlier spacecraft, the International Space Station has many luxuries, but laundry facilities are not one of them.  Instead, astronauts have other options.",
      "publish_date": "Tue, 20 May 2003 08:56:02 GMT",
      "guid": "http://liftoff.msfc.nasa.gov/2003/05/20.html#item570"
    }
  ]
}

Or just extracting the channel item from file

root = fromstring(Path("sample-rss-2.xml").read_text())
rv = ChannelItem(XPathExtractor("//channel/item"), is_many=True).extract(root)
print(json.dumps(rv, indent=2))

Output:

[
  {
    "title": "Star City",
    "link": "http://liftoff.msfc.nasa.gov/news/2003/news-starcity.asp",
    "description": "How do Americans get ready to work with Russians aboard the International Space Station? They take a crash course in culture, language and protocol at Russia's <a href=\"http://howe.iki.rssi.ru/GCTC/gctc_e.htm\">Star City</a>.",
    "publish_date": "Tue, 03 Jun 2003 09:39:21 GMT",
    "guid": "http://liftoff.msfc.nasa.gov/2003/06/03.html#item573"
  },
  {
    "title": "",
    "link": "",
    "description": "Sky watchers in Europe, Asia, and parts of Alaska and Canada will experience a <a href=\"http://science.nasa.gov/headlines/y2003/30may_solareclipse.htm\">partial eclipse of the Sun</a> on Saturday, May 31st.",
    "publish_date": "Fri, 30 May 2003 11:06:42 GMT",
    "guid": "http://liftoff.msfc.nasa.gov/2003/05/30.html#item572"
  },
  {
    "title": "The Engine That Does More",
    "link": "http://liftoff.msfc.nasa.gov/news/2003/news-VASIMR.asp",
    "description": "Before man travels to Mars, NASA hopes to design new engines that will let us fly through the Solar System more quickly.  The proposed VASIMR engine would do that.",
    "publish_date": "Tue, 27 May 2003 08:37:32 GMT",
    "guid": "http://liftoff.msfc.nasa.gov/2003/05/27.html#item571"
  },
  {
    "title": "Astronauts' Dirty Laundry",
    "link": "http://liftoff.msfc.nasa.gov/news/2003/news-laundry.asp",
    "description": "Compared to earlier spacecraft, the International Space Station has many luxuries, but laundry facilities are not one of them.  Instead, astronauts have other options.",
    "publish_date": "Tue, 20 May 2003 08:56:02 GMT",
    "guid": "http://liftoff.msfc.nasa.gov/2003/05/20.html#item570"
  }
]

Changelog

v0.1.3

5f4b0e0 Update README.md
1b8bfb9 Add UserWarning when extractor can't extract first item from result
dd2cd25 Remove the useless _extract call
655ec9d Add UserWarning when expr is conflict with parameter is_many=True
bcade2c No alow user to set is_many=True and default!=sentinel at same time
761bd30 Add more unit tests

Project details

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
Programming Language

Release history Release notifications | RSS feed

1.0.0b2 pre-release

Jul 25, 2021

1.0.0b1 pre-release

Jul 23, 2021

1.0.0b0 pre-release

Apr 12, 2021

0.10.2

Aug 17, 2021

0.10.1

Jul 22, 2021

0.10.0

Jul 21, 2021

0.9.0

Mar 11, 2021

0.8.0

Jan 13, 2021

0.7.0

May 30, 2020

0.6.1

May 3, 2020

0.6.0

Apr 10, 2020

0.6.0a3 pre-release

Mar 16, 2020

0.6.0.dev2 pre-release

Jan 3, 2020

0.6.0.dev1 pre-release

Jan 1, 2020

0.5.4

Nov 28, 2019

0.5.3

Nov 20, 2019

0.5.2

Nov 7, 2019

0.5.1

Nov 6, 2019

0.5.0

Nov 5, 2019

0.5.0.dev5 pre-release

Nov 2, 2019

0.5.0.dev4 pre-release

Oct 29, 2019

0.5.0.dev3 pre-release

Oct 21, 2019

0.5.0.dev2 pre-release

Oct 19, 2019

0.5.0.dev1 pre-release

Oct 15, 2019

0.4.2

Oct 16, 2019

0.4.1

Oct 12, 2019

0.4.0

Oct 10, 2019

0.4.0.dev3 pre-release

Sep 29, 2019

0.4.0.dev2 pre-release

Sep 28, 2019

0.4.0.dev1 pre-release

Sep 27, 2019

0.4.0.dev0 pre-release

Sep 26, 2019

0.3.2

May 31, 2019

0.2.2

May 21, 2019

0.2.1

May 19, 2019

0.2.0

May 6, 2019

0.1.5

Apr 23, 2019

0.1.4

Apr 21, 2019

This version

0.1.3

Apr 20, 2019

0.1.2

Apr 19, 2019

0.1.1

Apr 19, 2019

0.1.0

Apr 18, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_extractor-0.1.3.tar.gz (19.2 kB view hashes)

Uploaded Apr 20, 2019 Source

Built Distribution

data_extractor-0.1.3-py3-none-any.whl (7.5 kB view hashes)

Uploaded Apr 20, 2019 Python 3

Hashes for data_extractor-0.1.3.tar.gz

Hashes for data_extractor-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`1c1fe327627487bf9a408e4136132260f50c4d324c384408c04c8ff2554b989a`
MD5	`8193fe7192a0304f0bf2bfa8e9c63868`
BLAKE2b-256	`a68f3d9a8fa0464d6aacb87570115b687d22ecfb8591bd26a5e7e421ed2d2279`

Hashes for data_extractor-0.1.3-py3-none-any.whl

Hashes for data_extractor-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c0ae0d34b36b091a16a8c0a4f13cf3cf93d606433680c4ae83629363e00656ae`
MD5	`0f8ac0d5a9688fb79e3c05b9f254564f`
BLAKE2b-256	`560adc01c05fc058453a15f164f774b791edbbf9c8472fcd94a9ddce486f0068`