Combine XPath, CSS Selector and JSONPath for Web data extracting.
Project description
Data Extractor
Combine XPath, CSS Selector and JSONPath for Web data extracting.
Installation
pip install data-extractor
Usage
Download RSS Sample file
wget http://www.rssboard.org/files/sample-rss-2.xml
Simple Extractor
import json
from pathlib import Path
from data_extractor.item import Field, Item
from lxml.etree import fromstring
root = fromstring(Path("sample-rss-2.xml").read_text())
Using XPathExtractor
to extract rss channel title
from data_extractor.lxml import XPathExtractor
XPathExtractor("//channel/title/text()").extract_first(root)
# 'Liftoff News'
Using TextCSSExtractor
to extract all rss item link
from data_extractor.lxml import TextCSSExtractor
TextCSSExtractor("item>link").extract(root)
# ['http://liftoff.msfc.nasa.gov/news/2003/news-starcity.asp',
# 'http://liftoff.msfc.nasa.gov/news/2003/news-VASIMR.asp',
# 'http://liftoff.msfc.nasa.gov/news/2003/news-laundry.asp']
Using AttrRSSExtractor
to extract rss version
from data_extractor.lxml import AttrCSSExtractor
AttrCSSExtractor("rss", attr="version").extract_first(root)
# '2.0'
Complex Extractor
Defining ChannelItem
and Channel
class, then extracting the data
import json
from pathlib import Path
from data_extractor.item import Field, Item
from data_extractor.lxml import XPathExtractor
from lxml.etree import fromstring
class ChannelItem(Item):
title = Field(XPathExtractor("./title/text()"), default="")
link = Field(XPathExtractor("./link/text()"), default="")
description = Field(XPathExtractor("./description/text()"))
publish_date = Field(XPathExtractor("./pubDate/text()"))
guid = Field(XPathExtractor("./guid/text()"))
class Channel(Item):
title = Field(XPathExtractor("./title/text()"))
link = Field(XPathExtractor("./link/text()"))
description = Field(XPathExtractor("./description/text()"))
language = Field(XPathExtractor("./language/text()"))
publish_date = Field(XPathExtractor("./pubDate/text()"))
last_build_date = Field(XPathExtractor("./lastBuildDate/text()"))
docs = Field(XPathExtractor("./docs/text()"))
generator = Field(XPathExtractor("./generator/text()"))
managing_editor = Field(XPathExtractor("./managingEditor/text()"))
web_master = Field(XPathExtractor("./webMaster/text()"))
items = ChannelItem(XPathExtractor("./item"), is_many=True)
Extracting the rss data from file
root = fromstring(Path("sample-rss-2.xml").read_text())
rv = Channel(XPathExtractor("//channel")).extract(root)
print(json.dumps(rv, indent=2))
Output:
{
"title": "Liftoff News",
"link": "http://liftoff.msfc.nasa.gov/",
"description": "Liftoff to Space Exploration.",
"language": "en-us",
"publish_date": "Tue, 10 Jun 2003 04:00:00 GMT",
"last_build_date": "Tue, 10 Jun 2003 09:41:01 GMT",
"docs": "http://blogs.law.harvard.edu/tech/rss",
"generator": "Weblog Editor 2.0",
"managing_editor": "editor@example.com",
"web_master": "webmaster@example.com",
"items": [
{
"title": "Star City",
"link": "http://liftoff.msfc.nasa.gov/news/2003/news-starcity.asp",
"description": "How do Americans get ready to work with Russians aboard the International Space Station? They take a crash course in culture, language and protocol at Russia's <a href=\"http://howe.iki.rssi.ru/GCTC/gctc_e.htm\">Star City</a>.",
"publish_date": "Tue, 03 Jun 2003 09:39:21 GMT",
"guid": "http://liftoff.msfc.nasa.gov/2003/06/03.html#item573"
},
{
"title": "",
"link": "",
"description": "Sky watchers in Europe, Asia, and parts of Alaska and Canada will experience a <a href=\"http://science.nasa.gov/headlines/y2003/30may_solareclipse.htm\">partial eclipse of the Sun</a> on Saturday, May 31st.",
"publish_date": "Fri, 30 May 2003 11:06:42 GMT",
"guid": "http://liftoff.msfc.nasa.gov/2003/05/30.html#item572"
},
{
"title": "The Engine That Does More",
"link": "http://liftoff.msfc.nasa.gov/news/2003/news-VASIMR.asp",
"description": "Before man travels to Mars, NASA hopes to design new engines that will let us fly through the Solar System more quickly. The proposed VASIMR engine would do that.",
"publish_date": "Tue, 27 May 2003 08:37:32 GMT",
"guid": "http://liftoff.msfc.nasa.gov/2003/05/27.html#item571"
},
{
"title": "Astronauts' Dirty Laundry",
"link": "http://liftoff.msfc.nasa.gov/news/2003/news-laundry.asp",
"description": "Compared to earlier spacecraft, the International Space Station has many luxuries, but laundry facilities are not one of them. Instead, astronauts have other options.",
"publish_date": "Tue, 20 May 2003 08:56:02 GMT",
"guid": "http://liftoff.msfc.nasa.gov/2003/05/20.html#item570"
}
]
}
Or just extracting the channel item from file
root = fromstring(Path("sample-rss-2.xml").read_text())
rv = ChannelItem(XPathExtractor("//channel/item"), is_many=True).extract(root)
print(json.dumps(rv, indent=2))
Output:
[
{
"title": "Star City",
"link": "http://liftoff.msfc.nasa.gov/news/2003/news-starcity.asp",
"description": "How do Americans get ready to work with Russians aboard the International Space Station? They take a crash course in culture, language and protocol at Russia's <a href=\"http://howe.iki.rssi.ru/GCTC/gctc_e.htm\">Star City</a>.",
"publish_date": "Tue, 03 Jun 2003 09:39:21 GMT",
"guid": "http://liftoff.msfc.nasa.gov/2003/06/03.html#item573"
},
{
"title": "",
"link": "",
"description": "Sky watchers in Europe, Asia, and parts of Alaska and Canada will experience a <a href=\"http://science.nasa.gov/headlines/y2003/30may_solareclipse.htm\">partial eclipse of the Sun</a> on Saturday, May 31st.",
"publish_date": "Fri, 30 May 2003 11:06:42 GMT",
"guid": "http://liftoff.msfc.nasa.gov/2003/05/30.html#item572"
},
{
"title": "The Engine That Does More",
"link": "http://liftoff.msfc.nasa.gov/news/2003/news-VASIMR.asp",
"description": "Before man travels to Mars, NASA hopes to design new engines that will let us fly through the Solar System more quickly. The proposed VASIMR engine would do that.",
"publish_date": "Tue, 27 May 2003 08:37:32 GMT",
"guid": "http://liftoff.msfc.nasa.gov/2003/05/27.html#item571"
},
{
"title": "Astronauts' Dirty Laundry",
"link": "http://liftoff.msfc.nasa.gov/news/2003/news-laundry.asp",
"description": "Compared to earlier spacecraft, the International Space Station has many luxuries, but laundry facilities are not one of them. Instead, astronauts have other options.",
"publish_date": "Tue, 20 May 2003 08:56:02 GMT",
"guid": "http://liftoff.msfc.nasa.gov/2003/05/20.html#item570"
}
]
Changelog
v0.1.3
- 5f4b0e0 Update README.md
- 1b8bfb9 Add UserWarning when extractor can't extract first item from result
- dd2cd25 Remove the useless _extract call
- 655ec9d Add UserWarning when expr is conflict with parameter is_many=True
- bcade2c No alow user to set is_many=True and default!=sentinel at same time
- 761bd30 Add more unit tests
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
data_extractor-0.1.3.tar.gz
(19.2 kB
view hashes)
Built Distribution
Close
Hashes for data_extractor-0.1.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c0ae0d34b36b091a16a8c0a4f13cf3cf93d606433680c4ae83629363e00656ae |
|
MD5 | 0f8ac0d5a9688fb79e3c05b9f254564f |
|
BLAKE2b-256 | 560adc01c05fc058453a15f164f774b791edbbf9c8472fcd94a9ddce486f0068 |