Combine XPath, CSS Selector and JSONPath for Web data extracting.
Project description
Data Extractor
Combine XPath, CSS Selector and JSONPath for Web data extracting.
Installation
pip install data-extractor
Usage
Download RSS Sample file
wget http://www.rssboard.org/files/sample-rss-2.xml
Simple Extractor
import json
from pathlib import Path
from data_extractor.item import Field, Item
from lxml.etree import fromstring
root = fromstring(Path("sample-rss-2.xml").read_text())
Using XPathExtractor to extract rss channel title
from data_extractor.lxml import XPathExtractor
XPathExtractor("//channel/title/text()").extract_first(root)
# 'Liftoff News'
Using TextCSSExtractor to extract all rss item link
from data_extractor.lxml import TextCSSExtractor
TextCSSExtractor("item>link").extract(root)
# ['http://liftoff.msfc.nasa.gov/news/2003/news-starcity.asp',
# 'http://liftoff.msfc.nasa.gov/news/2003/news-VASIMR.asp',
# 'http://liftoff.msfc.nasa.gov/news/2003/news-laundry.asp']
Using AttrRSSExtractor to extract rss version
from data_extractor.lxml import AttrCSSExtractor
AttrCSSExtractor("rss", attr="version").extract_first(root)
# '2.0'
Complex Extractor
Defining ChannelItem and Channel class, then extracting the data
import json
from pathlib import Path
from data_extractor.item import Field, Item
from data_extractor.lxml import XPathExtractor
from lxml.etree import fromstring
class ChannelItem(Item):
title = Field(XPathExtractor("./title/text()"), default="")
link = Field(XPathExtractor("./link/text()"), default="")
description = Field(XPathExtractor("./description/text()"))
publish_date = Field(XPathExtractor("./pubDate/text()"))
guid = Field(XPathExtractor("./guid/text()"))
class Channel(Item):
title = Field(XPathExtractor("./title/text()"))
link = Field(XPathExtractor("./link/text()"))
description = Field(XPathExtractor("./description/text()"))
language = Field(XPathExtractor("./language/text()"))
publish_date = Field(XPathExtractor("./pubDate/text()"))
last_build_date = Field(XPathExtractor("./lastBuildDate/text()"))
docs = Field(XPathExtractor("./docs/text()"))
generator = Field(XPathExtractor("./generator/text()"))
managing_editor = Field(XPathExtractor("./managingEditor/text()"))
web_master = Field(XPathExtractor("./webMaster/text()"))
items = ChannelItem(XPathExtractor("./item"), is_many=True)
Extracting the rss data from file
root = fromstring(Path("sample-rss-2.xml").read_text())
rv = Channel(XPathExtractor("//channel")).extract(root)
print(json.dumps(rv, indent=2))
Output:
{
"title": "Liftoff News",
"link": "http://liftoff.msfc.nasa.gov/",
"description": "Liftoff to Space Exploration.",
"language": "en-us",
"publish_date": "Tue, 10 Jun 2003 04:00:00 GMT",
"last_build_date": "Tue, 10 Jun 2003 09:41:01 GMT",
"docs": "http://blogs.law.harvard.edu/tech/rss",
"generator": "Weblog Editor 2.0",
"managing_editor": "editor@example.com",
"web_master": "webmaster@example.com",
"items": [
{
"title": "Star City",
"link": "http://liftoff.msfc.nasa.gov/news/2003/news-starcity.asp",
"description": "How do Americans get ready to work with Russians aboard the International Space Station? They take a crash course in culture, language and protocol at Russia's <a href=\"http://howe.iki.rssi.ru/GCTC/gctc_e.htm\">Star City</a>.",
"publish_date": "Tue, 03 Jun 2003 09:39:21 GMT",
"guid": "http://liftoff.msfc.nasa.gov/2003/06/03.html#item573"
},
{
"title": "",
"link": "",
"description": "Sky watchers in Europe, Asia, and parts of Alaska and Canada will experience a <a href=\"http://science.nasa.gov/headlines/y2003/30may_solareclipse.htm\">partial eclipse of the Sun</a> on Saturday, May 31st.",
"publish_date": "Fri, 30 May 2003 11:06:42 GMT",
"guid": "http://liftoff.msfc.nasa.gov/2003/05/30.html#item572"
},
{
"title": "The Engine That Does More",
"link": "http://liftoff.msfc.nasa.gov/news/2003/news-VASIMR.asp",
"description": "Before man travels to Mars, NASA hopes to design new engines that will let us fly through the Solar System more quickly. The proposed VASIMR engine would do that.",
"publish_date": "Tue, 27 May 2003 08:37:32 GMT",
"guid": "http://liftoff.msfc.nasa.gov/2003/05/27.html#item571"
},
{
"title": "Astronauts' Dirty Laundry",
"link": "http://liftoff.msfc.nasa.gov/news/2003/news-laundry.asp",
"description": "Compared to earlier spacecraft, the International Space Station has many luxuries, but laundry facilities are not one of them. Instead, astronauts have other options.",
"publish_date": "Tue, 20 May 2003 08:56:02 GMT",
"guid": "http://liftoff.msfc.nasa.gov/2003/05/20.html#item570"
}
]
}
Or just extracting the channel item from file
root = fromstring(Path("sample-rss-2.xml").read_text())
rv = ChannelItem(XPathExtractor("//channel/item"), is_many=True).extract(root)
print(json.dumps(rv, indent=2))
Output:
[
{
"title": "Star City",
"link": "http://liftoff.msfc.nasa.gov/news/2003/news-starcity.asp",
"description": "How do Americans get ready to work with Russians aboard the International Space Station? They take a crash course in culture, language and protocol at Russia's <a href=\"http://howe.iki.rssi.ru/GCTC/gctc_e.htm\">Star City</a>.",
"publish_date": "Tue, 03 Jun 2003 09:39:21 GMT",
"guid": "http://liftoff.msfc.nasa.gov/2003/06/03.html#item573"
},
{
"title": "",
"link": "",
"description": "Sky watchers in Europe, Asia, and parts of Alaska and Canada will experience a <a href=\"http://science.nasa.gov/headlines/y2003/30may_solareclipse.htm\">partial eclipse of the Sun</a> on Saturday, May 31st.",
"publish_date": "Fri, 30 May 2003 11:06:42 GMT",
"guid": "http://liftoff.msfc.nasa.gov/2003/05/30.html#item572"
},
{
"title": "The Engine That Does More",
"link": "http://liftoff.msfc.nasa.gov/news/2003/news-VASIMR.asp",
"description": "Before man travels to Mars, NASA hopes to design new engines that will let us fly through the Solar System more quickly. The proposed VASIMR engine would do that.",
"publish_date": "Tue, 27 May 2003 08:37:32 GMT",
"guid": "http://liftoff.msfc.nasa.gov/2003/05/27.html#item571"
},
{
"title": "Astronauts' Dirty Laundry",
"link": "http://liftoff.msfc.nasa.gov/news/2003/news-laundry.asp",
"description": "Compared to earlier spacecraft, the International Space Station has many luxuries, but laundry facilities are not one of them. Instead, astronauts have other options.",
"publish_date": "Tue, 20 May 2003 08:56:02 GMT",
"guid": "http://liftoff.msfc.nasa.gov/2003/05/20.html#item570"
}
]
Changelog
v0.1.3
- 5f4b0e0 Update README.md
- 1b8bfb9 Add UserWarning when extractor can't extract first item from result
- dd2cd25 Remove the useless _extract call
- 655ec9d Add UserWarning when expr is conflict with parameter is_many=True
- bcade2c No alow user to set is_many=True and default!=sentinel at same time
- 761bd30 Add more unit tests
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file data_extractor-0.1.3.tar.gz.
File metadata
- Download URL: data_extractor-0.1.3.tar.gz
- Upload date:
- Size: 19.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.8.0 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.7.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1c1fe327627487bf9a408e4136132260f50c4d324c384408c04c8ff2554b989a
|
|
| MD5 |
8193fe7192a0304f0bf2bfa8e9c63868
|
|
| BLAKE2b-256 |
a68f3d9a8fa0464d6aacb87570115b687d22ecfb8591bd26a5e7e421ed2d2279
|
File details
Details for the file data_extractor-0.1.3-py3-none-any.whl.
File metadata
- Download URL: data_extractor-0.1.3-py3-none-any.whl
- Upload date:
- Size: 7.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.8.0 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.7.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c0ae0d34b36b091a16a8c0a4f13cf3cf93d606433680c4ae83629363e00656ae
|
|
| MD5 |
0f8ac0d5a9688fb79e3c05b9f254564f
|
|
| BLAKE2b-256 |
560adc01c05fc058453a15f164f774b791edbbf9c8472fcd94a9ddce486f0068
|