Skip to main content

A Python implementation of DEPTA

Project description

PyDepta

PyDepta is a library to extract structured data from HTML page. It can works in both supervised and unsupervised mode.

Under the hold, PyDepta implemented Yanhong Zhai and Bing Liu's work on Web Data Extraction Based on Partial Tree Alignment to extract data without example data (so called unsupervised learning). The basic idea is of this algorithm is to extract the data region with tree match algorithm (see Bing Lius' previous work on MDR) and then build a seed tree on top of records to extract aligned data fields.

PyDepta can also extract data with example data (so called supervised learning). It relies on Scrapely to extract the structured data after you tell it the data you'd to extract.

Usage

1. (Unsupervised) Extract from url

In this mode PyDepta extract the data blindly base on the similarity of subtrees.

>>> from pydepta import Depta
>>> d = Depta()
>>> url1 = 'http://www.iens.nl/restaurant/12229/nijmegen-pasta-e-fagioli'
>>> seed = d.extract(url=url1)[8]
>>> seed.as_plain_texts()[0]

['MartenHH', 'Meesterproever', '5 maanden geleden', '7', '10', '1', 'Eten', '6', 'Service', '9', 'Decor', 'Afgelopen zaterdag avond hebben we hier met z\'n zessen heerlijk kunnen dineren. De entourage was erg prettig en de bediening verliep soepel, op een paar vreemde uitschieters na (zie hieronder). Het voorgerecht op basis van aubergine, tomaat en mozarella was lekker. Ook het hoofdgerecht - de kalfsoester met serano ham was goed maar niet perse bijzonder. Er werden ook bijgerechten geserveerd op losse schaaltjes, maar heaas werd er werd niet gevraagd of alles voldoende was. De salade was bv snel op. De porties voldeden overigens prima en zeker na het nagerecht gingen wij zeer voldaan naar huis. \nTot zover de sterke punten. Wat bij een restaurant van dit prijsniveau gewoon niet mag voorkomen zijn de volgende twee zaken. Ten eerste werd ons bij het opdienen van het hoofdgerecht gevraagd wie wat had besteld. Dat hoort echt niet bij een restaurant van deze klasse, en voor mij is dit een echte afkapper. Ten tweede vroegen wij om advies over de wijnkaart. Dat ging helemaal mis. Wij kregen advies van degene die de wijnkaart zou hebben samen gesteld. Echter, toen ik vroeg of de "cannonau di sardegna" bij het menu zou passen werd deze mij zonder verdere motivatie ontraden. Deze zou een zeer vreemde smaak hebben en eigenlijk nergens bij passen. Ook andere adviezen kwamen niet echt uit de verf omdat degene die ons hielp niet echt met ons erover in gesprek leek te willen. Graag wat meer enthousiasme over de eigen wijnkaart - en ook kennis. Dat kan veel beter. Ze had bijvoorbeeld kunnen vragen waarom ik nu juist die ene wijn eruit pikte - het is nl een wijn die ik heel veel drink omdat ik hem erg lekker vind en overal bij vind passen - als het tenminste een goede fles is!', 'Gegeten op 17 augustus 2013', '', '', '', '', '\n                Deel            ', '\n                0 Reacties            ']

The result is a Region which can convert into plain texts (with region.as_plain_texts) or a HTML table (with region.as_html_table) or a python dict with (region_to_dict)

2. (Supervised) Extract with seed region and data

In this mode you tell PyDepta the data you expect to scrape from seed region. e.g. let's say on the seed region you'd like to scrape MartenHH as name, Afgelopen zaterdag avond hebben we... as text:

>>> data = {'name': 'MartenHH', 'text': 'Afgelopen zaterdag avond hebben we'}

Then train the PyDepta by adding the seed region and the data.

>>> d.train(seed, data)

Finally just tell the PyDepta to scrape other similar pages on that site and it will return the results.

>>> url2 = 'http://www.iens.nl/restaurant/22513/zwolle-hotel-fidder'
>>> for item in d.infer(url=url2):
...     print item
...
{u'text': [u'Heerlijke ontvangst van gastvrije en persoonlijke bediening. Eten is prima. Dit weekend gekozen voor gastronomisch arrangement en is echt goed. Goede keuzes met bijpassende wijnen. Lekker op loopafstand van Zwolle centrum.  Kortom een echte aanrader voor mensen die gastvrijheid en goed eten waarderen! En heb je kritiek of vragen: meldt het gewoon want hier wordt goed op ingespeeld.'], u'name': [u'CamielIens']}
{u'text': [u"Van de week waren we neer gestreken in een heuse stadstuin, niet ver van onze geliefde Peperbus gelegen namen we plaats op het terras van Fidder's. Het was heerlijk vertoeven in de schaduwrijk tuin, een terras kan je het haast niet noemen. We zaten tussen een moestuin en kruidentuin in en spotte regelmatig de chef die wat kruiden nodig had. De gerechten waren erg lekker en goed verzorgt. Binnenkort kom ik zeker terug om te genieten van hun dineractie."], u'name': [u'Hendrikdeboer']}
{u'text': [u'We hebben hier echt genoten van heerlijke vers bereide gerechten met een mooi wijnarrangement. Alles was goed op smaak. Mooie stadsreiniging en vriendelijke bediening. \nHier komen we graag terug'], u'name': [u'Vic1980']}
{u'text': [u'Heerlijk eten, niveau sterrenrestaurant. Rare omgeving: in een nauwe straat ver van het centrum. Veel te langzame bediening, maar wel vriendelijk. We hebben hier een ANWB menu gegeten. Heel mals rundvlees en als voorgerecht forelmousse en als nagerecht broodpudding.'], u'name': [u'Mathilde30']}

Author

pengtaoo AT gmail.com

Deployment

http://pydepta-heroku.herokuapp.com/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sd_pydepta-0.3.1.tar.gz (307.7 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sd_pydepta-0.3.1-cp314-cp314-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl (291.7 kB view details)

Uploaded CPython 3.14manylinux: glibc 2.28+ x86-64manylinux: glibc 2.5+ x86-64

sd_pydepta-0.3.1-cp314-cp314-macosx_11_0_arm64.whl (146.4 kB view details)

Uploaded CPython 3.14macOS 11.0+ ARM64

sd_pydepta-0.3.1-cp314-cp314-macosx_10_15_x86_64.whl (147.0 kB view details)

Uploaded CPython 3.14macOS 10.15+ x86-64

sd_pydepta-0.3.1-cp313-cp313-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl (293.2 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ x86-64manylinux: glibc 2.5+ x86-64

sd_pydepta-0.3.1-cp313-cp313-macosx_11_0_arm64.whl (146.4 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

sd_pydepta-0.3.1-cp313-cp313-macosx_10_13_x86_64.whl (147.0 kB view details)

Uploaded CPython 3.13macOS 10.13+ x86-64

sd_pydepta-0.3.1-cp312-cp312-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl (296.3 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64manylinux: glibc 2.5+ x86-64

sd_pydepta-0.3.1-cp312-cp312-macosx_11_0_arm64.whl (146.6 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

sd_pydepta-0.3.1-cp312-cp312-macosx_10_13_x86_64.whl (147.2 kB view details)

Uploaded CPython 3.12macOS 10.13+ x86-64

sd_pydepta-0.3.1-cp311-cp311-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl (288.6 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64manylinux: glibc 2.5+ x86-64

sd_pydepta-0.3.1-cp311-cp311-macosx_11_0_arm64.whl (146.8 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

sd_pydepta-0.3.1-cp311-cp311-macosx_10_9_x86_64.whl (146.9 kB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

sd_pydepta-0.3.1-cp310-cp310-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl (280.4 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64manylinux: glibc 2.5+ x86-64

sd_pydepta-0.3.1-cp310-cp310-macosx_11_0_arm64.whl (147.0 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

sd_pydepta-0.3.1-cp310-cp310-macosx_10_9_x86_64.whl (147.1 kB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

File details

Details for the file sd_pydepta-0.3.1.tar.gz.

File metadata

  • Download URL: sd_pydepta-0.3.1.tar.gz
  • Upload date:
  • Size: 307.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sd_pydepta-0.3.1.tar.gz
Algorithm Hash digest
SHA256 69c01448e919c6515df4150f557e29b5583afd54f9134df5d470daa484db1996
MD5 5ee7e86ab0ea58086e5977bcc69a5ee1
BLAKE2b-256 085ff0971767f9050162adca09f8d78c082e169d76e983e6ae48382a6d238fd2

See more details on using hashes here.

Provenance

The following attestation bundles were made for sd_pydepta-0.3.1.tar.gz:

Publisher: test.yml on SpazioDati/pydepta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sd_pydepta-0.3.1-cp314-cp314-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl.

File metadata

File hashes

Hashes for sd_pydepta-0.3.1-cp314-cp314-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl
Algorithm Hash digest
SHA256 c871844b722c8f52bd4713319c896325e1ee4d61bbe9f55d4bb702e33fd144ed
MD5 32b49032b10fbff2bdf8aa4fe7bc2f23
BLAKE2b-256 02f9e129bd0347708a8fdcd6fbb6554df83bb840e260044ae93f8abe11268b30

See more details on using hashes here.

Provenance

The following attestation bundles were made for sd_pydepta-0.3.1-cp314-cp314-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl:

Publisher: test.yml on SpazioDati/pydepta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sd_pydepta-0.3.1-cp314-cp314-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sd_pydepta-0.3.1-cp314-cp314-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6c4d8e3b0d2ebd96436018053a8ba4c7bbeeee9804cbed7b16ea2f6814d618d5
MD5 d449afb9c4385aa83867f5e11f9df432
BLAKE2b-256 5f924b941ec8a629254c81f7994873a25198b937e684749716d14544cc994363

See more details on using hashes here.

Provenance

The following attestation bundles were made for sd_pydepta-0.3.1-cp314-cp314-macosx_11_0_arm64.whl:

Publisher: test.yml on SpazioDati/pydepta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sd_pydepta-0.3.1-cp314-cp314-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for sd_pydepta-0.3.1-cp314-cp314-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 da06294d2589b415d7499d6818f5881de2679275b65497d2996542175649970f
MD5 c461ec52945c77d23b9885f1fb0b980e
BLAKE2b-256 d6c4da20b01cbd0b0b1c4441858e9bc49ef2fc8b9afc85ced95876492083b66f

See more details on using hashes here.

Provenance

The following attestation bundles were made for sd_pydepta-0.3.1-cp314-cp314-macosx_10_15_x86_64.whl:

Publisher: test.yml on SpazioDati/pydepta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sd_pydepta-0.3.1-cp313-cp313-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl.

File metadata

File hashes

Hashes for sd_pydepta-0.3.1-cp313-cp313-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl
Algorithm Hash digest
SHA256 75a0534b80664df78b0fdb53da873b024428fb0a3192cb268963a17a37401938
MD5 cab9560ffcd4368071f13cd9a510b1d5
BLAKE2b-256 58a49e3628fe3c049d831fcb6690b487c7676cb969f9035b736d0295e0e507e6

See more details on using hashes here.

Provenance

The following attestation bundles were made for sd_pydepta-0.3.1-cp313-cp313-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl:

Publisher: test.yml on SpazioDati/pydepta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sd_pydepta-0.3.1-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sd_pydepta-0.3.1-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 80f3dd76bdd44c8efd7e8118f89e3db29c76d9e2de679a0b9ff06e3b06fa7525
MD5 c5b7d0bbf123e48c3826d3dc8544318b
BLAKE2b-256 7069613ac23bf23a9499e05d1509a3610fcc69f258d149cc19e5097b9ee8c062

See more details on using hashes here.

Provenance

The following attestation bundles were made for sd_pydepta-0.3.1-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: test.yml on SpazioDati/pydepta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sd_pydepta-0.3.1-cp313-cp313-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for sd_pydepta-0.3.1-cp313-cp313-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 b39cf330221b5a01b4c4920a911bab823d0e1cc0eee47d170c9559ce18528bd8
MD5 ca1ca3d8c7f91c2374e3ce3771c4d347
BLAKE2b-256 14b5561e31d77bf3c6d7179e6e1cc6f0eab29c1c83ce4c0277756e77ed076b3b

See more details on using hashes here.

Provenance

The following attestation bundles were made for sd_pydepta-0.3.1-cp313-cp313-macosx_10_13_x86_64.whl:

Publisher: test.yml on SpazioDati/pydepta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sd_pydepta-0.3.1-cp312-cp312-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl.

File metadata

File hashes

Hashes for sd_pydepta-0.3.1-cp312-cp312-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl
Algorithm Hash digest
SHA256 6eec47dd7254b6a50e52075fc3d79e1b7a7efe75bfd63a0d12105a20f49db054
MD5 a6674daa8c30373ed83dc344a65073da
BLAKE2b-256 d4142b95237ed24a8b7ffd3071d8ed0fc535ccc15dd13969c7e447842aa18d53

See more details on using hashes here.

Provenance

The following attestation bundles were made for sd_pydepta-0.3.1-cp312-cp312-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl:

Publisher: test.yml on SpazioDati/pydepta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sd_pydepta-0.3.1-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sd_pydepta-0.3.1-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d49cedce30427a1cb51147f17b3a71bedce05dd4c4a5163ffeb2f831671994b5
MD5 4dd1f850ad255efbce8a042253801dee
BLAKE2b-256 b19d4abaa27cf068c5f9ab97eece57823f711a6abf712fcbdef62bcd05fe21ba

See more details on using hashes here.

Provenance

The following attestation bundles were made for sd_pydepta-0.3.1-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: test.yml on SpazioDati/pydepta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sd_pydepta-0.3.1-cp312-cp312-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for sd_pydepta-0.3.1-cp312-cp312-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 3728a304d6a01dc861cdd52a3eec40f73a0a729a3a331d87c7075332271e3316
MD5 7c6013b5e2e1500af01b10e46768053e
BLAKE2b-256 39b18e9fa33595065a0c0789fd2e621facfc0819d220626feef66bd5d325e9b5

See more details on using hashes here.

Provenance

The following attestation bundles were made for sd_pydepta-0.3.1-cp312-cp312-macosx_10_13_x86_64.whl:

Publisher: test.yml on SpazioDati/pydepta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sd_pydepta-0.3.1-cp311-cp311-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl.

File metadata

File hashes

Hashes for sd_pydepta-0.3.1-cp311-cp311-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl
Algorithm Hash digest
SHA256 8ce6603e1515479d4212625e00bfa4c5df24ad36cf71b16de48913256c09f795
MD5 7333261e0162da5c67623611dbd01ba4
BLAKE2b-256 947f12ad12f4d892337abb16b602e96b194799b635cea33d1e0524ccaaf42208

See more details on using hashes here.

Provenance

The following attestation bundles were made for sd_pydepta-0.3.1-cp311-cp311-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl:

Publisher: test.yml on SpazioDati/pydepta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sd_pydepta-0.3.1-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sd_pydepta-0.3.1-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a2c1cb658ca6bf62bfb9d6e82ae930e36bfcec70afced2b2145608001687ae03
MD5 b406b3a3134df5267d2c5c2e5f2ebcd4
BLAKE2b-256 d7eb35960e3b07f365cad0c47791c5c45eefa6a99141d63aa54fc6414b523e04

See more details on using hashes here.

Provenance

The following attestation bundles were made for sd_pydepta-0.3.1-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: test.yml on SpazioDati/pydepta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sd_pydepta-0.3.1-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for sd_pydepta-0.3.1-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 f557efae2b75ed9d1afb870a203464b164a196549a6e27d630b6c2ccd04e600a
MD5 59a4f722b3923598646dfa8c7ebab6e0
BLAKE2b-256 4c40c02e9e5798f91eff00b203e6506e3aecf38a02c6442e5d731ce5711f7969

See more details on using hashes here.

Provenance

The following attestation bundles were made for sd_pydepta-0.3.1-cp311-cp311-macosx_10_9_x86_64.whl:

Publisher: test.yml on SpazioDati/pydepta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sd_pydepta-0.3.1-cp310-cp310-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl.

File metadata

File hashes

Hashes for sd_pydepta-0.3.1-cp310-cp310-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl
Algorithm Hash digest
SHA256 6544d72ecc9bc2d99755d240649a334c9c8d0ab3adff6aa8d3bbf3591a6fbeb5
MD5 fe9553bc942ecf1e0d61712fbd55d88d
BLAKE2b-256 aef45bcefb21d5bcc44c030c06a8bbd74c8b69cf5b167796885f666520d29d40

See more details on using hashes here.

Provenance

The following attestation bundles were made for sd_pydepta-0.3.1-cp310-cp310-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl:

Publisher: test.yml on SpazioDati/pydepta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sd_pydepta-0.3.1-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sd_pydepta-0.3.1-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f88dccd2d16c113a9636e22470cbb8889c06f6726c6d876dc79eeb6ef39d8018
MD5 303941c39c5e22865bff6406d015a358
BLAKE2b-256 8a22e8a1000e2103f8f4398e929fbb5dfe94184ba5b7b862270b217796e933f6

See more details on using hashes here.

Provenance

The following attestation bundles were made for sd_pydepta-0.3.1-cp310-cp310-macosx_11_0_arm64.whl:

Publisher: test.yml on SpazioDati/pydepta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sd_pydepta-0.3.1-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for sd_pydepta-0.3.1-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 2826261688c5cfde9c38c7406e8695db85722d5d499193d43566b72b8f127f06
MD5 0bc751e7a8822e5b9b9a62280b5c7dbe
BLAKE2b-256 851f153f12c6c21f21edbca9e3e195419c7d35cff8294a8a1a57ef2f290bf7d0

See more details on using hashes here.

Provenance

The following attestation bundles were made for sd_pydepta-0.3.1-cp310-cp310-macosx_10_9_x86_64.whl:

Publisher: test.yml on SpazioDati/pydepta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page