Extracts cooking recipe from HTML structured data in the https://schema.org/Recipe format.
Project description
scrape-schema-recipe
Scrapes recipes from HTML https://schema.org/Recipe (Microdata/JSON-LD) into Python dictionaries.
Python version 3.4+
This library relies heavily upon extruct.
Online Example
>>> import scrape_schema_recipe
>>> url = 'https://www.foodnetwork.com/recipes/alton-brown/honey-mustard-dressing-recipe-1939031'
>>> recipe_list = scrape_schema_recipe.scrape_url(url, python_objects=True)
>>> len(recipe_list)
1
>>> recipe = recipe_list[0]
# Name of the recipe
>>> recipe['name']
'Honey Mustard Dressing'
# List of the Ingredients
>>> recipe['recipeIngredient']
['5 tablespoons medium body honey (sourwood is nice)',
'3 tablespoons smooth Dijon mustard',
'2 tablespoons rice wine vinegar']
# List of the Instructions
>>> recipe['recipeInstructions']
['Combine all ingredients in a bowl and whisk until smooth. Serve as a dressing or a dip.']
# Author
>>> recipe['author']
[{'@type': 'Person',
'name': 'Alton Brown',
'url': 'https://www.foodnetwork.com/profiles/talent/alton-brown'}]
'@type': 'Person' means a https://schema.org/Person object
# Preparation Time
>>> recipe['prepTime']
datetime.timedelta(0, 300)
# The library pendulum can give you something a little easier to read.
>>> import pendulum
# for pendulum version 1.0
>>> pendulum.Interval.instanceof(recipe['prepTime'])
<Interval [5 minutes]>
# for version 2.0 of pendulum
>>> pendulum.Duration(seconds=recipe['prepTime'].total_seconds())
<Duration [5 minutes]>
If python_objects wasn't set to True, this would return the string ISO8611 representation of the duration, 'PT5M'
# Publication date
>>> recipe['datePublished']
datetime.datetime(2016, 11, 13, 21, 5, 50, 518000, tzinfo=<FixedOffset '-05:00'>)
>>> str(recipe['datePublished'])
'2016-11-13 21:05:50.518000-05:00'
# Identifying this is http://schema.org/Recipe data (in LD-JSON format)
>>> recipe['@context'], recipe['@type']
('http://schema.org', 'Recipe')
# Content's URL
>>> recipe['url']
'https://www.foodnetwork.com/recipes/alton-brown/honey-mustard-dressing-recipe-1939031'
# all the keys in this dictionary
>>> recipe.keys()
dict_keys(['recipeYield', 'totalTime', 'dateModified', 'url', '@context', 'name', 'publisher', 'prepTime', 'datePublished', 'recipeIngredient', '@type', 'recipeInstructions', 'author', 'mainEntityOfPage', 'aggregateRating', 'recipeCategory', 'image', 'headline', 'review'])
Example from a File (alternative representations)
Also works with locally saved HTML example file.
>>> filelocation = 'test_data/google-recipe-example.html'
>>> recipe_list = scrape_schema_recipe.scrape(filelocation, python_objects=True)
>>> recipe = recipe_list[0]
>>> recipe['name']
'Party Coffee Cake'
>>> repcipe['datePublished']
datetime.date(2018, 3, 10)
# Recipe Instructions using the HowToStep
>>> recipe['recipeInstructions']
[{'@type': 'HowToStep',
'text': 'Preheat the oven to 350 degrees F. Grease and flour a 9x9 inch pan.'},
{'@type': 'HowToStep',
'text': 'In a large bowl, combine flour, sugar, baking powder, and salt.'},
{'@type': 'HowToStep', 'text': 'Mix in the butter, eggs, and milk.'},
{'@type': 'HowToStep', 'text': 'Spread into the prepared pan.'},
{'@type': 'HowToStep', 'text': 'Bake for 30 to 35 minutes, or until firm.'},
{'@type': 'HowToStep', 'text': 'Allow to cool.'}]
What Happens when Things Go Wrong
If there aren't any http://schema.org/Recipe formatted recipes on the site.
>>> url = 'https://www.google.com'
>>> recipe_list = scrape_schema_recipe.scrape(url, python_objects=True)
>>> len(recipe_list)
0
Some websites will cause an HTTPError
.
You may get around a 403 - Forbidden Errror by putting in an alternative user-agent
via the variable user_agent_str
.
Functions
load()
- load HTML schema.org/Recipe structured data from a file or file-like objectloads()
- loads HTML schema.org/Recipe structured data from a stringscrape_url()
- scrape a URL for HTML schema.org/Recipe structured datascrape()
- load HTML schema.org/Recipe structured data from a file, file-like object, string, or URL
Parameters
----------
location : string or file-like object
A url, filename, or text_string of HTML, or a file-like object.
python_object : bool, optional
when True it translates some data types into python objects
dates into datetime.date, datetimes into datetime.datetimes,
durations as dateime.timedelta. (defaults to False)
nonstandard_attrs : bool, optional
when True it adds nonstandard (for schema.org/Recipe) attributes to the
resulting dictionaries, that are outside the specification such as:
'_format' is either 'json-ld' or 'microdata' (how schema.org/Recipe was encoded into HTML)
'_source_url' is the source url, when 'url' has already been defined as another value
(defaults to False)
user_agent_str : string, optional ***only for scrape_url() and scrape()***
overide the user_agent_string with this value.
(defaults to None)
Returns
-------
list
a list of dictionaries in the style of schema.org/Recipe JSON-LD
no results - an empty list will be returned
These are also available with help()
in the python console.
Files
License: Apache 2.0 see LICENSE
Test data attribution and licensing: ATTRIBUTION.md
Development
Unit testing can be run by:
schema-recipe-scraper$ python3 test_scrape.py
mypy is used for static type checking
from the project directory:
schema-recipe-scraper$ mypy schema_recipe_scraper/scrape.py
If you run mypy from another directory the --ingnore-missing-imports
flag will need to be added,
thus $ mypy --ingnore-missing-imports scrape.py
--ignore-missing-imports
flag is used because most libraries don't have static typing information contained
in their own code or typeshed.
Reference Documentation
Here are some references for how schema.org/Recipe should be structured:
- https://schema.org/Recipe - official specification
- Recipe Google Search Guide - material teaching devlopers how to use the schema (with a little emphasis structured data impacts search results)
Other Similar Python Libraries
- recipe_scrapers - library scrapes recipes by using the HTML tags using BeautifulSoup. It has drivers for each and every supported website. This is a great fallback for when schema-recipe-scraper will not scrape a site.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for scrape-schema-recipe-0.0.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | ade7d771d10d191b9cc1f2a5f34b2ba12991865d4a87156507f91146056f3973 |
|
MD5 | a08da8d7db627dce65c7bdfa2418aca1 |
|
BLAKE2b-256 | 776c0be110723937e2ced1a5df06174a47b627e59e9cc1f7f1544bfb033123ea |
Hashes for scrape_schema_recipe-0.0.1-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4ef2af56d709fff13380049d28e8d2ac468b01a8bd2eb21d057f6b61f1e4b14b |
|
MD5 | 4c89e7f7790a4e88b578f1462a380b66 |
|
BLAKE2b-256 | 2a08d4ee405d1d39b3d6219f208308b497fa5e3ce55e6941bb18e47571d58a7c |