Skip to main content

An in-depth ikea scraper

Project description

HEMNES

Report Bug · Request Feature

Table of Contents

About the Project

Hemnes is a pip package for scraping product data from ikea. Good software avoids code repetition, but scrapers for specific websites are rewritten by developers all the time. Hemnes grew out of some backend code written for another project. At the time I was extremely annoyed to find that there was no available code for scraping ikea. I expanded this into a proper pip package because rewriting scrapers is a waste of time. That's valuable time that could be spent on actually developing new features or core functionality, that is instead spent on looking through html and css.

Hemnes gets you to the point of being able to query Ikea's product catalog in less than 30 seconds (I didn't time this). Just install from PyPI, specify a query, and start pulling data. Scraping should always be that easy.

The following product data is collected by Hemnes:

  • name (str) - name of the product
  • id (str) - unique product id
  • price (float) - product price
  • url (str) - url to product page
  • rating (float) - average customer rating
  • img_urls (list[str]) - urls to product images
  • colors (list[str]) - product colors (see COLORS in hemnes/helpers/find_elementfor a full list of colors being searched for)

Hemnes comes equipped with a bit of helpful functionality for doing things like:

  • finding products with specific keywords in their descriptions
  • logging
  • specifying a desired number of results.

Read more about extended functionality here

Built With

Why Selenium or Why not Python Requests

If you do a quick browse through stack overflow posts about scraping ikea - or if you are considering writing a scraper yourself - you will probably come across people using selenium Webdriver for pages that don't need it. Selenium is heavier than python requests, however, webdriver can load angular generated content whereas requests cannot.

Ikea's current website uses angular for its search results. This is a change from its older search, which was accessible via python requests. The new search provides more accurate results, and less garbage. Speaking from experience, the old search is really terrible and provides a number of trash results for any given query that I had trouble connecting back to my original search (one example of this is the old search returning 58 pages of results for 'table', which included things like placemats and dog toys; the new search returns 15, and all of the results are actually tables)

Getting Started

Hemnes is pretty straightforward to install and use. It functions exactly as you would expect a web-scraping package to function - just enter a query and get results back.

Runtime

  • Python 3+

Installation

Hemnes is installed using standard pip installation. To install from PyPI run

pip3 install hemnes

Alternatively, you can clone the repo and then run pip install inside the directory

# clone the repository
git clone https://github.com/sayeefrmoyen/hemnes.git
cd hemnes
pip3 install . # install the current directory as a package

ChromeDriver

Hemnes uses ChromeDriver to load webpages. If ChromeDriver is already installed and on your syspath you can skip the rest of this section.

To use ChromeDriver you will need to install Google Chrome. If you already have google chrome installed I believe that you should be able to run Hemnes without a problem. For full disclosure, I am not familiar with the google chrome or chrome driver code, and do not know how they interact, but I have had no issues running Hemnes without ever explicitly installing the ChromeDriver binary.

With that said, Selenium's documentation for WebDriver suggests that you install both a version of Google Chrome and ChromeDriver. If you find that after installing Google Chrome you are still receiving errors regarding ChromeDriver, then you should proceed to install ChromeDriver.

At this point, you should be ready to start using Hemnes.

Tests

Tests are broken into test_isolated and test_integrated. test_isolated are, as they sound, true unit-tests for individual functionality. test_integrated are tests for higher-level functionality that depends on some of the lower level functionality provided by functions tested in test_isolated.

To run the tests clone the repository and run using pytest

# install pytest if you do not have pytest installed
pip3 install pytest
# clone the repository
git clone https://github.com/sayeefrmoyen/hemnes.git
cd hemnes
# --verbose flag is optional - provides helpful console output
pytest --verbose # run all of the tests
pytest test/test_isolated.py --verbose # only run isolation tests
pytest test/test_integrated.py --verbose # only run intregration tests

Usage

Hemnes returns query results in the form of a list[Product]. Product is a helper class containing the following fields:

  • name (str) - name of the product
  • id (str) - unique product id
  • price (float) - product price
  • url (str) - url to product page
  • rating (float) - average customer rating
  • img_urls (list[str]) - urls to product images
  • colors (list[str]) - product colors
  • tag (str) - flexible usage field

Most of these are rather self-explanatory; I'm only going to talk about the tag field in depth. The tag attribute is specified for all products returned by a single call to process_query, and is included for flexible use. One example of why such a field would be useful is if you were storing this data in a database and needed a primary key to search on - you could use tag to indicate the type of product (chair, table, etc.). By default tag is set to None. For more details on using tag, see setting up additional options

One more thing to note is that all of these fields, excluding url, have the potential to not be found, and subsequently being set toNone. However, based on testing and examining the structure of Ikea's product webpages, it is unlikely that any of these attributes are unable to be found. The one exception to this rule is the rating field, which will be set to None any time the product has no reviews.

Basic Usage

At its simplest, Hemnes only requires a query. Use the process_query function to retrieve product data for a given query. process_query returns a list[Product] containing the products matching the given query.

import hemnes

# query ikea's product catalog for products tagged as chairs
# chair_results is now a list[Product] containing all of the products
# in ikea's catalog of chairs
results = hemnes.process_query('chair')

That's it. All of the fields of Product are available for you to do whatever you want with the results.

Additional Options

Hemnes expects to be passed anOptions object to specify a number of additional settings. Options is a helper class for organizing passing a large number of parameters to process_query. If you don't provide an Options object to process_query, a number of default settings are selected. The rest of this section will discuss how to modify those settings.

Specifying the path to the ChromeDriver binary

If you have installed the ChromeDriver binary and Google Chrome browser, but are still encountering errors when running Hemnes, you may need to explicitly pass the path to the binary to selenium Webdriver. To do so, use the Options class

import hemnes
...
# explicitly passing the chromedriver binary to webdriver
options = hemnes.Options()
options.cdriver_path = 'path/to/chromedriver/binary'
results = hemnes.process_query('your query', options)

Required Keywords & Strict Searching

Sometimes it's necessary to refine your query beyond usual high-level query terms. Hemnes allows you to specify a number of keywords to search for on product pages, and only return products which contain all or some of those words. Options accepts setting the keywords field to a set[str] containing the desired keywords.

You can also specify whether or not all of the keywords should be required by setting Options strict field to a bool. By default, strict is set to False, meaning that if keywords are passed, any product with at least one keyword will be returned. To require that all keywords are found for returned products, set strict to True

import hemnes
...
# setting required keywords
options = hemnes.Options()
options.keywords = {'large', 'comfortable'}
# enable strict-searching, requiring all products to contain all of the
# keywords in order to be returned. If disabled or untouched, any
# product with at least one of the keywords will be returned
options.strict = True
results = hemnes.process_query('chair', options)

For those who are curious about where keywords are being looked for, Hemnes searches through 3 different product description sections on each product's page for keywords.

Enable Logging

Some jobs for queries that return a large number of results may take a while to complete. Even for shorter jobs, it can be helpful to see where Hemnes is in processing a query. In order to enable logging process results, set the log field in Options to True. By default log is set to False to avoid overwhelming unsuspecting users.

import hemnes
...
# enabling logs
options = hemnes.Options()
options.log = True
# enable logging - this will log to both stdout and to a 
# logfile found at 'hemnes-logs/hemnes-MONTH-DAY-HOUR-MINUTE-SECOND.log
results = hemnes.process_query('chair', options)

By default logging will log to both stdout and to a log file that will look something like hemnes-logs/hemnes-04-23-02:41:16.log - the file is named hemnes-MONTH-DAY-HOUR-MINUTE-SECOND.log. If there is no hemnes-logs directory it will be created prior to trying to write files to it.

Hemnes will log things like:

  • when a valid product is found
  • when an invalid product is found (e.g. fails keyword requirements)
  • total number of valid products to be returned
  • any potential errors

Retrieving a Specific Number of Results

If you only need a specific number of results for a given query, set the num_results field of Options.

import hemnes
...
# setting a target number of results
options = hemnes.Options()
options.num_results = 10
# hemnes will only return up to 10 products
results = hemnes.process_query('chair', options)

Hemnes will return the number of products requested, or fewer if the query did not return enough results.

Speeding Up

Loading angular pages can be slow, mostly because it takes time to retrieve full DOM from such pages. Altering the imposed sleep time for DOM to be fully loaded can drastically increase the speed of Hemnes.

By default, Hemnes requires a 3 second sleep time after each page request in order to insure that the DOM is fully loaded. For users with fast download speeds, this may be longer than is necessary. In order to reduce the sleep time after network requests, set the sleep_time attribute of Options to something more appropriate for your internet connection.

import hemnes
...
# setting required keywords
options = hemnes.Options()
options.sleep_time = 1
# hemnes will now wait only 1 second for DOM to be loaded for 
# newly retrieved pages. Users should set sleep_time to an 
# appropriate amount of time based on their internet connection.
# The default setting of 3-seconds should be fine for almost all users
results = hemnes.process_query('chair', options)

Using Product's Tag Field

The tag field can be set for all returned Product for a given call to process_query by setting it in Options. tag should be of type str.

import hemnes
...
# setting the tag attribute
options = hemnes.Options()
options.tag = 'chair'
# all of the products returned in results will have their tag field 
# set to the string "chair"
results = hemnes.process_query('chair', options)

What's Next

Some ideas floating around include:

  • Enabling logging only to console or to file
  • Writing results to json/csv

Release History

  • 1.0 - first stable, tested release
  • 0.*.* - pre-test releases

License

Distributed under the MIT License. See LICENSE for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hemnes-1.0.tar.gz (20.9 kB view hashes)

Uploaded Source

Built Distribution

hemnes-1.0-py3-none-any.whl (19.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page