Skip to main content

A dynamic and scalable data pipeline from Airbnbs commercial site to your local system / cloud storage.

Project description

Commercial Scraper

A fully dynamic and scalable data pipeline made in Python dedicated to scraping commercial websites that don't offer API's. Can yield structured and unstructured data, and is able to save data both locally and/or on the cloud via the data processing module.

Currently, the scraper is only built to scrape Airbnb's website, but more websites are in the works to generalise the package.

Installation

Use the package manager pip to install CommercialScraper.

pip install CommercialScraper

Usage

from CommercialScraper.pipeline import AirbnbScraper
import CommercialScraper.data_processing

scraper = AirbnbScraper()

# Returns a dictionary of structured data and a list of image sources for a single product page
product_dict, imgs = scraper.scrape_product_data('https://any/airbnb/product/page', any_ID_you_wish, 'Any Category Label you wish')

# Returns a dataframe of product entries as well as a dictionary of image sources pertaining to each product entry
df, imgs = scraper.scrape_all()


# Saves the dataframe to a csv in your local directory inside a created 'data/' folder. 
data_processing.df_to_csv(df, 'any_filename')

# Saves images locally
data_processing.images_to_local(images)

# Saves structured data to sql database
data_processing.df_to_sql(df, table_name, username, password, hostname, port, database)

# Saves structured data to AWS cloud services s3 bucket
data_processing.df_to_s3(df, aws_access_key_id, region_name, aws_secret_access_key, bucket_name, upload_name)

# Saves images to AWS cloud services s3 bucket
data_processing.images_to_s3(source_links, aws_access_key_id,region_name, aws_secret_access_key, bucket_name, upload_name)

Docker Image

This package has been containerised in a docker image where it can be run as an application. Please note that data can only be stored on the cloud by this method, not locally. Docker Image

docker pull docker4ldrich/airbnb-scraper

docker run -it docker4ldrich/airbnb-scraper

Follow the prompts and insert credentials carefully, there won't be a chance to correct any typing errors! It's recommended that you paste credentials in where applicable.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

CommercialScraper-1.0.0.tar.gz (12.8 kB view details)

Uploaded Source

Built Distribution

CommercialScraper-1.0.0-py3-none-any.whl (14.4 kB view details)

Uploaded Python 3

File details

Details for the file CommercialScraper-1.0.0.tar.gz.

File metadata

  • Download URL: CommercialScraper-1.0.0.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.8.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.9.1

File hashes

Hashes for CommercialScraper-1.0.0.tar.gz
Algorithm Hash digest
SHA256 20f5f9d9e07655e75348c11eae5b737f686b433b88eec67f35fff63175205e61
MD5 33d940bc5de6e873e98789796460deef
BLAKE2b-256 3ce3f5c8b3098e1d50fd6d8c203e0512b550021099018da74c4ca5005263fb42

See more details on using hashes here.

File details

Details for the file CommercialScraper-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: CommercialScraper-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 14.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.8.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.9.1

File hashes

Hashes for CommercialScraper-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 771f7468eea5752817b42fcd7f7e16fd2072589ee85838d22f000d7cce910390
MD5 de718143e8ad75422f64cfec2a841d9b
BLAKE2b-256 298f29e11bfc9d0d2b1ea6472ff3401e5b24419231c41150f956a7f9fc7c51ee

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page