A Python command-line tool and package for scraping company, job, and founder data from Workatastartup.com
Project description
YCombinator-Scraper
YCombinator-Scraper provides a web scraping tool for extracting data from Workatastartup website. The package uses Selenium and BeautifulSoup to navigate through the pages and extract information.
Documentation: https://nneji123.github.io/ycombinator_scraper
Source Code: https://github.com/nneji123/ycombinator_scraper
Features
-
Web Scraping Capabilities:
- Extract detailed information about companies, including name, description, tags, images, job links, and social media links.
- Scrape job-specific details such as title, salary range, tags, and description.
-
Founder and Company Data Extraction:
- Obtain information about company founders, including name, image, description, linkedIn profile, and optional email addresses.
-
Headless Mode:
- Run the scraper in headless mode to perform web scraping without displaying a browser window.
-
Configurability:
- Easily configure scraper settings such as login credentials, logs directory, automatic install of webdriver based on browser with
webdriver-manager package
and using environment variables or a configuration file.
- Easily configure scraper settings such as login credentials, logs directory, automatic install of webdriver based on browser with
-
Command-Line Interface (CLI):
- Command-line tools to perform various scraping tasks interactively or in batch mode.
-
Data Output Formats:
- Save scraped data in JSON or CSV format, providing flexibility for further analysis or integration with other tools.
-
Caching Mechanism:
- Implement a caching feature to store function results for a specified duration, reducing redundant web requests and improving performance.
-
Docker Support:
- Package the scraper as a Docker image, enabling easy deployment and execution in containerized environments or run the prebuilt docker image
docker pull nneji123/ycombinator_scraper
.
- Package the scraper as a Docker image, enabling easy deployment and execution in containerized environments or run the prebuilt docker image
Requirements
- Python 3.9+
- Chrome or Chromium browser installed.
Installation
$ pip install ycombinator-scraper
$ ycombinator_scraper --help
# Output
YCombinator-Scraper Version 0.6.0
Usage: python -m ycombinator_scraper [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
login
scrape-company
scrape-founders
scrape-job
version
With Docker
$ git clone https://github.com/Nneji12/ycombinator-scraper
$ cd ycombinator-scraper
$ docker build -t your_name/scraper_name . # e.g docker build -t nneji123/ycombinator_scraper .
$ docker run nneji123/ycombinator_scraper python -m ycombinator_scraper --help
Dependencies
- click: Enables the creation of a command-line interface for interacting with the scraper tool.
- beautifulsoup4: Facilitates the parsing and extraction of data from HTML and XML in the web scraping process.
- loguru: Provides a robust logging framework to track and manage log messages generated during the scraping process.
- pandas: Utilized for the manipulation and organization of data, particularly in generating CSV files from scraped information.
- pathlib: Offers an object-oriented approach to handle file system paths, contributing to better file management within the project.
- pydantic: Used for data validation and structuring the models that represent various aspects of scraped data.
- pydantic-settings: Extends Pydantic to enhance the management of settings in the project.
- selenium: Employs browser automation for web scraping, allowing interaction with dynamic web pages and extraction of information.
Usage
ycscraper scrape-company --company-url https://www.workatastartup.com/company/example-inc
This command will scrape data for the specified company and save it in the default output format (JSON).
Example 2: Scrape Job Data using CLI
ycscraper scrape-job --job-url https://www.workatastartup.com/job/example-job
This command will scrape data for the specified job and save it in the default output format (JSON).
Example 3: Scrape Founder Data using CLI
ycscraper scrape-founders --company-url https://www.workatastartup.com/company/example-inc
This command will scrape founder data for the specified company and save it in the default output format (JSON).
Example 4: Scrape Company Data using Python Package
from ycombinator_scraper import Scraper
scraper = Scraper()
company_data = scraper.scrape_company_data("https://www.workatastartup.com/company/example-inc")
print(company_data.model_dump_json(indent=2))
Pydantic is used under the hood so methods like model_dump_json
are available for all the scraped data.
Documentation
The documentation is made with Material for MkDocs and is hosted by GitHub Pages.
License
YCombinator-Scraper is distributed under the terms of the MIT license.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for ycombinator-scraper-0.6.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4358bbc76057b6b8254714a489be5a0307df1c64bd1de0d59bace1cb99088907 |
|
MD5 | c179327289beb0ba4ec838c1724cdafa |
|
BLAKE2b-256 | 49a4ad6c94fa5b3ea818af49917d5c2e67d2262fda666dd917a4acae4a1d5460 |
Hashes for ycombinator_scraper-0.6.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 95170714ab3614ac339eb1d62e36b43260e652528c102e26bd03e409ae14a5a0 |
|
MD5 | 618d82344d6369d6cf3738fe3cad699f |
|
BLAKE2b-256 | e59a404473e61fef194721b7ef7377562f3e23095163c0de680c9b13eb2ab8a5 |