Hackernews Scraper
Project description
Hackernews-Scraping
Business Requirements:
- Scrape TheHackernews.com and store the result (Description, Image, Title, Url) in mongo db
- Maintain two relations - 1 with the url and title of the blog and other one with url and its meta data like (Description, Image, Title, Author)
Requirements:
- python3
- pip
- python libraries: _ requests _ BeautifulSoup4 _ pymongo _ jupyterlab * notebook
- MongoDB
- git
To run the application on your local machine:
Clone the repository:
-
Type the following in your terminal
git clone https://github.com/pushp1997/Hackernews-Scraping.git
-
Change the directory into the repository
cd ./Hackernews-Scraping
-
Create python virtual environment
python3 -m venv ./scrapeVenv
-
Activate the virtual environment created
- On linux / MacOS :
source ./scrapeVenv/bin/activate
- On Windows (cmd) :
"./scrapeVenv/Scripts/activate.bat"
- On Windows (powershell) :
"./scrapeVenv/Scripts/activate.ps1"
- On linux / MacOS :
-
Install python requirements
pip install -r requirements.txt
-
Open the ipynb using jupyter notebook
jupyter notebook "Hackernews Scraper.ipynb"
-
Run the notebook, you will be asked to provide inputs for no of pages to scrape to get the post and your MongoDB database URI to store the posts data.
-
Open mongodb shell connecting to the same URI you provided to the ipynb notebook while running it.
-
Change the database
use hackernews
-
Print the documents in the 'url-title' collection
db["url-title"].find().pretty()
-
Print the documents in the 'url-others' collection
db["url-others"].find().pretty()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for hnscraper-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4914d31b9f98a284b0066004c0bcb11a6c230ce534c9b4d4400ef5a514b1ab73 |
|
MD5 | 9a231e3e359a59039d4eba4940819e4a |
|
BLAKE2b-256 | af6ec3be257fa5fe699344b49e53dc71c2867bb85f4945ec0386b652090257a3 |