DataScraper: Effortless Dataset Extraction

Project description

Dataset Scraper (Scrapset)

Scrapset is a Python module specifically created for web scraping data from websites like Kaggle and Data.gov. It simplifies the task of extracting dataset information such as titles, upvotes (for Kaggle), and recent views (for Data.gov).

By utilizing the Scrapset module, you can automate the retrieval of dataset details from these platforms. This can be beneficial for various purposes such as data analysis, research, or developing machine learning models. The module employs the Selenium library to interact with the websites and extract the desired data.

With Scrapset, you can quickly and easily scrape dataset information, empowering you to work with valuable data from Kaggle, Data.gov, and similar websites.

KaggleDataSet Class

The KaggleDataSet class enables scraping of dataset information from Kaggle.

Methods: web_driver_chrome(): Initializes and returns a Selenium Chrome WebDriver with customized options for scraping Kaggle datasets.

data_set_page(url, last_page, initial_page): Scrapes the titles, upvotes, and additional details of datasets from Kaggle. The method takes the url of the Kaggle datasets page, the last_page number to scrape up to, and the initial_page number to start scraping from. It returns a dictionary containing the scraped dataset information.

DataDotGov Class

The DataDotGov class facilitates scraping of dataset information from Data.gov.

Methods: web_driver_chrome(): Initializes and returns a Selenium Chrome WebDriver with customized options for scraping Data.gov datasets.

data_set_page(url, last_page, initial_page): Scrapes the titles, recent views, and authors of datasets from Data.gov. The method takes the url of the Data.gov datasets page, the last_page number to scrape up to, and the initial_page number to start scraping from. It returns a dictionary containing the scraped dataset information.

Example code to extract titles of datasets from Data.gov


import kaggle_datasets as m
import pandas as pd
df=m.DataDotGov()
data=df.data_set_page('https://catalog.data.gov',last_page=10,initial_page=5)
datf=pd.DataFrame(data)
datf.to_csv('datagov.csv',index=False)

Example code to extract titles, upvote, Usability index of datasets from kaggle


import kaggle_datasets as m
import pandas as pd
df=m.KaggleDataSet()
data=df.data_set_page('https://kaggle.com',last_page=10,initial_page=5)
datf=pd.DataFrame(data)
datf.to_csv('kaggle.csv',index=False)

Example code to extract job details from indeed

You get the details in the form of a dictionary

There are three arguments in indeed_jobs method First: Url Second: last page that you want to scrap the data query: what job do you want to scrap

import indeed as in
dictionary=in()
data = indeed('https://ie.indeed.com', 40, 'data scientist')

IMDb Class

The IMDb class enables scraping of comments from IMDb movie pages.

Methods

web_driver_chrome()

def web_driver_chrome(self) -> webdriver.Chrome:
    """
    Initializes and returns a Selenium Chrome WebDriver with customized options for scraping IMDb comments.

    Returns:
        webdriver.Chrome: The Chrome WebDriver object.
    """

comments(url: str) -> List[str]

def comments(self, url: str) -> List[str]:
    """
    Scrapes comments from an IMDb movie page.

    Args:
        url (str): The URL of the IMDb movie page.

    Returns:
        List[str]: A list containing the scraped comments.
    """

Example Code

Here's an example code demonstrating how to use the IMDb class to scrape comments from an IMDb movie page:

import scrapset as m

df = m.imdb()
data = df.comments('https://www.imdb.com/title/tt0111161/reviews')

Please note that you should replace the URL 'https://www.imdb.com/title/tt0111161/reviews' with the IMDb movie page URL you want to scrape comments from.

VesselFinder Class

The VesselFinder class facilitates scraping vessel details and locations.

Methods

vessel_details(url: str) -> List[str]

def vessel_details(self, url: str) -> List[str]:
    """
    Scrapes vessel details from VesselFinder.

    Args:
        url (str): The URL of the VesselFinder vessel page.

    Returns:
        List[str]: A list containing the scraped vessel details.
    """

vessel_location(url: str) -> List[str]

def vessel_location(self, url: str) -> List[str]:
    """
    Scrapes vessel locations from VesselFinder.

    Args:
        url (str): The URL of the VesselFinder port page.

    Returns:
        List[str]: A list containing the scraped vessel locations.
    """

Example Code

import scrapset as m

df = m.VesselFinder()

# Scrape vessel details
vessel_details = df.vessel_details('https://www.vesselfinder.com/vessels')

# Scrape vessel locations
vessel_location = df.vessel_location('https://www.vesselfinder.com/ports')

Please replace the URLs 'https://www.vesselfinder.com/vessels' and 'https://www.vesselfinder.com/ports' with the specific VesselFinder pages you want to scrape vessel details and locations from.

import scrapset as m
import pandas as pd

# Scrape Kaggle dataset information
kaggle_df = m.KaggleDataSet()
kaggle_data = kaggle_df.data_set_page('https://kaggle.com', last_page=10, initial_page=5)
kaggle_datf = pd.DataFrame(kaggle_data)
kaggle_datf.to_csv('kaggle.csv', index=False)

# Scrape Data.gov dataset information
datagov_df = m.DataDotGov()
datagov_data = datagov_df.data_set_page('https://catalog.data.gov', last_page=10, initial_page=5)
datagov_datf = pd.DataFrame(datagov_data)
datagov_datf.to_csv('datagov.csv', index=False)

# Scrape job details from Indeed
indeed_df = m.indeed()
indeed_data = indeed_df.indeed_jobs('https://ie.indeed.com', 40, 'data scientist')
indeed_datf = pd.DataFrame(indeed_data)
indeed_datf.to_csv('indeed_jobs.csv', index=False)

# Scrape comments from IMDb movie page
imdb_df = m.imdb()
imdb_data = imdb_df.comments('https://www.imdb.com/title/tt0111161/reviews')

# Scrape vessel details and locations from VesselFinder
vesselfinder_df = m.VesselFinder()
vessel_details = vesselfinder_df.vessel_details('https://www.vesselfinder.com/vessels')
vessel_location = vesselfinder_df.vessel_location('https://www.vesselfinder.com/ports')

Angel Scraper

Introduction

The Angel class in the scrapset module is designed to scrape data from Google Maps. It provides a method for scrolling down the map and extracting information about companies and their phone numbers.

Methods

1. `scroll_using_mouse(duration=10, scroll_amount=1)`

This method simulates scrolling down on the webpage using the mouse wheel. It continues scrolling for the specified duration with a specified scroll amount.

Parameters:
- duration (int): The duration (in seconds) for which the scrolling action will continue.
- scroll_amount (int): The number of "clicks" of the scroll wheel to simulate. A positive value scrolls down, and a negative value scrolls up.

2. `Map(query)`

This method initiates a search on Google Maps based on the provided query. It then utilizes the scroll_using_mouse method to scroll down the map and extracts information about companies and their phone numbers.

Parameters:
- query (str): The search query for Google Maps.
Note on Scrolling:
- The scrolling action performed by scroll_using_mouse will only work correctly when the mouse cursor is positioned over the map cards on the webpage.

Example Usage

from scrapset import Angel

# Create an instance of the Angel class
angel_instance = Angel()

# Perform a Google Maps search for "example query"
result = angel_instance.Map("example query")

# Print the result
print(result)



#  Note:  This is for running Scrapset in google colab :

#run this command in the cell !apt-get update !apt-get install chromium chromium-driver !pip install selenium

%%shell

Add debian buster

cat > /etc/apt/sources.list.d/debian.list <<'EOF' deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster.gpg] http://deb.debian.org/debian buster main deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster-updates.gpg] http://deb.debian.org/debian buster-updates main deb [arch=amd64 signed-by=/usr/share/keyrings/debian-security-buster.gpg] http://deb.debian.org/debian-security buster/updates main EOF

Add keys

apt-key adv --keyserver keyserver.ubuntu.com --recv-keys DCC9EFBF77E11517 apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 648ACFD622F3D138 apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 112695A0E562B32A

apt-key export 77E11517 | gpg --dearmour -o /usr/share/keyrings/debian-buster.gpg apt-key export 22F3D138 | gpg --dearmour -o /usr/share/keyrings/debian-buster-updates.gpg apt-key export E562B32A | gpg --dearmour -o /usr/share/keyrings/debian-security-buster.gpg

Prefer debian repo for chromium* packages only

Note the double-blank lines between entries

cat > /etc/apt/preferences.d/chromium.pref << 'EOF' Package: * Pin: release a=eoan Pin-Priority: 500

Package: * Pin: origin "deb.debian.org" Pin-Priority: 300

Package: chromium* Pin: origin "deb.debian.org" Pin-Priority: 700 EOF

Project details

Release history Release notifications | RSS feed

9.4.9

Dec 18, 2023

9.3.9

Dec 18, 2023

9.2.9

Dec 18, 2023

9.1.9

Dec 18, 2023

9.1.8

Nov 20, 2023

This version

9.1.6

Nov 17, 2023

9.1.5

Nov 17, 2023

9.1.1

Nov 17, 2023

9.0.0

Sep 21, 2023

8.0.0

Aug 8, 2023

7.5.9

Aug 8, 2023

7.5.5

Aug 8, 2023

7.0.5

Aug 7, 2023

6.9.5

Aug 7, 2023

6.3.1

Aug 7, 2023

5.3.1

Jul 30, 2023

4.3.5

Jul 13, 2023

4.3.4

Jul 13, 2023

4.3.3

Jul 13, 2023

4.3.2

Jul 13, 2023

4.3.1

Jul 13, 2023

4.2.1

Jul 12, 2023

4.1.1

Jul 12, 2023

4.0.0

Jul 12, 2023

3.4.0

Jun 25, 2023

3.1.0

Jun 24, 2023

2.4.0

Jun 23, 2023

2.2.0

Jun 23, 2023

2.1.0

Jun 23, 2023

1.4.0

Jun 22, 2023

1.3.0

Jun 6, 2023

1.2.0

Jun 6, 2023

1.1.0

Jun 5, 2023

0.1.0

Jun 5, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Scrapset-9.1.6.tar.gz (9.7 kB view hashes)

Uploaded Nov 17, 2023 Source

Hashes for Scrapset-9.1.6.tar.gz

Hashes for Scrapset-9.1.6.tar.gz
Algorithm	Hash digest
SHA256	`7a22060aaee74bcaed64307acaf80c3f2e9cc89d10f812026cd84d8e5302e07c`
MD5	`04acfe97c506c132745427aebca3533f`
BLAKE2b-256	`20bdc86e653bbfc930689c0c924f00e75ebb2e79b432226d27972d56a294d49e`

Scrapset 9.1.6

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Dataset Scraper (Scrapset)

KaggleDataSet Class

DataDotGov Class

Example code to extract titles of datasets from Data.gov

Example code to extract titles, upvote, Usability index of datasets from kaggle

Example code to extract job details from indeed

You get the details in the form of a dictionary

IMDb Class

Methods

web_driver_chrome()

comments(url: str) -> List[str]

Example Code

VesselFinder Class

Methods

vessel_details(url: str) -> List[str]

vessel_location(url: str) -> List[str]

Example Code

Angel Scraper

Introduction

Methods

1. scroll_using_mouse(duration=10, scroll_amount=1)

2. Map(query)

Example Usage

Add debian buster

Add keys

Prefer debian repo for chromium* packages only

Note the double-blank lines between entries

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

1. `scroll_using_mouse(duration=10, scroll_amount=1)`

2. `Map(query)`