DataScraper: Effortless Dataset Extraction

Project description

Dataset Scraper (Scrapset)

Scrapset is a Python module specifically created for web scraping data from websites like Kaggle and Data.gov. It simplifies the task of extracting dataset information such as titles, upvotes (for Kaggle), and recent views (for Data.gov).

By utilizing the Scrapset module, you can automate the retrieval of dataset details from these platforms. This can be beneficial for various purposes such as data analysis, research, or developing machine learning models. The module employs the Selenium library to interact with the websites and extract the desired data.

With Scrapset, you can quickly and easily scrape dataset information, empowering you to work with valuable data from Kaggle, Data.gov, and similar websites.

KaggleDataSet Class

The KaggleDataSet class enables scraping of dataset information from Kaggle.

Methods: web_driver_chrome(): Initializes and returns a Selenium Chrome WebDriver with customized options for scraping Kaggle datasets.

data_set_page(url, last_page, initial_page): Scrapes the titles, upvotes, and additional details of datasets from Kaggle. The method takes the url of the Kaggle datasets page, the last_page number to scrape up to, and the initial_page number to start scraping from. It returns a dictionary containing the scraped dataset information.

DataDotGov Class

The DataDotGov class facilitates scraping of dataset information from Data.gov.

Methods: web_driver_chrome(): Initializes and returns a Selenium Chrome WebDriver with customized options for scraping Data.gov datasets.

data_set_page(url, last_page, initial_page): Scrapes the titles, recent views, and authors of datasets from Data.gov. The method takes the url of the Data.gov datasets page, the last_page number to scrape up to, and the initial_page number to start scraping from. It returns a dictionary containing the scraped dataset information.

Example code to extract titles of datasets from Data.gov


import kaggle_datasets as m
import pandas as pd
df=m.DataDotGov()
data=df.data_set_page('https://catalog.data.gov',last_page=10,initial_page=5)
datf=pd.DataFrame(data)
datf.to_csv('datagov.csv',index=False)

Example code to extract titles, upvote, Usability index of datasets from kaggle


import kaggle_datasets as m
import pandas as pd
df=m.KaggleDataSet()
data=df.data_set_page('https://kaggle.com',last_page=10,initial_page=5)
datf=pd.DataFrame(data)
datf.to_csv('kaggle.csv',index=False)

Project details

Release history Release notifications | RSS feed

9.4.9

Dec 18, 2023

9.3.9

Dec 18, 2023

9.2.9

Dec 18, 2023

9.1.9

Dec 18, 2023

9.1.8

Nov 20, 2023

9.1.6

Nov 17, 2023

9.1.5

Nov 17, 2023

9.1.1

Nov 17, 2023

9.0.0

Sep 21, 2023

8.0.0

Aug 8, 2023

7.5.9

Aug 8, 2023

7.5.5

Aug 8, 2023

7.0.5

Aug 7, 2023

6.9.5

Aug 7, 2023

6.3.1

Aug 7, 2023

5.3.1

Jul 30, 2023

4.3.5

Jul 13, 2023

4.3.4

Jul 13, 2023

4.3.3

Jul 13, 2023

4.3.2

Jul 13, 2023

4.3.1

Jul 13, 2023

4.2.1

Jul 12, 2023

4.1.1

Jul 12, 2023

4.0.0

Jul 12, 2023

3.4.0

Jun 25, 2023

3.1.0

Jun 24, 2023

2.4.0

Jun 23, 2023

2.2.0

Jun 23, 2023

2.1.0

Jun 23, 2023

1.4.0

Jun 22, 2023

1.3.0

Jun 6, 2023

1.2.0

Jun 6, 2023

This version

1.1.0

Jun 5, 2023

0.1.0

Jun 5, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

Scrapset-1.1.0-py3-none-any.whl (3.7 kB view hashes)

Uploaded Jun 5, 2023 Python 3

Hashes for Scrapset-1.1.0-py3-none-any.whl

Hashes for Scrapset-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d4f340efcb2da242e417e59d7fa02db3525ed35f4e77d104ecdb3750f33179fb`
MD5	`0196af3768ee7aa0a3bf00bc4917b7a2`
BLAKE2b-256	`18605dd7c7e5391fac52bd90eb437ce551dfe97fb15dd0e2d6037189cef514fa`