Simple test package

Project description

Scrapeddit

Overview

Scrapeddit is a Python class designed for scraping images from Reddit subreddits and creating PyTorch datasets. It facilitates the collection of image data from various subreddits, allowing for easy integration into machine learning pipelines or data analysis projects.

Key Features

Reddit Scraping: Automatically retrieves image URLs from specified subreddits using the PRAW library.
Flexible Configuration: Users can customize parameters such as subreddit names, post limits, sorting methods, and content safety filters.
Data Transformation: Supports image transformation and resizing to fit specific requirements.
Error Handling: Handles invalid subreddits, restricted subreddits, and failed image fetching gracefully, ensuring smooth data collection.
Data Visualization: Provides visualization tools to understand the distribution of data sources across different subreddits.

Usage

Initialization: Instantiate the ScrapeditDataset class with a list of subreddit names and optional parameters for customization. Install by:

pip install scrapeddit
pip3 install scrapeddit

Authentication: Complete authentication by regestering and making a app in prawn, using that complete the authentication by:

from scrapeddit import authentication
authentication.auth_reddit(client_id = "",
                    client_secret = "",
                    username = "",
                    password = "",
                    redirect_uri = "",
                    user_agent = "",
                    check_for_async=False
)

Getting Data: This is for collecting information for only on subreddit, parameters like limit, show_safe can be set

from scrapeddit import scrapeonce
scrape_df = scrapeonce.scrape_reddit('spotted', limit = 50)

Data Loading: Access the dataset like any other PyTorch dataset, allowing for seamless integration into machine learning workflows. Highly recommemded: Use the provided ResizeWithPadding tranform

from scrapeddit import redditdl
from scrapeddit.redditdl import ScrapeditDataset
from scrapeddit.transforms import ResizeWithPadding
import torchvision.transforms as transforms

size = 300
transform_resize=transforms.Compose([
                              ResizeWithPadding(size=size),
                              transforms.Resize((size,size)),
                              transforms.ToTensor()
                              ])

subreddits = ['Pizza', 'burgers']
dataset = ScrapeditDataset(subreddit=subreddits, limit = 200, transform = transform_resize, max_size = 100, show_safe = True)

Calling dataset() displays bar graph useful to visualize data imbalance caused due to data unavailability utilize torch.utils.data.random_split() to split into train and test 5. Data Analysis: Use the provided visualization functions to gain insights into the distribution of data sources and explore the collected dataset.

Model Training: Utilize the ScrapeditDataset as a DataLoader for training machine learning models. Integrate it with PyTorch's DataLoader for efficient batch processing and model training.
Getting models: Added functionality includes getting known models, by default it freezes non classifier layers

from scrapeddit import models
model1 = models.get_efficient(device = True) # efficient net model
model2 = models.get_vision_model('vgg16', device = True) # Get any model that is available in torchvision.models

Visualization: Two types of visualization are 6.1 show_images: Uses list of links of images to fetch image from the sources and display them accordingly

from scrapeddit import showit
showit.show_images([list of links] figsize = (10,10), max_images = 24)

6.2 sample_batch: Shows a batch of image data from a dataloader

from scrapeddit.showit import show_batch
sample_batch = next(iter(train_dataloader)) # Getting a batch of data
show_batch(sample_batch = sample_batch, max = 100, figsize = (15,15))

Requirements

Python 3.x
PRAW
pandas
requests
matplotlib
Pillow
torch
torchvision
tqdm

Project details

Release history Release notifications | RSS feed

0.3.7

May 13, 2024

0.3.6

May 13, 2024

0.3.5

May 13, 2024

This version

0.3.4

May 13, 2024

0.3.3

May 11, 2024

0.3.2

May 11, 2024

0.3.1

May 11, 2024

0.3.0

May 11, 2024

0.2.7

May 11, 2024

0.2.6

May 11, 2024

0.2.5

May 11, 2024

0.2.4

May 11, 2024

0.2.3

May 11, 2024

0.2.2

May 11, 2024

0.2.1

May 11, 2024

0.2.0

May 11, 2024

0.1.9

May 11, 2024

0.1.8

May 11, 2024

0.1.7

May 11, 2024

0.1.6

May 11, 2024

0.1.5

May 11, 2024

0.1.4

May 11, 2024

0.1.3

May 11, 2024

0.1.2

May 11, 2024

0.1.1

May 11, 2024

0.1.0

May 11, 2024

0.0.5

May 11, 2024

0.0.4

May 11, 2024

0.0.3

May 11, 2024

0.0.2

May 11, 2024

0.0.1

May 11, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapeddit-0.3.4.tar.gz (8.9 kB view details)

Uploaded May 13, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scrapeddit-0.3.4-py3-none-any.whl (9.2 kB view details)

Uploaded May 13, 2024 Python 3

File details

Details for the file scrapeddit-0.3.4.tar.gz.

File metadata

Download URL: scrapeddit-0.3.4.tar.gz
Upload date: May 13, 2024
Size: 8.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.11.5

File hashes

Hashes for scrapeddit-0.3.4.tar.gz
Algorithm	Hash digest
SHA256	`e1f8a6a050b022dacd2943808ee248b65c523db9cdbf1505b44f49f4b6f867cb`
MD5	`bb59ebbb57f9bf6250534f8cc2200611`
BLAKE2b-256	`5dceafa9bb03726f15c2326aee960d9b8bec4b1886f8492144fe917a863a3dd9`

See more details on using hashes here.

File details

Details for the file scrapeddit-0.3.4-py3-none-any.whl.

File metadata

Download URL: scrapeddit-0.3.4-py3-none-any.whl
Upload date: May 13, 2024
Size: 9.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.11.5

File hashes

Hashes for scrapeddit-0.3.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e0b2695f2cae23477a42e3e854e732e94ab56afb062267d72aed928678b31eb5`
MD5	`4d5ec5530a1904d99ab4f8f0c31db286`
BLAKE2b-256	`9849a6bc1322e40716b9886eef5a5707e3fa381e3bdb3bd16178b92875a40fa9`

See more details on using hashes here.

scrapeddit 0.3.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Scrapeddit

Overview

Key Features

Usage

Requirements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes