Skip to main content

Researches is a Google search scraper.

Project description

researches

Researches is a vanilla1 Google scraper. Minimal requirements.

search("Who invented papers?")

1 In context, this refers to raw/unformatted data and contents. researches does not clean them up for you, and it's not guranteed to be 100% human-readable. However, feeding to LLMs may help as most of them use word-level tokenizers.

Requirements

  • A decent computer
  • Python ≥ 3.9
  • httpx – HTTP connections.
  • selectolax – The HTML parser.

Usage

Just start searching right away. Don't worry, Gemini won't hurt you (also gemini).

search(
    "US to Japan",  # query
    hl="en",        # language
    ua=None,        # custom user agent or ours
    **kwargs        # kwargs to pass to httpx (optional)
) -> Result

For people who love async, we've also got you covered:

await asearch(
    "US to Japan"   # query
    hl="en",        # language
    ua=None,        # custom user agent or ours
    **kwargs        # kwargs to pass to httpx (optional)
)

So, what does the Result class has to offer? At a glance:

result.snippet?
        .text: str
        .name: str?

result.rich_block?
        .image: str?
        .forecast: PartialWeather[]
                     .weekday: str
                     .temp: str

result.aside?
       .text: str

result.weather?
       .c: str
       .f: str
       .precipitation: str
       .humidty: str
       .wind_metric: str
       .wind_imperial: str
       .description: str
       .forecast: PartialWeatherForReport[]
                   .weekday: str
                   .high_c: str
                   .low_c: str
                   .high_f: str
                   .low_f: str

result.web: Web[]
             .title: str
             .url: str
             .text: str

result.flights: Flight[]
                 .title: str
                 .description: str
                 .duration: str
                 .price: str

Background

Data comes in different shapes and sizes, and Google played it extremely well. That also includes randomizing CSS class names making it almost impossible to scrape data.

Google sucks, but it's actually the knowledge base we all need. Say, there are these types of result pages:

  • Links – What made Google, "Google." Or, &udm=14.
  • Rich blocks – Rich blocks that introduce persons, places and more.
  • Weather – Weather forecast.
  • Wikipedia (aside) – Wikipedia text.
  • Flights – Flights.

...and many more. (Contribute!)

Scraper APIs out there are hella expensive, and ain't no way I'm paying or entering their free tier. So, I made my own that's perfect for extracting data with LLMs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

researches-0.1.tar.gz (5.8 kB view details)

Uploaded Source

File details

Details for the file researches-0.1.tar.gz.

File metadata

  • Download URL: researches-0.1.tar.gz
  • Upload date:
  • Size: 5.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for researches-0.1.tar.gz
Algorithm Hash digest
SHA256 c4e4fc09240e7bfda703438cc574dcd443cf503b953dfe8d69cc8f1192a84747
MD5 b450cf947e3b8df3089c7955e6e32b63
BLAKE2b-256 04f23cff177bbc1af395a695bb0c2bd5a535f44c5ebc12fd692ba75bf7ec7f0a

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page