desync-search

API for the internet

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

1. Installation & Setup

Installing for the First Time

pip install desync_search

Updating to the Latest Version

pip install --upgrade desync_search

Make sure you have a valid Desync Search API key. You can set it as an environment variable (for example, DESYNC_API_KEY) or store it securely in your application’s configuration.

2. Initializing the DesyncClient

Once you have your API key, you can start using the client. By default, the client points to the production environment. If you want to activate “developer mode” (for example, to use a test endpoint), you can set developer_mode=True in the constructor.

Example:

from desync_search import DesyncClient
import os

# Fetch your Desync Search API key from an environment variable
my_api_key = os.getenv("DESYNC_API_KEY")

# Initialize the client (production environment by default)
client = DesyncClient(user_api_key=my_api_key)

# Or, to use the developer mode (test environment)
# client = DesyncClient(user_api_key=my_api_key, developer_mode=True)

3. Understanding the PageData Object

Almost all Desync Search actions return or work with an instance of PageData. This data structure represents a single search result (i.e., a “page record”). It includes fields such as:

id: The internal identifier of the record.
url: The page’s URL.
domain: The domain portion of the URL.
timestamp: When the page was collected.
bulk_search_id: If this record was part of a bulk/crawl operation.
search_type: e.g., "stealth_search" or "test_search".
text_content: Extracted text from the page.
html_content: Full HTML (if requested).
internal_links: List of links on this page pointing to the same domain.
external_links: List of links pointing to external domains.
latency_ms: How many milliseconds the scraping took (if available).
complete: A boolean indicating whether the page’s scraping is fully complete.
created_at: The timestamp of record creation in the system.

When you call methods like search(), bulk_search(), or crawl(), the returned values are typically PageData instances. You can inspect those instances to analyze URLs, retrieve text content, check completion status, and more.

Note: In code, you might see it as:
result_page = client.search("https://example.com")
print(result_page.url)
print(result_page.text_content)
Each PageData object can be handled or stored as needed in your application.

4. Performing a Single Search

To perform a simple search on a single URL, you can use the search() method. This executes a stealth search by default (10 credits per URL). If you want a cheaper test operation (1 credit), set search_type="test_search".

Example:

from desync_search import DesyncClient
import os

my_api_key = os.getenv("DESYNC_API_KEY")
client = DesyncClient(my_api_key)

target_url = "https://example.com"
result = client.search(target_url)

# Inspect PageData fields
print("URL:", result.url)
print("Number of internal links:", len(result.internal_links))
print("Number of external links:", len(result.external_links))
print("Text content length:", len(result.text_content))

Available Parameters:

url (required): The target URL to scrape.
search_type (optional): "stealth_search" (default) or "test_search".
scrape_full_html (optional): If True, returns full HTML in html_content. Default is False.
remove_link_duplicates (optional): If True, deduplicates discovered links. Default is True.

The method returns a PageData object with fields like url, html_content, text_content, internal_links, etc.

5. Crawling an Entire Domain

For a more powerful operation that follows links recursively within a single domain, use the crawl() method. By default, it goes up to max_depth=2 levels. You can adjust it to go deeper, but be mindful of increased credits usage.

Under the hood, crawl():

Performs a single stealth search on the starting URL.
Collects all same-domain links.
Performs a bulk search on those links, waits for them to complete, and gathers new links at each depth level.
Returns a list of PageData objects for all discovered pages.

Example:

from desync_search import DesyncClient
import os

my_api_key = os.getenv("DESYNC_API_KEY")
client = DesyncClient(user_api_key=my_api_key)

# Crawl up to 3 levels deep on the same domain
all_pages = client.crawl(
    start_url="https://www.example.com",
    max_depth=3,
    scrape_full_html=False,     # Set True if you need HTML content
    remove_link_duplicates=True # Avoid repeated links
)

print(f"Discovered {len(all_pages)} unique pages")

for page in all_pages:
    print("URL:", page.url, "| Depth:", getattr(page, "depth", None))

Parameters Explained:

start_url (required): The initial URL to begin crawling from.
max_depth (optional): Maximum link depth to follow (default 2).
scrape_full_html (optional): If True, includes the full HTML for each page (html_content). Default is False.
remove_link_duplicates (optional): If True, duplicates are filtered out during the search. Default is True.
poll_interval (optional): How many seconds to wait between checks for bulk-search completion at each depth. Default is 2.0.
wait_time_per_depth (optional): How many seconds to wait for each depth’s bulk operation to complete. Default is 30.0.
completion_fraction (optional): Fraction of links that need to be completed before moving on. Default is 0.975 (97.5%).

crawl() returns a list of PageData objects, each representing a crawled page. You can inspect their .depth, .url, .text_content, etc., or store them for further analysis.

6. Bulk Search

If you have a list of URLs to process at once, use bulk_search(). Under the hood, this triggers an asynchronous workflow in the Desync system that processes all URLs in parallel. Each URL consumes 10 credits (for stealth search).

6.1 Initiating a Bulk Search

from desync_search import DesyncClient
import os

my_api_key = os.getenv("DESYNC_API_KEY")
client = DesyncClient(my_api_key)

# A list of URLs to search
target_list = [
    "https://example.com",
    "https://another-example.net",
    # ... up to 1000
]

# Initiate the bulk search
bulk_info = client.bulk_search(target_list=target_list, extract_html=False)

print("Message:", bulk_info.get("message"))
print("Bulk Search ID:", bulk_info.get("bulk_search_id"))
print("Total links scheduled:", bulk_info.get("total_links"))
print("Cost charged:", bulk_info.get("cost_charged"))

Returned Fields in `bulk_info`

message: Typically "Bulk search triggered successfully.".
bulk_search_id: A unique ID that identifies this bulk operation.
total_links: Number of URLs submitted.
cost_charged: Total credits consumed (10 credits per URL).
execution_arn: (If applicable) ARN from the underlying AWS Step Functions workflow.

Note: By default, extract_html=False excludes the full HTML content from each page to save bandwidth. Set extract_html=True if you need HTML in the results.

6.2 Retrieving Bulk Search Results

After starting a bulk search, you typically want to wait until it completes before pulling the data. There are two main approaches:

(A) Manual Polling and Retrieval

Poll for minimal info using list_available:
```
partial_records = client.list_available(bulk_search_id=bulk_info["bulk_search_id"])
```
Each item in partial_records is a PageData with limited fields (e.g., IDs, domains, timestamps, complete status).

Check how many are complete:

num_complete = sum(1 for rec in partial_records if rec.complete)
print(f"{num_complete}/{bulk_info['total_links']} are done.")

Pull full data once enough have completed using pull_data:
```
full_records = client.pull_data(bulk_search_id=bulk_info["bulk_search_id"])
```
The resulting list contains PageData objects, including fields like text_content and html_content (if requested).

You can repeat steps 1 & 2 until the desired fraction of pages is complete or until you reach a certain timeout, then proceed with step 3.

(B) Automatic Polling with `collect_results`

For convenience, collect_results automates the polling loop and retrieves full data once a certain fraction of pages is done or a timeout has passed.

bulk_search_id = bulk_info["bulk_search_id"]
all_results = client.collect_results(
    bulk_search_id=bulk_search_id,
    target_links=target_list,
    wait_time=30.0,        # Max wait time in seconds (default 30s)
    poll_interval=2.0,     # Frequency to check completion (default 2s)
    completion_fraction=0.975  # Fraction of links that must be complete (default 0.975)
)

print(f"Retrieved {len(all_results)} pages.")
for record in all_results:
    print("URL:", record.url, "| Complete:", record.complete)

bulk_search_id: The ID returned from bulk_search().
target_links: The same list passed to bulk_search, used for completion calculations.
wait_time: Maximum wait time before pulling data regardless.
poll_interval: How often to re-check completion status.
completion_fraction: Automatically stops polling once this fraction of pages is completed.

Either method will ultimately yield a list of PageData objects with their full content (assuming you used extract_html=True or if you only need text-based data).

When to Use Bulk Search

Pros: Processes multiple URLs in parallel, more efficient than calling search() repeatedly.
Cons: Requires you to handle asynchronous polling, either manually or via collect_results.

7. Sample Scripts

7.1 Using an Explicit API Key

If you have your API key in a variable (e.g., my_api_key), you can directly pass it to the DesyncClient constructor:

from desync_search import DesyncClient
import os

my_api_key = os.getenv("DESYNC_API_KEY")  # or any other secure way to retrieve it
client = DesyncClient(user_api_key=my_api_key)

crawl_results = client.crawl(
    start_url="https://www.137ventures.com/",
    max_depth=3
)

7.2 Relying on an Environment Variable Only

DesyncClient can also read the DESYNC_API_KEY environment variable automatically if you omit the user_api_key argument:

from desync_search import DesyncClient

client = DesyncClient()  # Reads DESYNC_API_KEY from your environment

crawl_results = client.crawl(
    start_url="https://www.137ventures.com/",
    max_depth=3
)

Make sure you set the environment variable before running your script. For example, on a Unix-based shell:

export DESYNC_API_KEY="YOUR_API_KEY_HERE"
python your_script.py

On Windows (Command Prompt):

set DESYNC_API_KEY=YOUR_API_KEY_HERE
python your_script.py

Either approach will create a DesyncClient that you can immediately use to perform crawls, searches, or bulk operations.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.2.29

Aug 28, 2025

0.2.28

Aug 28, 2025

0.2.27

Aug 28, 2025

0.2.26

Aug 28, 2025

0.2.25

Mar 12, 2025

0.2.24

Feb 11, 2025

0.2.23

Feb 11, 2025

0.2.22

Feb 11, 2025

This version

0.2.21

Feb 7, 2025

0.2.20

Jan 31, 2025

0.2.19

Jan 31, 2025

0.2.18

Jan 29, 2025

0.2.17

Jan 25, 2025

0.2.16

Jan 25, 2025

0.2.15

Jan 25, 2025

0.2.14

Jan 25, 2025

0.2.13

Jan 25, 2025

0.2.12

Jan 23, 2025

0.2.11

Jan 23, 2025

0.2.10

Jan 23, 2025

0.2.9

Jan 23, 2025

0.2.8

Jan 22, 2025

0.2.4

Jan 20, 2025

0.2.3

Jan 20, 2025

0.2.2

Jan 20, 2025

0.2.1

Jan 20, 2025

0.2.0

Jan 20, 2025

0.1.4

Dec 5, 2024

0.1.3

Dec 5, 2024

0.1.0

Dec 3, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

desync_search-0.2.21.tar.gz (14.9 kB view details)

Uploaded Feb 7, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

desync_search-0.2.21-py3-none-any.whl (12.3 kB view details)

Uploaded Feb 7, 2025 Python 3

File details

Details for the file desync_search-0.2.21.tar.gz.

File metadata

Download URL: desync_search-0.2.21.tar.gz
Upload date: Feb 7, 2025
Size: 14.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for desync_search-0.2.21.tar.gz
Algorithm	Hash digest
SHA256	`dddfedb93a175e368a978d136c42540ba7918c1b99bed4700831a172cb32ebe2`
MD5	`3c9294dad299c05990b89799f9e9a843`
BLAKE2b-256	`c103e2b958996db33bfa0a8360e9bcde7189ec7be985563c32f8bde8713265aa`

See more details on using hashes here.

File details

Details for the file desync_search-0.2.21-py3-none-any.whl.

File metadata

Download URL: desync_search-0.2.21-py3-none-any.whl
Upload date: Feb 7, 2025
Size: 12.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for desync_search-0.2.21-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ca7bb43d9e9f9c7ed11d3671c246146775899f8fb30d3c7e4372e1b64b0808d0`
MD5	`82cc4d342bb9008045f126976dc7e326`
BLAKE2b-256	`0a9247cf3c57db637c565b6042045ccb252ad878c98f17ed03f0e56ddc4193f7`

See more details on using hashes here.

desync-search 0.2.21

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

1. Installation & Setup

Installing for the First Time

Updating to the Latest Version

2. Initializing the DesyncClient

3. Understanding the PageData Object

4. Performing a Single Search

5. Crawling an Entire Domain

6. Bulk Search

6.1 Initiating a Bulk Search

Returned Fields in bulk_info

6.2 Retrieving Bulk Search Results

(A) Manual Polling and Retrieval

(B) Automatic Polling with collect_results

7. Sample Scripts

7.1 Using an Explicit API Key

7.2 Relying on an Environment Variable Only

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Returned Fields in `bulk_info`

(B) Automatic Polling with `collect_results`