Spider API Reference (2024)

Start crawling a website(s) to collect resources.

POSThttps://api.spider.cloud/crawl

Request body

urlrequiredstring
The URI resource to crawl. This can be a comma split list for multiple urls.
Test Url
requeststring
The request type to perform. Possible values are http, chrome, and smart. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML.
limitnumber
The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.
Crawl Limit
return_formatstring
The format to return the data in. Possible values are markdown, raw, text, and bytes. Use raw to return the default format of the page like HTML etc.
proxy_enabledboolean
Enable premium proxies to prevent detection. Default is false.
anti_botboolean
Enable anti-bot mode using various techniques to increase the chance of success. Default is false.
tldboolean
Allow TLD's to be included. Default is false.
depthnumber
The crawl limit for maximum depth. If 0, no limit will be applied.
Crawl Depth
cacheboolean
Use HTTP caching for the crawl to speed up repeated runs. Default is true
budgetobject
Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.
Crawl Budget
localestring
The locale to use for request, example en-US.
cookiesstring
Add HTTP cookies to use for request.
stealthboolean
Use stealth mode for headless chrome request to help prevent being blocked. The default is false on chrome.
headersobject
Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.
metadataboolean
Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled.
viewportobject
Configure the viewport for chrome. Defaults to 800x600.
encodingstring
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
blacklistarray
Blacklist a set of paths that you do not want to crawl. You can use Regex patterns to help with the list.
whitelistarray
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
subdomainsboolean
Allow subdomains to be included. Default is false.
user_agentstring
Add a custom HTTP user agent to the request. By default this is set to a random agent.
store_databoolean
Boolean to determine if storage should be used. If set this takes precedence over storageless. Defaults to false.
gpt_configobject
Use AI to generate actions to perform during the crawl. You can pass an array for the "prompt" to chain steps.
fingerprintboolean
Use advanced fingerprint for chrome.
storagelessboolean
Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored.
readabilityboolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
chunking_algobject
Use a chunking algorithm to segment your content output. Possible values are ByWords, ByLines, ByCharacterLength, and BySentence. The following is an example of the object shape for "chunking_alg": { "type": "bysentence", "value": 2 } which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.
respect_robotsboolean
Respect the robots.txt file for crawling. Default is true.
query_selectorstring
The CSS query selector to use when extracting content from the markup.
Test Query Selector
full_resourcesboolean
Crawl and download all the resources for a website.
request_timeoutnumber
The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds.
run_in_backgroundboolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.
skip_config_checksboolean
Skip checking the database for website configuration. This may increase performance for request that are using limit=1. The default is false.

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/json',}json_data = {"limit":25,"return_format":"markdown","url":"https://spider.cloud"}response = requests.post('https://api.spider.cloud/crawl', headers=headers, json=json_data)print(response.json())

Response

[ { "content": "<html>...", "error": null, "status": 200, "url": "https://spider.cloud" }, // more content...]

Perform a search and gather a list of websites to start crawling and collect resources.

POSThttps://api.spider.cloud/search

Request body

searchrequiredstring
The search query you want to search for.
Search
search_limitnumber
The limit amount of urls to fetch or crawl from the search results. Remove the value or set it to 0 to crawl all URLs from the search results.
fetch_page_contentboolean
Fetch all the content of the websites by performing crawls. The default is true; if this is disabled, only the search results are returned instead.
countrystring
The country code to use for the search. It's a two-letter country code. (e.g. us for the United States).
locationstring
The location from where you want the search to originate.
languagestring
The language to use for the search. It's a two-letter language code (e.g., en for English).
numnumber
The maximum number of results to return for the search.
tldboolean
Allow TLD's to be included. Default is false.
pagenumber
The page number for the search results.
cacheboolean
Use HTTP caching for the crawl to speed up repeated runs. Default is true
requeststring
The request type to perform. Possible values are http, chrome, and smart. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML.
limitnumber
The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.
Crawl Limit
return_formatstring
The format to return the data in. Possible values are markdown, raw, text, and bytes. Use raw to return the default format of the page like HTML etc.
proxy_enabledboolean
Enable premium proxies to prevent detection. Default is false.
anti_botboolean
Enable anti-bot mode using various techniques to increase the chance of success. Default is false.
depthnumber
The crawl limit for maximum depth. If 0, no limit will be applied.
Crawl Depth
budgetobject
Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.
Crawl Budget
localestring
The locale to use for request, example en-US.
cookiesstring
Add HTTP cookies to use for request.
stealthboolean
Use stealth mode for headless chrome request to help prevent being blocked. The default is false on chrome.
headersobject
Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.
viewportobject
Configure the viewport for chrome. Defaults to 800x600.
encodingstring
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
blacklistarray
Blacklist a set of paths that you do not want to crawl. You can use Regex patterns to help with the list.
whitelistarray
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
subdomainsboolean
Allow subdomains to be included. Default is false.
user_agentstring
Add a custom HTTP user agent to the request. By default this is set to a random agent.
store_databoolean
Boolean to determine if storage should be used. If set this takes precedence over storageless. Defaults to false.
metadataboolean
Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled.
gpt_configobject
Use AI to generate actions to perform during the crawl. You can pass an array for the "prompt" to chain steps.
fingerprintboolean
Use advanced fingerprint for chrome.
storagelessboolean
Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored.
readabilityboolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
chunking_algobject
Use a chunking algorithm to segment your content output. Possible values are ByWords, ByLines, ByCharacterLength, and BySentence. The following is an example of the object shape for "chunking_alg": { "type": "bysentence", "value": 2 } which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.
respect_robotsboolean
Respect the robots.txt file for crawling. Default is true.
query_selectorstring
The CSS query selector to use when extracting content from the markup.
Test Query Selector
full_resourcesboolean
Crawl and download all the resources for a website.
request_timeoutnumber
The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds.
run_in_backgroundboolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.
skip_config_checksboolean
Skip checking the database for website configuration. This may increase performance for request that are using limit=1. The default is false.

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/json',}json_data = {"search":"a sports website","search_limit":3,"limit":25,"return_format":"markdown"}response = requests.post('https://api.spider.cloud/search', headers=headers, json=json_data)print(response.json())

Response

[ { "content": "<html>...", "error": null, "status": 200, "url": "https://spider.cloud" }, // more content...]

Start crawling a website(s) to collect links found.

POSThttps://api.spider.cloud/links

Request body

urlrequiredstring
The URI resource to crawl. This can be a comma split list for multiple urls.
Test Url
requeststring
The request type to perform. Possible values are http, chrome, and smart. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML.
limitnumber
The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.
Crawl Limit
return_formatstring
The format to return the data in. Possible values are markdown, raw, text, and bytes. Use raw to return the default format of the page like HTML etc.
proxy_enabledboolean
Enable premium proxies to prevent detection. Default is false.
anti_botboolean
Enable anti-bot mode using various techniques to increase the chance of success. Default is false.
tldboolean
Allow TLD's to be included. Default is false.
depthnumber
The crawl limit for maximum depth. If 0, no limit will be applied.
Crawl Depth
cacheboolean
Use HTTP caching for the crawl to speed up repeated runs. Default is true
budgetobject
Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.
Crawl Budget
localestring
The locale to use for request, example en-US.
cookiesstring
Add HTTP cookies to use for request.
stealthboolean
Use stealth mode for headless chrome request to help prevent being blocked. The default is false on chrome.
headersobject
Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.
metadataboolean
Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled.
viewportobject
Configure the viewport for chrome. Defaults to 800x600.
encodingstring
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
blacklistarray
Blacklist a set of paths that you do not want to crawl. You can use Regex patterns to help with the list.
whitelistarray
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
subdomainsboolean
Allow subdomains to be included. Default is false.
user_agentstring
Add a custom HTTP user agent to the request. By default this is set to a random agent.
store_databoolean
Boolean to determine if storage should be used. If set this takes precedence over storageless. Defaults to false.
gpt_configobject
Use AI to generate actions to perform during the crawl. You can pass an array for the "prompt" to chain steps.
fingerprintboolean
Use advanced fingerprint for chrome.
storagelessboolean
Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored.
readabilityboolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
chunking_algobject
Use a chunking algorithm to segment your content output. Possible values are ByWords, ByLines, ByCharacterLength, and BySentence. The following is an example of the object shape for "chunking_alg": { "type": "bysentence", "value": 2 } which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.
respect_robotsboolean
Respect the robots.txt file for crawling. Default is true.
query_selectorstring
The CSS query selector to use when extracting content from the markup.
Test Query Selector
full_resourcesboolean
Crawl and download all the resources for a website.
request_timeoutnumber
The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds.
run_in_backgroundboolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.
skip_config_checksboolean
Skip checking the database for website configuration. This may increase performance for request that are using limit=1. The default is false.

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/json',}json_data = {"limit":25,"return_format":"markdown","url":"https://spider.cloud"}response = requests.post('https://api.spider.cloud/links', headers=headers, json=json_data)print(response.json())

Response

[ { "url": "https://spider.cloud", "status": 200, "error": null }, // more content...]

Start taking screenshots of website(s) to collect images to base64 or binary.

POSThttps://api.spider.cloud/screenshot

Request body

urlrequiredstring
The URI resource to crawl. This can be a comma split list for multiple urls.
Test Url
requeststring
The request type to perform. Possible values are http, chrome, and smart. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML.
limitnumber
The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.
Crawl Limit
return_formatstring
The format to return the data in. Possible values are markdown, raw, text, and bytes. Use raw to return the default format of the page like HTML etc.
proxy_enabledboolean
Enable premium proxies to prevent detection. Default is false.
anti_botboolean
Enable anti-bot mode using various techniques to increase the chance of success. Default is false.
tldboolean
Allow TLD's to be included. Default is false.
depthnumber
The crawl limit for maximum depth. If 0, no limit will be applied.
Crawl Depth
cacheboolean
Use HTTP caching for the crawl to speed up repeated runs. Default is true
budgetobject
Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.
Crawl Budget
localestring
The locale to use for request, example en-US.
cookiesstring
Add HTTP cookies to use for request.
stealthboolean
Use stealth mode for headless chrome request to help prevent being blocked. The default is false on chrome.
headersobject
Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.
metadataboolean
Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled.
viewportobject
Configure the viewport for chrome. Defaults to 800x600.
encodingstring
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
blacklistarray
Blacklist a set of paths that you do not want to crawl. You can use Regex patterns to help with the list.
whitelistarray
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
subdomainsboolean
Allow subdomains to be included. Default is false.
user_agentstring
Add a custom HTTP user agent to the request. By default this is set to a random agent.
store_databoolean
Boolean to determine if storage should be used. If set this takes precedence over storageless. Defaults to false.
gpt_configobject
Use AI to generate actions to perform during the crawl. You can pass an array for the "prompt" to chain steps.
fingerprintboolean
Use advanced fingerprint for chrome.
storagelessboolean
Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored.
readabilityboolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
chunking_algobject
Use a chunking algorithm to segment your content output. Possible values are ByWords, ByLines, ByCharacterLength, and BySentence. The following is an example of the object shape for "chunking_alg": { "type": "bysentence", "value": 2 } which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.
respect_robotsboolean
Respect the robots.txt file for crawling. Default is true.
query_selectorstring
The CSS query selector to use when extracting content from the markup.
Test Query Selector
full_resourcesboolean
Crawl and download all the resources for a website.
request_timeoutnumber
The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds.
run_in_backgroundboolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.
skip_config_checksboolean
Skip checking the database for website configuration. This may increase performance for request that are using limit=1. The default is false.

Proxy-Mode

Alpha

Spider also offers a proxy front-end to the service. The Spider proxy will then handle requests just like any standard request, with the option to use high-performance residential proxies at 1TB per/s.

**HTTP address**: proxy.spider.cloud:8888**HTTPS address**: proxy.spider.cloud:8889**Username**: YOUR-API-KEY**Password**: PARAMETERS

Example proxy request

import requests, os# Proxy configurationproxies = { 'http': f"http://{os.getenv('SPIDER_API_KEY')}:request=Raw&premium_proxy=False@proxy.spider.cloud:8888", 'https': f"https://{os.getenv('SPIDER_API_KEY')}:request=Raw&premium_proxy=False@proxy.spider.cloud:8889"}# Function to make a request through the proxydef get_via_proxy(url): try: response = requests.get(url, proxies=proxies) response.raise_for_status() print('Response HTTP Status Code: ', response.status_code) print('Response HTTP Response Body: ', response.content) return response.text except requests.exceptions.RequestException as e: print(f"Error: {e}") return None# Example usageif __name__ == "__main__": get_via_proxy("https://www.choosealicense.com") get_via_proxy("https://www.choosealicense.com/community")

Pipelines

Create powerful workflows with our pipeline API endpoints. Use AI to extract contacts from any website or filter links with prompts with ease.

Start crawling a website(s) to collect all contacts utilizing AI. A minimum of $25 in credits is necessary for extraction.

POSThttps://api.spider.cloud/pipeline/extract-contacts

Request body

urlrequiredstring
The URI resource to crawl. This can be a comma split list for multiple urls.
Test Url
requeststring
The request type to perform. Possible values are http, chrome, and smart. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML.
limitnumber
The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.
Crawl Limit
return_formatstring
The format to return the data in. Possible values are markdown, raw, text, and bytes. Use raw to return the default format of the page like HTML etc.
proxy_enabledboolean
Enable premium proxies to prevent detection. Default is false.
anti_botboolean
Enable anti-bot mode using various techniques to increase the chance of success. Default is false.
tldboolean
Allow TLD's to be included. Default is false.
depthnumber
The crawl limit for maximum depth. If 0, no limit will be applied.
Crawl Depth
cacheboolean
Use HTTP caching for the crawl to speed up repeated runs. Default is true
budgetobject
Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.
Crawl Budget
localestring
The locale to use for request, example en-US.
cookiesstring
Add HTTP cookies to use for request.
stealthboolean
Use stealth mode for headless chrome request to help prevent being blocked. The default is false on chrome.
headersobject
Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.
metadataboolean
Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled.
viewportobject
Configure the viewport for chrome. Defaults to 800x600.
encodingstring
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
blacklistarray
Blacklist a set of paths that you do not want to crawl. You can use Regex patterns to help with the list.
whitelistarray
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
subdomainsboolean
Allow subdomains to be included. Default is false.
user_agentstring
Add a custom HTTP user agent to the request. By default this is set to a random agent.
store_databoolean
Boolean to determine if storage should be used. If set this takes precedence over storageless. Defaults to false.
gpt_configobject
Use AI to generate actions to perform during the crawl. You can pass an array for the "prompt" to chain steps.
fingerprintboolean
Use advanced fingerprint for chrome.
storagelessboolean
Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored.
readabilityboolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
chunking_algobject
Use a chunking algorithm to segment your content output. Possible values are ByWords, ByLines, ByCharacterLength, and BySentence. The following is an example of the object shape for "chunking_alg": { "type": "bysentence", "value": 2 } which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.
respect_robotsboolean
Respect the robots.txt file for crawling. Default is true.
query_selectorstring
The CSS query selector to use when extracting content from the markup.
Test Query Selector
full_resourcesboolean
Crawl and download all the resources for a website.
request_timeoutnumber
The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds.
run_in_backgroundboolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.
skip_config_checksboolean
Skip checking the database for website configuration. This may increase performance for request that are using limit=1. The default is false.

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}json_data = {"limit":25,"return_format":"markdown","url":"https://spider.cloud"}response = requests.post('https://api.spider.cloud/pipeline/extract-contacts', headers=headers, json=json_data)print(response.json())

Response

[ { "content": [{ "full_name": "John Doe", "email": "johndoe@gmail.com", "phone": "555-555-555", "title": "Baker" }], "error": null, "status": 200, "url": "https://spider.cloud" }, // more content...]

Crawl a website and accurately categorize it using AI.

POSThttps://api.spider.cloud/pipeline/label

Request body

urlrequiredstring
The URI resource to crawl. This can be a comma split list for multiple urls.
Test Url
requeststring
The request type to perform. Possible values are http, chrome, and smart. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML.
limitnumber
The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.
Crawl Limit
return_formatstring
The format to return the data in. Possible values are markdown, raw, text, and bytes. Use raw to return the default format of the page like HTML etc.
proxy_enabledboolean
Enable premium proxies to prevent detection. Default is false.
anti_botboolean
Enable anti-bot mode using various techniques to increase the chance of success. Default is false.
tldboolean
Allow TLD's to be included. Default is false.
depthnumber
The crawl limit for maximum depth. If 0, no limit will be applied.
Crawl Depth
cacheboolean
Use HTTP caching for the crawl to speed up repeated runs. Default is true
budgetobject
Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.
Crawl Budget
localestring
The locale to use for request, example en-US.
cookiesstring
Add HTTP cookies to use for request.
stealthboolean
Use stealth mode for headless chrome request to help prevent being blocked. The default is false on chrome.
headersobject
Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.
metadataboolean
Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled.
viewportobject
Configure the viewport for chrome. Defaults to 800x600.
encodingstring
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
blacklistarray
Blacklist a set of paths that you do not want to crawl. You can use Regex patterns to help with the list.
whitelistarray
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
subdomainsboolean
Allow subdomains to be included. Default is false.
user_agentstring
Add a custom HTTP user agent to the request. By default this is set to a random agent.
store_databoolean
Boolean to determine if storage should be used. If set this takes precedence over storageless. Defaults to false.
gpt_configobject
Use AI to generate actions to perform during the crawl. You can pass an array for the "prompt" to chain steps.
fingerprintboolean
Use advanced fingerprint for chrome.
storagelessboolean
Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored.
readabilityboolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
chunking_algobject
Use a chunking algorithm to segment your content output. Possible values are ByWords, ByLines, ByCharacterLength, and BySentence. The following is an example of the object shape for "chunking_alg": { "type": "bysentence", "value": 2 } which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.
respect_robotsboolean
Respect the robots.txt file for crawling. Default is true.
query_selectorstring
The CSS query selector to use when extracting content from the markup.
Test Query Selector
full_resourcesboolean
Crawl and download all the resources for a website.
request_timeoutnumber
The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds.
run_in_backgroundboolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.
skip_config_checksboolean
Skip checking the database for website configuration. This may increase performance for request that are using limit=1. The default is false.

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}json_data = {"limit":25,"return_format":"markdown","url":"https://spider.cloud"}response = requests.post('https://api.spider.cloud/pipeline/label', headers=headers, json=json_data)print(response.json())

Response

[ { "content": ["Government"], "error": null, "status": 200, "url": "https://spider.cloud" }, // more content...]

Crawl website(s) found from raw text or markdown.

POSThttps://api.spider.cloud/pipeline/crawl-text

Request body

textrequiredstring
The text string to extract urls from. The max limit for the text is 10mb.
requeststring
The request type to perform. Possible values are http, chrome, and smart. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML.
limitnumber
The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.
Crawl Limit
return_formatstring
The format to return the data in. Possible values are markdown, raw, text, and bytes. Use raw to return the default format of the page like HTML etc.
proxy_enabledboolean
Enable premium proxies to prevent detection. Default is false.
anti_botboolean
Enable anti-bot mode using various techniques to increase the chance of success. Default is false.
tldboolean
Allow TLD's to be included. Default is false.
urlstring
The URI resource to crawl. This can be a comma split list for multiple urls.
Test Url
depthnumber
The crawl limit for maximum depth. If 0, no limit will be applied.
Crawl Depth
cacheboolean
Use HTTP caching for the crawl to speed up repeated runs. Default is true
budgetobject
Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.
Crawl Budget
localestring
The locale to use for request, example en-US.
cookiesstring
Add HTTP cookies to use for request.
stealthboolean
Use stealth mode for headless chrome request to help prevent being blocked. The default is false on chrome.
headersobject
Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.
metadataboolean
Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled.
viewportobject
Configure the viewport for chrome. Defaults to 800x600.
encodingstring
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
blacklistarray
Blacklist a set of paths that you do not want to crawl. You can use Regex patterns to help with the list.
whitelistarray
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
subdomainsboolean
Allow subdomains to be included. Default is false.
user_agentstring
Add a custom HTTP user agent to the request. By default this is set to a random agent.
store_databoolean
Boolean to determine if storage should be used. If set this takes precedence over storageless. Defaults to false.
gpt_configobject
Use AI to generate actions to perform during the crawl. You can pass an array for the "prompt" to chain steps.
fingerprintboolean
Use advanced fingerprint for chrome.
storagelessboolean
Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored.
readabilityboolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
chunking_algobject
Use a chunking algorithm to segment your content output. Possible values are ByWords, ByLines, ByCharacterLength, and BySentence. The following is an example of the object shape for "chunking_alg": { "type": "bysentence", "value": 2 } which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.
respect_robotsboolean
Respect the robots.txt file for crawling. Default is true.
query_selectorstring
The CSS query selector to use when extracting content from the markup.
Test Query Selector
full_resourcesboolean
Crawl and download all the resources for a website.
request_timeoutnumber
The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds.
run_in_backgroundboolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.
skip_config_checksboolean
Skip checking the database for website configuration. This may increase performance for request that are using limit=1. The default is false.

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}json_data = {"text":"Check this link: https://example.com and email to example@email.com","limit":25,"return_format":"markdown"}response = requests.post('https://api.spider.cloud/pipeline/crawl-text', headers=headers, json=json_data)print(response.json())

Response

[ { "content": "<html>...", "error": null, "status": 200, "url": "https://spider.cloud" }, // more content...]

Queries

Query the data that you collect. Add dynamic filters for extracting exactly what is needed.

Get the websites stored.

GEThttps://api.spider.cloud/data/websites

Request params

limitstring
The limit of records to get.
Crawl Limit
pagenumber
The current page to get.
domainstring
Filter a single domain record.

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}response = requests.get('https://api.spider.cloud/data/websites?limit=25&return_format=markdown', headers=headers)print(response.json())

Response

{ "data": [ { "id": "2a503c02-f161-444b-b1fa-03a3914667b6", "user_id": "6bd06efa-bb0b-4f1f-a23f-05db0c4b1bfd", "url": "6bd06efa-bb0b-4f1f-a29f-05db0c4b1bfd/example.com/index.html", "domain": "spider.cloud", "created_at": "2024-04-18T15:40:25.667063+00:00", "updated_at": "2024-04-18T15:40:25.667063+00:00", "pathname": "/", "fts": "", "scheme": "https:", "last_checked_at": "2024-05-10T13:39:32.293017+00:00", "screenshot": null } ]}

Get the pages/resources stored.

GEThttps://api.spider.cloud/data/pages

Request params

limitstring
The limit of records to get.
Crawl Limit
pagenumber
The current page to get.
domainstring
Filter a single domain record.

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}response = requests.get('https://api.spider.cloud/data/pages?limit=25&return_format=markdown', headers=headers)print(response.json())

Response

{ "data": [ { "id": "733b0d0f-e406-4229-949d-8068ade54752", "user_id": "6bd06efa-bb0b-4f1f-a29f-05db0c4b1bfd", "url": "https://spider.cloud", "domain": "spider.cloud", "created_at": "2024-04-17T01:28:15.016975+00:00", "updated_at": "2024-04-17T01:28:15.016975+00:00", "proxy": true, "headless": true, "crawl_budget": null, "scheme": "https:", "last_checked_at": "2024-04-17T01:28:15.016975+00:00", "full_resources": false, "metadata": true, "gpt_config": null, "smart_mode": false, "fts": "'spider.cloud':1" } ]}

Get the pages metadata/resources stored.

GEThttps://api.spider.cloud/data/pages_metadata

Request params

limitstring
The limit of records to get.
Crawl Limit
pagenumber
The current page to get.
domainstring
Filter a single domain record.

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}response = requests.get('https://api.spider.cloud/data/pages_metadata?limit=25&return_format=markdown', headers=headers)print(response.json())

Response

{ "data": [ { "id": "e27a1995-2abe-4319-acd1-3dd8258f0f49", "user_id": "253524cd-3f94-4ed1-83b3-f7fab134c3ff", "url": "253524cd-3f94-4ed1-83b3-f7fab134c3ff/www.google.com/search?query=spider.cloud.html", "domain": "www.google.com", "resource_type": "html", "title": "spider.cloud - Google Search", "description": "", "file_size": 1253960, "embedding": null, "pathname": "/search", "created_at": "2024-05-18T17:40:16.4808+00:00", "updated_at": "2024-05-18T17:40:16.4808+00:00", "keywords": [ "Fastest Web Crawler spider", "Web scraping" ], "labels": "Search Engine", "extracted_data": null, "fts": "'/search':1" } ]}

Get the pages contacts stored.

GEThttps://api.spider.cloud/data/contacts

Request params

limitstring
The limit of records to get.
Crawl Limit
pagenumber
The current page to get.
domainstring
Filter a single domain record.

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}response = requests.get('https://api.spider.cloud/data/contacts?limit=25&return_format=markdown', headers=headers)print(response.json())

Response

{ "data": [ { "full_name": "John Doe", "email": "johndoe@gmail.com", "phone": "555-555-555", "title": "Baker" } ]}

Get the state of the crawl for the domain.

GEThttps://api.spider.cloud/data/crawl_state

Request params

limitstring
The limit of records to get.
Crawl Limit
pagenumber
The current page to get.
domainstring
Filter a single domain record.

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}response = requests.get('https://api.spider.cloud/data/crawl_state?limit=25&return_format=markdown', headers=headers)print(response.json())

Response

{ "data": { "id": "195bf2f2-2821-421d-b89c-f27e57ca71fh", "user_id": "6bd06efa-bb0a-4f1f-a29f-05db0c4b1bfg", "domain": "spider.cloud", "url": "https://spider.cloud/", "links": 1, "credits_used": 3, "mode": 2, "crawl_duration": 340, "message": null, "request_user_agent": "Spider", "level": "info", "status_code": 0, "created_at": "2024-04-21T01:21:32.886863+00:00", "updated_at": "2024-04-21T01:21:32.886863+00:00" }, "error": ""}

Get the last 24 hours of logs.

GEThttps://api.spider.cloud/data/crawl_logs

Request params

limitstring
The limit of records to get.
Crawl Limit
pagenumber
The current page to get.
domainstring
Filter a single domain record.

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}response = requests.get('https://api.spider.cloud/data/crawl_logs?limit=25&return_format=markdown', headers=headers)print(response.json())

Response

{ "data": { "id": "195bf2f2-2821-421d-b89c-f27e57ca71fh", "user_id": "6bd06efa-bb0a-4f1f-a29f-05db0c4b1bfg", "domain": "spider.cloud", "url": "https://spider.cloud/", "links": 1, "credits_used": 3, "mode": 2, "crawl_duration": 340, "message": null, "request_user_agent": "Spider", "level": "info", "status_code": 0, "created_at": "2024-04-21T01:21:32.886863+00:00", "updated_at": "2024-04-21T01:21:32.886863+00:00" }, "error": ""}

Get the remaining credits available.

GEThttps://api.spider.cloud/data/credits

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}response = requests.get('https://api.spider.cloud/data/credits?limit=25&return_format=markdown', headers=headers)print(response.json())

Response

{ "data": { "id": "8d662167-5a5f-41aa-9cb8-0cbb7d536891", "user_id": "6bd06efa-bb0a-4f1f-a29f-05db0c4b1bfg", "credits": 53334, "created_at": "2024-04-21T01:21:32.886863+00:00", "updated_at": "2024-04-21T01:21:32.886863+00:00" }}

Get the cron jobs that are set to keep data fresh.

GEThttps://api.spider.cloud/data/crons

Request params

limitstring
The limit of records to get.
Crawl Limit
pagenumber
The current page to get.
domainstring
Filter a single domain record.

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}response = requests.get('https://api.spider.cloud/data/crons?limit=25&return_format=markdown', headers=headers)print(response.json())

Response

Get the profile of the user. This returns data such as approved limits and usage for the month.

GEThttps://api.spider.cloud/data/profiles

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}response = requests.get('https://api.spider.cloud/crawl?limit=25&return_format=markdown', headers=headers)print(response.json())

Response

{ "data": [ { "id": "6bd06efa-bb0b-4f1f-a29f-05db0c4b1bfd", "email": "user@gmail.com", "stripe_id": "cus_OYO2rAhSQaYqHT", "is_deleted": null, "proxy": null, "headless": false, "billing_limit": 50, "billing_limit_soft": 120, "approved_usage": 0, "crawl_budget": { "*": 200 }, "usage": null, "has_subscription": false, "depth": null, "full_resources": false, "meta_data": true, "billing_allowed": false, "initial_promo": false } ]}

Get a real user agent to use for crawling.

GEThttps://api.spider.cloud/data/user_agents

Request params

limitstring
The limit of records to get.
Crawl Limit
osstring
Filter a by a device ex: Android, Mac OS, Android, Windows, Linux and more.
pagenumber
The current page to get.
platformstring
Filter a by a platform ex: Chrome, Edge, Safari, Firefox and more.

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}response = requests.get('https://api.spider.cloud/data/user_agents?limit=25&return_format=markdown', headers=headers)print(response.json())

Response

{ "data": { "agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36", "platform": "Chrome", "platform_version": "123.0.0.0", "device": "Macintosh", "os": "Mac OS", "os_version": "10.15.7", "cpu_architecture": "", "mobile": false, "device_type": "desktop" }}

Manage

Configure data to enhance crawl efficiency: create, update, and delete records.

Create or update a website by configuration.

POSThttps://api.spider.cloud/data/websites

Request body

urlrequiredstring
The URI resource to crawl. This can be a comma split list for multiple urls.
Test Url
requeststring
The request type to perform. Possible values are http, chrome, and smart. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML.
limitnumber
The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.
Crawl Limit
return_formatstring
The format to return the data in. Possible values are markdown, raw, text, and bytes. Use raw to return the default format of the page like HTML etc.
proxy_enabledboolean
Enable premium proxies to prevent detection. Default is false.
anti_botboolean
Enable anti-bot mode using various techniques to increase the chance of success. Default is false.
tldboolean
Allow TLD's to be included. Default is false.
cronstring
Set a cron period to run the website crawls automatically. Possible values are daily, weekly, and monthly.
depthnumber
The crawl limit for maximum depth. If 0, no limit will be applied.
Crawl Depth
cacheboolean
Use HTTP caching for the crawl to speed up repeated runs. Default is true
budgetobject
Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.
Crawl Budget
localestring
The locale to use for request, example en-US.
cookiesstring
Add HTTP cookies to use for request.
stealthboolean
Use stealth mode for headless chrome request to help prevent being blocked. The default is false on chrome.
headersobject
Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.
metadataboolean
Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled.
viewportobject
Configure the viewport for chrome. Defaults to 800x600.
encodingstring
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
blacklistarray
Blacklist a set of paths that you do not want to crawl. You can use Regex patterns to help with the list.
whitelistarray
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
subdomainsboolean
Allow subdomains to be included. Default is false.
user_agentstring
Add a custom HTTP user agent to the request. By default this is set to a random agent.
store_databoolean
Boolean to determine if storage should be used. If set this takes precedence over storageless. Defaults to false.
gpt_configobject
Use AI to generate actions to perform during the crawl. You can pass an array for the "prompt" to chain steps.
fingerprintboolean
Use advanced fingerprint for chrome.
storagelessboolean
Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored.
readabilityboolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
chunking_algobject
Use a chunking algorithm to segment your content output. Possible values are ByWords, ByLines, ByCharacterLength, and BySentence. The following is an example of the object shape for "chunking_alg": { "type": "bysentence", "value": 2 } which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.
respect_robotsboolean
Respect the robots.txt file for crawling. Default is true.
query_selectorstring
The CSS query selector to use when extracting content from the markup.
Test Query Selector
full_resourcesboolean
Crawl and download all the resources for a website.
request_timeoutnumber
The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds.
run_in_backgroundboolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.
skip_config_checksboolean
Skip checking the database for website configuration. This may increase performance for request that are using limit=1. The default is false.

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}json_data = {"limit":25,"return_format":"markdown","url":"https://spider.cloud"}response = requests.post('https://api.spider.cloud/crawl', headers=headers, json=json_data)print(response.json())

Response

{ "data": null}

Delete a website from your collection. Remove the url body to delete all websites.

DELETEhttps://api.spider.cloud/data/websites

Request body

urlrequiredstring
The URI resource to crawl. This can be a comma split list for multiple urls.
Test Url

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}json_data = {"limit":25,"return_format":"markdown","url":"https://spider.cloud"}response = requests.post('https://api.spider.cloud/crawl', headers=headers, json=json_data)print(response.json())

Response

{ "data": null}

Spider API Reference (2024)

urlrequiredstring

requeststring

limitnumber

return_formatstring

proxy_enabledboolean

anti_botboolean

tldboolean

depthnumber

cacheboolean

budgetobject

localestring

cookiesstring

stealthboolean

headersobject

metadataboolean

viewportobject

encodingstring

blacklistarray

whitelistarray

subdomainsboolean

user_agentstring

store_databoolean

gpt_configobject

fingerprintboolean

storagelessboolean

readabilityboolean

chunking_algobject

respect_robotsboolean

query_selectorstring

full_resourcesboolean

request_timeoutnumber

run_in_backgroundboolean

skip_config_checksboolean

searchrequiredstring

search_limitnumber

fetch_page_contentboolean

countrystring

locationstring

languagestring

numnumber

tldboolean

pagenumber

cacheboolean

requeststring

limitnumber

return_formatstring

proxy_enabledboolean

anti_botboolean

depthnumber

budgetobject

localestring

cookiesstring

stealthboolean

headersobject

viewportobject

encodingstring

blacklistarray

whitelistarray

subdomainsboolean

user_agentstring

store_databoolean

metadataboolean

gpt_configobject

fingerprintboolean

storagelessboolean

readabilityboolean

chunking_algobject

respect_robotsboolean

query_selectorstring

full_resourcesboolean

request_timeoutnumber

run_in_backgroundboolean

skip_config_checksboolean

urlrequiredstring

requeststring

limitnumber

return_formatstring

proxy_enabledboolean

anti_botboolean