Spider API Reference (2024)

Table of Contents
urlrequiredstring requeststring limitnumber return_formatstring proxy_enabledboolean anti_botboolean tldboolean depthnumber cacheboolean budgetobject localestring cookiesstring stealthboolean headersobject metadataboolean viewportobject encodingstring blacklistarray whitelistarray subdomainsboolean user_agentstring store_databoolean gpt_configobject fingerprintboolean storagelessboolean readabilityboolean chunking_algobject respect_robotsboolean query_selectorstring full_resourcesboolean request_timeoutnumber run_in_backgroundboolean skip_config_checksboolean searchrequiredstring search_limitnumber fetch_page_contentboolean countrystring locationstring languagestring numnumber tldboolean pagenumber cacheboolean requeststring limitnumber return_formatstring proxy_enabledboolean anti_botboolean depthnumber budgetobject localestring cookiesstring stealthboolean headersobject viewportobject encodingstring blacklistarray whitelistarray subdomainsboolean user_agentstring store_databoolean metadataboolean gpt_configobject fingerprintboolean storagelessboolean readabilityboolean chunking_algobject respect_robotsboolean query_selectorstring full_resourcesboolean request_timeoutnumber run_in_backgroundboolean skip_config_checksboolean urlrequiredstring requeststring limitnumber return_formatstring proxy_enabledboolean anti_botboolean tldboolean depthnumber cacheboolean budgetobject localestring cookiesstring stealthboolean headersobject metadataboolean viewportobject encodingstring blacklistarray whitelistarray subdomainsboolean user_agentstring store_databoolean gpt_configobject fingerprintboolean storagelessboolean readabilityboolean chunking_algobject respect_robotsboolean query_selectorstring full_resourcesboolean request_timeoutnumber run_in_backgroundboolean skip_config_checksboolean urlrequiredstring requeststring limitnumber return_formatstring proxy_enabledboolean anti_botboolean tldboolean depthnumber cacheboolean budgetobject localestring cookiesstring stealthboolean headersobject metadataboolean viewportobject encodingstring blacklistarray whitelistarray subdomainsboolean user_agentstring store_databoolean gpt_configobject fingerprintboolean storagelessboolean readabilityboolean chunking_algobject respect_robotsboolean query_selectorstring full_resourcesboolean request_timeoutnumber run_in_backgroundboolean skip_config_checksboolean dataobject return_formatstring readabilityboolean Proxy-Mode Pipelines urlrequiredstring requeststring limitnumber return_formatstring proxy_enabledboolean anti_botboolean tldboolean depthnumber cacheboolean budgetobject localestring cookiesstring stealthboolean headersobject metadataboolean viewportobject encodingstring blacklistarray whitelistarray subdomainsboolean user_agentstring store_databoolean gpt_configobject fingerprintboolean storagelessboolean readabilityboolean chunking_algobject respect_robotsboolean query_selectorstring full_resourcesboolean request_timeoutnumber run_in_backgroundboolean skip_config_checksboolean urlrequiredstring requeststring limitnumber return_formatstring proxy_enabledboolean anti_botboolean tldboolean depthnumber cacheboolean budgetobject localestring cookiesstring stealthboolean headersobject metadataboolean viewportobject encodingstring blacklistarray whitelistarray subdomainsboolean user_agentstring store_databoolean gpt_configobject fingerprintboolean storagelessboolean readabilityboolean chunking_algobject respect_robotsboolean query_selectorstring full_resourcesboolean request_timeoutnumber run_in_backgroundboolean skip_config_checksboolean textrequiredstring requeststring limitnumber return_formatstring proxy_enabledboolean anti_botboolean tldboolean urlstring depthnumber cacheboolean budgetobject localestring cookiesstring stealthboolean headersobject metadataboolean viewportobject encodingstring blacklistarray whitelistarray subdomainsboolean user_agentstring store_databoolean gpt_configobject fingerprintboolean storagelessboolean readabilityboolean chunking_algobject respect_robotsboolean query_selectorstring full_resourcesboolean request_timeoutnumber run_in_backgroundboolean skip_config_checksboolean Queries limitstring pagenumber domainstring limitstring pagenumber domainstring limitstring pagenumber domainstring limitstring pagenumber domainstring limitstring pagenumber domainstring limitstring pagenumber domainstring limitstring pagenumber domainstring limitstring osstring pagenumber platformstring Manage urlrequiredstring requeststring limitnumber return_formatstring proxy_enabledboolean anti_botboolean tldboolean cronstring depthnumber cacheboolean budgetobject localestring cookiesstring stealthboolean headersobject metadataboolean viewportobject encodingstring blacklistarray whitelistarray subdomainsboolean user_agentstring store_databoolean gpt_configobject fingerprintboolean storagelessboolean readabilityboolean chunking_algobject respect_robotsboolean query_selectorstring full_resourcesboolean request_timeoutnumber run_in_backgroundboolean skip_config_checksboolean urlrequiredstring References

Start crawling a website(s) to collect resources.

POSThttps://api.spider.cloud/crawl

Request body

  • urlrequiredstring

    The URI resource to crawl. This can be a comma split list for multiple urls.

  • requeststring

    The request type to perform. Possible values are http, chrome, and smart. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML.

  • limitnumber

    The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.

  • return_formatstring

    The format to return the data in. Possible values are markdown, raw, text, and bytes. Use raw to return the default format of the page like HTML etc.

  • proxy_enabledboolean

    Enable premium proxies to prevent detection. Default is false.

  • anti_botboolean

    Enable anti-bot mode using various techniques to increase the chance of success. Default is false.

  • tldboolean

    Allow TLD's to be included. Default is false.

  • depthnumber

    The crawl limit for maximum depth. If 0, no limit will be applied.

  • cacheboolean

    Use HTTP caching for the crawl to speed up repeated runs. Default is true

  • budgetobject

    Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.

  • localestring

    The locale to use for request, example en-US.

  • cookiesstring

    Add HTTP cookies to use for request.

  • stealthboolean

    Use stealth mode for headless chrome request to help prevent being blocked. The default is false on chrome.

  • headersobject

    Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.

  • metadataboolean

    Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled.

  • viewportobject

    Configure the viewport for chrome. Defaults to 800x600.

  • encodingstring

    The type of encoding to use like UTF-8, SHIFT_JIS, or etc.

  • blacklistarray

    Blacklist a set of paths that you do not want to crawl. You can use Regex patterns to help with the list.

  • whitelistarray

    Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.

  • subdomainsboolean

    Allow subdomains to be included. Default is false.

  • user_agentstring

    Add a custom HTTP user agent to the request. By default this is set to a random agent.

  • store_databoolean

    Boolean to determine if storage should be used. If set this takes precedence over storageless. Defaults to false.

  • gpt_configobject

    Use AI to generate actions to perform during the crawl. You can pass an array for the "prompt" to chain steps.

  • fingerprintboolean

    Use advanced fingerprint for chrome.

  • storagelessboolean

    Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored.

  • readabilityboolean

    Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.

  • chunking_algobject

    Use a chunking algorithm to segment your content output. Possible values are ByWords, ByLines, ByCharacterLength, and BySentence. The following is an example of the object shape for "chunking_alg": { "type": "bysentence", "value": 2 } which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.

  • respect_robotsboolean

    Respect the robots.txt file for crawling. Default is true.

  • query_selectorstring

    The CSS query selector to use when extracting content from the markup.

  • full_resourcesboolean

    Crawl and download all the resources for a website.

  • request_timeoutnumber

    The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds.

  • run_in_backgroundboolean

    Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.

  • skip_config_checksboolean

    Skip checking the database for website configuration. This may increase performance for request that are using limit=1. The default is false.

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/json',}json_data = {"limit":25,"return_format":"markdown","url":"https://spider.cloud"}response = requests.post('https://api.spider.cloud/crawl', headers=headers, json=json_data)print(response.json())

Response

[ { "content": "<html>...", "error": null, "status": 200, "url": "https://spider.cloud" }, // more content...]

Perform a search and gather a list of websites to start crawling and collect resources.

POSThttps://api.spider.cloud/search

Request body

  • searchrequiredstring

    The search query you want to search for.

  • search_limitnumber

    The limit amount of urls to fetch or crawl from the search results. Remove the value or set it to 0 to crawl all URLs from the search results.

  • fetch_page_contentboolean

    Fetch all the content of the websites by performing crawls. The default is true; if this is disabled, only the search results are returned instead.

  • countrystring

    The country code to use for the search. It's a two-letter country code. (e.g. us for the United States).

  • locationstring

    The location from where you want the search to originate.

  • languagestring

    The language to use for the search. It's a two-letter language code (e.g., en for English).

  • numnumber

    The maximum number of results to return for the search.

  • tldboolean

    Allow TLD's to be included. Default is false.

  • pagenumber

    The page number for the search results.

  • cacheboolean

    Use HTTP caching for the crawl to speed up repeated runs. Default is true

  • requeststring

    The request type to perform. Possible values are http, chrome, and smart. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML.

  • limitnumber

    The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.

  • return_formatstring

    The format to return the data in. Possible values are markdown, raw, text, and bytes. Use raw to return the default format of the page like HTML etc.

  • proxy_enabledboolean

    Enable premium proxies to prevent detection. Default is false.

  • anti_botboolean

    Enable anti-bot mode using various techniques to increase the chance of success. Default is false.

  • depthnumber

    The crawl limit for maximum depth. If 0, no limit will be applied.

  • budgetobject

    Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.

  • localestring

    The locale to use for request, example en-US.

  • cookiesstring

    Add HTTP cookies to use for request.

  • stealthboolean

    Use stealth mode for headless chrome request to help prevent being blocked. The default is false on chrome.

  • headersobject

    Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.

  • viewportobject

    Configure the viewport for chrome. Defaults to 800x600.

  • encodingstring

    The type of encoding to use like UTF-8, SHIFT_JIS, or etc.

  • blacklistarray

    Blacklist a set of paths that you do not want to crawl. You can use Regex patterns to help with the list.

  • whitelistarray

    Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.

  • subdomainsboolean

    Allow subdomains to be included. Default is false.

  • user_agentstring

    Add a custom HTTP user agent to the request. By default this is set to a random agent.

  • store_databoolean

    Boolean to determine if storage should be used. If set this takes precedence over storageless. Defaults to false.

  • metadataboolean

    Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled.

  • gpt_configobject

    Use AI to generate actions to perform during the crawl. You can pass an array for the "prompt" to chain steps.

  • fingerprintboolean

    Use advanced fingerprint for chrome.

  • storagelessboolean

    Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored.

  • readabilityboolean

    Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.

  • chunking_algobject

    Use a chunking algorithm to segment your content output. Possible values are ByWords, ByLines, ByCharacterLength, and BySentence. The following is an example of the object shape for "chunking_alg": { "type": "bysentence", "value": 2 } which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.

  • respect_robotsboolean

    Respect the robots.txt file for crawling. Default is true.

  • query_selectorstring

    The CSS query selector to use when extracting content from the markup.

  • full_resourcesboolean

    Crawl and download all the resources for a website.

  • request_timeoutnumber

    The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds.

  • run_in_backgroundboolean

    Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.

  • skip_config_checksboolean

    Skip checking the database for website configuration. This may increase performance for request that are using limit=1. The default is false.

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/json',}json_data = {"search":"a sports website","search_limit":3,"limit":25,"return_format":"markdown"}response = requests.post('https://api.spider.cloud/search', headers=headers, json=json_data)print(response.json())

Response

[ { "content": "<html>...", "error": null, "status": 200, "url": "https://spider.cloud" }, // more content...]

Start crawling a website(s) to collect links found.

POSThttps://api.spider.cloud/links

Request body

  • urlrequiredstring

    The URI resource to crawl. This can be a comma split list for multiple urls.

  • requeststring

    The request type to perform. Possible values are http, chrome, and smart. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML.

  • limitnumber

    The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.

  • return_formatstring

    The format to return the data in. Possible values are markdown, raw, text, and bytes. Use raw to return the default format of the page like HTML etc.

  • proxy_enabledboolean

    Enable premium proxies to prevent detection. Default is false.

  • anti_botboolean

    Enable anti-bot mode using various techniques to increase the chance of success. Default is false.

  • tldboolean

    Allow TLD's to be included. Default is false.

  • depthnumber

    The crawl limit for maximum depth. If 0, no limit will be applied.

  • cacheboolean

    Use HTTP caching for the crawl to speed up repeated runs. Default is true

  • budgetobject

    Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.

  • localestring

    The locale to use for request, example en-US.

  • cookiesstring

    Add HTTP cookies to use for request.

  • stealthboolean

    Use stealth mode for headless chrome request to help prevent being blocked. The default is false on chrome.

  • headersobject

    Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.

  • metadataboolean

    Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled.

  • viewportobject

    Configure the viewport for chrome. Defaults to 800x600.

  • encodingstring

    The type of encoding to use like UTF-8, SHIFT_JIS, or etc.

  • blacklistarray

    Blacklist a set of paths that you do not want to crawl. You can use Regex patterns to help with the list.

  • whitelistarray

    Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.

  • subdomainsboolean

    Allow subdomains to be included. Default is false.

  • user_agentstring

    Add a custom HTTP user agent to the request. By default this is set to a random agent.

  • store_databoolean

    Boolean to determine if storage should be used. If set this takes precedence over storageless. Defaults to false.

  • gpt_configobject

    Use AI to generate actions to perform during the crawl. You can pass an array for the "prompt" to chain steps.

  • fingerprintboolean

    Use advanced fingerprint for chrome.

  • storagelessboolean

    Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored.

  • readabilityboolean

    Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.

  • chunking_algobject

    Use a chunking algorithm to segment your content output. Possible values are ByWords, ByLines, ByCharacterLength, and BySentence. The following is an example of the object shape for "chunking_alg": { "type": "bysentence", "value": 2 } which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.

  • respect_robotsboolean

    Respect the robots.txt file for crawling. Default is true.

  • query_selectorstring

    The CSS query selector to use when extracting content from the markup.

  • full_resourcesboolean

    Crawl and download all the resources for a website.

  • request_timeoutnumber

    The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds.

  • run_in_backgroundboolean

    Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.

  • skip_config_checksboolean

    Skip checking the database for website configuration. This may increase performance for request that are using limit=1. The default is false.

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/json',}json_data = {"limit":25,"return_format":"markdown","url":"https://spider.cloud"}response = requests.post('https://api.spider.cloud/links', headers=headers, json=json_data)print(response.json())

Response

[ { "url": "https://spider.cloud", "status": 200, "error": null }, // more content...]

Start taking screenshots of website(s) to collect images to base64 or binary.

POSThttps://api.spider.cloud/screenshot

Request body

  • urlrequiredstring

    The URI resource to crawl. This can be a comma split list for multiple urls.

  • requeststring

    The request type to perform. Possible values are http, chrome, and smart. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML.

  • limitnumber

    The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.

  • return_formatstring

    The format to return the data in. Possible values are markdown, raw, text, and bytes. Use raw to return the default format of the page like HTML etc.

  • proxy_enabledboolean

    Enable premium proxies to prevent detection. Default is false.

  • anti_botboolean

    Enable anti-bot mode using various techniques to increase the chance of success. Default is false.

  • tldboolean

    Allow TLD's to be included. Default is false.

  • depthnumber

    The crawl limit for maximum depth. If 0, no limit will be applied.

  • cacheboolean

    Use HTTP caching for the crawl to speed up repeated runs. Default is true

  • budgetobject

    Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.

  • localestring

    The locale to use for request, example en-US.

  • cookiesstring

    Add HTTP cookies to use for request.

  • stealthboolean

    Use stealth mode for headless chrome request to help prevent being blocked. The default is false on chrome.

  • headersobject

    Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.

  • metadataboolean

    Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled.

  • viewportobject

    Configure the viewport for chrome. Defaults to 800x600.

  • encodingstring

    The type of encoding to use like UTF-8, SHIFT_JIS, or etc.

  • blacklistarray

    Blacklist a set of paths that you do not want to crawl. You can use Regex patterns to help with the list.

  • whitelistarray

    Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.

  • subdomainsboolean

    Allow subdomains to be included. Default is false.

  • user_agentstring

    Add a custom HTTP user agent to the request. By default this is set to a random agent.

  • store_databoolean

    Boolean to determine if storage should be used. If set this takes precedence over storageless. Defaults to false.

  • gpt_configobject

    Use AI to generate actions to perform during the crawl. You can pass an array for the "prompt" to chain steps.

  • fingerprintboolean

    Use advanced fingerprint for chrome.

  • storagelessboolean

    Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored.

  • readabilityboolean

    Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.

  • chunking_algobject

    Use a chunking algorithm to segment your content output. Possible values are ByWords, ByLines, ByCharacterLength, and BySentence. The following is an example of the object shape for "chunking_alg": { "type": "bysentence", "value": 2 } which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.

  • respect_robotsboolean

    Respect the robots.txt file for crawling. Default is true.

  • query_selectorstring

    The CSS query selector to use when extracting content from the markup.

  • full_resourcesboolean

    Crawl and download all the resources for a website.

  • request_timeoutnumber

    The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds.

  • run_in_backgroundboolean

    Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.

  • skip_config_checksboolean

    Skip checking the database for website configuration. This may increase performance for request that are using limit=1. The default is false.

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/json',}json_data = {"limit":25,"return_format":"markdown","url":"https://spider.cloud"}response = requests.post('https://api.spider.cloud/screenshot', headers=headers, json=json_data)print(response.json())

Response

[ { "content": "base64...", "error": null, "status": 200, "url": "https://spider.cloud" }, // more content...]

Transform HTML to Markdown or text fast. Each HTML transformation costs 1 credit. You can send up to 10MB of data at once.

POSThttps://api.spider.cloud/transform

Request body

  • dataobject

    A list of html data to transform. The object list takes the keys html and url. The url key is optional and only used when the readability is enabled.

  • return_formatstring

    The format to return the data in. Possible values are markdown, raw, text, and bytes. Use raw to return the default format of the page like HTML etc.

  • readabilityboolean

    Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/json',}json_data = {"return_format":"markdown","data":[{"html":"<html>\n<head>\n <title>Example Transform</title>\n <meta charset=\"utf-8\">\n <meta http-equiv=\"Content-type\" content=\"text/html; charset=utf-8\">\n <meta name=\"viewport\" content=\"width=device-width, initial-scale=1\">\n <style type=\"text/css\">\n html {\n background-color: #f0f0f2;\n margin: 0;\n padding: 0;\n font-size: 16px;\n }\n </style> \n</head>\n<body>\n<div>\n <h1>Example Website</h1>\n <p>This is some example markup to use to test the transform function.</p>\n <p><a href=\"https://spider.cloud/guides\">Guides</a></p>\n</div>\n</body></html>","url":"https://example.com"}]}response = requests.post('https://api.spider.cloud/transform', headers=headers, json=json_data)print(response.json())

Response

{ "content": [ "Example DomainExample Domain==========This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.[More information...](https://www.iana.org/domains/example)" ], "error": "", "status": 200 }

Proxy-Mode

Alpha

Spider also offers a proxy front-end to the service. The Spider proxy will then handle requests just like any standard request, with the option to use high-performance residential proxies at 1TB per/s.

**HTTP address**: proxy.spider.cloud:8888**HTTPS address**: proxy.spider.cloud:8889**Username**: YOUR-API-KEY**Password**: PARAMETERS

Example proxy request

import requests, os# Proxy configurationproxies = { 'http': f"http://{os.getenv('SPIDER_API_KEY')}:request=Raw&premium_proxy=False@proxy.spider.cloud:8888", 'https': f"https://{os.getenv('SPIDER_API_KEY')}:request=Raw&premium_proxy=False@proxy.spider.cloud:8889"}# Function to make a request through the proxydef get_via_proxy(url): try: response = requests.get(url, proxies=proxies) response.raise_for_status() print('Response HTTP Status Code: ', response.status_code) print('Response HTTP Response Body: ', response.content) return response.text except requests.exceptions.RequestException as e: print(f"Error: {e}") return None# Example usageif __name__ == "__main__": get_via_proxy("https://www.choosealicense.com") get_via_proxy("https://www.choosealicense.com/community")

Pipelines

Create powerful workflows with our pipeline API endpoints. Use AI to extract contacts from any website or filter links with prompts with ease.

Start crawling a website(s) to collect all contacts utilizing AI. A minimum of $25 in credits is necessary for extraction.

POSThttps://api.spider.cloud/pipeline/extract-contacts

Request body

  • urlrequiredstring

    The URI resource to crawl. This can be a comma split list for multiple urls.

  • requeststring

    The request type to perform. Possible values are http, chrome, and smart. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML.

  • limitnumber

    The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.

  • return_formatstring

    The format to return the data in. Possible values are markdown, raw, text, and bytes. Use raw to return the default format of the page like HTML etc.

  • proxy_enabledboolean

    Enable premium proxies to prevent detection. Default is false.

  • anti_botboolean

    Enable anti-bot mode using various techniques to increase the chance of success. Default is false.

  • tldboolean

    Allow TLD's to be included. Default is false.

  • depthnumber

    The crawl limit for maximum depth. If 0, no limit will be applied.

  • cacheboolean

    Use HTTP caching for the crawl to speed up repeated runs. Default is true

  • budgetobject

    Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.

  • localestring

    The locale to use for request, example en-US.

  • cookiesstring

    Add HTTP cookies to use for request.

  • stealthboolean

    Use stealth mode for headless chrome request to help prevent being blocked. The default is false on chrome.

  • headersobject

    Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.

  • metadataboolean

    Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled.

  • viewportobject

    Configure the viewport for chrome. Defaults to 800x600.

  • encodingstring

    The type of encoding to use like UTF-8, SHIFT_JIS, or etc.

  • blacklistarray

    Blacklist a set of paths that you do not want to crawl. You can use Regex patterns to help with the list.

  • whitelistarray

    Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.

  • subdomainsboolean

    Allow subdomains to be included. Default is false.

  • user_agentstring

    Add a custom HTTP user agent to the request. By default this is set to a random agent.

  • store_databoolean

    Boolean to determine if storage should be used. If set this takes precedence over storageless. Defaults to false.

  • gpt_configobject

    Use AI to generate actions to perform during the crawl. You can pass an array for the "prompt" to chain steps.

  • fingerprintboolean

    Use advanced fingerprint for chrome.

  • storagelessboolean

    Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored.

  • readabilityboolean

    Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.

  • chunking_algobject

    Use a chunking algorithm to segment your content output. Possible values are ByWords, ByLines, ByCharacterLength, and BySentence. The following is an example of the object shape for "chunking_alg": { "type": "bysentence", "value": 2 } which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.

  • respect_robotsboolean

    Respect the robots.txt file for crawling. Default is true.

  • query_selectorstring

    The CSS query selector to use when extracting content from the markup.

  • full_resourcesboolean

    Crawl and download all the resources for a website.

  • request_timeoutnumber

    The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds.

  • run_in_backgroundboolean

    Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.

  • skip_config_checksboolean

    Skip checking the database for website configuration. This may increase performance for request that are using limit=1. The default is false.

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}json_data = {"limit":25,"return_format":"markdown","url":"https://spider.cloud"}response = requests.post('https://api.spider.cloud/pipeline/extract-contacts', headers=headers, json=json_data)print(response.json())

Response

[ { "content": [{ "full_name": "John Doe", "email": "johndoe@gmail.com", "phone": "555-555-555", "title": "Baker" }], "error": null, "status": 200, "url": "https://spider.cloud" }, // more content...]

Crawl a website and accurately categorize it using AI.

POSThttps://api.spider.cloud/pipeline/label

Request body

  • urlrequiredstring

    The URI resource to crawl. This can be a comma split list for multiple urls.

  • requeststring

    The request type to perform. Possible values are http, chrome, and smart. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML.

  • limitnumber

    The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.

  • return_formatstring

    The format to return the data in. Possible values are markdown, raw, text, and bytes. Use raw to return the default format of the page like HTML etc.

  • proxy_enabledboolean

    Enable premium proxies to prevent detection. Default is false.

  • anti_botboolean

    Enable anti-bot mode using various techniques to increase the chance of success. Default is false.

  • tldboolean

    Allow TLD's to be included. Default is false.

  • depthnumber

    The crawl limit for maximum depth. If 0, no limit will be applied.

  • cacheboolean

    Use HTTP caching for the crawl to speed up repeated runs. Default is true

  • budgetobject

    Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.

  • localestring

    The locale to use for request, example en-US.

  • cookiesstring

    Add HTTP cookies to use for request.

  • stealthboolean

    Use stealth mode for headless chrome request to help prevent being blocked. The default is false on chrome.

  • headersobject

    Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.

  • metadataboolean

    Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled.

  • viewportobject

    Configure the viewport for chrome. Defaults to 800x600.

  • encodingstring

    The type of encoding to use like UTF-8, SHIFT_JIS, or etc.

  • blacklistarray

    Blacklist a set of paths that you do not want to crawl. You can use Regex patterns to help with the list.

  • whitelistarray

    Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.

  • subdomainsboolean

    Allow subdomains to be included. Default is false.

  • user_agentstring

    Add a custom HTTP user agent to the request. By default this is set to a random agent.

  • store_databoolean

    Boolean to determine if storage should be used. If set this takes precedence over storageless. Defaults to false.

  • gpt_configobject

    Use AI to generate actions to perform during the crawl. You can pass an array for the "prompt" to chain steps.

  • fingerprintboolean

    Use advanced fingerprint for chrome.

  • storagelessboolean

    Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored.

  • readabilityboolean

    Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.

  • chunking_algobject

    Use a chunking algorithm to segment your content output. Possible values are ByWords, ByLines, ByCharacterLength, and BySentence. The following is an example of the object shape for "chunking_alg": { "type": "bysentence", "value": 2 } which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.

  • respect_robotsboolean

    Respect the robots.txt file for crawling. Default is true.

  • query_selectorstring

    The CSS query selector to use when extracting content from the markup.

  • full_resourcesboolean

    Crawl and download all the resources for a website.

  • request_timeoutnumber

    The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds.

  • run_in_backgroundboolean

    Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.

  • skip_config_checksboolean

    Skip checking the database for website configuration. This may increase performance for request that are using limit=1. The default is false.

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}json_data = {"limit":25,"return_format":"markdown","url":"https://spider.cloud"}response = requests.post('https://api.spider.cloud/pipeline/label', headers=headers, json=json_data)print(response.json())

Response

[ { "content": ["Government"], "error": null, "status": 200, "url": "https://spider.cloud" }, // more content...]

Crawl website(s) found from raw text or markdown.

POSThttps://api.spider.cloud/pipeline/crawl-text

Request body

  • textrequiredstring

    The text string to extract urls from. The max limit for the text is 10mb.

  • requeststring

    The request type to perform. Possible values are http, chrome, and smart. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML.

  • limitnumber

    The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.

  • return_formatstring

    The format to return the data in. Possible values are markdown, raw, text, and bytes. Use raw to return the default format of the page like HTML etc.

  • proxy_enabledboolean

    Enable premium proxies to prevent detection. Default is false.

  • anti_botboolean

    Enable anti-bot mode using various techniques to increase the chance of success. Default is false.

  • tldboolean

    Allow TLD's to be included. Default is false.

  • urlstring

    The URI resource to crawl. This can be a comma split list for multiple urls.

  • depthnumber

    The crawl limit for maximum depth. If 0, no limit will be applied.

  • cacheboolean

    Use HTTP caching for the crawl to speed up repeated runs. Default is true

  • budgetobject

    Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.

  • localestring

    The locale to use for request, example en-US.

  • cookiesstring

    Add HTTP cookies to use for request.

  • stealthboolean

    Use stealth mode for headless chrome request to help prevent being blocked. The default is false on chrome.

  • headersobject

    Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.

  • metadataboolean

    Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled.

  • viewportobject

    Configure the viewport for chrome. Defaults to 800x600.

  • encodingstring

    The type of encoding to use like UTF-8, SHIFT_JIS, or etc.

  • blacklistarray

    Blacklist a set of paths that you do not want to crawl. You can use Regex patterns to help with the list.

  • whitelistarray

    Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.

  • subdomainsboolean

    Allow subdomains to be included. Default is false.

  • user_agentstring

    Add a custom HTTP user agent to the request. By default this is set to a random agent.

  • store_databoolean

    Boolean to determine if storage should be used. If set this takes precedence over storageless. Defaults to false.

  • gpt_configobject

    Use AI to generate actions to perform during the crawl. You can pass an array for the "prompt" to chain steps.

  • fingerprintboolean

    Use advanced fingerprint for chrome.

  • storagelessboolean

    Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored.

  • readabilityboolean

    Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.

  • chunking_algobject

    Use a chunking algorithm to segment your content output. Possible values are ByWords, ByLines, ByCharacterLength, and BySentence. The following is an example of the object shape for "chunking_alg": { "type": "bysentence", "value": 2 } which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.

  • respect_robotsboolean

    Respect the robots.txt file for crawling. Default is true.

  • query_selectorstring

    The CSS query selector to use when extracting content from the markup.

  • full_resourcesboolean

    Crawl and download all the resources for a website.

  • request_timeoutnumber

    The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds.

  • run_in_backgroundboolean

    Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.

  • skip_config_checksboolean

    Skip checking the database for website configuration. This may increase performance for request that are using limit=1. The default is false.

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}json_data = {"text":"Check this link: https://example.com and email to example@email.com","limit":25,"return_format":"markdown"}response = requests.post('https://api.spider.cloud/pipeline/crawl-text', headers=headers, json=json_data)print(response.json())

Response

[ { "content": "<html>...", "error": null, "status": 200, "url": "https://spider.cloud" }, // more content...]

Queries

Query the data that you collect. Add dynamic filters for extracting exactly what is needed.

Get the websites stored.

GEThttps://api.spider.cloud/data/websites

Request params

  • limitstring

    The limit of records to get.

  • pagenumber

    The current page to get.

  • domainstring

    Filter a single domain record.

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}response = requests.get('https://api.spider.cloud/data/websites?limit=25&return_format=markdown', headers=headers)print(response.json())

Response

{ "data": [ { "id": "2a503c02-f161-444b-b1fa-03a3914667b6", "user_id": "6bd06efa-bb0b-4f1f-a23f-05db0c4b1bfd", "url": "6bd06efa-bb0b-4f1f-a29f-05db0c4b1bfd/example.com/index.html", "domain": "spider.cloud", "created_at": "2024-04-18T15:40:25.667063+00:00", "updated_at": "2024-04-18T15:40:25.667063+00:00", "pathname": "/", "fts": "", "scheme": "https:", "last_checked_at": "2024-05-10T13:39:32.293017+00:00", "screenshot": null } ]}

Get the pages/resources stored.

GEThttps://api.spider.cloud/data/pages

Request params

  • limitstring

    The limit of records to get.

  • pagenumber

    The current page to get.

  • domainstring

    Filter a single domain record.

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}response = requests.get('https://api.spider.cloud/data/pages?limit=25&return_format=markdown', headers=headers)print(response.json())

Response

{ "data": [ { "id": "733b0d0f-e406-4229-949d-8068ade54752", "user_id": "6bd06efa-bb0b-4f1f-a29f-05db0c4b1bfd", "url": "https://spider.cloud", "domain": "spider.cloud", "created_at": "2024-04-17T01:28:15.016975+00:00", "updated_at": "2024-04-17T01:28:15.016975+00:00", "proxy": true, "headless": true, "crawl_budget": null, "scheme": "https:", "last_checked_at": "2024-04-17T01:28:15.016975+00:00", "full_resources": false, "metadata": true, "gpt_config": null, "smart_mode": false, "fts": "'spider.cloud':1" } ]}

Get the pages metadata/resources stored.

GEThttps://api.spider.cloud/data/pages_metadata

Request params

  • limitstring

    The limit of records to get.

  • pagenumber

    The current page to get.

  • domainstring

    Filter a single domain record.

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}response = requests.get('https://api.spider.cloud/data/pages_metadata?limit=25&return_format=markdown', headers=headers)print(response.json())

Response

{ "data": [ { "id": "e27a1995-2abe-4319-acd1-3dd8258f0f49", "user_id": "253524cd-3f94-4ed1-83b3-f7fab134c3ff", "url": "253524cd-3f94-4ed1-83b3-f7fab134c3ff/www.google.com/search?query=spider.cloud.html", "domain": "www.google.com", "resource_type": "html", "title": "spider.cloud - Google Search", "description": "", "file_size": 1253960, "embedding": null, "pathname": "/search", "created_at": "2024-05-18T17:40:16.4808+00:00", "updated_at": "2024-05-18T17:40:16.4808+00:00", "keywords": [ "Fastest Web Crawler spider", "Web scraping" ], "labels": "Search Engine", "extracted_data": null, "fts": "'/search':1" } ]}

Get the pages contacts stored.

GEThttps://api.spider.cloud/data/contacts

Request params

  • limitstring

    The limit of records to get.

  • pagenumber

    The current page to get.

  • domainstring

    Filter a single domain record.

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}response = requests.get('https://api.spider.cloud/data/contacts?limit=25&return_format=markdown', headers=headers)print(response.json())

Response

{ "data": [ { "full_name": "John Doe", "email": "johndoe@gmail.com", "phone": "555-555-555", "title": "Baker" } ]}

Get the state of the crawl for the domain.

GEThttps://api.spider.cloud/data/crawl_state

Request params

  • limitstring

    The limit of records to get.

  • pagenumber

    The current page to get.

  • domainstring

    Filter a single domain record.

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}response = requests.get('https://api.spider.cloud/data/crawl_state?limit=25&return_format=markdown', headers=headers)print(response.json())

Response

{ "data": { "id": "195bf2f2-2821-421d-b89c-f27e57ca71fh", "user_id": "6bd06efa-bb0a-4f1f-a29f-05db0c4b1bfg", "domain": "spider.cloud", "url": "https://spider.cloud/", "links": 1, "credits_used": 3, "mode": 2, "crawl_duration": 340, "message": null, "request_user_agent": "Spider", "level": "info", "status_code": 0, "created_at": "2024-04-21T01:21:32.886863+00:00", "updated_at": "2024-04-21T01:21:32.886863+00:00" }, "error": ""}

Get the last 24 hours of logs.

GEThttps://api.spider.cloud/data/crawl_logs

Request params

  • limitstring

    The limit of records to get.

  • pagenumber

    The current page to get.

  • domainstring

    Filter a single domain record.

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}response = requests.get('https://api.spider.cloud/data/crawl_logs?limit=25&return_format=markdown', headers=headers)print(response.json())

Response

{ "data": { "id": "195bf2f2-2821-421d-b89c-f27e57ca71fh", "user_id": "6bd06efa-bb0a-4f1f-a29f-05db0c4b1bfg", "domain": "spider.cloud", "url": "https://spider.cloud/", "links": 1, "credits_used": 3, "mode": 2, "crawl_duration": 340, "message": null, "request_user_agent": "Spider", "level": "info", "status_code": 0, "created_at": "2024-04-21T01:21:32.886863+00:00", "updated_at": "2024-04-21T01:21:32.886863+00:00" }, "error": ""}

Get the remaining credits available.

GEThttps://api.spider.cloud/data/credits

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}response = requests.get('https://api.spider.cloud/data/credits?limit=25&return_format=markdown', headers=headers)print(response.json())

Response

{ "data": { "id": "8d662167-5a5f-41aa-9cb8-0cbb7d536891", "user_id": "6bd06efa-bb0a-4f1f-a29f-05db0c4b1bfg", "credits": 53334, "created_at": "2024-04-21T01:21:32.886863+00:00", "updated_at": "2024-04-21T01:21:32.886863+00:00" }}

Get the cron jobs that are set to keep data fresh.

GEThttps://api.spider.cloud/data/crons

Request params

  • limitstring

    The limit of records to get.

  • pagenumber

    The current page to get.

  • domainstring

    Filter a single domain record.

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}response = requests.get('https://api.spider.cloud/data/crons?limit=25&return_format=markdown', headers=headers)print(response.json())

Response

Get the profile of the user. This returns data such as approved limits and usage for the month.

GEThttps://api.spider.cloud/data/profiles

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}response = requests.get('https://api.spider.cloud/crawl?limit=25&return_format=markdown', headers=headers)print(response.json())

Response

{ "data": [ { "id": "6bd06efa-bb0b-4f1f-a29f-05db0c4b1bfd", "email": "user@gmail.com", "stripe_id": "cus_OYO2rAhSQaYqHT", "is_deleted": null, "proxy": null, "headless": false, "billing_limit": 50, "billing_limit_soft": 120, "approved_usage": 0, "crawl_budget": { "*": 200 }, "usage": null, "has_subscription": false, "depth": null, "full_resources": false, "meta_data": true, "billing_allowed": false, "initial_promo": false } ]}

Get a real user agent to use for crawling.

GEThttps://api.spider.cloud/data/user_agents

Request params

  • limitstring

    The limit of records to get.

  • osstring

    Filter a by a device ex: Android, Mac OS, Android, Windows, Linux and more.

  • pagenumber

    The current page to get.

  • platformstring

    Filter a by a platform ex: Chrome, Edge, Safari, Firefox and more.

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}response = requests.get('https://api.spider.cloud/data/user_agents?limit=25&return_format=markdown', headers=headers)print(response.json())

Response

{ "data": { "agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36", "platform": "Chrome", "platform_version": "123.0.0.0", "device": "Macintosh", "os": "Mac OS", "os_version": "10.15.7", "cpu_architecture": "", "mobile": false, "device_type": "desktop" }}

Manage

Configure data to enhance crawl efficiency: create, update, and delete records.

Create or update a website by configuration.

POSThttps://api.spider.cloud/data/websites

Request body

  • urlrequiredstring

    The URI resource to crawl. This can be a comma split list for multiple urls.

  • requeststring

    The request type to perform. Possible values are http, chrome, and smart. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML.

  • limitnumber

    The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.

  • return_formatstring

    The format to return the data in. Possible values are markdown, raw, text, and bytes. Use raw to return the default format of the page like HTML etc.

  • proxy_enabledboolean

    Enable premium proxies to prevent detection. Default is false.

  • anti_botboolean

    Enable anti-bot mode using various techniques to increase the chance of success. Default is false.

  • tldboolean

    Allow TLD's to be included. Default is false.

  • cronstring

    Set a cron period to run the website crawls automatically. Possible values are daily, weekly, and monthly.

  • depthnumber

    The crawl limit for maximum depth. If 0, no limit will be applied.

  • cacheboolean

    Use HTTP caching for the crawl to speed up repeated runs. Default is true

  • budgetobject

    Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.

  • localestring

    The locale to use for request, example en-US.

  • cookiesstring

    Add HTTP cookies to use for request.

  • stealthboolean

    Use stealth mode for headless chrome request to help prevent being blocked. The default is false on chrome.

  • headersobject

    Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.

  • metadataboolean

    Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled.

  • viewportobject

    Configure the viewport for chrome. Defaults to 800x600.

  • encodingstring

    The type of encoding to use like UTF-8, SHIFT_JIS, or etc.

  • blacklistarray

    Blacklist a set of paths that you do not want to crawl. You can use Regex patterns to help with the list.

  • whitelistarray

    Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.

  • subdomainsboolean

    Allow subdomains to be included. Default is false.

  • user_agentstring

    Add a custom HTTP user agent to the request. By default this is set to a random agent.

  • store_databoolean

    Boolean to determine if storage should be used. If set this takes precedence over storageless. Defaults to false.

  • gpt_configobject

    Use AI to generate actions to perform during the crawl. You can pass an array for the "prompt" to chain steps.

  • fingerprintboolean

    Use advanced fingerprint for chrome.

  • storagelessboolean

    Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored.

  • readabilityboolean

    Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.

  • chunking_algobject

    Use a chunking algorithm to segment your content output. Possible values are ByWords, ByLines, ByCharacterLength, and BySentence. The following is an example of the object shape for "chunking_alg": { "type": "bysentence", "value": 2 } which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.

  • respect_robotsboolean

    Respect the robots.txt file for crawling. Default is true.

  • query_selectorstring

    The CSS query selector to use when extracting content from the markup.

  • full_resourcesboolean

    Crawl and download all the resources for a website.

  • request_timeoutnumber

    The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds.

  • run_in_backgroundboolean

    Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.

  • skip_config_checksboolean

    Skip checking the database for website configuration. This may increase performance for request that are using limit=1. The default is false.

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}json_data = {"limit":25,"return_format":"markdown","url":"https://spider.cloud"}response = requests.post('https://api.spider.cloud/crawl', headers=headers, json=json_data)print(response.json())

Response

{ "data": null}

Delete a website from your collection. Remove the url body to delete all websites.

DELETEhttps://api.spider.cloud/data/websites

Request body

  • urlrequiredstring

    The URI resource to crawl. This can be a comma split list for multiple urls.

Example request

import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}json_data = {"limit":25,"return_format":"markdown","url":"https://spider.cloud"}response = requests.post('https://api.spider.cloud/crawl', headers=headers, json=json_data)print(response.json())

Response

{ "data": null}
Spider API Reference (2024)

References

Top Articles
Latest Posts
Article information

Author: Chrissy Homenick

Last Updated:

Views: 6422

Rating: 4.3 / 5 (54 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Chrissy Homenick

Birthday: 2001-10-22

Address: 611 Kuhn Oval, Feltonbury, NY 02783-3818

Phone: +96619177651654

Job: Mining Representative

Hobby: amateur radio, Sculling, Knife making, Gardening, Watching movies, Gunsmithing, Video gaming

Introduction: My name is Chrissy Homenick, I am a tender, funny, determined, tender, glorious, fancy, enthusiastic person who loves writing and wants to share my knowledge and understanding with you.