Start crawling a website(s) to collect resources.
POSThttps://api.spider.cloud/crawl
Request body
urlrequiredstring
The URI resource to crawl. This can be a comma split list for multiple urls.
requeststring
The request type to perform. Possible values are
http
,chrome
, andsmart
. Usesmart
to perform HTTP request by default until JavaScript rendering is needed for the HTML.limitnumber
The maximum amount of pages allowed to crawl per website. Remove the value or set it to
0
to crawl all pages.return_formatstring
The format to return the data in. Possible values are
markdown
,raw
,text
, andbytes
. Useraw
to return the default format of the page likeHTML
etc.proxy_enabledboolean
Enable premium proxies to prevent detection. Default is
false
.anti_botboolean
Enable anti-bot mode using various techniques to increase the chance of success. Default is
false
.tldboolean
Allow TLD's to be included. Default is
false
.depthnumber
The crawl limit for maximum depth. If
0
, no limit will be applied.cacheboolean
Use HTTP caching for the crawl to speed up repeated runs. Default is
true
budgetobject
Object that has paths with a counter for limiting the amount of pages example
{"*":1}
for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting{ "/docs/colors": 10, "/docs/": 100 }
which only allows a max of 100 pages if the route matches/docs/:pathname
and only 10 pages if it matches/docs/colors/:pathname
.localestring
The locale to use for request, example
en-US
.cookiesstring
Add HTTP cookies to use for request.
stealthboolean
Use stealth mode for headless chrome request to help prevent being blocked. The default is
false
on chrome.headersobject
Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.
metadataboolean
Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to
false
unless you have the website already stored with the configuration enabled.viewportobject
Configure the viewport for chrome. Defaults to
800x600
.encodingstring
The type of encoding to use like
UTF-8
,SHIFT_JIS
, or etc.blacklistarray
Blacklist a set of paths that you do not want to crawl. You can use Regex patterns to help with the list.
whitelistarray
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
subdomainsboolean
Allow subdomains to be included. Default is
false
.user_agentstring
Add a custom HTTP user agent to the request. By default this is set to a random agent.
store_databoolean
Boolean to determine if storage should be used. If set this takes precedence over
storageless
. Defaults tofalse
.gpt_configobject
Use AI to generate actions to perform during the crawl. You can pass an array for the
"prompt"
to chain steps.fingerprintboolean
Use advanced fingerprint for chrome.
storagelessboolean
Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to
false
unless you have the website already stored.readabilityboolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
chunking_algobject
Use a chunking algorithm to segment your content output. Possible values are
ByWords
,ByLines
,ByCharacterLength
, andBySentence
. The following is an example of the object shape for"chunking_alg": { "type": "bysentence", "value": 2 }
which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.respect_robotsboolean
Respect the robots.txt file for crawling. Default is
true
.query_selectorstring
The CSS query selector to use when extracting content from the markup.
full_resourcesboolean
Crawl and download all the resources for a website.
request_timeoutnumber
The timeout to use for request. Timeouts can be from
5-60
. The default is30
seconds.run_in_backgroundboolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.
skip_config_checksboolean
Skip checking the database for website configuration. This may increase performance for request that are using limit=1. The default is
false
.
Example request
import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/json',}json_data = {"limit":25,"return_format":"markdown","url":"https://spider.cloud"}response = requests.post('https://api.spider.cloud/crawl', headers=headers, json=json_data)print(response.json())
Response
[ { "content": "<html>...", "error": null, "status": 200, "url": "https://spider.cloud" }, // more content...]
Perform a search and gather a list of websites to start crawling and collect resources.
POSThttps://api.spider.cloud/search
Request body
searchrequiredstring
The search query you want to search for.
search_limitnumber
The limit amount of urls to fetch or crawl from the search results. Remove the value or set it to
0
to crawl all URLs from the search results.fetch_page_contentboolean
Fetch all the content of the websites by performing crawls. The default is
true
; if this is disabled, only the search results are returned instead.countrystring
The country code to use for the search. It's a two-letter country code. (e.g.
us
for the United States).locationstring
The location from where you want the search to originate.
languagestring
The language to use for the search. It's a two-letter language code (e.g.,
en
for English).numnumber
The maximum number of results to return for the search.
tldboolean
Allow TLD's to be included. Default is
false
.pagenumber
The page number for the search results.
cacheboolean
Use HTTP caching for the crawl to speed up repeated runs. Default is
true
requeststring
The request type to perform. Possible values are
http
,chrome
, andsmart
. Usesmart
to perform HTTP request by default until JavaScript rendering is needed for the HTML.limitnumber
The maximum amount of pages allowed to crawl per website. Remove the value or set it to
0
to crawl all pages.return_formatstring
The format to return the data in. Possible values are
markdown
,raw
,text
, andbytes
. Useraw
to return the default format of the page likeHTML
etc.proxy_enabledboolean
Enable premium proxies to prevent detection. Default is
false
.anti_botboolean
Enable anti-bot mode using various techniques to increase the chance of success. Default is
false
.depthnumber
The crawl limit for maximum depth. If
0
, no limit will be applied.budgetobject
Object that has paths with a counter for limiting the amount of pages example
{"*":1}
for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting{ "/docs/colors": 10, "/docs/": 100 }
which only allows a max of 100 pages if the route matches/docs/:pathname
and only 10 pages if it matches/docs/colors/:pathname
.localestring
The locale to use for request, example
en-US
.cookiesstring
Add HTTP cookies to use for request.
stealthboolean
Use stealth mode for headless chrome request to help prevent being blocked. The default is
false
on chrome.headersobject
Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.
viewportobject
Configure the viewport for chrome. Defaults to
800x600
.encodingstring
The type of encoding to use like
UTF-8
,SHIFT_JIS
, or etc.blacklistarray
Blacklist a set of paths that you do not want to crawl. You can use Regex patterns to help with the list.
whitelistarray
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
subdomainsboolean
Allow subdomains to be included. Default is
false
.user_agentstring
Add a custom HTTP user agent to the request. By default this is set to a random agent.
store_databoolean
Boolean to determine if storage should be used. If set this takes precedence over
storageless
. Defaults tofalse
.metadataboolean
Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to
false
unless you have the website already stored with the configuration enabled.gpt_configobject
Use AI to generate actions to perform during the crawl. You can pass an array for the
"prompt"
to chain steps.fingerprintboolean
Use advanced fingerprint for chrome.
storagelessboolean
Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to
false
unless you have the website already stored.readabilityboolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
chunking_algobject
Use a chunking algorithm to segment your content output. Possible values are
ByWords
,ByLines
,ByCharacterLength
, andBySentence
. The following is an example of the object shape for"chunking_alg": { "type": "bysentence", "value": 2 }
which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.respect_robotsboolean
Respect the robots.txt file for crawling. Default is
true
.query_selectorstring
The CSS query selector to use when extracting content from the markup.
full_resourcesboolean
Crawl and download all the resources for a website.
request_timeoutnumber
The timeout to use for request. Timeouts can be from
5-60
. The default is30
seconds.run_in_backgroundboolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.
skip_config_checksboolean
Skip checking the database for website configuration. This may increase performance for request that are using limit=1. The default is
false
.
Example request
import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/json',}json_data = {"search":"a sports website","search_limit":3,"limit":25,"return_format":"markdown"}response = requests.post('https://api.spider.cloud/search', headers=headers, json=json_data)print(response.json())
Response
[ { "content": "<html>...", "error": null, "status": 200, "url": "https://spider.cloud" }, // more content...]
Start crawling a website(s) to collect links found.
POSThttps://api.spider.cloud/links
Request body
urlrequiredstring
The URI resource to crawl. This can be a comma split list for multiple urls.
requeststring
The request type to perform. Possible values are
http
,chrome
, andsmart
. Usesmart
to perform HTTP request by default until JavaScript rendering is needed for the HTML.limitnumber
The maximum amount of pages allowed to crawl per website. Remove the value or set it to
0
to crawl all pages.return_formatstring
The format to return the data in. Possible values are
markdown
,raw
,text
, andbytes
. Useraw
to return the default format of the page likeHTML
etc.proxy_enabledboolean
Enable premium proxies to prevent detection. Default is
false
.anti_botboolean
Enable anti-bot mode using various techniques to increase the chance of success. Default is
false
.tldboolean
Allow TLD's to be included. Default is
false
.depthnumber
The crawl limit for maximum depth. If
0
, no limit will be applied.cacheboolean
Use HTTP caching for the crawl to speed up repeated runs. Default is
true
budgetobject
Object that has paths with a counter for limiting the amount of pages example
{"*":1}
for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting{ "/docs/colors": 10, "/docs/": 100 }
which only allows a max of 100 pages if the route matches/docs/:pathname
and only 10 pages if it matches/docs/colors/:pathname
.localestring
The locale to use for request, example
en-US
.cookiesstring
Add HTTP cookies to use for request.
stealthboolean
Use stealth mode for headless chrome request to help prevent being blocked. The default is
false
on chrome.headersobject
Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.
metadataboolean
Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to
false
unless you have the website already stored with the configuration enabled.viewportobject
Configure the viewport for chrome. Defaults to
800x600
.encodingstring
The type of encoding to use like
UTF-8
,SHIFT_JIS
, or etc.blacklistarray
Blacklist a set of paths that you do not want to crawl. You can use Regex patterns to help with the list.
whitelistarray
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
subdomainsboolean
Allow subdomains to be included. Default is
false
.user_agentstring
Add a custom HTTP user agent to the request. By default this is set to a random agent.
store_databoolean
Boolean to determine if storage should be used. If set this takes precedence over
storageless
. Defaults tofalse
.gpt_configobject
Use AI to generate actions to perform during the crawl. You can pass an array for the
"prompt"
to chain steps.fingerprintboolean
Use advanced fingerprint for chrome.
storagelessboolean
Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to
false
unless you have the website already stored.readabilityboolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
chunking_algobject
Use a chunking algorithm to segment your content output. Possible values are
ByWords
,ByLines
,ByCharacterLength
, andBySentence
. The following is an example of the object shape for"chunking_alg": { "type": "bysentence", "value": 2 }
which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.respect_robotsboolean
Respect the robots.txt file for crawling. Default is
true
.query_selectorstring
The CSS query selector to use when extracting content from the markup.
full_resourcesboolean
Crawl and download all the resources for a website.
request_timeoutnumber
The timeout to use for request. Timeouts can be from
5-60
. The default is30
seconds.run_in_backgroundboolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.
skip_config_checksboolean
Skip checking the database for website configuration. This may increase performance for request that are using limit=1. The default is
false
.
Example request
import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/json',}json_data = {"limit":25,"return_format":"markdown","url":"https://spider.cloud"}response = requests.post('https://api.spider.cloud/links', headers=headers, json=json_data)print(response.json())
Response
[ { "url": "https://spider.cloud", "status": 200, "error": null }, // more content...]
Start taking screenshots of website(s) to collect images to base64 or binary.
POSThttps://api.spider.cloud/screenshot
Request body
urlrequiredstring
The URI resource to crawl. This can be a comma split list for multiple urls.
requeststring
The request type to perform. Possible values are
http
,chrome
, andsmart
. Usesmart
to perform HTTP request by default until JavaScript rendering is needed for the HTML.limitnumber
The maximum amount of pages allowed to crawl per website. Remove the value or set it to
0
to crawl all pages.return_formatstring
The format to return the data in. Possible values are
markdown
,raw
,text
, andbytes
. Useraw
to return the default format of the page likeHTML
etc.proxy_enabledboolean
Enable premium proxies to prevent detection. Default is
false
.anti_botboolean
Enable anti-bot mode using various techniques to increase the chance of success. Default is
false
.tldboolean
Allow TLD's to be included. Default is
false
.depthnumber
The crawl limit for maximum depth. If
0
, no limit will be applied.cacheboolean
Use HTTP caching for the crawl to speed up repeated runs. Default is
true
budgetobject
Object that has paths with a counter for limiting the amount of pages example
{"*":1}
for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting{ "/docs/colors": 10, "/docs/": 100 }
which only allows a max of 100 pages if the route matches/docs/:pathname
and only 10 pages if it matches/docs/colors/:pathname
.localestring
The locale to use for request, example
en-US
.cookiesstring
Add HTTP cookies to use for request.
stealthboolean
Use stealth mode for headless chrome request to help prevent being blocked. The default is
false
on chrome.headersobject
Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.
metadataboolean
Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to
false
unless you have the website already stored with the configuration enabled.viewportobject
Configure the viewport for chrome. Defaults to
800x600
.encodingstring
The type of encoding to use like
UTF-8
,SHIFT_JIS
, or etc.blacklistarray
Blacklist a set of paths that you do not want to crawl. You can use Regex patterns to help with the list.
whitelistarray
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
subdomainsboolean
Allow subdomains to be included. Default is
false
.user_agentstring
Add a custom HTTP user agent to the request. By default this is set to a random agent.
store_databoolean
Boolean to determine if storage should be used. If set this takes precedence over
storageless
. Defaults tofalse
.gpt_configobject
Use AI to generate actions to perform during the crawl. You can pass an array for the
"prompt"
to chain steps.fingerprintboolean
Use advanced fingerprint for chrome.
storagelessboolean
Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to
false
unless you have the website already stored.readabilityboolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
chunking_algobject
Use a chunking algorithm to segment your content output. Possible values are
ByWords
,ByLines
,ByCharacterLength
, andBySentence
. The following is an example of the object shape for"chunking_alg": { "type": "bysentence", "value": 2 }
which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.respect_robotsboolean
Respect the robots.txt file for crawling. Default is
true
.query_selectorstring
The CSS query selector to use when extracting content from the markup.
full_resourcesboolean
Crawl and download all the resources for a website.
request_timeoutnumber
The timeout to use for request. Timeouts can be from
5-60
. The default is30
seconds.run_in_backgroundboolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.
skip_config_checksboolean
Skip checking the database for website configuration. This may increase performance for request that are using limit=1. The default is
false
.
Example request
import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/json',}json_data = {"limit":25,"return_format":"markdown","url":"https://spider.cloud"}response = requests.post('https://api.spider.cloud/screenshot', headers=headers, json=json_data)print(response.json())
Response
[ { "content": "base64...", "error": null, "status": 200, "url": "https://spider.cloud" }, // more content...]
Transform HTML to Markdown or text fast. Each HTML transformation costs 1 credit. You can send up to 10MB of data at once.
POSThttps://api.spider.cloud/transform
Request body
dataobject
A list of html data to transform. The object list takes the keys
html
andurl
. The url key is optional and only used when the readability is enabled.return_formatstring
The format to return the data in. Possible values are
markdown
,raw
,text
, andbytes
. Useraw
to return the default format of the page likeHTML
etc.readabilityboolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
Example request
import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/json',}json_data = {"return_format":"markdown","data":[{"html":"<html>\n<head>\n <title>Example Transform</title>\n <meta charset=\"utf-8\">\n <meta http-equiv=\"Content-type\" content=\"text/html; charset=utf-8\">\n <meta name=\"viewport\" content=\"width=device-width, initial-scale=1\">\n <style type=\"text/css\">\n html {\n background-color: #f0f0f2;\n margin: 0;\n padding: 0;\n font-size: 16px;\n }\n </style> \n</head>\n<body>\n<div>\n <h1>Example Website</h1>\n <p>This is some example markup to use to test the transform function.</p>\n <p><a href=\"https://spider.cloud/guides\">Guides</a></p>\n</div>\n</body></html>","url":"https://example.com"}]}response = requests.post('https://api.spider.cloud/transform', headers=headers, json=json_data)print(response.json())
Response
{ "content": [ "Example DomainExample Domain==========This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.[More information...](https://www.iana.org/domains/example)" ], "error": "", "status": 200 }
Proxy-Mode
Alpha
Spider also offers a proxy front-end to the service. The Spider proxy will then handle requests just like any standard request, with the option to use high-performance residential proxies at 1TB per/s.
**HTTP address**: proxy.spider.cloud:8888
**HTTPS address**: proxy.spider.cloud:8889
**Username**: YOUR-API-KEY
**Password**: PARAMETERS
Example proxy request
import requests, os# Proxy configurationproxies = { 'http': f"http://{os.getenv('SPIDER_API_KEY')}:request=Raw&premium_proxy=False@proxy.spider.cloud:8888", 'https': f"https://{os.getenv('SPIDER_API_KEY')}:request=Raw&premium_proxy=False@proxy.spider.cloud:8889"}# Function to make a request through the proxydef get_via_proxy(url): try: response = requests.get(url, proxies=proxies) response.raise_for_status() print('Response HTTP Status Code: ', response.status_code) print('Response HTTP Response Body: ', response.content) return response.text except requests.exceptions.RequestException as e: print(f"Error: {e}") return None# Example usageif __name__ == "__main__": get_via_proxy("https://www.choosealicense.com") get_via_proxy("https://www.choosealicense.com/community")
Pipelines
Create powerful workflows with our pipeline API endpoints. Use AI to extract contacts from any website or filter links with prompts with ease.
Start crawling a website(s) to collect all contacts utilizing AI. A minimum of $25 in credits is necessary for extraction.
POSThttps://api.spider.cloud/pipeline/extract-contacts
Request body
urlrequiredstring
The URI resource to crawl. This can be a comma split list for multiple urls.
requeststring
The request type to perform. Possible values are
http
,chrome
, andsmart
. Usesmart
to perform HTTP request by default until JavaScript rendering is needed for the HTML.limitnumber
The maximum amount of pages allowed to crawl per website. Remove the value or set it to
0
to crawl all pages.return_formatstring
The format to return the data in. Possible values are
markdown
,raw
,text
, andbytes
. Useraw
to return the default format of the page likeHTML
etc.proxy_enabledboolean
Enable premium proxies to prevent detection. Default is
false
.anti_botboolean
Enable anti-bot mode using various techniques to increase the chance of success. Default is
false
.tldboolean
Allow TLD's to be included. Default is
false
.depthnumber
The crawl limit for maximum depth. If
0
, no limit will be applied.cacheboolean
Use HTTP caching for the crawl to speed up repeated runs. Default is
true
budgetobject
Object that has paths with a counter for limiting the amount of pages example
{"*":1}
for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting{ "/docs/colors": 10, "/docs/": 100 }
which only allows a max of 100 pages if the route matches/docs/:pathname
and only 10 pages if it matches/docs/colors/:pathname
.localestring
The locale to use for request, example
en-US
.cookiesstring
Add HTTP cookies to use for request.
stealthboolean
Use stealth mode for headless chrome request to help prevent being blocked. The default is
false
on chrome.headersobject
Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.
metadataboolean
Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to
false
unless you have the website already stored with the configuration enabled.viewportobject
Configure the viewport for chrome. Defaults to
800x600
.encodingstring
The type of encoding to use like
UTF-8
,SHIFT_JIS
, or etc.blacklistarray
Blacklist a set of paths that you do not want to crawl. You can use Regex patterns to help with the list.
whitelistarray
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
subdomainsboolean
Allow subdomains to be included. Default is
false
.user_agentstring
Add a custom HTTP user agent to the request. By default this is set to a random agent.
store_databoolean
Boolean to determine if storage should be used. If set this takes precedence over
storageless
. Defaults tofalse
.gpt_configobject
Use AI to generate actions to perform during the crawl. You can pass an array for the
"prompt"
to chain steps.fingerprintboolean
Use advanced fingerprint for chrome.
storagelessboolean
Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to
false
unless you have the website already stored.readabilityboolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
chunking_algobject
Use a chunking algorithm to segment your content output. Possible values are
ByWords
,ByLines
,ByCharacterLength
, andBySentence
. The following is an example of the object shape for"chunking_alg": { "type": "bysentence", "value": 2 }
which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.respect_robotsboolean
Respect the robots.txt file for crawling. Default is
true
.query_selectorstring
The CSS query selector to use when extracting content from the markup.
full_resourcesboolean
Crawl and download all the resources for a website.
request_timeoutnumber
The timeout to use for request. Timeouts can be from
5-60
. The default is30
seconds.run_in_backgroundboolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.
skip_config_checksboolean
Skip checking the database for website configuration. This may increase performance for request that are using limit=1. The default is
false
.
Example request
import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}json_data = {"limit":25,"return_format":"markdown","url":"https://spider.cloud"}response = requests.post('https://api.spider.cloud/pipeline/extract-contacts', headers=headers, json=json_data)print(response.json())
Response
[ { "content": [{ "full_name": "John Doe", "email": "johndoe@gmail.com", "phone": "555-555-555", "title": "Baker" }], "error": null, "status": 200, "url": "https://spider.cloud" }, // more content...]
Crawl a website and accurately categorize it using AI.
POSThttps://api.spider.cloud/pipeline/label
Request body
urlrequiredstring
The URI resource to crawl. This can be a comma split list for multiple urls.
requeststring
The request type to perform. Possible values are
http
,chrome
, andsmart
. Usesmart
to perform HTTP request by default until JavaScript rendering is needed for the HTML.limitnumber
The maximum amount of pages allowed to crawl per website. Remove the value or set it to
0
to crawl all pages.return_formatstring
The format to return the data in. Possible values are
markdown
,raw
,text
, andbytes
. Useraw
to return the default format of the page likeHTML
etc.proxy_enabledboolean
Enable premium proxies to prevent detection. Default is
false
.anti_botboolean
Enable anti-bot mode using various techniques to increase the chance of success. Default is
false
.tldboolean
Allow TLD's to be included. Default is
false
.depthnumber
The crawl limit for maximum depth. If
0
, no limit will be applied.cacheboolean
Use HTTP caching for the crawl to speed up repeated runs. Default is
true
budgetobject
Object that has paths with a counter for limiting the amount of pages example
{"*":1}
for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting{ "/docs/colors": 10, "/docs/": 100 }
which only allows a max of 100 pages if the route matches/docs/:pathname
and only 10 pages if it matches/docs/colors/:pathname
.localestring
The locale to use for request, example
en-US
.cookiesstring
Add HTTP cookies to use for request.
stealthboolean
Use stealth mode for headless chrome request to help prevent being blocked. The default is
false
on chrome.headersobject
Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.
metadataboolean
Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to
false
unless you have the website already stored with the configuration enabled.viewportobject
Configure the viewport for chrome. Defaults to
800x600
.encodingstring
The type of encoding to use like
UTF-8
,SHIFT_JIS
, or etc.blacklistarray
Blacklist a set of paths that you do not want to crawl. You can use Regex patterns to help with the list.
whitelistarray
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
subdomainsboolean
Allow subdomains to be included. Default is
false
.user_agentstring
Add a custom HTTP user agent to the request. By default this is set to a random agent.
store_databoolean
Boolean to determine if storage should be used. If set this takes precedence over
storageless
. Defaults tofalse
.gpt_configobject
Use AI to generate actions to perform during the crawl. You can pass an array for the
"prompt"
to chain steps.fingerprintboolean
Use advanced fingerprint for chrome.
storagelessboolean
Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to
false
unless you have the website already stored.readabilityboolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
chunking_algobject
Use a chunking algorithm to segment your content output. Possible values are
ByWords
,ByLines
,ByCharacterLength
, andBySentence
. The following is an example of the object shape for"chunking_alg": { "type": "bysentence", "value": 2 }
which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.respect_robotsboolean
Respect the robots.txt file for crawling. Default is
true
.query_selectorstring
The CSS query selector to use when extracting content from the markup.
full_resourcesboolean
Crawl and download all the resources for a website.
request_timeoutnumber
The timeout to use for request. Timeouts can be from
5-60
. The default is30
seconds.run_in_backgroundboolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.
skip_config_checksboolean
Skip checking the database for website configuration. This may increase performance for request that are using limit=1. The default is
false
.
Example request
import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}json_data = {"limit":25,"return_format":"markdown","url":"https://spider.cloud"}response = requests.post('https://api.spider.cloud/pipeline/label', headers=headers, json=json_data)print(response.json())
Response
[ { "content": ["Government"], "error": null, "status": 200, "url": "https://spider.cloud" }, // more content...]
Crawl website(s) found from raw text or markdown.
POSThttps://api.spider.cloud/pipeline/crawl-text
Request body
textrequiredstring
The text string to extract urls from. The max limit for the text is 10mb.
requeststring
The request type to perform. Possible values are
http
,chrome
, andsmart
. Usesmart
to perform HTTP request by default until JavaScript rendering is needed for the HTML.limitnumber
The maximum amount of pages allowed to crawl per website. Remove the value or set it to
0
to crawl all pages.return_formatstring
The format to return the data in. Possible values are
markdown
,raw
,text
, andbytes
. Useraw
to return the default format of the page likeHTML
etc.proxy_enabledboolean
Enable premium proxies to prevent detection. Default is
false
.anti_botboolean
Enable anti-bot mode using various techniques to increase the chance of success. Default is
false
.tldboolean
Allow TLD's to be included. Default is
false
.urlstring
The URI resource to crawl. This can be a comma split list for multiple urls.
depthnumber
The crawl limit for maximum depth. If
0
, no limit will be applied.cacheboolean
Use HTTP caching for the crawl to speed up repeated runs. Default is
true
budgetobject
Object that has paths with a counter for limiting the amount of pages example
{"*":1}
for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting{ "/docs/colors": 10, "/docs/": 100 }
which only allows a max of 100 pages if the route matches/docs/:pathname
and only 10 pages if it matches/docs/colors/:pathname
.localestring
The locale to use for request, example
en-US
.cookiesstring
Add HTTP cookies to use for request.
stealthboolean
Use stealth mode for headless chrome request to help prevent being blocked. The default is
false
on chrome.headersobject
Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.
metadataboolean
Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to
false
unless you have the website already stored with the configuration enabled.viewportobject
Configure the viewport for chrome. Defaults to
800x600
.encodingstring
The type of encoding to use like
UTF-8
,SHIFT_JIS
, or etc.blacklistarray
Blacklist a set of paths that you do not want to crawl. You can use Regex patterns to help with the list.
whitelistarray
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
subdomainsboolean
Allow subdomains to be included. Default is
false
.user_agentstring
Add a custom HTTP user agent to the request. By default this is set to a random agent.
store_databoolean
Boolean to determine if storage should be used. If set this takes precedence over
storageless
. Defaults tofalse
.gpt_configobject
Use AI to generate actions to perform during the crawl. You can pass an array for the
"prompt"
to chain steps.fingerprintboolean
Use advanced fingerprint for chrome.
storagelessboolean
Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to
false
unless you have the website already stored.readabilityboolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
chunking_algobject
Use a chunking algorithm to segment your content output. Possible values are
ByWords
,ByLines
,ByCharacterLength
, andBySentence
. The following is an example of the object shape for"chunking_alg": { "type": "bysentence", "value": 2 }
which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.respect_robotsboolean
Respect the robots.txt file for crawling. Default is
true
.query_selectorstring
The CSS query selector to use when extracting content from the markup.
full_resourcesboolean
Crawl and download all the resources for a website.
request_timeoutnumber
The timeout to use for request. Timeouts can be from
5-60
. The default is30
seconds.run_in_backgroundboolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.
skip_config_checksboolean
Skip checking the database for website configuration. This may increase performance for request that are using limit=1. The default is
false
.
Example request
import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}json_data = {"text":"Check this link: https://example.com and email to example@email.com","limit":25,"return_format":"markdown"}response = requests.post('https://api.spider.cloud/pipeline/crawl-text', headers=headers, json=json_data)print(response.json())
Response
[ { "content": "<html>...", "error": null, "status": 200, "url": "https://spider.cloud" }, // more content...]
Queries
Query the data that you collect. Add dynamic filters for extracting exactly what is needed.
Get the websites stored.
GEThttps://api.spider.cloud/data/websites
Request params
limitstring
The limit of records to get.
pagenumber
The current page to get.
domainstring
Filter a single domain record.
Example request
import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}response = requests.get('https://api.spider.cloud/data/websites?limit=25&return_format=markdown', headers=headers)print(response.json())
Response
{ "data": [ { "id": "2a503c02-f161-444b-b1fa-03a3914667b6", "user_id": "6bd06efa-bb0b-4f1f-a23f-05db0c4b1bfd", "url": "6bd06efa-bb0b-4f1f-a29f-05db0c4b1bfd/example.com/index.html", "domain": "spider.cloud", "created_at": "2024-04-18T15:40:25.667063+00:00", "updated_at": "2024-04-18T15:40:25.667063+00:00", "pathname": "/", "fts": "", "scheme": "https:", "last_checked_at": "2024-05-10T13:39:32.293017+00:00", "screenshot": null } ]}
Get the pages/resources stored.
GEThttps://api.spider.cloud/data/pages
Request params
limitstring
The limit of records to get.
pagenumber
The current page to get.
domainstring
Filter a single domain record.
Example request
import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}response = requests.get('https://api.spider.cloud/data/pages?limit=25&return_format=markdown', headers=headers)print(response.json())
Response
{ "data": [ { "id": "733b0d0f-e406-4229-949d-8068ade54752", "user_id": "6bd06efa-bb0b-4f1f-a29f-05db0c4b1bfd", "url": "https://spider.cloud", "domain": "spider.cloud", "created_at": "2024-04-17T01:28:15.016975+00:00", "updated_at": "2024-04-17T01:28:15.016975+00:00", "proxy": true, "headless": true, "crawl_budget": null, "scheme": "https:", "last_checked_at": "2024-04-17T01:28:15.016975+00:00", "full_resources": false, "metadata": true, "gpt_config": null, "smart_mode": false, "fts": "'spider.cloud':1" } ]}
Get the pages metadata/resources stored.
GEThttps://api.spider.cloud/data/pages_metadata
Request params
limitstring
The limit of records to get.
pagenumber
The current page to get.
domainstring
Filter a single domain record.
Example request
import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}response = requests.get('https://api.spider.cloud/data/pages_metadata?limit=25&return_format=markdown', headers=headers)print(response.json())
Response
{ "data": [ { "id": "e27a1995-2abe-4319-acd1-3dd8258f0f49", "user_id": "253524cd-3f94-4ed1-83b3-f7fab134c3ff", "url": "253524cd-3f94-4ed1-83b3-f7fab134c3ff/www.google.com/search?query=spider.cloud.html", "domain": "www.google.com", "resource_type": "html", "title": "spider.cloud - Google Search", "description": "", "file_size": 1253960, "embedding": null, "pathname": "/search", "created_at": "2024-05-18T17:40:16.4808+00:00", "updated_at": "2024-05-18T17:40:16.4808+00:00", "keywords": [ "Fastest Web Crawler spider", "Web scraping" ], "labels": "Search Engine", "extracted_data": null, "fts": "'/search':1" } ]}
Get the pages contacts stored.
GEThttps://api.spider.cloud/data/contacts
Request params
limitstring
The limit of records to get.
pagenumber
The current page to get.
domainstring
Filter a single domain record.
Example request
import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}response = requests.get('https://api.spider.cloud/data/contacts?limit=25&return_format=markdown', headers=headers)print(response.json())
Response
{ "data": [ { "full_name": "John Doe", "email": "johndoe@gmail.com", "phone": "555-555-555", "title": "Baker" } ]}
Get the state of the crawl for the domain.
GEThttps://api.spider.cloud/data/crawl_state
Request params
limitstring
The limit of records to get.
pagenumber
The current page to get.
domainstring
Filter a single domain record.
Example request
import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}response = requests.get('https://api.spider.cloud/data/crawl_state?limit=25&return_format=markdown', headers=headers)print(response.json())
Response
{ "data": { "id": "195bf2f2-2821-421d-b89c-f27e57ca71fh", "user_id": "6bd06efa-bb0a-4f1f-a29f-05db0c4b1bfg", "domain": "spider.cloud", "url": "https://spider.cloud/", "links": 1, "credits_used": 3, "mode": 2, "crawl_duration": 340, "message": null, "request_user_agent": "Spider", "level": "info", "status_code": 0, "created_at": "2024-04-21T01:21:32.886863+00:00", "updated_at": "2024-04-21T01:21:32.886863+00:00" }, "error": ""}
Get the last 24 hours of logs.
GEThttps://api.spider.cloud/data/crawl_logs
Request params
limitstring
The limit of records to get.
pagenumber
The current page to get.
domainstring
Filter a single domain record.
Example request
import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}response = requests.get('https://api.spider.cloud/data/crawl_logs?limit=25&return_format=markdown', headers=headers)print(response.json())
Response
{ "data": { "id": "195bf2f2-2821-421d-b89c-f27e57ca71fh", "user_id": "6bd06efa-bb0a-4f1f-a29f-05db0c4b1bfg", "domain": "spider.cloud", "url": "https://spider.cloud/", "links": 1, "credits_used": 3, "mode": 2, "crawl_duration": 340, "message": null, "request_user_agent": "Spider", "level": "info", "status_code": 0, "created_at": "2024-04-21T01:21:32.886863+00:00", "updated_at": "2024-04-21T01:21:32.886863+00:00" }, "error": ""}
Get the remaining credits available.
GEThttps://api.spider.cloud/data/credits
Example request
import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}response = requests.get('https://api.spider.cloud/data/credits?limit=25&return_format=markdown', headers=headers)print(response.json())
Response
{ "data": { "id": "8d662167-5a5f-41aa-9cb8-0cbb7d536891", "user_id": "6bd06efa-bb0a-4f1f-a29f-05db0c4b1bfg", "credits": 53334, "created_at": "2024-04-21T01:21:32.886863+00:00", "updated_at": "2024-04-21T01:21:32.886863+00:00" }}
Get the cron jobs that are set to keep data fresh.
GEThttps://api.spider.cloud/data/crons
Request params
limitstring
The limit of records to get.
pagenumber
The current page to get.
domainstring
Filter a single domain record.
Example request
import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}response = requests.get('https://api.spider.cloud/data/crons?limit=25&return_format=markdown', headers=headers)print(response.json())
Response
Get the profile of the user. This returns data such as approved limits and usage for the month.
GEThttps://api.spider.cloud/data/profiles
Example request
import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}response = requests.get('https://api.spider.cloud/crawl?limit=25&return_format=markdown', headers=headers)print(response.json())
Response
{ "data": [ { "id": "6bd06efa-bb0b-4f1f-a29f-05db0c4b1bfd", "email": "user@gmail.com", "stripe_id": "cus_OYO2rAhSQaYqHT", "is_deleted": null, "proxy": null, "headless": false, "billing_limit": 50, "billing_limit_soft": 120, "approved_usage": 0, "crawl_budget": { "*": 200 }, "usage": null, "has_subscription": false, "depth": null, "full_resources": false, "meta_data": true, "billing_allowed": false, "initial_promo": false } ]}
Get a real user agent to use for crawling.
GEThttps://api.spider.cloud/data/user_agents
Request params
limitstring
The limit of records to get.
osstring
Filter a by a device ex:
Android
,Mac OS
,Android
,Windows
,Linux
and more.pagenumber
The current page to get.
platformstring
Filter a by a platform ex:
Chrome
,Edge
,Safari
,Firefox
and more.
Example request
import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}response = requests.get('https://api.spider.cloud/data/user_agents?limit=25&return_format=markdown', headers=headers)print(response.json())
Response
{ "data": { "agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36", "platform": "Chrome", "platform_version": "123.0.0.0", "device": "Macintosh", "os": "Mac OS", "os_version": "10.15.7", "cpu_architecture": "", "mobile": false, "device_type": "desktop" }}
Manage
Configure data to enhance crawl efficiency: create, update, and delete records.
Create or update a website by configuration.
POSThttps://api.spider.cloud/data/websites
Request body
urlrequiredstring
The URI resource to crawl. This can be a comma split list for multiple urls.
requeststring
The request type to perform. Possible values are
http
,chrome
, andsmart
. Usesmart
to perform HTTP request by default until JavaScript rendering is needed for the HTML.limitnumber
The maximum amount of pages allowed to crawl per website. Remove the value or set it to
0
to crawl all pages.return_formatstring
The format to return the data in. Possible values are
markdown
,raw
,text
, andbytes
. Useraw
to return the default format of the page likeHTML
etc.proxy_enabledboolean
Enable premium proxies to prevent detection. Default is
false
.anti_botboolean
Enable anti-bot mode using various techniques to increase the chance of success. Default is
false
.tldboolean
Allow TLD's to be included. Default is
false
.cronstring
Set a cron period to run the website crawls automatically. Possible values are
daily
,weekly
, andmonthly
.depthnumber
The crawl limit for maximum depth. If
0
, no limit will be applied.cacheboolean
Use HTTP caching for the crawl to speed up repeated runs. Default is
true
budgetobject
Object that has paths with a counter for limiting the amount of pages example
{"*":1}
for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting{ "/docs/colors": 10, "/docs/": 100 }
which only allows a max of 100 pages if the route matches/docs/:pathname
and only 10 pages if it matches/docs/colors/:pathname
.localestring
The locale to use for request, example
en-US
.cookiesstring
Add HTTP cookies to use for request.
stealthboolean
Use stealth mode for headless chrome request to help prevent being blocked. The default is
false
on chrome.headersobject
Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.
metadataboolean
Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to
false
unless you have the website already stored with the configuration enabled.viewportobject
Configure the viewport for chrome. Defaults to
800x600
.encodingstring
The type of encoding to use like
UTF-8
,SHIFT_JIS
, or etc.blacklistarray
Blacklist a set of paths that you do not want to crawl. You can use Regex patterns to help with the list.
whitelistarray
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
subdomainsboolean
Allow subdomains to be included. Default is
false
.user_agentstring
Add a custom HTTP user agent to the request. By default this is set to a random agent.
store_databoolean
Boolean to determine if storage should be used. If set this takes precedence over
storageless
. Defaults tofalse
.gpt_configobject
Use AI to generate actions to perform during the crawl. You can pass an array for the
"prompt"
to chain steps.fingerprintboolean
Use advanced fingerprint for chrome.
storagelessboolean
Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to
false
unless you have the website already stored.readabilityboolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
chunking_algobject
Use a chunking algorithm to segment your content output. Possible values are
ByWords
,ByLines
,ByCharacterLength
, andBySentence
. The following is an example of the object shape for"chunking_alg": { "type": "bysentence", "value": 2 }
which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.respect_robotsboolean
Respect the robots.txt file for crawling. Default is
true
.query_selectorstring
The CSS query selector to use when extracting content from the markup.
full_resourcesboolean
Crawl and download all the resources for a website.
request_timeoutnumber
The timeout to use for request. Timeouts can be from
5-60
. The default is30
seconds.run_in_backgroundboolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.
skip_config_checksboolean
Skip checking the database for website configuration. This may increase performance for request that are using limit=1. The default is
false
.
Example request
import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}json_data = {"limit":25,"return_format":"markdown","url":"https://spider.cloud"}response = requests.post('https://api.spider.cloud/crawl', headers=headers, json=json_data)print(response.json())
Response
{ "data": null}
Delete a website from your collection. Remove the url
body to delete all websites.
DELETEhttps://api.spider.cloud/data/websites
Request body
urlrequiredstring
The URI resource to crawl. This can be a comma split list for multiple urls.
Example request
import requests, osheaders = { 'Authorization': os.environ["SPIDER_API_KEY"], 'Content-Type': 'application/jsonl',}json_data = {"limit":25,"return_format":"markdown","url":"https://spider.cloud"}response = requests.post('https://api.spider.cloud/crawl', headers=headers, json=json_data)print(response.json())
Response
{ "data": null}