Running crawls with cURL
For this guide, we're going to assume you're interested in scraping keywords from a specific list of websites you're interested in.
1. Open your terminal
If you want to use cURL to access the 80legs API, we're assuming you have access to a standard, Linux-based terminal. Open a terminal session to get started.
2. Get your API token
The next thing you'll need is your API token. The API token lets you authenticate with 80legs API and tells it who you are, what you have access to, and so on. Without it, you can't use the API.
To get your API token, go the 80legs Web Portal (https://portal.80legs.com), login, and click on your account name and the top-right. From there, you'll see a link to the "My Account" page, which will take you to a page showing your token. Your API token will be a long string of letters and numbers. Copy the API token or store it somewhere you can easily reference.
For the rest of this document, we'll use AAAXXXXXXXXXXXX as a substitute example for your actual API token when showing example API calls.
3. Upload your URL list
Before we can create our web crawl, we need to create a URL list. A URL list is one or more URLs from which your crawl will start. Without the URL list, a crawl won't know where to start.
Enter the following into your terminal (replace the dummy API token with your real API token):
curl --request PUT --url https://AAAXXXXXXXXXXXX:@api.80legs.com/v2/urllists/urlList1 --header 'content-type: application/octet-stream' -data '["https://www.80legs.com/", "https://www.datafiniti.co/"]'
In this example, we're creating a URL list with just https://www.80legs.com
and https://www.datafiniti.co
. Any crawl using this URL list will start crawling from these two URLs.
You should get a response similar to this (although it may not look as pretty in your terminal):
{
"created": "urlList1"
}
4. Upload your 80app
The next thing we'll need to do is upload an 80app. An 80app is a small piece of code that runs every time your crawler requests a URL and does the work of generating links to crawl and scraping data from the web page.
You can read more about 80apps here. You can also view sample 80app code here. For now, we'll just use the code from the KeywordCollector 80app, since we're interested in scraping keywords for this example. Copy the code and save it your local system as keywordCollectorjs
.
Enter the following into your terminal (replace the dummy API token with your real API token and /path/to/keywordCollector.js
with the actual path to this file on your local system):
curl --request PUT --url https://AAAXXXXXXXXXXXX:@api.80legs.com/v2/apps/keywordCollector.js --header 'content-type: application/octet-stream' --data /path/to/keywordCollector.js
You should get a response similar to this (although it may not look as pretty in your terminal):
{
"created": "keywordCollector.js"
}
5. Configure and run your crawl
Now that we've created a URL list and an 80app, we're ready to run our web crawl!
Enter the following into your terminal (replace the dummy API token with your real API token):
curl --request PUT https://AAAXXXXXXXXXXXX:@api.80legs.com/v2/crawls/testCrawl1 --header 'content-type: application/json' --data '{"app": "keywordCollector.js", "urllist": "urlList1", "max_depth": 10, "max_urls": 1000}'
You should get a response similar to this (although it may not look as pretty in your terminal):
{
"id": 123456,
"name": "testCrawl1",
"user": "abcdefghijklmnopqrstuvwxyz012345",
"app": "keywordCollector.js",
"urllist": "urlList1",
"max_depth": 10,
"max_urls": 1000,
"status": "STARTED",
"depth": 0,
"urls_crawled": 0,
"date_created": "2018-7-25 19:0:54",
"date_completed": "",
"date_started": "2018-7-25 19:0:54"
}
Let's break down each of the parameters we sent in our request:
Request Body Parameter | Description |
---|---|
app | The name of the 80app we're going to use. |
urllist | The name of the URL list we're going to use. |
max_depth | The maximum depth level for this crawl. Learn more about crawl depth here. |
max_urls | The maximum number of URLs this crawl will request. |
Now let's dive through the response the API returned:
Response Field | Description |
---|---|
id | The ID of the crawl. This is a globally unique identifier. |
name | The name you gave the crawl. |
user | Your API token. |
app | The name of the 80app this crawl is using. |
urllist | The URL list this crawl is using. |
max_depth | The maximum depth level for this crawl. |
max_urls | The maximum number of URLs this crawl will request. |
status | The current status of the crawl. Check the possible values here. |
depth | The current depth level of the crawl. |
urls_crawled | The number of URLs crawled so far. |
date_created | The date you created this crawl. |
date_completed | The date the crawl completed. This will be empty until the crawl completes or is canceled. |
date_started | The date the crawl started running. This can be different than date_created when the crawl starts off as queued. |
6. Check on crawl status
As mentioned, there is a status
field in the response body above. This field shows us the crawl has started, which means it's running. Web crawls typically do not complete instantaneously, since they need to spend requesting URLs and crawling links. In order to tell if the crawl has finished, we can check on its status on a periodic basis.
Enter the following into your terminal (replace the dummy API token with your real API token):
curl --request GET https://AAAXXXXXXXXXXXX:@api.80legs.com/v2/crawls/testCrawl1
You'll get another crawl object as your response like this:
{
"id": 123456,
"name": "testCrawl1",
"user": "abcdefghijklmnopqrstuvwxyz012345",
"app": "keywordCollector",
"urllist": "urlList1",
"max_depth": 10,
"max_urls": 1000,
"status": "STARTED",
"depth": 1,
"urls_crawled": 198,
"date_created": "2018-7-25 19:0:54",
"date_completed": "",
"date_started": "2018-7-25 19:0:54"
}
If you keep sending this request, you should notice depth
and urls_crawled
gradually increasing. At some point, status
will change to COMPLETED
. That's how you know the crawl has finished running.
7. Download results
After the crawl finishes, you'll want to download the result files. Result files are logs of all the data scraped during the crawl.
Once you see a status
of COMPLETED
for your crawl, enter the following into your terminal (replace the dummy API token with your real API token):
curl --request GET https://<api_token>:@api.80legs.com/v2/results/testCrawl1
You should get a response similar to this (although it may not look as pretty in your terminal):
[
"http://datafiniti-voltron-results.s3.amazonaws.com/abcdefghijklmnopqrstuvwxyz012345/123456_1.txt?AWSAccessKeyId=AKIAIELL2XADVPVJZ4MA&Signature=P5aPspt%2B%2F0Kr8u1nxU%2FHVVrRgOw%3D&Expires=1530820626"
]
Depending on how many URLs you crawl, and how much data you scrape from each URL, you'll see one or more links to result files in your results response. 80legs will create a results file for every 100 MB of data you scrape, which means result files can post while your crawl is running.
For very large crawls that take more than 7 days to run, we recommend checking your available results on a weekly basis. Result files will expire 7 days after they are created.
To download the result files, copy each URL from the result response and run a command like:
curl 'http://datafiniti-voltron-results.s3.amazonaws.com/abcdefghijklmnopqrstuvwxyz012345/123456_1.txt?AWSAccessKeyId=AKIAIELL2XADVPVJZ4MA&Signature=P5aPspt%2B%2F0Kr8u1nxU%2FHVVrRgOw%3D&Expires=1530820626' > 123456_1.txt
8. Process the results
After you've download the result files, you'll want to process them so you can make use of the data. A result file will have a structure similar to this:
[
{
"url": "https://www.80legs.com",
"result": "...."
},
{
"url": "https://www.datafiniti.co",
"result": "...."
},
...
]
Note that the file is a large JSON object. Specifically, it's an array of objects, where each object consists of a url
field and a result
field. The result
field will contain a string related to the data you've scraped, which, if you remember, is determined by your 80app.
In order to process these results files, we recommend using a programming language. We have examples for how to do this in our other guides.
Updated about 1 year ago