Running crawls with cURL

For this guide, we're going to assume you're interested in scraping keywords from a specific list of websites you're interested in.

1. Open your terminal

If you want to use cURL to access the 80legs API, we're assuming you have access to a standard, Linux-based terminal. Open a terminal session to get started.

2. Get your API token

The next thing you'll need is your API token. The API token lets you authenticate with 80legs API and tells it who you are, what you have access to, and so on. Without it, you can't use the API.

To get your API token, go the 80legs Web Portal (https://portal.80legs.com), login, and click on your account name and the top-right. From there, you'll see a link to the "My Account" page, which will take you to a page showing your token. Your API token will be a long string of letters and numbers. Copy the API token or store it somewhere you can easily reference.

📘

For the rest of this document, we'll use AAAXXXXXXXXXXXX as a substitute example for your actual API token when showing example API calls.

3. Upload your URL list

Before we can create our web crawl, we need to create a URL list. A URL list is one or more URLs from which your crawl will start. Without the URL list, a crawl won't know where to start.

Enter the following into your terminal (replace the dummy API token with your real API token):

curl --request PUT --url https://AAAXXXXXXXXXXXX:@api.80legs.com/v2/urllists/urlList1 --header 'content-type: application/octet-stream' -data '["https://www.80legs.com/", "https://www.datafiniti.co/"]'

In this example, we're creating a URL list with just https://www.80legs.com and https://www.datafiniti.co. Any crawl using this URL list will start crawling from these two URLs.

You should get a response similar to this (although it may not look as pretty in your terminal):

{
  "created": "urlList1"
}

4. Upload your 80app

The next thing we'll need to do is upload an 80app. An 80app is a small piece of code that runs every time your crawler requests a URL and does the work of generating links to crawl and scraping data from the web page.

You can read more about 80apps here. You can also view sample 80app code here. For now, we'll just use the code from the KeywordCollector 80app, since we're interested in scraping keywords for this example. Copy the code and save it your local system as keywordCollectorjs.

Enter the following into your terminal (replace the dummy API token with your real API token and /path/to/keywordCollector.js with the actual path to this file on your local system):

curl --request PUT --url https://AAAXXXXXXXXXXXX:@api.80legs.com/v2/apps/keywordCollector.js --header 'content-type: application/octet-stream' --data /path/to/keywordCollector.js

You should get a response similar to this (although it may not look as pretty in your terminal):

{
  "created": "keywordCollector.js"
}

5. Configure and run your crawl

Now that we've created a URL list and an 80app, we're ready to run our web crawl!

Enter the following into your terminal (replace the dummy API token with your real API token):

curl --request PUT https://AAAXXXXXXXXXXXX:@api.80legs.com/v2/crawls/testCrawl1 --header 'content-type: application/json' --data '{"app": "keywordCollector.js", "urllist": "urlList1", "max_depth": 10, "max_urls": 1000}'

You should get a response similar to this (although it may not look as pretty in your terminal):

{
    "id": 123456,
    "name": "testCrawl1",
    "user": "abcdefghijklmnopqrstuvwxyz012345",
    "app": "keywordCollector.js",
    "urllist": "urlList1",
    "max_depth": 10,
    "max_urls": 1000,
    "status": "STARTED",
    "depth": 0,
    "urls_crawled": 0,
    "date_created": "2018-7-25 19:0:54",
    "date_completed": "",
    "date_started": "2018-7-25 19:0:54"
}

Let's break down each of the parameters we sent in our request:

Request Body ParameterDescription
appThe name of the 80app we're going to use.
urllistThe name of the URL list we're going to use.
max_depthThe maximum depth level for this crawl. Learn more about crawl depth here.
max_urlsThe maximum number of URLs this crawl will request.

Now let's dive through the response the API returned:

Response FieldDescription
idThe ID of the crawl. This is a globally unique identifier.
nameThe name you gave the crawl.
userYour API token.
appThe name of the 80app this crawl is using.
urllistThe URL list this crawl is using.
max_depthThe maximum depth level for this crawl.
max_urlsThe maximum number of URLs this crawl will request.
statusThe current status of the crawl. Check the possible values here.
depthThe current depth level of the crawl.
urls_crawledThe number of URLs crawled so far.
date_createdThe date you created this crawl.
date_completedThe date the crawl completed. This will be empty until the crawl completes or is canceled.
date_startedThe date the crawl started running. This can be different than date_created when the crawl starts off as queued.

6. Check on crawl status

As mentioned, there is a status field in the response body above. This field shows us the crawl has started, which means it's running. Web crawls typically do not complete instantaneously, since they need to spend requesting URLs and crawling links. In order to tell if the crawl has finished, we can check on its status on a periodic basis.

Enter the following into your terminal (replace the dummy API token with your real API token):

curl --request GET https://AAAXXXXXXXXXXXX:@api.80legs.com/v2/crawls/testCrawl1

You'll get another crawl object as your response like this:

{
    "id": 123456,
    "name": "testCrawl1",
    "user": "abcdefghijklmnopqrstuvwxyz012345",
    "app": "keywordCollector",
    "urllist": "urlList1",
    "max_depth": 10,
    "max_urls": 1000,
    "status": "STARTED",
    "depth": 1,
    "urls_crawled": 198,
    "date_created": "2018-7-25 19:0:54",
    "date_completed": "",
    "date_started": "2018-7-25 19:0:54"
}

If you keep sending this request, you should notice depth and urls_crawled gradually increasing. At some point, status will change to COMPLETED. That's how you know the crawl has finished running.

7. Download results

After the crawl finishes, you'll want to download the result files. Result files are logs of all the data scraped during the crawl.

Once you see a status of COMPLETED for your crawl, enter the following into your terminal (replace the dummy API token with your real API token):

curl --request GET https://<api_token>:@api.80legs.com/v2/results/testCrawl1

You should get a response similar to this (although it may not look as pretty in your terminal):

[
    "http://datafiniti-voltron-results.s3.amazonaws.com/abcdefghijklmnopqrstuvwxyz012345/123456_1.txt?AWSAccessKeyId=AKIAIELL2XADVPVJZ4MA&Signature=P5aPspt%2B%2F0Kr8u1nxU%2FHVVrRgOw%3D&Expires=1530820626"
]

Depending on how many URLs you crawl, and how much data you scrape from each URL, you'll see one or more links to result files in your results response. 80legs will create a results file for every 100 MB of data you scrape, which means result files can post while your crawl is running.

For very large crawls that take more than 7 days to run, we recommend checking your available results on a weekly basis. Result files will expire 7 days after they are created.

To download the result files, copy each URL from the result response and run a command like:

curl 'http://datafiniti-voltron-results.s3.amazonaws.com/abcdefghijklmnopqrstuvwxyz012345/123456_1.txt?AWSAccessKeyId=AKIAIELL2XADVPVJZ4MA&Signature=P5aPspt%2B%2F0Kr8u1nxU%2FHVVrRgOw%3D&Expires=1530820626' > 123456_1.txt

8. Process the results

After you've download the result files, you'll want to process them so you can make use of the data. A result file will have a structure similar to this:

[
  {
    "url": "https://www.80legs.com",
    "result": "...."
  },
  {
    "url": "https://www.datafiniti.co",
    "result": "...."
  },
  ...
]

Note that the file is a large JSON object. Specifically, it's an array of objects, where each object consists of a url field and a result field. The result field will contain a string related to the data you've scraped, which, if you remember, is determined by your 80app.

In order to process these results files, we recommend using a programming language. We have examples for how to do this in our other guides.