Running crawls with Postman

For this guide, we're going to assume you're interested in scraping keywords from a specific list of websites you're interested in.

1. Download your Postman collection

You can download a Postman collection directly from our web portal. Login at https://portal.80legs.com/login and then go to your profile page. Click on the "Get Postman Collection" to download your collection.

The collection will be pre-populated with several API calls that use your API token.

Import this collection into Postman. You can do this by opening Postman and clicking on "Import" at the top-left of your screen.

2. Upload your URL list

Before we can create our web crawl, we need to create a URL list. A URL list is one or more URLs from which your crawl will start. Without the URL list, a crawl won't know where to start.

With Postman open, click on the 80legs API collection. Then navigate to Secured Endpoints > Upload a URL List. Click on "Body" in the main screen.

First, give your URL list a name by replacing :name in the API call with a unique name, like urlList1.

Then, enter the following in the text area:

[
    "https://www.80legs.com",
    "https://www.datafiniti.co"
]

In this example, we're creating a URL list with just https://www.80legs.com and https://www.datafiniti.co. Any crawl using this URL list will start crawling from these two URLs.

Click the blue "Send" button. You should get this response:

{
  "created": "urlList1"
}

3. Upload your 80app

The next thing we'll need to do is upload an 80app. An 80app is a small piece of code that runs every time your crawler requests a URL and does the work of generating links to crawl and scraping data from the web page.

You can read more about 80apps here. You can also view sample 80app code here. For now, we'll just use the code from the KeywordCollector 80app, since we're interested in scraping keywords for this example.

In Postman, navigate to Secured Endpoints > Apps > Upload an 80app. Click on "Body" in the main screen.

First, give your 80app list a name by replacing :name in the API call with a unique name, like keywordCollector.js.

Then, copy the code from KeywordCollector and paste it into the request body field, like so:

Click the blue "Send" button. You should get this response:

{
  "created": "keywordCollector.js"
}

4. Configure and run your crawl

Now that we've created a URL list and an 80app, we're ready to run our web crawl!

In Postman, navigate to Secured Endpoints > Crawls > Start a Crawl. Click on "Body" in the main screen.

First, give your crawl a name by replacing :name with something unique, like 'testCrawl1`.

Next, configure your request body so that the crawl will use the URL list and 80app you just created. You can also change the max_depth and max_urls field if you like. Read more about what these fields mean here. Your request body should look something like this:

{
    "app": "keywordCollector.js",
    "urllist": "urlList1",
    "max_depth": 10,
    "max_urls": 1000
}

Once you've configured the crawl using the request body, click the blue "Send" button. You should get a response like this:

{
    "id": 123456,
    "name": "testCrawl1",
    "user": "abcdefghijklmnopqrstuvwxyz012345",
    "app": "keywordCollector",
    "urllist": "urlList1",
    "max_depth": 10,
    "max_urls": 1000,
    "status": "STARTED",
    "depth": 0,
    "urls_crawled": 0,
    "date_created": "2018-7-25 19:0:54",
    "date_completed": "",
    "date_started": "2018-7-25 19:0:54"
}

Let's break down each of the parameters we sent in our request:

Request Body Parameter

Description

app

The name of the 80app we're going to use.

urllist

The name of the URL list we're going to use.

max_depth

The maximum depth level for this crawl. Learn more about crawl depth here.

max_urls

The maximum number of URLs this crawl will request.

Now let's dive through the response the API returned:

Response Field

Description

id

The ID of the crawl. This is a globally unique identifier.

name

The name you gave the crawl.

user

Your API token.

app

The name of the 80app this crawl is using.

urllist

The URL list this crawl is using.

max_depth

The maximum depth level for this crawl.

max_urls

The maximum number of URLs this crawl will request.

status

The current status of the crawl. Check the possible values here.

depth

The current depth level of the crawl.

urls_crawled

The number of URLs crawled so far.

date_created

The date you created this crawl.

date_completed

The date the crawl completed. This will be empty until the crawl completes or is canceled.

date_started

The date the crawl started running. This can be different than date_created when the crawl starts off as queued.

5. Check on crawl status

As mentioned, there is a status field in the response body above. This field shows us the crawl has started, which means it's running. Web crawls typically do not complete instantaneously, since they need to spend requesting URLs and crawling links. In order to tell if the crawl has finished, we can check on its status on a periodic basis.

To do this in Postman, navigate to Secured Endpoints > Crawls > Get a Crawl. Replace :name with the name of your crawl. In this example, the name would be testCrawl1. Click on the blue "Send" button. You'll get another crawl object as your response like this:

{
    "id": 123456,
    "name": "testCrawl1",
    "user": "abcdefghijklmnopqrstuvwxyz012345",
    "app": "keywordCollector",
    "urllist": "urlList1",
    "max_depth": 10,
    "max_urls": 1000,
    "status": "STARTED",
    "depth": 1,
    "urls_crawled": 198,
    "date_created": "2018-7-25 19:0:54",
    "date_completed": "",
    "date_started": "2018-7-25 19:0:54"
}

If you keep sending this request, you should notice depth and urls_crawled gradually increasing. At some point, status will change to COMPLETED. That's how you know the crawl has finished running.

6. Download results

After the crawl finishes, you'll want to download the result files. Result files are logs of all the data scraped during the crawl.

In Postman, navigate to Secured Endpoints > Results > Get Results from a Crawl. Replace :name with your crawl's name. E.g., testCrawl. Click the blue "Send" button. You'll get a response similar to:

[
    "http://datafiniti-voltron-results.s3.amazonaws.com/abcdefghijklmnopqrstuvwxyz012345/123456_1.txt?AWSAccessKeyId=AKIAIELL2XADVPVJZ4MA&Signature=P5aPspt%2B%2F0Kr8u1nxU%2FHVVrRgOw%3D&Expires=1530820626"
]

Depending on how many URLs you crawl, and how much data you scrape from each URL, you'll see one or more links to result files in your results response. 80legs will create a results file for every 100 MB of data you scrape, which means result files can post while your crawl is running.

For very large crawls that take more than 7 days to run, we recommend checking your available results on a weekly basis. Result files will expire 7 days after they are created.

7. Process the results

After you've download the result files, you'll want to process them so you can make use of the data. A result file will have a structure similar to this:

[
  {
    "url": "https://www.80legs.com",
    "result": "...."
  },
  {
    "url": "https://www.datafiniti.co",
    "result": "...."
  },
  ...
]

Note that the file is a large JSON object. Specifically, it's an array of objects, where each object consists of a url field and a result field. The result field will contain a string related to the data you've scraped, which, if you remember, is determined by your 80app.

In order to process these results files, we recommend using a programming language. We have examples for how to do this in our other guides.