Running crawls with Postman
For this guide, we're going to assume you're interested in scraping keywords from a specific list of websites you're interested in.
1. Download your Postman collection
You can download a Postman collection directly from our web portal. Login at https://portal.80legs.com/login and then go to your profile page. Click on the "Get Postman Collection" to download your collection.
The collection will be pre-populated with several API calls that use your API token.
Import this collection into Postman. You can do this by opening Postman and clicking on "Import" at the top-left of your screen.
2. Upload your URL list
Before we can create our web crawl, we need to create a URL list. A URL list is one or more URLs from which your crawl will start. Without the URL list, a crawl won't know where to start.
With Postman open, click on the 80legs API collection. Then navigate to Secured Endpoints > Upload a URL List. Click on "Body" in the main screen.
First, give your URL list a name by replacing :name
in the API call with a unique name, like urlList1
.
Then, enter the following in the text area:
[
"https://www.80legs.com",
"https://www.datafiniti.co"
]
In this example, we're creating a URL list with just https://www.80legs.com
and https://www.datafiniti.co
. Any crawl using this URL list will start crawling from these two URLs.
Click the blue "Send" button. You should get this response:
{
"created": "urlList1"
}
3. Upload your 80app
The next thing we'll need to do is upload an 80app. An 80app is a small piece of code that runs every time your crawler requests a URL and does the work of generating links to crawl and scraping data from the web page.
You can read more about 80apps here. You can also view sample 80app code here. For now, we'll just use the code from the KeywordCollector 80app, since we're interested in scraping keywords for this example.
In Postman, navigate to Secured Endpoints > Apps > Upload an 80app. Click on "Body" in the main screen.
First, give your 80app list a name by replacing :name
in the API call with a unique name, like keywordCollector.js
.
Then, copy the code from KeywordCollector and paste it into the request body field, like so:
Click the blue "Send" button. You should get this response:
{
"created": "keywordCollector.js"
}
4. Configure and run your crawl
Now that we've created a URL list and an 80app, we're ready to run our web crawl!
In Postman, navigate to Secured Endpoints > Crawls > Start a Crawl. Click on "Body" in the main screen.
First, give your crawl a name by replacing :name
with something unique, like 'testCrawl1`.
Next, configure your request body so that the crawl will use the URL list and 80app you just created. You can also change the max_depth
and max_urls
field if you like. Read more about what these fields mean here. Your request body should look something like this:
{
"app": "keywordCollector.js",
"urllist": "urlList1",
"max_depth": 10,
"max_urls": 1000
}
Once you've configured the crawl using the request body, click the blue "Send" button. You should get a response like this:
{
"id": 123456,
"name": "testCrawl1",
"user": "abcdefghijklmnopqrstuvwxyz012345",
"app": "keywordCollector",
"urllist": "urlList1",
"max_depth": 10,
"max_urls": 1000,
"status": "STARTED",
"depth": 0,
"urls_crawled": 0,
"date_created": "2018-7-25 19:0:54",
"date_completed": "",
"date_started": "2018-7-25 19:0:54"
}
Let's break down each of the parameters we sent in our request:
Request Body Parameter | Description |
---|---|
app | The name of the 80app we're going to use. |
urllist | The name of the URL list we're going to use. |
max_depth | The maximum depth level for this crawl. Learn more about crawl depth here. |
max_urls | The maximum number of URLs this crawl will request. |
Now let's dive through the response the API returned:
Response Field | Description |
---|---|
id | The ID of the crawl. This is a globally unique identifier. |
name | The name you gave the crawl. |
user | Your API token. |
app | The name of the 80app this crawl is using. |
urllist | The URL list this crawl is using. |
max_depth | The maximum depth level for this crawl. |
max_urls | The maximum number of URLs this crawl will request. |
status | The current status of the crawl. Check the possible values here. |
depth | The current depth level of the crawl. |
urls_crawled | The number of URLs crawled so far. |
date_created | The date you created this crawl. |
date_completed | The date the crawl completed. This will be empty until the crawl completes or is canceled. |
date_started | The date the crawl started running. This can be different than date_created when the crawl starts off as queued. |
5. Check on crawl status
As mentioned, there is a status
field in the response body above. This field shows us the crawl has started, which means it's running. Web crawls typically do not complete instantaneously, since they need to spend requesting URLs and crawling links. In order to tell if the crawl has finished, we can check on its status on a periodic basis.
To do this in Postman, navigate to Secured Endpoints > Crawls > Get a Crawl. Replace :name
with the name of your crawl. In this example, the name would be testCrawl1
. Click on the blue "Send" button. You'll get another crawl object as your response like this:
{
"id": 123456,
"name": "testCrawl1",
"user": "abcdefghijklmnopqrstuvwxyz012345",
"app": "keywordCollector",
"urllist": "urlList1",
"max_depth": 10,
"max_urls": 1000,
"status": "STARTED",
"depth": 1,
"urls_crawled": 198,
"date_created": "2018-7-25 19:0:54",
"date_completed": "",
"date_started": "2018-7-25 19:0:54"
}
If you keep sending this request, you should notice depth
and urls_crawled
gradually increasing. At some point, status
will change to COMPLETED
. That's how you know the crawl has finished running.
6. Download results
After the crawl finishes, you'll want to download the result files. Result files are logs of all the data scraped during the crawl.
In Postman, navigate to Secured Endpoints > Results > Get Results from a Crawl. Replace :name
with your crawl's name. E.g., testCrawl
. Click the blue "Send" button. You'll get a response similar to:
[
"http://datafiniti-voltron-results.s3.amazonaws.com/abcdefghijklmnopqrstuvwxyz012345/123456_1.txt?AWSAccessKeyId=AKIAIELL2XADVPVJZ4MA&Signature=P5aPspt%2B%2F0Kr8u1nxU%2FHVVrRgOw%3D&Expires=1530820626"
]
Depending on how many URLs you crawl, and how much data you scrape from each URL, you'll see one or more links to result files in your results response. 80legs will create a results file for every 100 MB of data you scrape, which means result files can post while your crawl is running.
For very large crawls that take more than 7 days to run, we recommend checking your available results on a weekly basis. Result files will expire 7 days after they are created.
7. Process the results
After you've download the result files, you'll want to process them so you can make use of the data. A result file will have a structure similar to this:
[
{
"url": "https://www.80legs.com",
"result": "...."
},
{
"url": "https://www.datafiniti.co",
"result": "...."
},
...
]
Note that the file is a large JSON object. Specifically, it's an array of objects, where each object consists of a url
field and a result
field. The result
field will contain a string related to the data you've scraped, which, if you remember, is determined by your 80app.
In order to process these results files, we recommend using a programming language. We have examples for how to do this in our other guides.
Updated about 6 years ago