Running crawls with Python

For this guide, we're going to assume you're interested in scraping keywords from a specific list of websites you're interested in.

๐Ÿšง

Note that we are using Python 3 for the examples below.

1. Install the requests module for Python

In your terminal, run the following to install the requests module for Python:

pip3 install requests

2. Get your API token

The next thing you'll need is your API token. The API token lets you authenticate with 80legs API and tells it who you are, what you have access to, and so on. Without it, you can't use the API.

To get your API token, go the 80legs Web Portal (https://portal.80legs.com), login, and click on your account name and the top-right. From there, you'll see a link to the "My Account" page, which will take you to a page showing your token. Your API token will be a long string of letters and numbers. Copy the API token or store it somewhere you can easily reference.

๐Ÿ“˜

For the rest of this document, we'll use AAAXXXXXXXXXXXX as a substitute example for your actual API token when showing example API calls.

3. Upload your URL list

Before we can create our web crawl, we need to create a URL list. A URL list is one or more URLs from which your crawl will start. Without the URL list, a crawl won't know where to start.

Write the following code in your code editor (replace the dummy API token with your real API token):

import requests
import json

# Set your API parameters here.
API_token = 'AAAXXXXXXXXXXXX'
urllist_name = 'urlList1'

request_headers = {
	'Content-Type': 'application/json',
}
request_data = [
	'https://www.80legs.com',
	'https://www.datafiniti.co'
]

# Make the API call.
r = requests.put('https://' + API_token + ':@api.80legs.com/v2/urllists/' + urllist_name,json=request_data,headers=request_headers);

# Do something with the response.
if r.status_code == 201:
	print(r.content)
else:
	print('Request failed')

In this example, we're creating a URL list with just https://www.80legs.com and https://www.datafiniti.co. Any crawl using this URL list will start crawling from these two URLs.

You should get a response similar to this (although it may not look as pretty in your terminal):

{
  location: 'urllists/AAAXXXXXXXXXXXX/urlList1',
  name: 'urlList1',
  user: 'AAAXXXXXXXXXXXX',
  date_created: '2018-07-24T00:30:43.991Z',
  date_updated: '2018-07-24T00:30:43.991Z',
  id: '5b5673331141d3e8f728dde6'
}

4. Upload your 80app

The next thing we'll need to do is upload an 80app. An 80app is a small piece of code that runs every time your crawler requests a URL and does the work of generating links to crawl and scraping data from the web page.

You can read more about 80apps here. You can also view sample 80app code here. For now, we'll just use the code from the KeywordCollector 80app, since we're interested in scraping keywords for this example. Copy the code and save it your local system as keywordCollector.js.

Write the following code in your code editor (replace the dummy API token with your real API token and /path/to/keywordCollector.js with the actual path to this file on your local system):

import requests

# Set your API parameters here.
API_token = 'AAAXXXXXXXXXXXX'
app_name = 'keywordCollector.js'
with open('keywordCollector.js','r') as myFile:
	app_content = myFile.read()

request_headers = {
	'Content-Type': 'application/octet-stream',
}

# Make the API call.
r = requests.put('https://' + API_token + ':@api.80legs.com/v2/apps/' + app_name,data=app_content,headers=request_headers);

# Do something with the response.
if r.status_code == 201:
	print(r.content)
else:
	print('Request failed')

You should get a response similar to this (although it may not look as pretty in your terminal):

{
  "location":"80apps/AAAXXXXXXXXXXXX/keywordCollector.js",
  "name":"app1",
  "user":"AAAXXXXXXXXXXXX",
  "date_created":"2018-07-24T00:41:29.598Z",
  "date_updated":"2018-07-24T00:41:29.598Z",
  "id":"5b5675b91141d3e8f76d4fc7"
}

5. Configure and run your crawl

Now that we've created a URL list and an 80app, we're ready to run our web crawl!

Write the following code in your code editor (replace the dummy API token with your real API token):

import requests
import json

# Set your API parameters here.
API_token = 'AAAXXXXXXXXXXXX'
crawl_name = 'crawl1'
urllist_name = 'urlList1'
app_name = 'keywordCollector.js'
max_depth = 10
max_urls = 1000

request_headers = {
	'Content-Type': 'application/json',
}
request_data = {
	'urllist': urllist_name,
	'app': app_name,
	'max_depth': max_depth,
	'max_urls': max_urls
}

# Make the API call.
r = requests.put('https://' + API_token + ':@api.80legs.com/v2/crawls/' + crawl_name,json=request_data,headers=request_headers);

# Do something with the response.
if r.status_code == 201:
	print(r.content)
else:
	print('Request failed')

You should get a response similar to this (although it may not look as pretty in your terminal):

{
  date_updated: '2018-07-24T00:57:47.445Z',
  date_created: '2018-07-24T00:57:47.245Z',
  user: 'AAAXXXXXXXXXXXX',
  name: 'crawl1',
  urllist: 'urlList1',
  max_urls: 1000,
  date_started: '2018-07-24T00:57:47.444Z',
  format: 'json',
  urls_crawled: 0,
  max_depth: 10,
  depth: 0,
  status: 'STARTED',
  app: 'keywordCollector.js',
  id: 1568124
}

Let's break down each of the parameters we sent in our request:

Request Body ParameterDescription
appThe name of the 80app we're going to use.
urllistThe name of the URL list we're going to use.
max_depthThe maximum depth level for this crawl. Learn more about crawl depth here.
max_urlsThe maximum number of URLs this crawl will request.

Now let's dive through the response the API returned:

Response FieldDescription
idThe ID of the crawl. This is a globally unique identifier.
nameThe name you gave the crawl.
userYour API token.
appThe name of the 80app this crawl is using.
urllistThe URL list this crawl is using.
max_depthThe maximum depth level for this crawl.
max_urlsThe maximum number of URLs this crawl will request.
statusThe current status of the crawl. Check the possible values here.
depthThe current depth level of the crawl.
urls_crawledThe number of URLs crawled so far.
date_createdThe date you created this crawl.
date_completedThe date the crawl completed. This will be empty until the crawl completes or is canceled.
date_startedThe date the crawl started running. This can be different than date_created when the crawl starts off as queued.

6. Check on crawl status

As mentioned, there is a status field in the response body above. This field shows us the crawl has started, which means it's running. Web crawls typically do not complete instantaneously, since they need to spend requesting URLs and crawling links. In order to tell if the crawl has finished, we can check on its status on a periodic basis.

Write the following code in your code editor (replace the dummy API token with your real API token):

import requests
import json

# Set your API parameters here.
API_token = 'AAAXXXXXXXXXXXX'
crawl_name = 'crawl1'

request_headers = {
	'Content-Type': 'application/json',
}

# Make the API call.
r = requests.get('https://' + API_token + ':@api.80legs.com/v2/crawls/' + crawl_name,headers=request_headers);

# Do something with the response.
if r.status_code == 200:
	print(r.content)
else:
	print('Request failed')

You'll get another crawl object as your response like this:

{
  date_updated: '2018-07-24T00:57:47.445Z',
  date_created: '2018-07-24T00:57:47.245Z',
  user: 'AAAXXXXXXXXXXXX',
  name: 'crawl1',
  urllist: 'urlList1',
  max_urls: 1000,
  date_started: '2018-07-24T00:57:47.444Z',
  format: 'json',
  urls_crawled: 1,
  max_depth: 10,
  depth: 0,
  status: 'STARTED',
  app: 'keywordCollector.js',
  id: 1568124
}

If you keep sending this request, you should notice depth and urls_crawled gradually increasing. At some point, status will change to COMPLETED. That's how you know the crawl has finished running.

7. Download results

After the crawl finishes, you'll want to download the result files. Result files are logs of all the data scraped during the crawl.

Once you see a status of COMPLETED for your crawl, use the following code to get the results (replace the dummy API token with your real API token):

import requests
import json

# Set your API parameters here.
API_token = 'AAAXXXXXXXXXXXX'
crawl_name = 'crawl1'

request_headers = {
	'Content-Type': 'application/json',
}

# Make the API call.
r = requests.get('https://' + API_token + ':@api.80legs.com/v2/results/' + crawl_name,headers=request_headers);

# Do something with the response.
if r.status_code == 200:
	print(r.content)
else:
	print('Request failed')

You should get a response similar to this (although it may not look as pretty in your terminal):

[
    "http://datafiniti-voltron-results.s3.amazonaws.com/abcdefghijklmnopqrstuvwxyz012345/123456_1.txt?AWSAccessKeyId=AKIAIELL2XADVPVJZ4MA&Signature=P5aPspt%2B%2F0Kr8u1nxU%2FHVVrRgOw%3D&Expires=1530820626"
]

Depending on how many URLs you crawl, and how much data you scrape from each URL, you'll see one or more links to result files in your results response. 80legs will create a results file for every 100 MB of data you scrape, which means result files can post while your crawl is running.

For very large crawls that take more than 7 days to run, we recommend checking your available results on a weekly basis. Result files will expire 7 days after they are created.

To download the result files, you can run code like this:

import requests
import urllib.parse
import json

# Set your API parameters here.
API_token = 'AAAXXXXXXXXXXXX'
crawl_name = 'crawl1'

request_headers = {
	'Content-Type': 'application/json',
}

# Make the API call.
r = requests.get('https://' + API_token + ':@api.80legs.com/v2/results/' + crawl_name,headers=request_headers);

# Do something with the response.
if r.status_code == 200:
	result_list = r.json()
	i = 1
	for result in result_list:
		filename = crawl_name + '_' + str(i) + '.txt'
		urllib.request.urlretrieve(result,filename)
		i += 1
else:
	print('Request failed')

8. Process the results

After you've download the result files, you'll want to process them so you can make use of the data. A result file will have a structure similar to this:

[
  {
    "url": "https://www.80legs.com",
    "result": "...."
  },
  {
    "url": "https://www.datafiniti.co",
    "result": "...."
  },
  ...
]

Note that the file is a large JSON object. Specifically, it's an array of objects, where each object consists of a url field and a result field. The result field will contain a string related to the data you've scraped, which, if you remember, is determined by your 80app.

In order to process these results files, you can use code similar to this:

import json
import re

# Set the location of your file here
filename = 'xxxx_x.txt'

result_regex = re.compile(r'\{\"url\":\".*?\",\"result\":\"\{.*?\}\"\}')

with open(filename) as myFile:
	for line in myFile:
		results = result_regex.findall(line)
		for result in results:
			print(json.loads(result))

You can edit the code in the inner-most loop above to do whatever you'd like with the data, such as store the data in a database, write it out to your console, etc.

๐Ÿ“˜

For this guide, we have created separate code files or blocks for each step of the crawl creation process. We've done this so you can understand the process better. In practice, it's probably best to combine the code into a single application to improve maintainability and usability.