Introduction to 80apps
80apps are "micro-apps" that run in conjunction with your web crawl. Each 80app takes the following actions when your crawl hits an individual URL:
- Determine which links, if any, to crawl from the current URL your crawl is hitting (via the 80app's
parseLinks
method). - Determine what data, if any, to scrape from the current URL's content (via the 80app's
processDocument
method).
80apps are implemented Node.js, and you can code an implement your own 80apps do whatever you like. In essence, 80apps give you complete control over how your crawl behaves and what data it scrapes.
Structure of an 80app
Here's what a generic 80app looks like:
var EightyApp = function() {
// processDocument returns scraped data
this.processDocument = function(html, url, headers, status, cheerio) {
var app = this;
var $html = app.parseHtml(html, cheerio);
var object = new Object();
// populate the object with data you want to scrape
return JSON.stringify(object);
}
// parseLinks returns the next set of links to crawl
this.parseLinks = function(html, url, headers, status, cheerio) {
var app = this;
var $html = app.parseHtml(html, cheerio);
var links = [];
// generate the set of links you want to crawl next
return links;
}
}
try {
module.exports = function(EightyAppBase) {
EightyApp.prototype = new EightyAppBase();
return new EightyApp();
}
} catch(e) {
EightyApp.prototype = new EightyAppBase();
}
Every 80app consists of two main functions:
processDocument
: returns scraped data from the URL being crawledparseLinks
: returns the next set of links to crawl
processDocument
this.processDocument = function(html, url, headers, status, cheerio)
Parameter | Description |
---|---|
html | A string representation of the content returned by a GET request to url |
url | The current URL being crawled |
headers | The headers returned by a GET request to url |
status | The HTTP status code returned when requesting url |
cheerio | A library for converting the html string to a DOM object. More info. |
You can customize processDocument
to return scraped data from the current URL being crawled.
parseLinks
this.parseLinks = function(html, url, headers, status, jQuery)
Parameter | Description |
---|---|
html | A string representation of the content returned by a GET request to url |
url | The current URL being crawled |
headers | The headers returned by a GET request to url |
status | The HTTP status code returned when requesting url |
jQuery | A library for converting the html string to a DOM object. More info. |
You can customize parseLinks
to return a list of links to crawl after crawling the current URL.
Updated over 6 years ago