Introduction to 80apps

80apps are "micro-apps" that run in conjunction with your web crawl. Each 80app takes the following actions when your crawl hits an individual URL:

  1. Determine which links, if any, to crawl from the current URL your crawl is hitting (via the 80app's parseLinks method).
  2. Determine what data, if any, to scrape from the current URL's content (via the 80app's processDocument method).

80apps are implemented Node.js, and you can code an implement your own 80apps do whatever you like. In essence, 80apps give you complete control over how your crawl behaves and what data it scrapes.

Structure of an 80app

Here's what a generic 80app looks like:

var EightyApp = function() {
    // processDocument returns scraped data
    this.processDocument = function(html, url, headers, status, cheerio) {
        var app = this;
        var $html = app.parseHtml(html, cheerio);
        var object = new Object();

        // populate the object with data you want to scrape

        return JSON.stringify(object);
    }

    // parseLinks returns the next set of links to crawl
    this.parseLinks = function(html, url, headers, status, cheerio) {
        var app = this;
        var $html = app.parseHtml(html, cheerio);
        var links = [];

        // generate the set of links you want to crawl next

        return links;
    }
}

try {
    module.exports = function(EightyAppBase) {
        EightyApp.prototype = new EightyAppBase();
        return new EightyApp();
    }
} catch(e) {
    EightyApp.prototype = new EightyAppBase();
}

Every 80app consists of two main functions:

  • processDocument: returns scraped data from the URL being crawled
  • parseLinks: returns the next set of links to crawl

processDocument

this.processDocument = function(html, url, headers, status, cheerio)

Parameter

Description

html

A string representation of the content returned by a GET request to url

url

The current URL being crawled

headers

The headers returned by a GET request to url

status

The HTTP status code returned when requesting url

cheerio

A library for converting the html string to a DOM object. More info.

You can customize processDocument to return scraped data from the current URL being crawled.

parseLinks

this.parseLinks = function(html, url, headers, status, jQuery)

Parameter

Description

html

A string representation of the content returned by a GET request to url

url

The current URL being crawled

headers

The headers returned by a GET request to url

status

The HTTP status code returned when requesting url

jQuery

A library for converting the html string to a DOM object. More info.

You can customize parseLinks to return a list of links to crawl after crawling the current URL.