Making your own 80app

Creating your own 80app is the best way to take full advantage of 80legs. By making your own 80app, you can fully control what data you scrape and how your crawls finds URLs.

The steps below will guide you through the process of making your own 80app.

📘

These instructions assume you are familiar with Git, Github, and Javascript.

1. Fork the EightyApps repo

Open a web browser and go to https://github.com/datafiniti/EightyApps. This is our public repo for 80apps. You can see a variety of example 80apps in the /apps directory.

From this page, click on the "Fork" button at the top-right of the page. This will create a copy of the repo in your own Github account.

Next, in your local command line, do a git clone https://github.com/your_user_name/EightyApps (replace "your_user_name" with your actual Github user name) to create a local version of the repo from which you can edit code.

2. Create a blank 80app

From the /apps directory, create a new file called My80app.js and open it in your favorite code editor (we prefer Sublime). Copy the following into the file:

var EightyApp = function() {
  	// processDocument returns scraped data
    this.processDocument = function(html, url, headers, status, cheerio) {
        var app = this;
        var $html = app.parseHtml(html, cheerio);
        var object = new Object();

        // populate the object with data you want to scrape

        return JSON.stringify(object);
    }

    // parseLinks returns the next set of links to crawl
    this.parseLinks = function(html, url, headers, status, cheerio) {
        var app = this;
        var $html = app.parseHtml(html, cheerio);
        var links = [];

        // generate the set of links you want to crawl next

        return links;
    }
}

try {
    module.exports = function(EightyAppBase) {
        EightyApp.prototype = new EightyAppBase();
        return new EightyApp();
    }
} catch(e) {
    EightyApp.prototype = new EightyAppBase();
}

The code above is like a "blank" 80app. It doesn't really do anything, but it will serve as a skeleton that you can fill in with the actual logic you want in your 80app.

3. Implement your parseLinks method

The parseLinks method will determine what links to crawl from each URL your crawl requests.

For the purposes of this guide, we'll assume you're trying to crawl every link on https://www.80legs.com. This means that your URL list will only contain:

[
  "https://www.80legs.com"
]

It also means that we only want to crawl links that belong to the 80legs.com domain. To make sure the parseLinks method only generates such links, we would modify the method like so:

this.parseLinks = function(html, url, headers, status, cheerio) {
    var app = this;
    var $html = app.parseHtml(html, cheerio);
    var links = []; // this is the set of links we'll crawl next

    const r = /:\/\/(.[^/]+)/;
    const urlDomain = url.match(r)[1];
    const normalizedUrlDomain = urlDomain.toLowerCase();

    // gets all links in the html document
    $html.find('a').each(function (i, obj) {
      	// generate a URL from the href tag of the link
        const link = app.makeLink(url, $(this).attr('href'));
        if (link) {
            const linkDomain = link.match(r)[1];
          	// only add to our link set if the link's domain matches url's domain
            if (linkDomain.toLowerCase() === normalizedUrlDomain) {
                links.push(link);
            }
        }
    });

    return links;
}

4. Implement your processDocument method

The processDocument method will determine what data you scrape from each URL you crawl.

For the purposes of this guide, we'll assume you want to scrape the <title> tag of each URL you crawl. To do this, we would modify processDocument like so:

this.processDocument = function(html, url, headers, status, cheerio) {
    var app = this;
    var $html = app.parseHtml(html, cheerio);
    var object = new Object();

    var title = $html.find('title').text();
  	object.title = title;

    return JSON.stringify(object);
}

When this processDocument crawls https://www.80legs.com, it will generate this in the result file for the crawl:

{
  "url": "https://www.80legs.com",
  "result": "{\"title\":\"80legs - Customizable Web Scraping\"}"
}

5. Test your 80app

Once you've finished coding your 80app, you can test it on http://80apptester.80legs.com. This page is a utility we provide to help test 80apps and debug them.

Copy and paste your 80app into the main text area, enter a test URL underneath, and then click the "Crawl" button.

The 80app Tester will show you the results of your 80app in the "Result" section. There's a section for your processDocument and parseLinks.

You can also use console.log statements in your 80app code. Any output from these will show up in the "Output" section.

6. Upload your 80app

Once you're satisfied with your testing, upload your 80app to your account using the web portal or API. Give it a unique name when uploading so you can identify it later.

7. Use your 80app in a crawl

Now you're ready to use your 80app in a web crawl.

If you're using the web portal, just select your 80app by name in the Create a Crawl form.

If you're using the API, specify your 80app through the app field when submitting a crawl creation request.