Making your own 80app
Creating your own 80app is the best way to take full advantage of 80legs. By making your own 80app, you can fully control what data you scrape and how your crawls finds URLs.
The steps below will guide you through the process of making your own 80app.
These instructions assume you are familiar with Git, Github, and Javascript.
1. Fork the EightyApps repo
Open a web browser and go to https://github.com/datafiniti/EightyApps. This is our public repo for 80apps. You can see a variety of example 80apps in the /apps
directory.
From this page, click on the "Fork" button at the top-right of the page. This will create a copy of the repo in your own Github account.
Next, in your local command line, do a git clone https://github.com/your_user_name/EightyApps
(replace "your_user_name" with your actual Github user name) to create a local version of the repo from which you can edit code.
2. Create a blank 80app
From the /apps
directory, create a new file called My80app.js
and open it in your favorite code editor (we prefer Sublime). Copy the following into the file:
var EightyApp = function() {
// processDocument returns scraped data
this.processDocument = function(html, url, headers, status, cheerio) {
var app = this;
var $html = app.parseHtml(html, cheerio);
var object = new Object();
// populate the object with data you want to scrape
return JSON.stringify(object);
}
// parseLinks returns the next set of links to crawl
this.parseLinks = function(html, url, headers, status, cheerio) {
var app = this;
var $html = app.parseHtml(html, cheerio);
var links = [];
// generate the set of links you want to crawl next
return links;
}
}
try {
module.exports = function(EightyAppBase) {
EightyApp.prototype = new EightyAppBase();
return new EightyApp();
}
} catch(e) {
EightyApp.prototype = new EightyAppBase();
}
The code above is like a "blank" 80app. It doesn't really do anything, but it will serve as a skeleton that you can fill in with the actual logic you want in your 80app.
3. Implement your parseLinks method
The parseLinks
method will determine what links to crawl from each URL your crawl requests.
For the purposes of this guide, we'll assume you're trying to crawl every link on https://www.80legs.com. This means that your URL list will only contain:
[
"https://www.80legs.com"
]
It also means that we only want to crawl links that belong to the 80legs.com domain. To make sure the parseLinks
method only generates such links, we would modify the method like so:
this.parseLinks = function(html, url, headers, status, cheerio) {
var app = this;
var $html = app.parseHtml(html, cheerio);
var links = []; // this is the set of links we'll crawl next
const r = /:\/\/(.[^/]+)/;
const urlDomain = url.match(r)[1];
const normalizedUrlDomain = urlDomain.toLowerCase();
// gets all links in the html document
$html.find('a').each(function (i, obj) {
// generate a URL from the href tag of the link
const link = app.makeLink(url, $(this).attr('href'));
if (link) {
const linkDomain = link.match(r)[1];
// only add to our link set if the link's domain matches url's domain
if (linkDomain.toLowerCase() === normalizedUrlDomain) {
links.push(link);
}
}
});
return links;
}
4. Implement your processDocument method
The processDocument
method will determine what data you scrape from each URL you crawl.
For the purposes of this guide, we'll assume you want to scrape the <title>
tag of each URL you crawl. To do this, we would modify processDocument
like so:
this.processDocument = function(html, url, headers, status, cheerio) {
var app = this;
var $html = app.parseHtml(html, cheerio);
var object = new Object();
var title = $html.find('title').text();
object.title = title;
return JSON.stringify(object);
}
When this processDocument
crawls https://www.80legs.com, it will generate this in the result file for the crawl:
{
"url": "https://www.80legs.com",
"result": "{\"title\":\"80legs - Customizable Web Scraping\"}"
}
5. Test your 80app
Once you've finished coding your 80app, you can test it on http://80apptester.80legs.com. This page is a utility we provide to help test 80apps and debug them.
Copy and paste your 80app into the main text area, enter a test URL underneath, and then click the "Crawl" button.
The 80app Tester will show you the results of your 80app in the "Result" section. There's a section for your processDocument
and parseLinks
.
You can also use console.log
statements in your 80app code. Any output from these will show up in the "Output" section.
6. Upload your 80app
Once you're satisfied with your testing, upload your 80app to your account using the web portal or API. Give it a unique name when uploading so you can identify it later.
7. Use your 80app in a crawl
Now you're ready to use your 80app in a web crawl.
If you're using the web portal, just select your 80app by name in the Create a Crawl form.
If you're using the API, specify your 80app through the app
field when submitting a crawl creation request.
Updated over 6 years ago