Basic web scraping with Node.js and Cheerio.js

It makes perfect sense that there are security rules that limit the reach of client-side JavaScript. If you relax any of these rules, the user is open to malicious activity. On the server side, JavaScript is not subject to these kinds of limitations. With this freedom comes a great deal of power. Web scraping is one of the cool upsides to this freedom.

To get started, clone the following git hub repository:

https://github.com/kevinchisholm/basic-web-scraping-with-node-and-cheeriojs

(Instructions on how to run the code are available in the Git hub page.)

The page we will scrape

Lets' take a moment to look at the example web page that we will scrape: http://output.jsbin.com/xavuga. If you use your web developer tools to inspect the DOM, you'll see that there are three main sections to the page. There is a HEADER element, a SECTION element, and a FOOTER element. We will target those three sections of the page later in some of the code examples.

The request NPM module

A key tool we'll need is the request NPM module. This module allows you to make an HTTP request and use the return value as you wish.

The cheerio NPM module

The cheerio NPM module provides a server-side jQuery implementation. There is not a 1:1 method replication; that was not their goal. Cheerio's functionality mirrors the most common tasks associated with jQuery. They key point is: you can parse HTML with JavaScript on the server-side.

Caching an entire web page

Example # 1

var fs = require('fs'),
    request = require('request'),
    cheerio = require('cheerio'),
    pageURL = 'http://output.jsbin.com/xavuga';

function scrapePage () {
    //make an HTTP request for the page to be scraped
    request(pageURL, function(error, response, responseHtml){        

        //write the entire scraped page to the local file system
        fs.writeFile(__dirname + '/HTML/entire-page.html', responseHtml, function(err){
            console.log('entire-page.html successfully written to HTML folder');
        })
    }) ;
}

//scrape the page
scrapePage();

In example # 1, we set some variables. The fs variable references the file system node module. This module provides access to the local file system. We'll need this to write files to disk. The request variable refers to the request node module, which we discussed earlier. The cheerio variable refers to the cheerio node module, which we also discussed earlier. The pageUrl variable is the URL of the web page that we will scrape.

At the highest level, there are two things that happen in this code. We defined a function named scrapePage, and then we execute that function. Let's look at what happens inside of this function.

First, we call the request function, passing it two arguments. The first argument is the URL of the request. The second argument is a callback function. This callback function takes three arguments. The first argument is an error object. This "error first" pattern is common in Node.js. The second object is the response object. The third argument is the contents of the request. In this case, the contents of the request is HTML.

Inside of the request callback, we leverage the file-system module's writeFile method. The first argument we pass is the full path of the file name. This tells the fs module the what file to write. The second argument is the content that we want to write to the file. Notice that we pass the responseHtml variable. Again, this is the content returned by the request function. The third argument is a callback function. In this case, we are using the callback function to log a message indicating that the file write to disk was successful.

When you run example # 1, you should see a new file in the HTML folder: content.html. This file contains the entire contents of the web page that we make a request to.

Caching only a part of a web page

Example # 2

function scrapePage () {
    //make an HTTP request for the page to be scraped
    request(pageURL, function(error, response, responseHtml){        

        //write the entire scraped page to the local file system
        fs.writeFile(__dirname + '/HTML/entire-page.html', responseHtml, function(err){
            console.log('entire-page.html successfully written to HTML folder');
        })

        //write isolated sections of the entire scraped page to the local file system

        //create the cheerio object
        var $ = cheerio.load(responseHtml),
            //create a reference to the header element
            $header = $('header').html();

        //write the header to the local file system
        fs.writeFile(__dirname + '/HTML/header.html', $header, function(err){
            console.log('header.html successfully written to HTML folder');
        });
    });
}

In example # 2, we have an updated version of the scrapePage function. For brevity sake, I have omitted the parts of the code that have not changed.

The first change to the scrapePage function is the use of the cheerio.load method, and assigned it to the $ variable. Now. we can use the $ variable much the same way we would jQuery. We created the $header variable, which contains the HTML of the HTML header element. We then use the file-system module's writeFile metho to write the HTML header element to the file: header.html.

When you run example # 2, you should see another new file in the HTML folder: header.html. This file contains the entire contents of the web page that we make a request to.

Example # 3

function scrapePage () {
    //make an HTTP request for the page to be scraped
    request(pageURL, function(error, response, responseHtml){        

        //write the entire scraped page to the local file system
        fs.writeFile(__dirname + '/HTML/entire-page.html', responseHtml, function(err){
            console.log('entire-page.html successfully written to HTML folder');
        })

        //write isolated sections of the entire scraped page to the local file system

        //create the cheerio object
        var $ = cheerio.load(responseHtml),
            //create a reference to the header element
            $header = $('header').html(),
            $content = $('#mainContent').html(),
            $footer = $('footer').html();

        //write the header to the local file system
        fs.writeFile(__dirname + '/HTML/header.html', $header, function(err){
            console.log('header.html successfully written to HTML folder');
        });

        //write the content to the local file system
        fs.writeFile(__dirname + '/HTML/content.html', $content, function(err){
            console.log('content.html successfully written to HTML folder');
        })

        //write the footer to the local file system
        fs.writeFile(__dirname + '/HTML/footer.html', $footer, function(err){
            console.log('footer.html successfully written to HTML folder');
        });
    });
}

In example # 3, we have updated the scrapePage function again. The new code follows the same pattern as example # 2. The difference is that we have also scraped the content and footer sections. In both cases, we write the associated HTML file to disk.

When you run example # 3, you should see four files in the HTML folder. They are entire-page.html, header.html, conten.html and footer.html.

Summary

In this article, we scratched the surface of what is possible when scraping web pages. The high-level areas we focused on are: making a request and then parsing the HTML of the request. We used the request module to make the HTTP request, and the cheerio module to parse the returned HTML. We also used the fs (file-system) module to write our scraped HTML to disk. Hopefully this article has pointed you in the right direction. Happy web page scraping!

Author:

Kevin Chisholm blog.kevinchisholm.com

Share This