The whole idea of writing this article came from Traversy?media channel on Youtube, the Web Scraping tutorial?is really good and it’s an intro, I have tested it on some popular websites and I find some challenges and I will describe it in this article.

Many years ago I have written a web scraper application?which crawls on some news agencies and grab specific?data if I remember correctly I have coded it with PHP using?CodeIgniter framework, currently I was thinking about a project which somehow is related to the web scraping and I saw this video, so I am going to make a web scraper application with Node.js and Cheerio library.

Definition of the project:? Scraping?HuffingtonPost articles which is related to Italy and save it to an Excel .csv file. For preventing duplicate syntax I will just grab the title and thumbnail of the news.

Let’s start!

I assume you already know what is NodeJS and you have installed it on your computer.

In your blank directory write

npm init -y

Because we want Node.js to create a package.js file and we put “-y” so we can ignore all the questions.

Then we should add 2 libraries. First Cheerio which is :

Fast, flexible & lean implementation of core jQuery designed specifically for the server.

And the other one is Request?which is:

Request is designed to be the simplest way possible to make http calls. It supports HTTPS and follows redirects by default.

Implementation

Type npm i cheerio request

It would install cheerio and request library at the same time
npm i
is equal to
npm install

and yes you can name 2 libraries on the same line for installing.

 

Nodejs and Cheerio
The result should be like this

In the directory create a file like scrapper.js or whatever you want, it is a?web scraping project, right?

First I put all the codes here and then I will explain more? :

 

// Including necessary libraries
const request = require('request');
const cheerio = require('cheerio');

// File system is inside the NodeJS
const fs = require('fs');
// Write our data in CSV file
const writeStream = fs.createWriteStream('post.csv');


// Write header for excel file
writeStream.write('Title,Description \n');

// Requesting the URL with 3 default parameters
request('https://www.huffingtonpost.com/topic/italy-travel', (error,
                                                              response,
                                                              html) => {
    if (!error && response.statusCode === 200) {
        // Use cheerio to parse and create the jQuery-like DOM
        const $ = cheerio.load(html);

        // Finding targeted elements
        $('.card__content').each((i, el) => {
            const title = $(el)
                .find('.card__headline__text')
                .text()
                .replace(/["',]/g, "")
                .trim();

            const desc = $(el).find('.card__image__src').attr('src').split('?')[0];


            // Write Row to CSV
            writeStream.write(`${title}, ${desc} \n`);

        });
        // Making sure everything is alright
        console.log('its done');
    }
});

 

 

All descriptions?placed in comments between codes but I have described a little more about tricky parts of code?below:

if you have confusion on why using? $ in Cheerio is best practice, read this discussing on Stackoverflow.
const $ = cheerio.load(html);

For replacing all junk characters in the title I have used?the regular expression (RegExp Object) so I can describe a pattern of characters in brackets. Be careful?you can not use all characters like dash ““, for more detail check this link.

I put “g” after the bracket which asks for global changes rather than stopping after the first match.
I use trim() method to remove whitespace from both sides of strings.

.replace(/["',]/g, "")
.trim();

When I want to get the URL link of an image it retrieved a long URL which after question mark all of that is junk characters related to caching method and etc, so I remove junk characters with using split method.
const desc = $(el).find('.card__image__src').attr('src').split('?')[0];

I have use backtick “??`? ” because it can insert the content of variables. Feel free to test with just single quote symbol ‘ and see the result.
writeStream.write(`${title}, ${desc} \n`);

So that’s it! if you have any question feel free to ask.

I put the files and code here in my Github.

I also describe it on Youtube too, so you can check it over there too, I offer you to watch this video because I have test line by line and you can see the live result in each step.

 

Leave a Reply

Your email address will not be published. Required fields are marked *