The whole idea of writing this article came from Traversy?media channel on Youtube, the Web Scraping tutorial?is really good and it’s an intro, I have tested it on some popular websites and I find some challenges and I will describe it in this article.
Many years ago I have written a web scraper application?which crawls on some news agencies and grab specific?data if I remember correctly I have coded it with PHP using?CodeIgniter framework, currently I was thinking about a project which somehow is related to the web scraping and I saw this video, so I am going to make a web scraper application with Node.js and Cheerio library.
Definition of the project:? Scraping?HuffingtonPost articles which is related to Italy and save it to an Excel .csv file. For preventing duplicate syntax I will just grab the title and thumbnail of the news.
Let’s start!
I assume you already know what is NodeJS and you have installed it on your computer.
In your blank directory write
npm init -y
Because we want Node.js to create a package.js file and we put “-y” so we can ignore all the questions.
Then we should add 2 libraries. First Cheerio which is :
Fast, flexible & lean implementation of core jQuery designed specifically for the server.
And the other one is Request?which is:
Request is designed to be the simplest way possible to make http calls. It supports HTTPS and follows redirects by default.
Implementation
Type npm i cheerio request
It would install cheerio and request library at the same time
npm i
is equal to
npm install
and yes you can name 2 libraries on the same line for installing.
In the directory create a file like scrapper.js or whatever you want, it is a?web scraping project, right?
First I put all the codes here and then I will explain more? :
// Including necessary libraries const request = require('request'); const cheerio = require('cheerio'); // File system is inside the NodeJS const fs = require('fs'); // Write our data in CSV file const writeStream = fs.createWriteStream('post.csv'); // Write header for excel file writeStream.write('Title,Description \n'); // Requesting the URL with 3 default parameters request('https://www.huffingtonpost.com/topic/italy-travel', (error, response, html) => { if (!error && response.statusCode === 200) { // Use cheerio to parse and create the jQuery-like DOM const $ = cheerio.load(html); // Finding targeted elements $('.card__content').each((i, el) => { const title = $(el) .find('.card__headline__text') .text() .replace(/["',]/g, "") .trim(); const desc = $(el).find('.card__image__src').attr('src').split('?')[0]; // Write Row to CSV writeStream.write(`${title}, ${desc} \n`); }); // Making sure everything is alright console.log('its done'); } });
All descriptions?placed in comments between codes but I have described a little more about tricky parts of code?below:
if you have confusion on why using? $ in Cheerio is best practice, read this discussing on Stackoverflow.
const $ = cheerio.load(html);
For replacing all junk characters in the title I have used?the regular expression (RegExp Object) so I can describe a pattern of characters in brackets. Be careful?you can not use all characters like dash “–“, for more detail check this link.
I put “g” after the bracket which asks for global changes rather than stopping after the first match.
I use trim() method to remove whitespace from both sides of strings.
.replace(/["',]/g, "")
.trim();
When I want to get the URL link of an image it retrieved a long URL which after question mark all of that is junk characters related to caching method and etc, so I remove junk characters with using split method.
const desc = $(el).find('.card__image__src').attr('src').split('?')[0];
I have use backtick “??`? ” because it can insert the content of variables. Feel free to test with just single quote symbol ‘ and see the result.
writeStream.write(`${title}, ${desc} \n`);
So that’s it! if you have any question feel free to ask.
I put the files and code here in my Github.
I also describe it on Youtube too, so you can check it over there too, I offer you to watch this video because I have test line by line and you can see the live result in each step.