Web Scraping Using Node Js



Hi Guys,

Learn how to do basic web scraping using Node.js in this tutorial. The request-promise and cheerio libraries are used.💻 Github: https://github.com/beaucarne. Apr 16, 2021 And in the python script (has scraping functions) take time to process the output as the csv file may have more number of rows (lets say csv file of size 200 X 1 and has 200 URLs) so to process 200 input and then return their corresponding output lead to me request timeout as the python execution requires some time. Getting started with web scraping is easy, and the process can be broken down into two main parts: acquiring the data using an HTML request library or a headless browser, and parsing the data to get the exact information you want. This guide will walk you through the process with the popular Node.js request-promise module, CheerioJS,. And in the python script (has scraping functions) take time to process the output as the csv file may have more number of rows (lets say csv file of size 200 X 1 and has 200 URLs) so to process 200 input and then return their corresponding output lead to me request timeout as the python execution requires some time. Thanks to Node.js, JavaScript is a great language to u se for a web scraper: not only is Node fast, but you’ll likely end up using a lot of the same methods you’re used to from querying the.

Today, I will learn you how to use to retrieve data from any websites or web pages using the node js and cheerio. we will show example of web scraping in node.js. you can easliy use retrieve data from any websites or web pages using the node js and cheerio.

What is web scraping?

Web scraping is a technique used to retrieve data from websites using a script. Web scraping is the way to automate the laborious work of copying data from various websites.

Web Scraping Using Node Js With Projects From Scratch

Web Scraping is generally performed in the cases when the desirable websites don’t expose external API for public consumption. Some common web scraping scenarios are:

-Fetch trending posts on social media sites.

-Fetch email addresses from various websites for sales leads.

-Fetch news headlines from news websites.

For Example, if you may want to scrape medium.com blog post using the following url https://medium.com/search?q=node.js

After that, open the Inspector in chrome dev tools and see the DOM elements of it.

If you see it carefully, it has a pattern. we can scrap it using the element class names.

Web Scraping with Node js and Cheerio


Follow the here steps and retrieve or scrap blog posts data from the medium.com using node js and cheerio:

Step 1: Setup Node js Project

In this step,Let’s set up the project to scrape medium blog posts. Create a Project directory.

Install all the dependencies mentioned above.

Step 2: Making Http Request

Now this step, making the http request to get the webpage elements:

Step 3: Extract Data From Blog Posts

Here this step,Once you retrive all the blog posts from medium.com, you can use cheerio to scrap the data that you need.

This loads the data to the dollar variable. if you have used JQuery before, you know the reason why we are using $ here(Just to follow some old school naming convention).

Now, you can traverse through the DOM tree.

Since you need only the title and link from scrapped blog posts on your web page. you will get the elements in the HTML using either the class name of it or class name of the parent element.

Firstly, we need to get all the blogs DOM which has .js-block as a class name.

Most Importantly, each keyword loops through all the element which has the class name as js-block.

After, you scrap the title and link of each blog post from medium.com.

Web Scraping Using Node Js

This will scrap the blog posts for a given tag.

The full source code of node js web scraping:

app.js

Step 4: Create Views

Next this step, you need to create one folder name layouts, so go to your nodewebscrap app and find views folder and inside this folder create new folder name layouts.

Inside a layout folder, create one views file name main.handlebars and update the following code into your views/layouts/main.handlebars file:

After that, create one new view file name index.handlebars outside the layouts folder.

nodewebscraper/views/index.handlebars

Update the following code into your index.handlerbars:

After that, create one new view file name list.handlebars outside the layouts folder.

nodewebscraper/views/list.handlebars

Update the following code into your list.handlerbars:

Step 5: Run development server

It will help you....

Posted by Soham Kamani on November 24, 2018

GSD

Almost all the information on the web exists in the form of HTML pages. The information in these pages is structured as paragraphs, headings, lists, or one of the many other HTML elements. These elements are organized in the browser as a hierarchical tree structure called the DOM (short for Document Object Model). Each element can have multiple child elements, which can also have their own children. This structure makes it convenient to extract specific information from the page.

The process of extracting this information is called 'scraping' the web, and it’s useful for a variety of applications. All search engines, for example, use web scraping to index web pages for their search results. We can also use web scraping in our own applications when we want to automate repetitive information gathering tasks.

Web scraping tool

Learn how your Marketing team can update your Node App with ButterCMS.

Cheerio is a Node.js library that helps developers interpret and analyze web pages using a jQuery-like syntax. In this post, I will explain how to use Cheerio in your tech stack to scrape the web. We will use the headless CMSAPI documentation for ButterCMS as an example and use Cheerio to extract all the API endpoint URLs from the web page.

Why Cheerio

There are many other web scraping libraries, and they run on most popular programming languages and platforms. What makes Cheerio unique, however, is its jQuery-based API.

jQuery is by far the most popular javascript library in use today. It's used in browser-based javascript applications to traverse and manipulate the DOM. For example, if your document has the following paragraph:

You could use jQuery to get the text of the paragraph:

The above code uses a CSS selector #example to get the element with the id of 'example'. The text method of jQuery extracts just the text inside the element (the <strong> tags disappeared in the output).

The jQuery API is useful because it uses standard CSS selectors to search for elements, and has a readable API to extract information from them. jQuery is, however, usable only inside the browser, and thus cannot be used for web scraping. Cheerio solves this problem by providing jQuery's functionality within the Node.js runtime, so that it can be used in server-side applications as well. Now, we can use the same familiar CSS selection syntax and jQuery methods without depending on the browser.

The Cheerio API

Unlike jQuery, Cheerio doesn't have access to the browser’s DOM . Instead, we need to load the source code of the webpage we want to crawl. Cheerio allows us to load HTML code as a string, and returns an instance that we can use just like jQuery.

Let's look at how we can implement the previous example using cheerio:

You can find more information on the Cheerio API in the official documentation.

Scraping the ButterCMS documentation page

The ButterCMS documentation page is filled with useful information on their APIs. For our application, we just want to extract the URLs of the API endpoints.

For example, the API to get a single page is documented below:

What we want is the URL:

https://api.buttercms.com/v2/pages/<page_type_slug>/<page_slug>/?auth_token=api_token_b60a008a

In order to use Cheerio to extract all the URLs documented on the page, we need to:

  1. Download the source code of the webpage, and load it into a Cheerio instance
  2. Use the Cheerio API to filter out the HTML elements containing the URLs

To get started, make sure you have Node.js installed on your system. Create an empty folder as your project directory:

mkdir cheerio-example


Next, go inside the directory and start a new node project:

npm init
## follow the instructions, which will create a package.json file in the directory

Web Scraping Using Node Js Tutorial


Finally, create a new index.js file inside the directory, which is where the code will go.

Obtaining the website source code

We can use the axios library to download the source code from the documentation page.

While in the project directory, install the axios library:

npm install axios

We can then use axios to download the website source code

Add the above code to index.js and run it with:

You should then see the HTML source code printed to your console. This can be quite large! Let’s explore the source code to find patterns we can use to extract the information we want. You can use your favorite browser to view the source code. Right-click on any page and click on the 'View Page Source' option in your browser.

Learn how your Marketing team can update your Node App with ButterCMS.

Extracting information from the source code

After looking at the code for the ButterCMS documentation page, it looks like all the API URLs are contained in span elements within pre elements:

We can use this pattern to extract the URLs from the source code. To get started, let's install the Cheerio library into our project:

npm install cheerio

Now, we can use the response data from earlier to create a Cheerio instance and scrape the webpage we downloaded:

Run the above program with:

node index.js

And you should see the output:


'https://api.buttercms.com/v2/posts/?auth_token=e47fc1e1ee6cb9496247914f7da8be296a09d91b'
'https://api.buttercms.com/v2/pages/<page_type_slug>/<page_slug>/?auth_token=e47fc1e1ee6cb9496247914f7da8be296a09d91b'
'https://api.buttercms.com/v2/pages/<page_type>/?auth_token=e47fc1e1ee6cb9496247914f7da8be296a09d91b'
'https://api.buttercms.com/v2/content/?keys=homepage_headline,homepage_title&auth_token=e47fc1e1ee6cb9496247914f7da8be296a09d91b'
'https://api.buttercms.com/v2/posts/?page=1&page_size=10&auth_token=e47fc1e1ee6cb9496247914f7da8be296a09d91b'
'https://api.buttercms.com/v2/posts/<slug>/?auth_token=e47fc1e1ee6cb9496247914f7da8be296a09d91b'
'https://api.buttercms.com/v2/search/?query=my+favorite+post&auth_token=e47fc1e1ee6cb9496247914f7da8be296a09d91b'
'https://api.buttercms.com/v2/authors/?auth_token=e47fc1e1ee6cb9496247914f7da8be296a09d91b'
'https://api.buttercms.com/v2/authors/jennifer-smith/?auth_token=e47fc1e1ee6cb9496247914f7da8be296a09d91b'
'https://api.buttercms.com/v2/categories/?auth_token=e47fc1e1ee6cb9496247914f7da8be296a09d91b'
'https://api.buttercms.com/v2/categories/product-updates/?auth_token=e47fc1e1ee6cb9496247914f7da8be296a09d91b'
'https://api.buttercms.com/v2/tags/?auth_token=e47fc1e1ee6cb9496247914f7da8be296a09d91b'
'https://api.buttercms.com/v2/tags/product-updates/?auth_token=e47fc1e1ee6cb9496247914f7da8be296a09d91b'
'https://api.buttercms.com/v2/feeds/rss/?auth_token=e47fc1e1ee6cb9496247914f7da8be296a09d91b'
'https://api.buttercms.com/v2/feeds/atom/?auth_token=e47fc1e1ee6cb9496247914f7da8be296a09d91b'
'https://api.buttercms.com/v2/feeds/sitemap/?auth_token=e47fc1e1ee6cb9496247914f7da8be296a09d91b'

Pretty neat! We just got all the URLs of the APIs listed on the ButterCMS documentation page.

Conclusion

Cheerio makes it really easy for us to use the tried and tested jQuery API in a server-based environment. In fact, if you use the code we just wrote, barring the page download and loading, it would work perfectly in the browser as well. You can verify this by going to the ButterCMS documentation page and pasting the following jQuery code in the browser console:

You’ll see the same output as the previous example:

You can even use the browser to play around with the DOM before finally writing your program with Node and Cheerio.

Web Scraping Using Node Js


One important aspect to remember while web scraping is to find patterns in the elements you want to extract. For example, they could all be list items under a common ul element, or they could be rows in a table element. Inspecting the source code of a webpage is the best way to find such patterns, after which using Cheerio's API should be a piece of cake!

Soham is a full stack developer with experience in developing web applications at scale in a variety of technologies and frameworks.

ButterCMS is the #1 rated Headless CMS

Related articles

Agile Content: 7 Trusted Techniques for Marketing Teams

Stella Inabo

Providing a Powerful Digital Experience with Branded Content

Scott Kessman

Ecommerce Tutorial: Build a Powerful Next.js Shopping Cart

Scraping

Blessing Krofegha

Don’t miss a single post

Get our latest articles, stay updated!