How to Scrape a Website using Puppeteer.js?

Web scraping is one of the best ways to automate the process of data collection from the web. A web scraper, often called a "crawler," surfs the web and extracts data from selected pages. This automation is much easier than manually extracting data from different web pages and provides a solution when websites don't offer APIs for data access.

In this tutorial, we will create a web scraper in Node.js using Puppeteer.js to extract book information from a sample website.

What is Puppeteer.js?

Puppeteer is a Node.js library that provides a high-level API to control Chrome or Chromium browsers programmatically. It can generate screenshots, PDFs, crawl SPAs, and perform automated testing.

Project Setup

First, create a new project directory and initialize it:

mkdir book-scraper-app
cd book-scraper-app
npm init -y

Install Puppeteer as a development dependency:

npm install --save-dev puppeteer

This command installs Puppeteer and downloads a compatible Chromium browser version.

Project Structure

We'll organize our scraper into four files:

  • browser.js - Browser instance management
  • index.js - Application entry point
  • pageController.js - Scraping controller
  • pageScraper.js - Core scraping logic

Browser Instance Setup

Create browser.js to handle browser initialization:

const puppeteer = require('puppeteer');

async function browserInit() {
   let browser;
   try {
      console.log("Opening the browser......");
      browser = await puppeteer.launch({
         headless: false,
         ignoreDefaultArgs: ['--disable-extensions'],
         args: ["--disable-setuid-sandbox"],
         ignoreHTTPSErrors: true
      });
   } catch (err) {
      console.log("Could not create a browser instance => : ", err);
   }
   return browser;
}

module.exports = {
   browserInit
};

Key configuration options:

  • headless: false - Shows the browser interface for debugging
  • ignoreHTTPSErrors - Allows visiting non-HTTPS sites
  • ignoreDefaultArgs - Prevents extension conflicts

Application Entry Point

Create index.js as the main application file:

const browserObject = require('./browser');
const scraperController = require('./pageController');

let browserInstance = browserObject.browserInit();
scraperController(browserInstance);

Page Controller

Create pageController.js to manage the scraping process:

const pageScraper = require('./pageScraper');

async function scrapeAll(browserInstance){
   let browser;
   try{
      browser = await browserInstance;
      await pageScraper.scraper(browser);
   }
   catch(err){
      console.log("Could not resolve the browser instance => ", err);
   }
}

module.exports = (browserInstance) => scrapeAll(browserInstance);

Core Scraping Logic

Create pageScraper.js with the main scraping functionality:

const scraperObject = {
   url: 'http://books.toscrape.com/catalogue/category/books/childrens_11/index.html',
   
   async scraper(browser) {
      let page = await browser.newPage();
      console.log(`Navigating to ${this.url}...`);
      await page.goto(this.url);
      await page.waitForSelector('.page_inner');

      // Extract book URLs
      let urls = await page.$$eval('section ol > li', links => {
         links = links.filter(link => 
            link.querySelector('.instock.availability > i').textContent !== "In stock"
         );
         links = links.map(el => el.querySelector('h3 > a').href);
         return links;
      });

      // Scrape individual book pages
      let pagePromise = (link) => new Promise(async(resolve, reject) => {
         let bookData = {};
         let newPage = await browser.newPage();
         await newPage.goto(link);
         
         bookData['title'] = await newPage.$eval('.product_main > h1', text => text.textContent);
         bookData['price'] = await newPage.$eval('.price_color', text => text.textContent);
         bookData['availability'] = await newPage.$eval('.instock.availability', text => {
            text = text.textContent.replace(/(\r
\t|
|\r|\t)/gm, ""); let regexp = /^.*\((.*)\).*$/i; let stockAvailable = regexp.exec(text)[1].split(' ')[0]; return stockAvailable; }); bookData['imageUrl'] = await newPage.$eval('#product_gallery img', img => img.src); bookData['description'] = await newPage.$eval('#product_description', div => div.nextSibling.nextSibling.textContent ); resolve(bookData); await newPage.close(); }); for(let link of urls){ let bookInfo = await pagePromise(link); console.log(bookInfo); } } }; module.exports = scraperObject;

Running the Scraper

Add a start script to your package.json:

{
   "scripts": {
      "start": "node index.js"
   }
}

Run the scraper:

npm run start

Sample Output

Opening the browser......
Navigating to http://books.toscrape.com/catalogue/category/books/childrens_11/index.html...
{
  title: 'Birdsong: A Story in Pictures',
  price: '£54.64',
  availability: '19',
  imageUrl: 'http://books.toscrape.com/media/cache/af/2f/af2fe2419ea136f2cd567aa92082c3ae.jpg',
  description: 'Bring the thrilling story of one red bird to life...'
}
{
  title: 'The Bear and the Piano',
  price: '£36.89',
  availability: '18',
  imageUrl: 'http://books.toscrape.com/media/cache/d0/87/d0876dcd1a6530a4cb54903aad7a3e28.jpg',
  description: 'One day, a young bear stumbles upon something...'
}

Key Features

  • Automated browsing - Opens pages programmatically
  • DOM manipulation - Uses CSS selectors to extract data
  • Promise-based - Handles asynchronous operations efficiently
  • Error handling - Includes try-catch blocks for robust operation

Best Practices

  • Always respect robots.txt and website terms of service
  • Add delays between requests to avoid overwhelming servers
  • Handle errors gracefully with proper exception handling
  • Close browser pages after scraping to free memory

Conclusion

Puppeteer.js provides a powerful and flexible way to scrape websites by controlling a real browser instance. This approach handles JavaScript-rendered content effectively and allows for complex interactions with web pages, making it ideal for modern web scraping tasks.

Updated on: 2026-03-15T23:19:01+05:30

464 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements