Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to Scrape a Website using Puppeteer.js?
Web scraping is one of the best ways to automate the process of data collection from the web. A web scraper, often called a "crawler," surfs the web and extracts data from selected pages. This automation is much easier than manually extracting data from different web pages and provides a solution when websites don't offer APIs for data access.
In this tutorial, we will create a web scraper in Node.js using Puppeteer.js to extract book information from a sample website.
What is Puppeteer.js?
Puppeteer is a Node.js library that provides a high-level API to control Chrome or Chromium browsers programmatically. It can generate screenshots, PDFs, crawl SPAs, and perform automated testing.
Project Setup
First, create a new project directory and initialize it:
mkdir book-scraper-app cd book-scraper-app npm init -y
Install Puppeteer as a development dependency:
npm install --save-dev puppeteer
This command installs Puppeteer and downloads a compatible Chromium browser version.
Project Structure
We'll organize our scraper into four files:
- browser.js - Browser instance management
- index.js - Application entry point
- pageController.js - Scraping controller
- pageScraper.js - Core scraping logic
Browser Instance Setup
Create browser.js to handle browser initialization:
const puppeteer = require('puppeteer');
async function browserInit() {
let browser;
try {
console.log("Opening the browser......");
browser = await puppeteer.launch({
headless: false,
ignoreDefaultArgs: ['--disable-extensions'],
args: ["--disable-setuid-sandbox"],
ignoreHTTPSErrors: true
});
} catch (err) {
console.log("Could not create a browser instance => : ", err);
}
return browser;
}
module.exports = {
browserInit
};
Key configuration options:
- headless: false - Shows the browser interface for debugging
- ignoreHTTPSErrors - Allows visiting non-HTTPS sites
- ignoreDefaultArgs - Prevents extension conflicts
Application Entry Point
Create index.js as the main application file:
const browserObject = require('./browser');
const scraperController = require('./pageController');
let browserInstance = browserObject.browserInit();
scraperController(browserInstance);
Page Controller
Create pageController.js to manage the scraping process:
const pageScraper = require('./pageScraper');
async function scrapeAll(browserInstance){
let browser;
try{
browser = await browserInstance;
await pageScraper.scraper(browser);
}
catch(err){
console.log("Could not resolve the browser instance => ", err);
}
}
module.exports = (browserInstance) => scrapeAll(browserInstance);
Core Scraping Logic
Create pageScraper.js with the main scraping functionality:
const scraperObject = {
url: 'http://books.toscrape.com/catalogue/category/books/childrens_11/index.html',
async scraper(browser) {
let page = await browser.newPage();
console.log(`Navigating to ${this.url}...`);
await page.goto(this.url);
await page.waitForSelector('.page_inner');
// Extract book URLs
let urls = await page.$$eval('section ol > li', links => {
links = links.filter(link =>
link.querySelector('.instock.availability > i').textContent !== "In stock"
);
links = links.map(el => el.querySelector('h3 > a').href);
return links;
});
// Scrape individual book pages
let pagePromise = (link) => new Promise(async(resolve, reject) => {
let bookData = {};
let newPage = await browser.newPage();
await newPage.goto(link);
bookData['title'] = await newPage.$eval('.product_main > h1', text => text.textContent);
bookData['price'] = await newPage.$eval('.price_color', text => text.textContent);
bookData['availability'] = await newPage.$eval('.instock.availability', text => {
text = text.textContent.replace(/(\r
\t|
|\r|\t)/gm, "");
let regexp = /^.*\((.*)\).*$/i;
let stockAvailable = regexp.exec(text)[1].split(' ')[0];
return stockAvailable;
});
bookData['imageUrl'] = await newPage.$eval('#product_gallery img', img => img.src);
bookData['description'] = await newPage.$eval('#product_description', div =>
div.nextSibling.nextSibling.textContent
);
resolve(bookData);
await newPage.close();
});
for(let link of urls){
let bookInfo = await pagePromise(link);
console.log(bookInfo);
}
}
};
module.exports = scraperObject;
Running the Scraper
Add a start script to your package.json:
{
"scripts": {
"start": "node index.js"
}
}
Run the scraper:
npm run start
Sample Output
Opening the browser......
Navigating to http://books.toscrape.com/catalogue/category/books/childrens_11/index.html...
{
title: 'Birdsong: A Story in Pictures',
price: '£54.64',
availability: '19',
imageUrl: 'http://books.toscrape.com/media/cache/af/2f/af2fe2419ea136f2cd567aa92082c3ae.jpg',
description: 'Bring the thrilling story of one red bird to life...'
}
{
title: 'The Bear and the Piano',
price: '£36.89',
availability: '18',
imageUrl: 'http://books.toscrape.com/media/cache/d0/87/d0876dcd1a6530a4cb54903aad7a3e28.jpg',
description: 'One day, a young bear stumbles upon something...'
}
Key Features
- Automated browsing - Opens pages programmatically
- DOM manipulation - Uses CSS selectors to extract data
- Promise-based - Handles asynchronous operations efficiently
- Error handling - Includes try-catch blocks for robust operation
Best Practices
- Always respect robots.txt and website terms of service
- Add delays between requests to avoid overwhelming servers
- Handle errors gracefully with proper exception handling
- Close browser pages after scraping to free memory
Conclusion
Puppeteer.js provides a powerful and flexible way to scrape websites by controlling a real browser instance. This approach handles JavaScript-rendered content effectively and allows for complex interactions with web pages, making it ideal for modern web scraping tasks.
