How to use Boto3 to paginate through all crawlers present in AWS Glue

In this article, we will explore how to use Boto3 to paginate through all AWS Glue crawlers in your account efficiently.

Overview

AWS Glue crawlers can be numerous in large accounts. Using pagination allows you to retrieve crawler information in manageable chunks, preventing timeouts and memory issues.

Parameters

The pagination function accepts three key parameters −

  • max_items − Total number of records to return. If more records exist, a NextToken is provided for continuation.

  • page_size − Number of crawlers per page/batch.

  • starting_token − Token from previous response to continue pagination from a specific point.

Implementation Steps

  • Step 1: Import required libraries and handle exceptions

  • Step 2: Create AWS session and Glue client

  • Step 3: Create paginator object using get_crawlers

  • Step 4: Configure pagination parameters and execute

  • Step 5: Handle responses and continue pagination if needed

Example

The following code demonstrates how to paginate through all crawlers in your AWS Glue Data Catalog −

import boto3
from botocore.exceptions import ClientError

def paginate_through_crawlers(max_items=None, page_size=None, starting_token=None):
    session = boto3.session.Session()
    glue_client = session.client('glue')
    
    try:
        paginator = glue_client.get_paginator('get_crawlers')
        response = paginator.paginate(
            PaginationConfig={
                'MaxItems': max_items,
                'PageSize': page_size,
                'StartingToken': starting_token
            }
        )
        return response
    except ClientError as e:
        raise Exception("boto3 client error in paginate_through_crawlers: " + str(e))
    except Exception as e:
        raise Exception("Unexpected error in paginate_through_crawlers: " + str(e))

# First run - get first 3 crawlers with page size 5
response_1 = paginate_through_crawlers(max_items=3, page_size=5)
print("First batch of crawlers:")
for page in response_1:
    print(f"Found {len(page['Crawlers'])} crawlers")
    if 'NextToken' in page:
        next_token = page['NextToken']
        print(f"NextToken: {next_token}")

# Second run - continue from where we left off
if 'next_token' in locals():
    response_2 = paginate_through_crawlers(max_items=3, page_size=5, starting_token=next_token)
    print("Second batch of crawlers:")
    for page in response_2:
        print(f"Found {len(page['Crawlers'])} crawlers")

Sample Output

The output shows crawler details including name, role, targets, and metadata −

First batch of crawlers:
Found 3 crawlers
NextToken: crawlr-wells

Crawler details include:
- Name: DailyTest_v1.01
- Role: ds-dev
- State: READY
- DatabaseName: default
- S3 Targets and configurations

Second batch of crawlers:
Found 3 crawlers
NextToken: discovery_rep

Additional crawler details:
- Name: crwlr-cw-etf
- Role: dev-ds-glue-role
- State: READY
- Last crawl status and timing information

Best Practices

  • Set appropriate page_size values (typically 10-100) to balance performance and memory usage

  • Always handle NextToken to ensure complete data retrieval

  • Implement proper error handling for network timeouts and AWS service limits

  • Use max_items to limit total results when testing or for specific use cases

Conclusion

Boto3 pagination provides an efficient way to retrieve large numbers of AWS Glue crawlers without overwhelming your application or hitting service limits. Use the NextToken to continue retrieving data across multiple requests when needed.

Updated on: 2026-03-25T18:52:52+05:30

496 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements