Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Selected Reading
How to use Boto3 library in Python to get a list of files from S3 based on the last modified date using AWS Resource?
Use the boto3 library in Python to retrieve a list of files from AWS S3 that were modified after a specific timestamp. This is useful for filtering files based on their last modified date using AWS Resource interface.
Prerequisites
Before running the code, ensure you have:
- AWS credentials configured (via CLI, environment variables, or IAM roles)
- Boto3 library installed:
pip install boto3 - Proper S3 bucket permissions
Approach
The solution involves these key steps:
- Validate the S3 path format
- Create AWS session and S3 resource
- List all objects in the specified prefix
- Compare each file's LastModified timestamp
- Return files modified after the given date
Implementation
import boto3
from botocore.exceptions import ClientError
from datetime import datetime
def list_files_by_last_modified(s3_path, last_modified_timestamp):
"""
Get list of S3 files modified after a given timestamp
Args:
s3_path: S3 path in format 's3://bucket-name/prefix/'
last_modified_timestamp: Timestamp string or datetime object
Returns:
List of S3 file paths modified after the timestamp
"""
# Validate S3 path format
if 's3://' not in s3_path:
raise ValueError('Invalid S3 path. Expected format: s3://bucket-name/prefix/')
# Parse S3 path
path_parts = s3_path.replace('s3://', '').split('/')
bucket_name = path_parts[0]
prefix = '/'.join(path_parts[1:]) if len(path_parts) > 1 else ''
# Add trailing slash if prefix exists
if prefix and not prefix.endswith('/'):
prefix += '/'
# Create AWS session and S3 resource
session = boto3.Session()
s3_resource = session.resource('s3')
try:
# List all objects with the given prefix
response = s3_resource.meta.client.list_objects_v2(
Bucket=bucket_name,
Prefix=prefix
)
# Handle case when no objects found
if 'Contents' not in response:
return []
# Convert timestamp to datetime if it's a string
if isinstance(last_modified_timestamp, str):
# Parse timestamp (handles timezone-aware strings)
if '+' in last_modified_timestamp:
timestamp = datetime.fromisoformat(last_modified_timestamp.replace('+00:00', '+00:00'))
else:
timestamp = datetime.fromisoformat(last_modified_timestamp)
else:
timestamp = last_modified_timestamp
# Filter files based on last modified date
filtered_files = []
for obj in response['Contents']:
if obj['LastModified'] >= timestamp:
full_path = f"s3://{bucket_name}/{obj['Key']}"
filtered_files.append(full_path)
return filtered_files
except ClientError as e:
error_code = e.response['Error']['Code']
if error_code == 'NoSuchBucket':
raise Exception(f"Bucket '{bucket_name}' does not exist")
elif error_code == 'AccessDenied':
raise Exception(f"Access denied to bucket '{bucket_name}'")
else:
raise Exception(f"AWS error: {e}")
except Exception as e:
raise Exception(f"Unexpected error: {e}")
# Example usage
if __name__ == "__main__":
# Example 1: Find files modified after specific timestamp
try:
timestamp = "2021-01-21T13:19:56.986445+00:00"
files = list_files_by_last_modified("s3://my-bucket/uploads/", timestamp)
print(f"Files modified after {timestamp}:")
for file in files:
print(f" {file}")
except Exception as e:
print(f"Error: {e}")
# Example 2: Using datetime object
try:
from datetime import datetime, timezone
cutoff_time = datetime(2021, 1, 21, 13, 19, 56, tzinfo=timezone.utc)
files = list_files_by_last_modified("s3://my-bucket/data/", cutoff_time)
print(f"\nFiles found: {len(files)}")
except Exception as e:
print(f"Error: {e}")
Key Features
| Feature | Description | Benefit |
|---|---|---|
| Timezone Support | Handles timezone-aware timestamps | Accurate date comparisons |
| Error Handling | Specific error messages for common issues | Better debugging experience |
| Flexible Input | Accepts string or datetime objects | Easy integration |
| list_objects_v2 | Uses newer S3 API version | Better performance |
Common Use Cases
- Data Processing: Process only newly uploaded files
- Backup Systems: Identify files that need backing up
- ETL Pipelines: Filter datasets by modification date
- Log Analysis: Analyze recent log files only
Best Practices
- Use
list_objects_v2instead of the olderlist_objects - Handle pagination for buckets with many objects
- Use timezone-aware datetime objects for accurate comparisons
- Implement proper error handling for network and permission issues
- Consider using S3 inventory for large-scale operations
Conclusion
Using boto3 with proper timestamp filtering allows efficient retrieval of recently modified S3 files. The approach combines S3's list_objects_v2 API with datetime comparison for robust file filtering based on modification dates.
Advertisements
