Skip to main content

Object Storage & Blob Storage

Petabyte-scale unstructured data with durability, availability, and CDN integration

TL;DR

Object storage (S3, GCS, Azure Blob) stores unstructured data (files, images, backups) at petabyte scale with high durability and availability. Cost-effective, serverless, integrates with CDN. Trade-off: eventual consistency, different access patterns than databases, network latency, not suitable for random access within files.

Learning Objectives

  • Understand object storage design (buckets, keys, versioning)
  • Design key hierarchies for organizational and performance benefit
  • Recognize storage classes and lifecycle policies
  • Plan for durability, availability, and cost optimization

Motivating Scenario

Photo sharing app: 1M users, 1000 photos each = 1B images. RDBMS would need exabytes. S3: terabytes with 11 nines durability. Users in US, EU, Asia: CloudFront CDN serves 95% requests with <100ms latency. Old photos auto-archive to Glacier: 10x cost savings.

Core Concepts

Practical Example

import boto3
from botocore.config import Config
import os

# Create S3 client
s3_client = boto3.client('s3', region_name='us-west-2')
s3_resource = boto3.resource('s3')

# Upload file
def upload_file(bucket, key, file_path):
s3_client.upload_file(
file_path,
bucket,
key,
ExtraArgs={
'ContentType': 'image/jpeg',
'ServerSideEncryption': 'AES256',
'Metadata': {'user-id': '123', 'upload-date': '2025-02-14'}
}
)

# Upload with progress callback
def upload_file_with_progress(bucket, key, file_path):
s3_resource.meta.client.upload_file(
file_path, bucket, key,
Callback=ProgressPercentage(file_path)
)

# Download file
def download_file(bucket, key, local_path):
s3_client.download_file(bucket, key, local_path)

# List objects
def list_objects(bucket, prefix='', max_keys=100):
response = s3_client.list_objects_v2(
Bucket=bucket,
Prefix=prefix,
MaxKeys=max_keys
)

objects = []
for obj in response.get('Contents', []):
objects.append({
'key': obj['Key'],
'size': obj['Size'],
'last_modified': obj['LastModified'],
'storage_class': obj.get('StorageClass', 'STANDARD')
})

return objects

# Generate presigned URL (temporary access)
def get_presigned_url(bucket, key, expiration_seconds=3600):
url = s3_client.generate_presigned_url(
'get_object',
Params={'Bucket': bucket, 'Key': key},
ExpiresIn=expiration_seconds
)
return url

# Multipart upload for large files
def upload_large_file(bucket, key, file_path):
file_size = os.path.getsize(file_path)
part_size = 5 * 1024 * 1024 # 5 MB

multipart = s3_client.create_multipart_upload(Bucket=bucket, Key=key)
upload_id = multipart['UploadId']

parts = []
with open(file_path, 'rb') as f:
part_num = 1
while True:
data = f.read(part_size)
if not data:
break

response = s3_client.upload_part(
Bucket=bucket,
Key=key,
PartNumber=part_num,
UploadId=upload_id,
Body=data
)

parts.append({
'ETag': response['ETag'],
'PartNumber': part_num
})
part_num += 1

s3_client.complete_multipart_upload(
Bucket=bucket,
Key=key,
UploadId=upload_id,
MultipartUpload={'Parts': parts}
)

# Lifecycle policy (auto-archive)
def set_lifecycle_policy(bucket):
s3_client.put_bucket_lifecycle_configuration(
Bucket=bucket,
LifecycleConfiguration={
'Rules': [
{
'Id': 'archive-old-photos',
'Filter': {'Prefix': 'photos/'},
'Status': 'Enabled',
'Transitions': [
{
'Days': 30,
'StorageClass': 'INTELLIGENT_TIERING'
},
{
'Days': 90,
'StorageClass': 'GLACIER'
},
],
'Expiration': {'Days': 2555} # 7 years
}
}
)

# Versioning
def enable_versioning(bucket):
s3_client.put_bucket_versioning(
Bucket=bucket,
VersioningConfiguration={'Status': 'Enabled'}
)

# Server-side replication
def setup_replication(source_bucket, dest_bucket):
s3_client.put_bucket_replication(
Bucket=source_bucket,
ReplicationConfiguration={
'Role': 'arn:aws:iam::ACCOUNT:role/s3-replication',
'Rules': [
{
'ID': 'replicate-all',
'Status': 'Enabled',
'Priority': 1,
'Filter': {'Prefix': ''},
'Destination': {
'Bucket': f'arn:aws:s3:::{dest_bucket}',
'ReplicationTime': {'Status': 'Enabled', 'Time': {'Minutes': 15}}
}
}
}
)

When to Use Object Storage / When Not to Use

Use Object Storage When
  1. Large unstructured files (images, videos)
  2. Petabyte-scale storage needed
  3. Infrequent random access
  4. Archival and backup data
  5. Integration with CDN for distribution
Use Databases When
  1. Frequent random access within data
  2. Complex queries required
  3. Transactional consistency needed
  4. Small structured records
  5. ACID guarantees important

Patterns and Pitfalls

Design Review Checklist

  • Bucket naming strategy supports access patterns
  • Key hierarchy meaningful and consistently applied
  • Versioning enabled for critical data
  • Lifecycle policies configured for cost
  • Cross-region replication for important data
  • Encryption at rest and in transit configured
  • Presigned URLs for temporary access
  • Access logging enabled for audit trail
  • Multipart upload for large files
  • CDN distribution configured for frequently accessed objects

Self-Check

  • How would you organize keys for a photo sharing app?
  • What's the difference between versioning and cross-region replication?
  • When would you use Intelligent-Tiering vs manual transitions?
  • How do lifecycle policies reduce storage costs?
info

Object storage provides cost-effective, durable storage for unstructured data at massive scale, but isn't suitable for random access within files or complex queries. Use for files, backups, and archives alongside databases for structured data.

Next Steps

  • Explore CDN Integration for global distribution
  • Learn Cost Optimization for cloud storage
  • Study Backup & Disaster Recovery strategies
  • Dive into Data Lifecycle policies

References

  • AWS S3 Documentation
  • Google Cloud Storage Documentation
  • Azure Blob Storage Guide
  • AWS S3 Best Practices