Architecture & System Design
Distributed systems, scalability, and design patterns
Monolithic vs Microservices trade-offs
What are the trade-offs between monolithic and microservices architectures?
Monolithic Architecture:
Single deployable unit containing all functionality.
┌─────────────────────────────────┐
│ Monolith │
│ ┌─────┐ ┌─────┐ ┌─────────┐ │
│ │Users│ │Orders│ │Inventory│ │
│ └─────┘ └─────┘ └─────────┘ │
│ Single Database │
└─────────────────────────────────┘
Microservices Architecture:
Multiple independent services communicating over network.
┌───────┐ ┌────────┐ ┌───────────┐
│ Users │ │ Orders │ │ Inventory │
│ DB │ │ DB │ │ DB │
└───┬───┘ └───┬────┘ └─────┬─────┘
│ │ │
────┴───────────┴──────────────┴────
API Gateway
Trade-offs:
| Aspect | Monolith | Microservices |
|---|---|---|
| Complexity | Lower | Higher |
| Deployment | All-or-nothing | Independent |
| Scaling | Entire app | Per service |
| Data consistency | Easy (ACID) | Hard (distributed) |
| Development speed | Fast initially | Fast at scale |
| Testing | Simpler | More complex |
| Latency | In-process | Network calls |
| Team autonomy | Low | High |
When Monolith:
- Small team (<10)
- Simple domain
- Starting a new project
- Unclear boundaries
- Need quick MVP
When Microservices:
- Large organization
- Need independent scaling
- Different tech stacks needed
- Clear domain boundaries
- High availability critical
Migration Path:
1. Start monolith
2. Identify bounded contexts
3. Extract services incrementally
4. Strangler fig pattern
Key Points to Look For:
- Knows trade-offs, not just hype
- Considers team size
- Understands operational complexity
Follow-up: How do you identify service boundaries?
MVC, MVP, MVVM - differences
What are the differences between MVC, MVP, and MVVM patterns?
MVC (Model-View-Controller):
User
│
┌────▼────┐
│Controller│────→ Model
└────┬────┘ │
│ │
┌────▼────┐ │
│ View │←───────┘
└─────────┘
Controller: Handles input, updates Model
Model: Business logic, data
View: Renders UI from Model
MVP (Model-View-Presenter):
User
│
┌────▼────┐
│ View │←──────┐
└────┬────┘ │
│ │
┌────▼─────┐ │
│ Presenter│──→ Model
└──────────┘
View: Passive, delegates to Presenter
Presenter: All logic, updates View
Model: Business logic, data
MVVM (Model-View-ViewModel):
User
│
┌────▼────┐
│ View │
└────┬────┘
│ Data Binding
┌────▼─────┐
│ViewModel │
└────┬─────┘
│
┌────▼────┐
│ Model │
└─────────┘
View: Binds to ViewModel
ViewModel: View state, commands
Model: Business logic
Comparison:
| Aspect | MVC | MVP | MVVM |
|---|---|---|---|
| View-Logic coupling | Medium | Low | Low |
| Testability | Medium | High | High |
| View updates | Controller | Presenter | Binding |
| Complexity | Low | Medium | Medium |
| Best for | Web apps | Desktop, mobile | Desktop, SPA |
Examples:
- MVC: Ruby on Rails, ASP.NET MVC, Spring MVC
- MVP: Android (traditional), WinForms
- MVVM: WPF, Angular, Vue.js, SwiftUI
Key Differences:
MVC vs MVP:
- MVC: View can query Model directly
- MVP: All communication through Presenter
MVP vs MVVM:
- MVP: Presenter explicitly updates View
- MVVM: Data binding handles updates
Key Points to Look For:
- Knows data flow direction
- Understands testability implications
- Can match to technologies
Follow-up: Which pattern would you choose for a React application?
Layered architecture and its layers
Explain layered architecture and the purpose of each layer.
Layered Architecture:
Organizes code into horizontal layers with specific responsibilities.
┌─────────────────────────────────┐
│ Presentation Layer │ API/UI
├─────────────────────────────────┤
│ Application Layer │ Use cases
├─────────────────────────────────┤
│ Domain Layer │ Business logic
├─────────────────────────────────┤
│ Infrastructure Layer │ External systems
└─────────────────────────────────┘
Layers:
1. Presentation Layer
- Handles user interface / API endpoints
- Request/response formatting
- Input validation (format only)
- No business logic
@RestController
public class UserController {
@PostMapping("/users")
public Response createUser(@Valid UserDTO dto) {
User user = userService.create(dto);
return Response.created(user.getId());
}
}
2. Application Layer (Service Layer)
- Orchestrates use cases
- Transaction management
- Calls domain layer
- No business rules
@Service
public class UserService {
public User create(UserDTO dto) {
User user = userFactory.create(dto);
validateUnique(user.getEmail());
userRepository.save(user);
eventPublisher.publish(new UserCreated(user));
return user;
}
}
3. Domain Layer
- Business rules and logic
- Domain entities
- Value objects
- Domain services
public class User {
private Email email;
private Password password;
public void changePassword(Password newPassword) {
validatePasswordPolicy(newPassword);
this.password = newPassword;
}
}
4. Infrastructure Layer
- Database access
- External services
- File system
- Messaging
@Repository
public class JpaUserRepository implements UserRepository {
@Override
public void save(User user) {
entityManager.persist(user);
}
}
Dependency Rule:
Presentation → Application → Domain ← Infrastructure
↑
Domain is the core
Upper layers depend on lower, but Domain knows nothing about outer layers.
Key Points to Look For:
- Knows each layer's responsibility
- Understands dependency direction
- Can identify layer violations
Follow-up: What's the difference between this and Clean Architecture?
Clean Architecture / Hexagonal Architecture
Explain Clean Architecture or Hexagonal Architecture. How do they differ from traditional layered architecture?
Core Principle: Business logic at center, frameworks/external concerns at edges.
Hexagonal Architecture (Ports & Adapters):
┌─────────────┐
HTTP ──────→ │ Port │
│ (Interface)│
└──────┬──────┘
│
┌─────────▼─────────┐
CLI ─────→│ Application │←───── Tests
│ Core │
└─────────┬─────────┘
│
┌──────▼──────┐
│ Port │
│ (Interface)│
└──────┬──────┘
│
┌───────────┼───────────┐
│ │ │
PostgreSQL Redis Email
Ports: Interfaces defining how core interacts with outside
Adapters: Implementations of ports (HTTP, DB, etc.)
Clean Architecture (Onion):
┌─────────────────────────────────┐
│ Frameworks & Drivers │
│ ┌─────────────────────────┐ │
│ │ Interface Adapters │ │
│ │ ┌──────────────────┐ │ │
│ │ │ Use Cases │ │ │
│ │ │ ┌───────────┐ │ │ │
│ │ │ │ Entities │ │ │ │
│ │ │ └───────────┘ │ │ │
│ │ └──────────────────┘ │ │
│ └─────────────────────────┘ │
└─────────────────────────────────┘
Dependency Rule:
Dependencies point INWARD only. Inner circles know nothing about outer.
Example Structure:
src/
├── domain/ # Entities, Value Objects
│ ├── User.java
│ └── UserRepository.java # Interface!
├── application/ # Use Cases
│ └── CreateUserUseCase.java
├── adapters/
│ ├── web/ # HTTP adapter
│ │ └── UserController.java
│ └── persistence/ # DB adapter
│ └── JpaUserRepository.java
└── config/ # Wiring
Key Difference from Layered:
Layered:
- Domain depends on infrastructure interfaces
- Change DB → Change domain
Clean/Hexagonal:
- Infrastructure depends on domain interfaces
- Change DB → Only change adapter
Benefits:
1. Framework independence
2. Testability (mock ports)
3. UI independence
4. Database independence
Key Points to Look For:
- Understands dependency direction
- Knows ports and adapters
- Can explain benefits
Follow-up: How do you handle cross-cutting concerns like logging?
Event-Driven Architecture
What is Event-Driven Architecture? When would you use it?
Event-Driven Architecture (EDA):
Systems communicate by producing and consuming events.
┌─────────┐ Event ┌────────────┐
│ Service │────────────────│ Event Bus │
│ A │ │ (Kafka, │
└─────────┘ │ RabbitMQ) │
└─────┬──────┘
┌─────┴──────┐
┌──────┴────┐ ┌─────┴─────┐
│ Service B │ │ Service C │
└───────────┘ └───────────┘
Event Types:
1. Domain Events:
// Something that happened in the domain
public class OrderPlaced {
UUID orderId;
UUID customerId;
BigDecimal total;
Instant occurredAt;
}
2. Integration Events:
// Events for external systems
public class OrderPlacedIntegrationEvent {
String orderId; // Strings for compatibility
String timestamp;
}
Patterns:
Event Notification:
"Something happened" → Consumers query for details
Loose coupling, may need callbacks
Event-Carried State Transfer:
Event contains all needed data
No callbacks needed, eventual consistency
Event Sourcing:
Store events as source of truth
Rebuild state by replaying events
Benefits:
1. Loose coupling - Producers don't know consumers
2. Scalability - Add consumers independently
3. Resilience - Events can be replayed
4. Audit trail - Event history
Challenges:
1. Eventual consistency - Not immediate
2. Debugging - Harder to trace flow
3. Event ordering - Need careful design
4. Idempotency - Handle duplicate events
When to Use:
- Decoupled services
- Async is acceptable
- Audit trail needed
- High scalability needed
When NOT to Use:
- Strong consistency required
- Simple CRUD operations
- Small systems
Key Points to Look For:
- Knows event types
- Understands trade-offs
- Can identify use cases
Follow-up: How do you ensure event ordering?
CQRS pattern explained
What is CQRS and when would you use it?
CQRS (Command Query Responsibility Segregation):
Separate read and write models.
Traditional (Single Model):
Client → API → Service → Repository → Database
↑ │
└──────────────────────────────────────┘
Same model for reads/writes
CQRS:
┌─────────────────────────────┐
Write ────→│ Command Handler → Write DB │
└─────────────────────────────┘
│
Sync (events)
│
┌────────────▼────────────────┐
Read ─────→│ Query Handler → Read DB │
└─────────────────────────────┘
Components:
Commands (Write):
public class PlaceOrderCommand {
UUID customerId;
List<LineItem> items;
}
public class PlaceOrderHandler {
void handle(PlaceOrderCommand cmd) {
Order order = Order.create(cmd);
orderRepository.save(order);
eventBus.publish(new OrderPlaced(order));
}
}
Queries (Read):
public class GetOrderSummaryQuery {
UUID orderId;
}
public class GetOrderSummaryHandler {
OrderSummaryDTO handle(GetOrderSummaryQuery query) {
return readDB.getOrderSummary(query.orderId);
}
}
Benefits:
1. Optimized models - Read model for queries, write for commands
2. Scalability - Scale reads independently
3. Simplicity - Each side is simpler
4. Performance - Denormalized read model
When to Use:
- Read/write patterns differ significantly
- Complex domain with simple queries
- Need separate scaling
- Event sourcing
When NOT to Use:
- Simple CRUD
- Small applications
- Team unfamiliar with pattern
CQRS + Event Sourcing:
Command → Event Store → Events → Projections → Read DB
Key Points to Look For:
- Understands separation concept
- Knows benefits and trade-offs
- Can identify appropriate use cases
Follow-up: How do you handle consistency between read and write models?
Domain-Driven Design basics
What are the key concepts of Domain-Driven Design?
DDD focuses on complex domain modeling and collaboration with domain experts.
Strategic Patterns:
1. Bounded Context:
┌───────────────┐ ┌───────────────┐
│ Sales │ │ Shipping │
│ Context │ │ Context │
│ │ │ │
│ Customer: │ │ Customer: │
│ - name │ │ - address │
│ - creditLimit │ │ - deliveryPref│
└───────────────┘ └───────────────┘
Same word, different meaning!
2. Ubiquitous Language:
Shared vocabulary between developers and domain experts.
// Code matches domain language
class Order {
void place() { } // Not "save" or "create"
void fulfill() { } // Domain term
void cancel() { }
}
3. Context Mapping:
Sales ←─(Customer/Supplier)─→ Billing
←─(Shared Kernel)─→ Inventory
←─(Anti-corruption Layer)─→ Legacy
Tactical Patterns:
1. Entities:
Objects with identity.
class Order {
private OrderId id; // Identity
// Two orders with same data but different IDs are different
}
2. Value Objects:
Objects without identity, defined by attributes.
class Money {
private BigDecimal amount;
private Currency currency;
// Two Money with same amount/currency are equal
}
3. Aggregates:
Cluster of entities with consistency boundary.
class Order { // Aggregate Root
private List<LineItem> items; // Part of aggregate
void addItem(Product p, int qty) {
// Order controls consistency of items
}
}
4. Repositories:
Collection-like interface for aggregates.
interface OrderRepository {
Order findById(OrderId id);
void save(Order order);
}
5. Domain Services:
Operations that don't belong to any entity.
class PricingService {
Money calculateTotal(Order order, Customer customer) {
// Complex pricing across multiple entities
}
}
6. Domain Events:
class OrderPlaced {
OrderId orderId;
CustomerId customerId;
Instant occurredAt;
}
Key Points to Look For:
- Knows strategic vs tactical
- Understands bounded contexts
- Can explain aggregates
Follow-up: How do you communicate between bounded contexts?
System Design Concepts
Horizontal vs Vertical scaling
What's the difference between horizontal and vertical scaling?
Vertical Scaling (Scale Up):
Add more resources to existing machine.
Before: After:
┌────────┐ ┌────────────┐
│ 4 CPU │ │ 16 CPU │
│ 8GB RAM│ → │ 64GB RAM │
│ 100GB │ │ 1TB SSD │
└────────┘ └────────────┘
Horizontal Scaling (Scale Out):
Add more machines.
Before: After:
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│ Server │ │Server 1│ │Server 2│ │Server 3│
└────────┘ └────────┘ └────────┘ └────────┘
└──────────┼──────────┘
Load Balancer
Comparison:
| Aspect | Vertical | Horizontal |
|---|---|---|
| Complexity | Simple | Complex |
| Limit | Hardware max | Virtually unlimited |
| Downtime | Often needed | Zero-downtime |
| Cost | Expensive | Cost-effective |
| Availability | Single point | High availability |
| Data consistency | Easy | Challenging |
When to Use:
Vertical:
- Database servers (initially)
- Simple applications
- Quick fix needed
- Stateful applications
Horizontal:
- Web servers
- Microservices
- High availability needed
- Unpredictable growth
Challenges with Horizontal:
1. State management - Sessions, cache
2. Data consistency - Distributed transactions
3. Load balancing - Request distribution
4. Service discovery - Finding instances
Best Practice:
Start vertical (simpler), scale horizontal when needed.
Design stateless from the beginning.
Key Points to Look For:
- Knows trade-offs
- Understands complexity
- Can advise on when to use each
Follow-up: How do you handle session state with horizontal scaling?
Load balancing strategies
What are different load balancing strategies?
Load Balancer: Distributes incoming requests across multiple servers.
Clients
│
┌──────▼──────┐
│Load Balancer│
└──────┬──────┘
┌─────┼─────┐
│ │ │
┌────▼┐ ┌──▼──┐ ┌▼────┐
│ S1 │ │ S2 │ │ S3 │
└─────┘ └─────┘ └─────┘
Strategies:
1. Round Robin:
Request 1 → Server 1
Request 2 → Server 2
Request 3 → Server 3
Request 4 → Server 1 (cycle)
Simple but ignores server capacity.
2. Weighted Round Robin:
Server 1 (weight 3): Gets 3 of every 6 requests
Server 2 (weight 2): Gets 2 of every 6 requests
Server 3 (weight 1): Gets 1 of every 6 requests
3. Least Connections:
Server 1: 10 active connections
Server 2: 5 active connections
Server 3: 8 active connections
→ Send to Server 2
Good for varying request duration.
4. IP Hash:
hash(client_ip) % num_servers → Server
Same client always hits same server
Good for session affinity (sticky sessions).
5. Least Response Time:
Server 1: avg 50ms
Server 2: avg 30ms ← Send here
Server 3: avg 45ms
6. Random:
Simple, works well with many servers.
Layer 4 vs Layer 7:
| Layer 4 (Transport) | Layer 7 (Application) |
|---|---|
| TCP/UDP level | HTTP level |
| Faster | More features |
| Can't inspect content | Content-based routing |
| Connection-based | Request-based |
Layer 7 Features:
/api/* → API servers
/static/* → CDN
/admin/* → Admin servers
Health Checks:
Active: LB pings servers
Passive: Monitor responses
Unhealthy → Remove from pool
Healthy → Add back
Key Points to Look For:
- Knows multiple strategies
- Understands Layer 4 vs 7
- Mentions health checks
Follow-up: How do you handle session stickiness?
Caching strategies: write-through, write-back, write-around
Explain different caching write strategies.
Caching reduces latency and database load.
1. Cache-Aside (Lazy Loading):
Application manages cache.
Read:
1. Check cache
2. If miss, read DB
3. Write to cache
4. Return
Write:
1. Write to DB
2. Invalidate/update cache
def get_user(user_id):
user = cache.get(f"user:{user_id}")
if user is None:
user = db.get_user(user_id)
cache.set(f"user:{user_id}", user, ttl=3600)
return user
Pros: Only cache what's needed
Cons: Cache miss penalty, stale data possible
2. Write-Through:
Write to cache and DB synchronously.
Write:
1. Write to cache
2. Cache writes to DB
3. Return
Read:
1. Read from cache (always fresh)
App → Cache → DB
↑
Synchronous
Pros: Cache always fresh
Cons: Write latency, cache may fill with unused data
3. Write-Back (Write-Behind):
Write to cache, async write to DB.
Write:
1. Write to cache
2. Return immediately
3. Cache writes to DB async (batched)
App → Cache ···→ DB (async)
│
Immediate return
Pros: Fast writes
Cons: Data loss risk if cache fails
4. Write-Around:
Write directly to DB, bypass cache.
Write:
1. Write to DB only
2. Cache gets populated on read
Read:
1. Check cache
2. If miss, read DB, populate cache
Pros: Cache not flooded with writes
Cons: Read-after-write returns stale/misses
Comparison:
| Strategy | Read Perf | Write Perf | Consistency | Durability |
|---|---|---|---|---|
| Cache-Aside | Good | Medium | Medium | High |
| Write-Through | Best | Low | High | High |
| Write-Back | Best | Best | High | Low |
| Write-Around | Medium | High | Low | High |
Key Points to Look For:
- Knows multiple strategies
- Understands trade-offs
- Can choose based on requirements
Follow-up: How do you handle cache invalidation?
CDN and edge caching
How does a CDN work? When would you use one?
CDN (Content Delivery Network):
Distributed servers that cache content close to users.
User in Tokyo
│
┌──────────▼──────────┐
│ Tokyo Edge Server │ ← Cache hit!
│ (CDN PoP) │
└──────────┬──────────┘
│ Cache miss
┌──────────▼──────────┐
│ Origin Server │
│ (Your server in US) │
└─────────────────────┘
How It Works:
1. User requests content
2. CDN edge receives request
3. If cached → Return immediately
4. If not → Fetch from origin, cache, return
Content Types:
Static Content:
- Images, CSS, JS
- Videos, downloads
- Fonts, documents
Dynamic Content (with Edge Computing):
- Personalized pages
- API responses (short TTL)
- Server-side rendering
Benefits:
1. Latency - Geographically closer
2. Bandwidth - Offload origin server
3. Availability - Redundant edge locations
4. DDoS protection - Distributed defense
CDN Configuration:
# Cache rules
/static/* → Cache 1 year
/api/public/* → Cache 5 minutes
/api/private/* → No cache
/*.html → Cache 1 hour, stale-while-revalidate
Cache Headers:
Cache-Control: public, max-age=31536000, immutable
Cache-Control: public, max-age=300, stale-while-revalidate=60
Cache-Control: private, no-store
Cache Invalidation:
# Purge specific URL
cdn.purge("/static/app.js")
# Purge by tag
cdn.purge(tag="product-images")
# Version in URL (preferred)
/static/app.v123.js
When to Use:
- Global user base
- Static assets
- High traffic
- Video streaming
- API caching
Providers:
Cloudflare, AWS CloudFront, Akamai, Fastly
Key Points to Look For:
- Understands how CDN works
- Knows cache headers
- Mentions invalidation challenges
Follow-up: How do you handle cache invalidation for dynamic content?
Rate limiting algorithms: token bucket, leaky bucket
Explain token bucket and leaky bucket rate limiting algorithms.
Purpose: Prevent abuse, ensure fair usage, protect resources.
Token Bucket:
Bucket fills with tokens at fixed rate
Each request consumes a token
No token → Request rejected
┌─────────────┐
│ ●●●●○○○○○○ │ ← Tokens (5/10 available)
└─────────────┘
↑ Fill rate: 1/second
Implementation:
class TokenBucket:
def __init__(self, capacity, refill_rate):
self.capacity = capacity
self.tokens = capacity
self.refill_rate = refill_rate
self.last_refill = time.time()
def allow_request(self):
self._refill()
if self.tokens >= 1:
self.tokens -= 1
return True
return False
def _refill(self):
now = time.time()
elapsed = now - self.last_refill
self.tokens = min(
self.capacity,
self.tokens + elapsed * self.refill_rate
)
self.last_refill = now
Characteristics:
- Allows bursts (up to bucket capacity)
- Smooth average rate
- Simple to implement
Leaky Bucket:
Requests enter bucket
Bucket "leaks" at constant rate
Overflow → Request rejected
↓ Requests
┌─────────────┐
│ ●●●●●●●● │ ← Buffer
└─────┬───────┘
↓ Constant outflow rate
[Process]
Implementation:
class LeakyBucket:
def __init__(self, capacity, leak_rate):
self.capacity = capacity
self.water = 0
self.leak_rate = leak_rate
self.last_leak = time.time()
def allow_request(self):
self._leak()
if self.water < self.capacity:
self.water += 1
return True
return False
def _leak(self):
now = time.time()
elapsed = now - self.last_leak
self.water = max(0, self.water - elapsed * self.leak_rate)
self.last_leak = now
Characteristics:
- Constant output rate
- Smooths bursts
- May add latency (queue)
Comparison:
| Aspect | Token Bucket | Leaky Bucket |
|---|---|---|
| Bursts | Allows | Smooths |
| Output rate | Variable | Constant |
| Simplicity | Simple | Simple |
| Use case | API rate limiting | Traffic shaping |
Other Algorithms:
Fixed Window:
Window: 00:00-01:00 → 100 requests allowed
Problem: 200 requests possible at boundary
Sliding Window Log:
Track timestamp of each request
Count requests in last N seconds
Sliding Window Counter:
Combine fixed windows with weighting
Previous window count × overlap + current count
Key Points to Look For:
- Knows both algorithms
- Understands burst handling
- Can implement basic version
Follow-up: How would you implement distributed rate limiting?
Circuit breaker pattern
What is the circuit breaker pattern? How does it work?
Circuit Breaker: Prevents cascading failures by failing fast when a service is unhealthy.
States:
Success
┌─────────────────┐
│ │
▼ Failure │
┌────────┐ threshold ┌▼───────┐
│ CLOSED │──────────→│ OPEN │
└────────┘ └───┬────┘
▲ │
│ Timeout expires │
│ ▼
│ ┌───────────┐
│ Success │ HALF-OPEN │
└──────────────┴───────────┘
│
│ Failure
└────────→ Back to OPEN
States Explained:
CLOSED (Normal):
- Requests flow through
- Track failure count/rate
- If threshold exceeded → OPEN
OPEN (Failing Fast):
- Reject requests immediately
- Don't call downstream
- After timeout → HALF-OPEN
HALF-OPEN (Testing):
- Allow limited requests through
- If success → CLOSED
- If failure → OPEN
Implementation:
class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout=30):
self.state = "CLOSED"
self.failures = 0
self.failure_threshold = failure_threshold
self.timeout = timeout
self.last_failure_time = None
def call(self, func):
if self.state == "OPEN":
if time.time() - self.last_failure_time > self.timeout:
self.state = "HALF-OPEN"
else:
raise CircuitOpenException()
try:
result = func()
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
def _on_success(self):
self.failures = 0
self.state = "CLOSED"
def _on_failure(self):
self.failures += 1
self.last_failure_time = time.time()
if self.failures >= self.failure_threshold:
self.state = "OPEN"
Benefits:
1. Fail fast - Don't wait for timeouts
2. Protect downstream - Give service time to recover
3. Provide fallback - Graceful degradation
4. Resource conservation - Don't waste connections
With Fallback:
@CircuitBreaker(name = "inventory", fallbackMethod = "getDefaultInventory")
public Inventory getInventory(String productId) {
return inventoryService.get(productId);
}
public Inventory getDefaultInventory(String productId, Exception ex) {
return new Inventory(productId, 0, "UNKNOWN");
}
Libraries:
- Resilience4j (Java)
- Polly (.NET)
- Hystrix (deprecated)
Key Points to Look For:
- Knows all three states
- Understands failure detection
- Mentions fallback handling
Follow-up: How do you determine appropriate thresholds?
Bulkhead pattern for fault isolation
What is the Bulkhead pattern? How does it improve resilience?
Bulkhead: Isolate components to contain failures, like ship compartments.
Ship without bulkheads: Ship with bulkheads:
┌────────────────────┐ ┌──────┬──────┬──────┐
│ Flooding │ │ OK │Flood │ OK │
│ ~~~~~~~~~~~~~~~~ │ │ │~~~~~~│ │
└────────────────────┘ └──────┴──────┴──────┘
SINKS! STAYS AFLOAT
In Software:
Without Bulkhead:
┌─────────────────────────────────────┐
│ Shared Thread Pool │
│ Orders ─────→ ●●●●●●●●●● │
│ Users ──────→ (stuck) │
│ Products ───→ (stuck) │
└─────────────────────────────────────┘
One slow service blocks everything!
With Bulkhead:
┌───────────┐ ┌───────────┐ ┌───────────┐
│ Orders │ │ Users │ │ Products │
│ ●●●●●●●●●●│ │ ●●● │ │ ●●●● │
│ (stuck) │ │ (working) │ │ (working) │
└───────────┘ └───────────┘ └───────────┘
Failure contained!
Implementation Types:
1. Thread Pool Bulkhead:
// Separate thread pools per service
ExecutorService ordersPool = Executors.newFixedThreadPool(10);
ExecutorService usersPool = Executors.newFixedThreadPool(5);
ExecutorService productsPool = Executors.newFixedThreadPool(5);
// Orders being slow doesn't affect Users
ordersPool.submit(() -> callOrderService());
usersPool.submit(() -> callUserService());
2. Semaphore Bulkhead:
Semaphore ordersSemaphore = new Semaphore(10);
void callOrderService() {
if (ordersSemaphore.tryAcquire()) {
try {
// Call service
} finally {
ordersSemaphore.release();
}
} else {
throw new BulkheadFullException();
}
}
3. Connection Pool Bulkhead:
# Separate pools per external service
datasource:
orders:
maximum-pool-size: 10
users:
maximum-pool-size: 5
With Resilience4j:
@Bulkhead(name = "orderService", type = Bulkhead.Type.SEMAPHORE)
public Order getOrder(String id) {
return orderClient.get(id);
}
// Configuration
resilience4j.bulkhead:
instances:
orderService:
maxConcurrentCalls: 10
maxWaitDuration: 100ms
Benefits:
1. Fault isolation - Failures don't cascade
2. Fair resource allocation - Critical services protected
3. Predictable behavior - Known limits
4. Graceful degradation - Partial failures
Key Points to Look For:
- Understands isolation concept
- Knows implementation approaches
- Can size bulkheads
Follow-up: How do you combine bulkhead with circuit breaker?
Distributed Systems
CAP theorem in practice
How do you apply CAP theorem when designing systems?
Recap: During partition, choose Consistency or Availability.
Practical Application:
1. Identify Partition Tolerance Requirement:
Single datacenter, reliable network?
→ Partitions rare, might accept CA behavior
Multi-region, microservices?
→ Partitions will happen, plan for CP or AP
2. Per-Feature Decision:
Same system, different requirements:
Shopping Cart: AP
- Show cart even if stale
- User can add items, reconcile later
Checkout/Payment: CP
- Block until consistent
- Can't afford duplicate charges
3. Tunable Consistency:
// Cassandra: Consistency level per query
// Quorum = majority must respond
session.execute(
QueryBuilder.select()
.from("orders")
.where(eq("id", orderId))
.setConsistencyLevel(ConsistencyLevel.QUORUM)
);
// Strong consistency: QUORUM write + QUORUM read
// Eventual consistency: ONE write + ONE read
4. Design for Failure:
# Handle partition gracefully
def get_user_profile(user_id):
try:
return user_service.get(user_id, timeout=1)
except (TimeoutError, ConnectionError):
# AP: Return cached/default data
return cache.get(f"user:{user_id}") or DEFAULT_PROFILE
5. Consider PACELC:
Normal operation: What's the latency vs consistency trade-off?
Example: DynamoDB
- Partition: Available (AP)
- Else: Choose latency vs consistency
- Eventual: Faster reads
- Strong: Wait for leader
Real System Examples:
| System | During Partition | Normal |
|---|---|---|
| DynamoDB | AP | Tunable |
| Cassandra | AP | Tunable |
| MongoDB | CP | Strong |
| Spanner | CP | Strong |
| CockroachDB | CP | Strong |
Key Points to Look For:
- Applies per-feature, not system-wide
- Knows tunable consistency
- Understands practical implications
Follow-up: How do you test partition handling?
Consistency models in distributed systems
What are different consistency models in distributed systems?
Consistency Models (Strongest to Weakest):
1. Linearizability (Strict):
Operations appear instantaneous at some point.
Write X=5 at T1
Read at T2 (T2 > T1) → Must see 5
Global ordering exists
Like a single server
2. Sequential Consistency:
Operations appear in SOME total order consistent with program order.
Thread 1: Write X=1, Write X=2
Thread 2: Read X, Read X
Valid: Read 1, Read 2
Valid: Read 2, Read 2
Invalid: Read 2, Read 1 (order violation)
3. Causal Consistency:
Causally related operations seen in order; concurrent operations may vary.
A writes X=1
A writes Y=2 (caused by A seeing X=1)
→ If B sees Y=2, B must also see X=1
But concurrent writes can be seen in different order.
4. Eventual Consistency:
Given enough time without updates, all replicas converge.
Write X=5 to Node A
Eventually, Node B sees X=5
No timing guarantee
5. Read-Your-Writes:
Client always sees their own writes.
Write X=5
Read X → 5 (guaranteed)
But other clients may not see it yet.
6. Monotonic Reads:
Once you see a value, you won't see older values.
Read X → 5
Read X → 5 or newer, never older
Implementation Patterns:
Strong Consistency:
# Synchronous replication
def write(key, value):
primary.write(key, value)
for replica in replicas:
replica.write(key, value) # Wait for all
return success
Eventual Consistency:
# Async replication
def write(key, value):
primary.write(key, value)
queue.enqueue(replicate, key, value) # Async
return success
Read-Your-Writes:
# Track write version
def write(key, value):
version = db.write(key, value)
session.last_write_version[key] = version
def read(key):
result = db.read(key)
if result.version < session.last_write_version.get(key, 0):
result = db.read_from_primary(key) # Force primary
return result
Key Points to Look For:
- Knows multiple models
- Understands trade-offs
- Can implement basic patterns
Follow-up: How do you choose consistency model for a given use case?
Distributed transactions: saga pattern
What is the Saga pattern? How does it handle distributed transactions?
Saga: Sequence of local transactions with compensating actions for rollback.
Problem: Can't use ACID transactions across services.
Order Service → Payment Service → Inventory Service
│ │ │
└───────────────┴──────────────────┘
No distributed transaction!
Saga Types:
1. Choreography:
Services communicate through events.
Order Payment Inventory
│ │ │
│ OrderCreated │ │
│───────────────→│ │
│ │ PaymentSucceeded
│ │───────────────→│
│ │ │ InventoryReserved
│←───────────────┼────────────────│
2. Orchestration:
Central coordinator manages saga.
┌───────────────┐
│ Orchestrator │
└───────┬───────┘
┌───────┼───────┐
│ │ │
▼ ▼ ▼
Order Payment Inventory
Implementation (Orchestration):
class CreateOrderSaga:
def __init__(self):
self.steps = [
Step(OrderService.create, OrderService.cancel),
Step(PaymentService.charge, PaymentService.refund),
Step(InventoryService.reserve, InventoryService.release),
]
def execute(self, order_data):
completed = []
try:
for step in self.steps:
step.forward(order_data)
completed.append(step)
except Exception:
# Compensate in reverse order
for step in reversed(completed):
step.compensate(order_data)
raise SagaFailed()
Compensation:
Happy Path:
T1 → T2 → T3 → Success
Failure at T3:
T1 → T2 → T3 (fails)
↓
C2 ← C1 (compensate)
Considerations:
1. Compensations must be idempotent:
def refund(payment_id):
if not already_refunded(payment_id):
process_refund(payment_id)
2. Handle partial failures:
# What if compensation fails?
# Retry with backoff
# Dead letter queue for manual intervention
3. State tracking:
class SagaState:
id: str
current_step: int
status: Literal["RUNNING", "COMPLETED", "COMPENSATING", "FAILED"]
data: dict
Choreography vs Orchestration:
| Aspect | Choreography | Orchestration |
|---|---|---|
| Coupling | Loose | Tight to orchestrator |
| Complexity | Distributed | Centralized |
| Debugging | Harder | Easier |
| Single failure | Resilient | SPOF risk |
Key Points to Look For:
- Knows both types
- Understands compensation
- Handles failure scenarios
Follow-up: How do you ensure sagas are idempotent?
Message queues: when and why
When should you use a message queue? What problems does it solve?
Message Queue: Async communication between services.
Producer → Queue → Consumer
│
Decoupled!
When to Use:
1. Async Processing:
# Synchronous (slow)
def create_order(order):
save_order(order)
send_email(order) # Wait
update_analytics(order) # Wait
return order
# Async with queue (fast)
def create_order(order):
save_order(order)
queue.publish("order.created", order) # Fire and forget
return order
# Consumers process later
@subscribe("order.created")
def handle_order(order):
send_email(order)
update_analytics(order)
2. Load Leveling:
Spike: ████████████████ (1000 req/s)
↓
Queue: [●●●●●●●●●●●●●●●●] (buffer)
↓
Consumer: ███ (100 req/s steady)
3. Decoupling:
Without queue:
Order Service → Inventory Service
↓
Payment Service
With queue:
Order Service → Queue ← Inventory Service
← Payment Service
Services don't know about each other
4. Reliability:
Message persisted → Consumer can fail/restart
At-least-once delivery
Dead letter queue for failures
Common Patterns:
Work Queue:
Producer → Queue → Consumer 1
→ Consumer 2
→ Consumer 3
Load distributed among workers
Pub/Sub:
Producer → Exchange → Queue A → Consumer A
→ Queue B → Consumer B
Multiple consumers get same message
When NOT to Use:
- Need immediate response
- Simple request-response
- Low latency critical
- Small scale / simple systems
Technologies:
- RabbitMQ: Traditional, AMQP
- Kafka: High throughput, log-based
- SQS: AWS managed
- Redis: Simple pub/sub
Key Points to Look For:
- Knows multiple use cases
- Understands trade-offs
- Mentions reliability patterns
Follow-up: What's the difference between Kafka and RabbitMQ?
Event sourcing explained
What is event sourcing? When would you use it?
Event Sourcing: Store all changes as a sequence of events, not just current state.
Traditional:
┌─────────────────┐
│ Account │
│ balance: $100 │ ← Only current state
└─────────────────┘
Event Sourcing:
┌────────────────────────────────────┐
│ Event Store │
│ 1. AccountCreated($0) │
│ 2. Deposited($100) │
│ 3. Withdrawn($30) │
│ 4. Deposited($50) │
│ Current: Replay → $120 │
└────────────────────────────────────┘
Implementation:
# Events
@dataclass
class AccountCreated:
account_id: str
timestamp: datetime
@dataclass
class MoneyDeposited:
account_id: str
amount: Decimal
timestamp: datetime
# Aggregate
class Account:
def __init__(self, events):
self.balance = Decimal(0)
for event in events:
self.apply(event)
def apply(self, event):
if isinstance(event, MoneyDeposited):
self.balance += event.amount
elif isinstance(event, MoneyWithdrawn):
self.balance -= event.amount
def deposit(self, amount):
if amount <= 0:
raise InvalidAmount()
return MoneyDeposited(self.id, amount, datetime.now())
# Usage
events = event_store.get_events(account_id)
account = Account(events)
new_event = account.deposit(100)
event_store.append(account_id, new_event)
Projections (CQRS Read Models):
# Build read-optimized views from events
class AccountBalanceProjection:
def __init__(self):
self.balances = {}
def handle(self, event):
if isinstance(event, MoneyDeposited):
self.balances[event.account_id] = \
self.balances.get(event.account_id, 0) + event.amount
Benefits:
1. Complete audit trail
2. Time travel - Rebuild state at any point
3. Debug - See exactly what happened
4. Derived views - Create any projection
5. Event replay - Fix bugs retroactively
Challenges:
1. Event schema evolution
2. Eventual consistency
3. Query complexity (need projections)
4. Storage growth (snapshots help)
When to Use:
- Audit trail required
- Complex domain
- CQRS fits well
- Need temporal queries
When NOT to Use:
- Simple CRUD
- No audit needs
- Team unfamiliar
Key Points to Look For:
- Understands event vs state storage
- Knows projections
- Can explain trade-offs
Follow-up: How do you handle event schema changes?
Idempotency in distributed systems
What is idempotency and why is it important in distributed systems?
Idempotent Operation: Same result no matter how many times executed.
f(x) = f(f(x)) = f(f(f(x))) = ...
Why Important:
Client → Server
↓ (request)
Server processes
↓ (response lost!)
Client retries
Server processes AGAIN ← Problem!
Examples:
Idempotent:
# GET - reading doesn't change state
GET /users/123
# PUT - setting specific value
PUT /users/123 {"name": "Alice"}
# DELETE - already deleted = same result
DELETE /orders/456
NOT Idempotent:
# POST - creates new resource each time
POST /users {"name": "Alice"} # Creates user 1
POST /users {"name": "Alice"} # Creates user 2!
# Increment without guard
POST /accounts/123/deposit {"amount": 100}
# Double charge on retry!
Making Operations Idempotent:
1. Idempotency Key:
# Client sends unique key
POST /payments
Idempotency-Key: abc123
{"amount": 100}
# Server checks key before processing
def process_payment(key, amount):
if redis.exists(f"idempotency:{key}"):
return get_cached_response(key)
result = charge_card(amount)
redis.setex(f"idempotency:{key}", 86400, result)
return result
2. Database Constraints:
-- Unique constraint prevents duplicates
CREATE TABLE payments (
id SERIAL PRIMARY KEY,
idempotency_key VARCHAR(255) UNIQUE,
amount DECIMAL
);
-- Insert fails on duplicate key
INSERT INTO payments (idempotency_key, amount)
VALUES ('abc123', 100);
3. Check-and-Set:
def transfer(from_id, to_id, amount, transfer_id):
# Check if already processed
if Transfer.exists(transfer_id):
return "Already processed"
# Process
with transaction():
Transfer.create(id=transfer_id, ...)
Account.debit(from_id, amount)
Account.credit(to_id, amount)
Best Practices:
1. Use idempotency keys for non-idempotent operations
2. Store key → result mapping
3. Set reasonable expiry
4. Generate keys client-side
Key Points to Look For:
- Understands retry problem
- Knows implementation patterns
- Mentions idempotency keys
Follow-up: How long should you keep idempotency keys?
Leader election algorithms
How do leader election algorithms work in distributed systems?
Leader Election: Designate one node as leader to coordinate actions.
Why Needed:
- Single writer for consistency
- Coordination tasks
- Distributed locks
- Consensus protocols
Algorithms:
1. Bully Algorithm:
Nodes: 1, 2, 3, 4, 5 (higher = higher priority)
Current leader: 5
Node 5 fails
Node 3 notices, starts election
Node 3 → Sends ELECTION to 4, 5
Node 4 → Responds OK (higher, will take over)
Node 4 → Sends ELECTION to 5
(No response from 5)
Node 4 → Broadcasts COORDINATOR
Node 4 is new leader
2. Ring Algorithm:
Nodes in logical ring: 1 → 2 → 3 → 4 → 5 → 1
Node 3 starts election
Sends [3] to node 4
Node 4 adds ID: [3, 4] → sends to 5
Node 5 adds: [3, 4, 5] → sends to 1
... around the ring
When message returns with all IDs
Highest ID is leader
3. Raft Election:
States: Follower → Candidate → Leader
1. Follower times out (no heartbeat from leader)
2. Becomes Candidate, votes for self
3. Requests votes from others
4. If majority votes: Becomes Leader
5. Leader sends heartbeats
Using ZooKeeper/etcd:
# ZooKeeper ephemeral sequential nodes
def elect_leader(zk, path):
# Create ephemeral sequential node
my_node = zk.create(
f"{path}/candidate-",
ephemeral=True,
sequence=True
)
while True:
children = sorted(zk.get_children(path))
if my_node == children[0]:
# I'm the leader!
return True
else:
# Watch predecessor
predecessor = children[children.index(my_node) - 1]
zk.watch(predecessor, on_change=check_leader)
wait()
Using Redis:
def try_become_leader(redis, key, node_id, ttl=30):
# SET if not exists with expiry
acquired = redis.set(key, node_id, nx=True, ex=ttl)
if acquired:
# Extend periodically
start_heartbeat(redis, key, node_id, ttl)
return acquired
Considerations:
1. Split brain - Multiple leaders
2. Network partitions - Need majority
3. Failover time - Election duration
4. Thundering herd - Staggered timeouts
Key Points to Look For:
- Knows multiple algorithms
- Understands consensus
- Considers failure scenarios
Follow-up: What happens during a network partition?
Scalability Scenarios
Design a URL shortener
Design a URL shortening service like bit.ly.
Requirements:
- Shorten URL: Long → Short
- Redirect: Short → Long
- Analytics (optional): Click tracking
API:
POST /shorten
Body: {"url": "https://example.com/very/long/path"}
Response: {"short_url": "https://short.ly/abc123"}
GET /abc123
Response: 301 Redirect to original URL
Short URL Generation:
Option 1: Counter + Base62:
ALPHABET = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
def encode(num):
if num == 0:
return ALPHABET[0]
result = []
while num:
result.append(ALPHABET[num % 62])
num //= 62
return ''.join(reversed(result))
# Counter: 1000000 → "4c92"
# 7 chars = 62^7 = 3.5 trillion URLs
Option 2: Hash + Truncate:
def generate_short(url):
hash = md5(url).hexdigest()[:7]
# Handle collisions
while exists(hash):
hash = md5(url + random_string()).hexdigest()[:7]
return hash
Database:
CREATE TABLE urls (
id BIGSERIAL PRIMARY KEY,
short_code VARCHAR(10) UNIQUE,
original_url TEXT,
created_at TIMESTAMP,
clicks BIGINT DEFAULT 0
);
CREATE INDEX idx_short_code ON urls(short_code);
Architecture:
┌─────────┐ ┌───────────┐ ┌────────────┐
│ Client │────→│ API │────→│ Database │
└─────────┘ │ Servers │ └────────────┘
└─────┬─────┘
│
┌─────▼─────┐
│ Cache │
│ (Redis) │
└───────────┘
Scaling:
1. Cache popular URLs in Redis
2. Shard database by short_code
3. CDN for redirects
4. Counter service with distributed ID generation
Read Path:
def redirect(short_code):
# Check cache first
url = redis.get(f"url:{short_code}")
if not url:
url = db.query("SELECT original_url FROM urls WHERE short_code = ?", short_code)
redis.setex(f"url:{short_code}", 3600, url)
# Async click tracking
kafka.send("clicks", {"code": short_code, "time": now()})
return redirect_301(url)
Key Points to Look For:
- Clear API design
- Encoding strategy
- Caching approach
- Scaling considerations
Design a rate limiter
Design a distributed rate limiter for an API.
Requirements:
- Limit requests per user/IP
- Distributed (multiple servers)
- Low latency
- Configurable limits
Algorithm Choice: Token Bucket (allows bursts)
Redis Implementation:
class RateLimiter:
def __init__(self, redis, limit=100, window=60):
self.redis = redis
self.limit = limit
self.window = window
def is_allowed(self, user_id):
key = f"rate:{user_id}"
# Lua script for atomicity
script = """
local tokens = redis.call('GET', KEYS[1])
if not tokens then
redis.call('SET', KEYS[1], ARGV[1] - 1, 'EX', ARGV[2])
return 1
end
if tonumber(tokens) > 0 then
redis.call('DECR', KEYS[1])
return 1
end
return 0
"""
allowed = self.redis.eval(script, 1, key, self.limit, self.window)
return bool(allowed)
Sliding Window Counter:
def is_allowed_sliding(redis, user_id, limit, window):
now = time.time()
minute = int(now / 60)
# Current and previous minute counts
curr_key = f"rate:{user_id}:{minute}"
prev_key = f"rate:{user_id}:{minute - 1}"
curr_count = int(redis.get(curr_key) or 0)
prev_count = int(redis.get(prev_key) or 0)
# Weight previous window
elapsed = now % 60
weighted = prev_count * (60 - elapsed) / 60 + curr_count
if weighted >= limit:
return False
redis.incr(curr_key)
redis.expire(curr_key, 120)
return True
Architecture:
┌─────────┐ ┌──────────────┐ ┌─────────────┐
│ Client │────→│ Rate Limiter │────→│ Redis │
└─────────┘ │ Middleware │ │ Cluster │
└──────┬───────┘ └─────────────┘
│
┌─────▼─────┐
│ API │
│ Service │
└───────────┘
Response Headers:
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 45
X-RateLimit-Reset: 1609459200
Retry-After: 30
Considerations:
1. Race conditions - Use atomic operations
2. Clock sync - Use Redis time
3. Failure mode - Fail open or closed?
4. Per-endpoint limits - Different limits for different APIs
Key Points to Look For:
- Algorithm choice with reasoning
- Atomic operations
- Distributed considerations
- Response headers
Design a cache system
Design a distributed caching system like Memcached or Redis.
Requirements:
- Key-value storage
- Low latency (<1ms)
- High throughput
- Distributed across nodes
- LRU eviction
Single Node Design:
┌────────────────────────────────────────┐
│ Cache Node │
│ ┌─────────────────────────────────┐ │
│ │ Hash Table │ │
│ │ key → node pointer │ │
│ └─────────────────────────────────┘ │
│ ┌─────────────────────────────────┐ │
│ │ Doubly Linked List (LRU) │ │
│ │ HEAD ← → node ← → node ← → TAIL│ │
│ └─────────────────────────────────┘ │
└────────────────────────────────────────┘
LRU Cache:
class LRUCache:
def __init__(self, capacity):
self.capacity = capacity
self.cache = {} # key → node
self.head = Node(None, None) # Most recent
self.tail = Node(None, None) # Least recent
self.head.next = self.tail
self.tail.prev = self.head
def get(self, key):
if key in self.cache:
node = self.cache[key]
self._move_to_head(node)
return node.value
return None
def put(self, key, value):
if key in self.cache:
node = self.cache[key]
node.value = value
self._move_to_head(node)
else:
if len(self.cache) >= self.capacity:
self._evict()
node = Node(key, value)
self.cache[key] = node
self._add_to_head(node)
Distributed Design:
Client → Consistent Hashing → Node
│
┌─────┼─────┐
│ │ │
Node1 Node2 Node3
Consistent Hashing:
class ConsistentHash:
def __init__(self, nodes, virtual_nodes=150):
self.ring = SortedDict()
for node in nodes:
for i in range(virtual_nodes):
key = hash(f"{node}:{i}")
self.ring[key] = node
def get_node(self, key):
if not self.ring:
return None
hash_key = hash(key)
# Find first node clockwise
idx = self.ring.bisect_right(hash_key)
if idx == len(self.ring):
idx = 0
return self.ring.values()[idx]
Architecture:
┌──────────────────────────────────────────────────┐
│ Cache Cluster │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Node 1 │ │ Node 2 │ │ Node 3 │ │
│ │ Hash │ │ Hash │ │ Hash │ │
│ │ Ring │ │ Ring │ │ Ring │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ └────┴────────────┴────────────┘ │
│ Consistent Hashing Ring │
└──────────────────────────────────────────────────┘
Additional Features:
- TTL expiration
- Replication for HA
- Pub/sub
- Atomic operations
- Memory management
Key Points to Look For:
- LRU implementation
- Consistent hashing
- Eviction strategy
- Replication considerations
Design a notification system
Design a notification system that can send push notifications, emails, and SMS.
Requirements:
- Multiple channels (push, email, SMS)
- High throughput (millions/day)
- Template support
- Delivery tracking
- Retry failed deliveries
Architecture:
┌─────────────────────────────────────────────────────────────┐
│ API Gateway │
└────────────────────────────┬────────────────────────────────┘
│
┌────────────────────────────▼────────────────────────────────┐
│ Notification Service │
│ • Validate request │
│ • User preferences │
│ • Rate limiting │
└────────────────────────────┬────────────────────────────────┘
│
┌────────────────────────────▼────────────────────────────────┐
│ Message Queue │
│ ┌──────────┬──────────┬──────────┐ │
│ │ Push │ Email │ SMS │ │
│ │ Queue │ Queue │ Queue │ │
│ └────┬─────┴────┬─────┴────┬─────┘ │
└──────────────┼──────────┼──────────┼────────────────────────┘
│ │ │
┌──────────▼──┐ ┌─────▼─────┐ ┌──▼──────────┐
│Push Workers │ │Email │ │SMS Workers │
│ │ │Workers │ │ │
└──────┬──────┘ └─────┬─────┘ └──────┬──────┘
│ │ │
┌──────▼──────┐ ┌─────▼─────┐ ┌──────▼──────┐
│ FCM │ │SendGrid │ │ Twilio │
│ APNS │ │Mailgun │ │ Vonage │
└─────────────┘ └───────────┘ └─────────────┘
API:
POST /v1/notifications
{
"user_id": "user123",
"template_id": "order_shipped",
"channels": ["push", "email"],
"data": {
"order_id": "ORD456",
"tracking_url": "..."
}
}
Database Schema:
CREATE TABLE notifications (
id UUID PRIMARY KEY,
user_id VARCHAR(255),
template_id VARCHAR(255),
channel VARCHAR(20),
status VARCHAR(20), -- pending, sent, delivered, failed
created_at TIMESTAMP,
sent_at TIMESTAMP,
data JSONB
);
CREATE TABLE user_preferences (
user_id VARCHAR(255) PRIMARY KEY,
push_enabled BOOLEAN DEFAULT true,
email_enabled BOOLEAN DEFAULT true,
sms_enabled BOOLEAN DEFAULT true,
quiet_hours_start TIME,
quiet_hours_end TIME
);
Worker Logic:
class PushWorker:
def process(self, message):
notification = message.body
user = get_user(notification.user_id)
try:
# Get device tokens
tokens = get_device_tokens(user.id)
# Render template
content = render_template(
notification.template_id,
notification.data
)
# Send to FCM/APNS
for token in tokens:
send_push(token, content)
# Update status
update_status(notification.id, "sent")
except Exception as e:
# Retry with backoff
if message.retry_count < 3:
queue.publish_with_delay(
message,
delay=exponential_backoff(message.retry_count)
)
else:
update_status(notification.id, "failed")
send_to_dlq(message)
Key Considerations:
1. Deduplication - Idempotency keys
2. Rate limiting - Per user, per channel
3. Priority queues - Urgent vs batch
4. Tracking - Open rates, delivery status
5. Unsubscribe - User preferences
Key Points to Look For:
- Multiple channels handled
- Queue-based architecture
- Retry mechanism
- User preferences
Handling millions of concurrent users
How would you design a system to handle millions of concurrent users?
Principles:
1. Stateless Services:
User → Load Balancer → Any Server
↓
Session Store (Redis)
2. Horizontal Scaling:
┌─────────────┐
│ CDN │ ← Static content
└──────┬──────┘
│
┌──────▼──────┐
│ LB │ ← Distribute load
└──────┬──────┘
┌────┴────┐
│ │ │ │ │ │ ← Auto-scaling group
└─────────┘
│
┌──────▼──────┐
│ Cache │ ← Reduce DB load
└──────┬──────┘
│
┌──────▼──────┐
│ Database │ ← Sharded, replicated
└─────────────┘
3. Caching Everywhere:
Browser Cache → CDN → App Cache → DB Cache → Database
4. Database Scaling:
Write: Primary → Replicas (async)
Read: Load balance across replicas
Shard: Distribute by user_id/region
5. Async Processing:
User Request → API → Queue → Workers
↑ │
└──────────┘
Fast response
Architecture for 10M Concurrent:
┌─────────────┐
│ CDN │
│ (CloudFront)│
└──────┬──────┘
│
┌──────────────────┼──────────────────┐
│ │ │
┌─────▼─────┐ ┌─────▼─────┐ ┌─────▼─────┐
│ Region │ │ Region │ │ Region │
│ US-East │ │ EU-West │ │ APAC │
└─────┬─────┘ └─────┬─────┘ └─────┬─────┘
│ │ │
┌─────▼──────────────────────────────────────────┐
│ Per Region: │
│ ┌────────────────────────────────────────┐ │
│ │ Load Balancers │ │
│ └────────────────────┬───────────────────┘ │
│ ┌─────────────┴─────────────┐ │
│ │ │ │
│ ┌──────▼──────┐ ┌───────▼───────┐ │
│ │ API Servers │ │ WebSocket │ │
│ │ (Auto-scale)│ │ Servers │ │
│ └──────┬──────┘ └───────┬───────┘ │
│ │ │ │
│ ┌──────▼──────────────────────────▼───────┐ │
│ │ Redis Cluster │ │
│ │ (Cache + Sessions) │ │
│ └─────────────────┬───────────────────────┘ │
│ │ │
│ ┌─────────────────▼───────────────────────┐ │
│ │ Database Cluster │ │
│ │ Primary + Replicas, Sharded │ │
│ └─────────────────────────────────────────┘ │
└────────────────────────────────────────────────┘
Numbers:
10M concurrent users
~1M requests/second (100 RPS per user)
Servers needed:
- 1 server handles ~10K RPS
- Need ~100 servers + buffer
- Auto-scale 50-200 based on load
Cache hit rate target: 95%+
Database: Write-heavy → Sharding
Read-heavy → Replicas
Key Points to Look For:
- Multi-region deployment
- Caching strategy
- Database scaling
- Async processing
- Auto-scaling
Observability
Three pillars of observability: logs, metrics, traces
What are the three pillars of observability and how do they differ?
Observability: Ability to understand system state from external outputs.
The Three Pillars:
┌──────────────────────────────────────────────────────────────┐
│ Observability │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Logs │ │ Metrics │ │ Traces │ │
│ │ │ │ │ │ │ │
│ │ "What │ │ "How │ │ "Where │ │
│ │ happened"│ │ much" │ │ it went"│ │
│ └──────────┘ └──────────┘ └──────────┘ │
└──────────────────────────────────────────────────────────────┘
1. Logs - Event Records:
{
"timestamp": "2024-01-15T10:30:00Z",
"level": "ERROR",
"service": "payment-service",
"message": "Payment failed",
"user_id": "user_123",
"error": "Card declined",
"request_id": "req_abc"
}
Characteristics:
- Discrete events
- High cardinality (unique values)
- Human-readable
- Great for debugging specific issues
2. Metrics - Aggregated Measurements:
# Counter - cumulative value
http_requests_total{method="GET", status="200"} 1523
# Gauge - current value
active_connections 45
# Histogram - distribution
http_request_duration_seconds_bucket{le="0.1"} 24054
http_request_duration_seconds_bucket{le="0.5"} 33444
Characteristics:
- Numeric, aggregatable
- Low cardinality
- Efficient storage
- Great for alerting and trends
3. Traces - Request Journey:
Trace ID: abc123
├── Span: API Gateway (50ms)
│ └── Span: Auth Service (10ms)
├── Span: Order Service (200ms)
│ ├── Span: Database Query (50ms)
│ └── Span: Payment Service (100ms)
│ └── Span: External API (80ms)
└── Total: 250ms
Characteristics:
- Shows request flow
- Spans across services
- Includes timing
- Great for debugging distributed systems
Comparison:
| Aspect | Logs | Metrics | Traces |
|---|---|---|---|
| Question answered | What happened? | How much/how many? | Where did time go? |
| Data type | Text/JSON | Numbers | Spans/Context |
| Cardinality | High | Low | Medium |
| Storage cost | High | Low | Medium |
| Best for | Debugging | Alerting/Trends | Latency analysis |
Using Together:
Alert fires: "Error rate > 5%" ← Metrics
Check logs: "Payment service errors" ← Logs
Trace request: "Where is the latency?" ← Traces
Tools by Pillar:
- Logs: ELK Stack, Splunk, Loki
- Metrics: Prometheus, Datadog, CloudWatch
- Traces: Jaeger, Zipkin, X-Ray
Key Points to Look For:
- Knows all three pillars
- Understands different purposes
- Can explain when to use each
Follow-up: How do you correlate logs, metrics, and traces?
Distributed tracing: How does it work?
How does distributed tracing work in microservices?
Distributed Tracing: Tracks requests as they flow through multiple services.
Core Concepts:
Trace: Complete journey of a request
Span: Single operation within a trace
Context: Metadata passed between services
Trace ID: abc-123
│
├── Span: api-gateway (start: 0ms, duration: 250ms)
│ │ service: api-gateway
│ │ operation: /orders
│ │
│ ├── Span: auth-service (start: 5ms, duration: 20ms)
│ │ service: auth-service
│ │ operation: validateToken
│ │
│ └── Span: order-service (start: 30ms, duration: 200ms)
│ service: order-service
│ operation: createOrder
│ │
│ ├── Span: postgres (start: 35ms, duration: 50ms)
│ │ operation: INSERT orders
│ │
│ └── Span: payment-service (start: 90ms, duration: 100ms)
│ service: payment-service
│ operation: chargeCard
How It Works:
1. Context Propagation:
# Service A creates trace context
import opentelemetry.trace as trace
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("handle_request") as span:
span.set_attribute("user_id", user_id)
# Context automatically injected into HTTP headers
response = requests.get(
"http://service-b/api",
headers=inject_context() # traceparent: 00-abc123-def456-01
)
2. Header Format (W3C Trace Context):
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
│ │ │ │
│ │ │ └─ flags (sampled)
│ │ └─ parent span id
│ └─ trace id
└─ version
3. Service B Receives and Continues:
# Service B extracts context
from opentelemetry.propagate import extract
context = extract(request.headers)
# Creates child span with same trace ID
with tracer.start_as_current_span("process_data", context=context) as span:
span.set_attribute("order_id", order_id)
# Continue processing...
4. Spans Collected and Assembled:
Service A ──span──→ Collector
Service B ──span──→ Collector ──→ Backend ──→ UI
Service C ──span──→ Collector
Span Attributes:
span.set_attribute("http.method", "POST")
span.set_attribute("http.url", "/api/orders")
span.set_attribute("http.status_code", 200)
span.set_attribute("db.system", "postgresql")
span.set_attribute("db.statement", "SELECT * FROM orders")
# Events within span
span.add_event("cache_miss", {"key": "user:123"})
# Errors
span.record_exception(exception)
span.set_status(Status(StatusCode.ERROR, "Payment failed"))
Sampling Strategies:
# Head-based sampling (decide at start)
sampler = TraceIdRatioBased(0.1) # 10% of traces
# Tail-based sampling (decide after complete)
# Keep all errors, sample successful requests
if span.status == ERROR or random() < 0.01:
export(span)
Architecture:
┌─────────────────────────────────────────────────────┐
│ Application │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │Service A│ │Service B│ │Service C│ │
│ │ SDK │ │ SDK │ │ SDK │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ └────────────┼────────────┘ │
│ │ │
│ ┌─────▼─────┐ │
│ │ Agent/ │ │
│ │ Collector │ │
│ └─────┬─────┘ │
└────────────────────┼───────────────────────────────┘
│
┌──────▼──────┐
│ Backend │
│ (Jaeger, │
│ Zipkin) │
└──────┬──────┘
│
┌──────▼──────┐
│ UI │
└─────────────┘
Key Points to Look For:
- Understands trace context propagation
- Knows span structure
- Can explain sampling
Follow-up: How do you handle tracing with async message queues?
What makes a good log message?
What makes a good log message? What should you include?
Good Log Message Characteristics:
1. Structured Format:
// Good: Structured, parseable
{
"timestamp": "2024-01-15T10:30:00.123Z",
"level": "ERROR",
"service": "order-service",
"message": "Failed to process order",
"order_id": "ord_123",
"user_id": "usr_456",
"error": "Insufficient inventory",
"product_id": "prod_789",
"requested_quantity": 5,
"available_quantity": 2,
"trace_id": "abc123"
}
// Bad: Unstructured, hard to parse
"ERROR: Order ord_123 failed for user usr_456 - not enough inventory for prod_789 (wanted 5, have 2)"
2. Appropriate Log Level:
# DEBUG: Detailed diagnostic information
logger.debug(f"Cache lookup for key: {key}")
# INFO: Normal operations, milestones
logger.info(f"Order {order_id} created successfully")
# WARNING: Unexpected but recoverable
logger.warning(f"Retry attempt {attempt} for external API")
# ERROR: Failures that need attention
logger.error(f"Payment failed", extra={"order_id": order_id})
# CRITICAL: System-level failures
logger.critical(f"Database connection pool exhausted")
3. Include Context:
# Bad: No context
logger.error("Failed to process request")
# Good: Rich context
logger.error(
"Failed to process payment",
extra={
"order_id": order.id,
"user_id": user.id,
"amount": payment.amount,
"payment_method": payment.method,
"error_code": e.code,
"trace_id": get_trace_id()
}
)
4. Correlation IDs:
# Include trace/request ID for correlation
class RequestMiddleware:
def process_request(self, request):
request_id = request.headers.get('X-Request-ID') or uuid4()
# Add to all logs in this request
logger = logger.bind(request_id=request_id)
# Now all logs include request_id
# Easy to find all logs for one request
5. Don't Log Sensitive Data:
# Bad: Logging sensitive data
logger.info(f"User login: {email}, password: {password}")
logger.info(f"Payment with card: {card_number}")
# Good: Mask or omit sensitive data
logger.info(f"User login: {email}")
logger.info(f"Payment with card: ****{card_number[-4:]}")
# Sensitive fields to never log:
# - Passwords, tokens, API keys
# - Credit card numbers, SSNs
# - Personal health information
# - Full addresses, phone numbers
6. Actionable Messages:
# Bad: Vague
logger.error("Something went wrong")
# Good: Actionable
logger.error(
"Database connection timeout after 30s",
extra={
"host": db_host,
"action": "Check database health, consider increasing pool size"
}
)
Log Message Template:
WHEN: Timestamp (ISO 8601, UTC)
WHERE: Service, function, file:line
WHAT: Clear description of event
WHO: User ID, request ID
WHY: Error details, stack trace (for errors)
CONTEXT: Relevant business data
Example Implementation:
import structlog
logger = structlog.get_logger()
def process_order(order):
log = logger.bind(
order_id=order.id,
user_id=order.user_id
)
log.info("processing_order_started")
try:
result = payment_service.charge(order)
log.info("payment_successful", amount=order.total)
except PaymentError as e:
log.error(
"payment_failed",
error_code=e.code,
error_message=str(e),
retry_eligible=e.is_retryable
)
raise
Key Points to Look For:
- Uses structured logging
- Includes correlation IDs
- Appropriate log levels
- No sensitive data
Follow-up: How do you manage log volume in high-traffic systems?
Metrics: RED vs USE method
What are the RED and USE methods for metrics? When do you use each?
RED Method: For request-driven services (APIs, microservices)
USE Method: For resources (CPU, memory, disk)
RED Method:
R - Rate: Requests per second
E - Errors: Failed requests per second
D - Duration: Time per request (latency)
Example Metrics:
# Rate: Requests per second
http_requests_total{service="api", endpoint="/orders"}
rate(http_requests_total[5m])
# Errors: Error rate
http_requests_total{service="api", status=~"5.."}
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
# Duration: Latency percentiles
http_request_duration_seconds{service="api", quantile="0.99"}
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
RED Dashboard:
┌─────────────────────────────────────────────────────────────┐
│ Order Service Dashboard │
├─────────────────────┬─────────────────────┬─────────────────┤
│ Request Rate │ Error Rate │ Latency (p99) │
│ ████████████ │ ███ │ █████ │
│ 1,234 req/s │ 0.5% │ 125ms │
└─────────────────────┴─────────────────────┴─────────────────┘
USE Method:
U - Utilization: % of resource capacity used
S - Saturation: Queue depth, wait time
E - Errors: Error count
Example Metrics:
# Utilization: CPU usage
node_cpu_seconds_total
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Saturation: Load average (queue depth)
node_load1
# Or: disk I/O queue
node_disk_io_time_weighted_seconds_total
# Errors: Disk errors
node_disk_read_errors_total
node_disk_write_errors_total
USE Dashboard:
┌─────────────────────────────────────────────────────────────┐
│ Server Resources │
├───────────────────┬───────────────────┬─────────────────────┤
│ CPU Utilization │ Memory │ Disk I/O │
│ ████████░░ 80% │ ██████░░░░ 60% │ ████░░░░░░ 40% │
│ │ │ │
│ Saturation: 2.5 │ Swap: 0 MB │ Queue: 3 │
│ Errors: 0 │ Errors: 0 │ Errors: 0 │
└───────────────────┴───────────────────┴─────────────────────┘
When to Use Each:
| Method | Use For | Examples |
|---|---|---|
| RED | Request-driven services | APIs, web servers, microservices |
| USE | Resources | CPU, memory, disk, network, database connections |
Combined Approach:
Application Layer (RED):
├── API Gateway: Rate, Errors, Duration
├── Order Service: Rate, Errors, Duration
└── Payment Service: Rate, Errors, Duration
Infrastructure Layer (USE):
├── Servers: CPU, Memory, Disk
├── Database: Connections, Query time
└── Message Queue: Queue depth, Throughput
Four Golden Signals (Google SRE):
Alternative to RED:
1. Latency (similar to Duration)
2. Traffic (similar to Rate)
3. Errors (same)
4. Saturation (from USE method)
Combines best of both for services!
Alerting Based on Methods:
# RED-based alerts
- alert: HighErrorRate
expr: rate(http_errors_total[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
- alert: HighLatency
expr: histogram_quantile(0.99, rate(http_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
# USE-based alerts
- alert: HighCPU
expr: node_cpu_utilization > 0.9
for: 10m
labels:
severity: warning
- alert: DiskFull
expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.1
for: 5m
labels:
severity: critical
Key Points to Look For:
- Knows both methods
- Can apply to appropriate contexts
- Understands practical metrics
Follow-up: How do you set SLOs based on these metrics?
DevOps & Infrastructure
Docker basics: containers vs VMs
What's the difference between containers and virtual machines? Why use containers?
Virtual Machines (VMs):
┌─────────────────────────────────────────────────────────────┐
│ Hardware │
├─────────────────────────────────────────────────────────────┤
│ Host OS (Hypervisor) │
├───────────────────┬───────────────────┬─────────────────────┤
│ Guest OS │ Guest OS │ Guest OS │
│ (Linux) │ (Windows) │ (Linux) │
├───────────────────┼───────────────────┼─────────────────────┤
│ Bins/Libs │ Bins/Libs │ Bins/Libs │
├───────────────────┼───────────────────┼─────────────────────┤
│ App A │ App B │ App C │
└───────────────────┴───────────────────┴─────────────────────┘
Each VM: Full OS, GB of memory, minutes to start
Containers:
┌─────────────────────────────────────────────────────────────┐
│ Hardware │
├─────────────────────────────────────────────────────────────┤
│ Host OS │
├─────────────────────────────────────────────────────────────┤
│ Container Runtime │
├───────────────────┬───────────────────┬─────────────────────┤
│ Bins/Libs │ Bins/Libs │ Bins/Libs │
├───────────────────┼───────────────────┼─────────────────────┤
│ App A │ App B │ App C │
└───────────────────┴───────────────────┴─────────────────────┘
Containers: Share OS kernel, MB of memory, seconds to start
Key Differences:
| Aspect | VMs | Containers |
|---|---|---|
| Isolation | Hardware-level | Process-level |
| Size | GBs | MBs |
| Startup | Minutes | Seconds |
| OS | Full guest OS | Shares host kernel |
| Performance | ~5% overhead | Near-native |
| Portability | Hypervisor-dependent | Runs anywhere |
Docker Basics:
Dockerfile:
# Base image
FROM python:3.11-slim
# Set working directory
WORKDIR /app
# Copy dependencies first (caching)
COPY requirements.txt .
RUN pip install -r requirements.txt
# Copy application code
COPY . .
# Expose port
EXPOSE 8000
# Run command
CMD ["python", "app.py"]
Common Commands:
# Build image
docker build -t myapp:1.0 .
# Run container
docker run -d -p 8000:8000 --name myapp myapp:1.0
# List containers
docker ps
# View logs
docker logs myapp
# Execute command in container
docker exec -it myapp /bin/bash
# Stop and remove
docker stop myapp && docker rm myapp
Why Containers:
1. Consistency: "Works on my machine" → Works everywhere
2. Isolation: Dependencies don't conflict
3. Efficiency: Better resource utilization
4. Speed: Fast to build, start, scale
5. DevOps: Same artifact from dev to prod
When to Use VMs:
- Need different OS (Linux + Windows)
- Stronger isolation required
- Running legacy applications
- Compliance requirements
Key Points to Look For:
- Understands isolation difference
- Knows basic Docker commands
- Can explain benefits
Follow-up: What is a Docker image layer and why does it matter?
Kubernetes: pods, services, deployments
Explain the core Kubernetes concepts: pods, services, and deployments.
Kubernetes: Container orchestration platform for deploying, scaling, and managing containerized applications.
Core Concepts:
1. Pod:
Smallest deployable unit. One or more containers that share storage/network.
apiVersion: v1
kind: Pod
metadata:
name: my-app
spec:
containers:
- name: app
image: myapp:1.0
ports:
- containerPort: 8080
- name: sidecar
image: log-collector:1.0
┌─────────────────────────────────────┐
│ Pod │
│ ┌────────────┐ ┌────────────┐ │
│ │ Container │ │ Container │ │
│ │ (app) │ │ (sidecar) │ │
│ └────────────┘ └────────────┘ │
│ │
│ Shared: Network (localhost) │
│ Storage (volumes) │
│ IP Address │
└─────────────────────────────────────┘
2. Service:
Stable network endpoint for a set of pods. Pods come and go, Services provide consistent access.
apiVersion: v1
kind: Service
metadata:
name: my-app-service
spec:
selector:
app: my-app # Find pods with this label
ports:
- port: 80 # Service port
targetPort: 8080 # Container port
type: ClusterIP # Internal only
Service Types:
ClusterIP: Internal cluster access only (default)
NodePort: Exposes on each node's IP at static port
LoadBalancer: Exposes via cloud load balancer
┌─────────────────────────────────────────────────────────────┐
│ Cluster │
│ │
│ my-app-service (ClusterIP: 10.0.0.100) │
│ ↓ │
│ ┌───────┴───────┐ │
│ │ │ │
│ ┌──▼──┐ ┌──▼──┐ │
│ │ Pod │ │ Pod │ ← selector: app=my-app │
│ └─────┘ └─────┘ │
└─────────────────────────────────────────────────────────────┘
3. Deployment:
Manages ReplicaSets and provides declarative updates for Pods.
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 3
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: app
image: myapp:1.0
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "500m"
Deployment Features:
# Scale up/down
kubectl scale deployment my-app --replicas=5
# Rolling update
kubectl set image deployment/my-app app=myapp:2.0
# Rollback
kubectl rollout undo deployment/my-app
# Check status
kubectl rollout status deployment/my-app
How They Work Together:
Deployment
│
│ manages
▼
ReplicaSet
│
│ creates
┌────────────┼────────────┐
▼ ▼ ▼
Pod Pod Pod
│ │ │
└────────────┼────────────┘
│
│ exposed by
▼
Service
│
│ accessed by
▼
Clients
Other Important Resources:
- ConfigMap: Configuration data
- Secret: Sensitive data (encrypted)
- Ingress: HTTP routing, TLS termination
- PersistentVolume: Storage
Key Points to Look For:
- Understands pod vs container
- Knows what services provide
- Can explain deployment benefits
Follow-up: What happens during a rolling deployment?
Blue-green vs canary deployments
What's the difference between blue-green and canary deployments?
Purpose: Both minimize risk when releasing new versions.
Blue-Green Deployment:
Two identical environments. Switch traffic instantly.
Before:
┌─────────────────┐ ┌─────────────────┐
│ Blue (v1) │ ← 100% │ Green (v2) │
│ PRODUCTION │ traffic │ STAGING │
└─────────────────┘ └─────────────────┘
After switch:
┌─────────────────┐ ┌─────────────────┐
│ Blue (v1) │ │ Green (v2) │ ← 100%
│ STANDBY │ │ PRODUCTION │ traffic
└─────────────────┘ └─────────────────┘
How It Works:
1. Deploy v2 to green environment
2. Test v2 thoroughly
3. Switch load balancer to green
4. Blue becomes standby (instant rollback)
# AWS ALB example
- weight: 0 # Blue (v1)
targetGroup: blue-tg
- weight: 100 # Green (v2)
targetGroup: green-tg
Canary Deployment:
Gradually shift traffic to new version.
Phase 1: 5% to v2
┌─────────────────────────────────────────────────────┐
│ v1 ████████████████████████████████████████ 95% │
│ v2 ██ 5% │
└─────────────────────────────────────────────────────┘
Phase 2: 25% to v2
┌─────────────────────────────────────────────────────┐
│ v1 ███████████████████████████████ 75% │
│ v2 █████████ 25% │
└─────────────────────────────────────────────────────┘
Phase 3: 100% to v2
┌─────────────────────────────────────────────────────┐
│ v2 ████████████████████████████████████████ 100% │
└─────────────────────────────────────────────────────┘
How It Works:
1. Deploy v2 alongside v1
2. Route small % to v2
3. Monitor metrics (errors, latency)
4. Gradually increase % if healthy
5. Rollback if problems detected
# Kubernetes canary
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
spec:
http:
- route:
- destination:
host: my-app
subset: v1
weight: 90
- destination:
host: my-app
subset: v2
weight: 10
Comparison:
| Aspect | Blue-Green | Canary |
|---|---|---|
| Traffic shift | Instant (100%) | Gradual (%, over time) |
| Risk | Moderate | Low |
| Infrastructure | 2x resources | 1x + small % |
| Rollback | Instant | Instant |
| Testing | Full before switch | In production |
| Complexity | Simple | More complex |
| Best for | Database changes | Feature validation |
When to Use:
Blue-Green:
- Database schema changes
- Full system testing needed
- Quick rollback critical
- Simpler setup preferred
Canary:
- Validating with real traffic
- A/B testing new features
- Gradual risk mitigation
- Long-running releases
Rolling Deployment (Alternative):
Replace instances one at a time:
[v1] [v1] [v1] [v1]
[v2] [v1] [v1] [v1]
[v2] [v2] [v1] [v1]
[v2] [v2] [v2] [v1]
[v2] [v2] [v2] [v2]
Kubernetes default strategy
Less control than canary
Key Points to Look For:
- Knows difference
- Can recommend based on scenario
- Understands trade-offs
Follow-up: How do you handle database migrations with blue-green deployments?
Infrastructure as Code: benefits and tools
What is Infrastructure as Code and why is it important?
Infrastructure as Code (IaC): Managing infrastructure through code instead of manual processes.
Before IaC:
Click Console → Configure VM → Set up network → Manual
"I think I clicked these settings last time..."
With IaC:
resource "aws_instance" "web" {
ami = "ami-0c55b159cbfafe1f0"
instance_type = "t3.medium"
tags = {
Name = "web-server"
}
}
Benefits:
1. Version Control:
# Track changes over time
git log --oneline
abc123 Add load balancer
def456 Increase instance size
ghi789 Initial infrastructure
# Review changes
git diff HEAD~1
2. Repeatability:
# Same infrastructure every time
terraform apply # Dev
terraform apply # Staging
terraform apply # Production
# No "snowflake" servers
3. Self-Documentation:
# Code IS the documentation
resource "aws_security_group" "web" {
name = "web-sg"
description = "Allow HTTP and HTTPS"
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
}
4. Testing:
# Validate before applying
terraform validate
terraform plan
# Automated testing
kitchen test # Test Kitchen
pytest # Pulumi/CDK tests
5. Disaster Recovery:
# Rebuild entire infrastructure
terraform destroy
terraform apply
# Back to known state
Major Tools:
Terraform (HashiCorp):
# Declarative, cloud-agnostic
provider "aws" {
region = "us-east-1"
}
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
}
resource "aws_subnet" "web" {
vpc_id = aws_vpc.main.id
cidr_block = "10.0.1.0/24"
}
AWS CloudFormation:
# AWS-native, YAML/JSON
AWSTemplateFormatVersion: '2010-09-09'
Resources:
WebServer:
Type: AWS::EC2::Instance
Properties:
InstanceType: t3.medium
ImageId: ami-0c55b159cbfafe1f0
Pulumi (Code-based):
# Real programming languages
import pulumi_aws as aws
vpc = aws.ec2.Vpc("main", cidr_block="10.0.0.0/16")
subnet = aws.ec2.Subnet("web",
vpc_id=vpc.id,
cidr_block="10.0.1.0/24"
)
Ansible (Configuration Management):
# Imperative, agent-less
- hosts: webservers
tasks:
- name: Install nginx
apt:
name: nginx
state: present
- name: Start nginx
service:
name: nginx
state: started
Tool Comparison:
| Tool | Type | State | Language |
|---|---|---|---|
| Terraform | Declarative | Remote/Local | HCL |
| CloudFormation | Declarative | AWS-managed | YAML/JSON |
| Pulumi | Declarative | Remote | Python/TS/Go |
| Ansible | Imperative | Stateless | YAML |
| CDK | Declarative | CloudFormation | TS/Python |
Best Practices:
1. Store in version control
2. Use remote state (S3, Terraform Cloud)
3. Use modules for reusability
4. Implement CI/CD for infrastructure
5. Use environments (dev/staging/prod)
6. Peer review changes
Example Workflow:
Developer → PR → Review → Merge → CI/CD → terraform apply
│
└── terraform plan (preview)
Key Points to Look For:
- Understands benefits
- Knows major tools
- Mentions version control
Follow-up: How do you handle secrets in Infrastructure as Code?