Back to Questions

Architecture & System Design

Distributed systems, scalability, and design patterns

♦

Difficulty:

Session: 0 asked

Monolithic vs Microservices trade-offs

What are the trade-offs between monolithic and microservices architectures?

Mid

Monolithic Architecture:
Single deployable unit containing all functionality.

┌─────────────────────────────────┐
│           Monolith              │
│  ┌─────┐ ┌─────┐ ┌─────────┐   │
│  │Users│ │Orders│ │Inventory│   │
│  └─────┘ └─────┘ └─────────┘   │
│         Single Database         │
└─────────────────────────────────┘

Microservices Architecture:
Multiple independent services communicating over network.

┌───────┐   ┌────────┐   ┌───────────┐
│ Users │   │ Orders │   │ Inventory │
│  DB   │   │   DB   │   │    DB     │
└───┬───┘   └───┬────┘   └─────┬─────┘
    │           │              │
────┴───────────┴──────────────┴────
           API Gateway

Trade-offs:

Aspect	Monolith	Microservices
Complexity	Lower	Higher
Deployment	All-or-nothing	Independent
Scaling	Entire app	Per service
Data consistency	Easy (ACID)	Hard (distributed)
Development speed	Fast initially	Fast at scale
Testing	Simpler	More complex
Latency	In-process	Network calls
Team autonomy	Low	High

When Monolith:
- Small team (<10)
- Simple domain
- Starting a new project
- Unclear boundaries
- Need quick MVP

When Microservices:
- Large organization
- Need independent scaling
- Different tech stacks needed
- Clear domain boundaries
- High availability critical

Migration Path:

1. Start monolith
2. Identify bounded contexts
3. Extract services incrementally
4. Strangler fig pattern

Key Points to Look For:
- Knows trade-offs, not just hype
- Considers team size
- Understands operational complexity

Follow-up: How do you identify service boundaries?

MVC, MVP, MVVM - differences

What are the differences between MVC, MVP, and MVVM patterns?

Mid

MVC (Model-View-Controller):

       User
         │
    ┌────▼────┐
    │Controller│────→ Model
    └────┬────┘        │
         │             │
    ┌────▼────┐        │
    │  View   │←───────┘
    └─────────┘

Controller: Handles input, updates Model
Model: Business logic, data
View: Renders UI from Model

MVP (Model-View-Presenter):

       User
         │
    ┌────▼────┐
    │  View   │←──────┐
    └────┬────┘       │
         │            │
    ┌────▼─────┐      │
    │ Presenter│──→ Model
    └──────────┘

View: Passive, delegates to Presenter
Presenter: All logic, updates View
Model: Business logic, data

MVVM (Model-View-ViewModel):

       User
         │
    ┌────▼────┐
    │  View   │
    └────┬────┘
         │ Data Binding
    ┌────▼─────┐
    │ViewModel │
    └────┬─────┘
         │
    ┌────▼────┐
    │  Model  │
    └─────────┘

View: Binds to ViewModel
ViewModel: View state, commands
Model: Business logic

Comparison:

Aspect	MVC	MVP	MVVM
View-Logic coupling	Medium	Low	Low
Testability	Medium	High	High
View updates	Controller	Presenter	Binding
Complexity	Low	Medium	Medium
Best for	Web apps	Desktop, mobile	Desktop, SPA

Examples:
- MVC: Ruby on Rails, ASP.NET MVC, Spring MVC
- MVP: Android (traditional), WinForms
- MVVM: WPF, Angular, Vue.js, SwiftUI

Key Differences:

MVC vs MVP:
- MVC: View can query Model directly
- MVP: All communication through Presenter

MVP vs MVVM:
- MVP: Presenter explicitly updates View
- MVVM: Data binding handles updates

Key Points to Look For:
- Knows data flow direction
- Understands testability implications
- Can match to technologies

Follow-up: Which pattern would you choose for a React application?

Layered architecture and its layers

Explain layered architecture and the purpose of each layer.

Junior

Layered Architecture:
Organizes code into horizontal layers with specific responsibilities.

┌─────────────────────────────────┐
│      Presentation Layer         │  API/UI
├─────────────────────────────────┤
│      Application Layer          │  Use cases
├─────────────────────────────────┤
│        Domain Layer             │  Business logic
├─────────────────────────────────┤
│      Infrastructure Layer       │  External systems
└─────────────────────────────────┘

Layers:

1. Presentation Layer
- Handles user interface / API endpoints
- Request/response formatting
- Input validation (format only)
- No business logic

@RestController
public class UserController {
    @PostMapping("/users")
    public Response createUser(@Valid UserDTO dto) {
        User user = userService.create(dto);
        return Response.created(user.getId());
    }
}

2. Application Layer (Service Layer)
- Orchestrates use cases
- Transaction management
- Calls domain layer
- No business rules

@Service
public class UserService {
    public User create(UserDTO dto) {
        User user = userFactory.create(dto);
        validateUnique(user.getEmail());
        userRepository.save(user);
        eventPublisher.publish(new UserCreated(user));
        return user;
    }
}

3. Domain Layer
- Business rules and logic
- Domain entities
- Value objects
- Domain services

public class User {
    private Email email;
    private Password password;

    public void changePassword(Password newPassword) {
        validatePasswordPolicy(newPassword);
        this.password = newPassword;
    }
}

4. Infrastructure Layer
- Database access
- External services
- File system
- Messaging

@Repository
public class JpaUserRepository implements UserRepository {
    @Override
    public void save(User user) {
        entityManager.persist(user);
    }
}

Dependency Rule:

Presentation → Application → Domain ← Infrastructure
                              ↑
                    Domain is the core

Upper layers depend on lower, but Domain knows nothing about outer layers.

Key Points to Look For:
- Knows each layer's responsibility
- Understands dependency direction
- Can identify layer violations

Follow-up: What's the difference between this and Clean Architecture?

Clean Architecture / Hexagonal Architecture

Explain Clean Architecture or Hexagonal Architecture. How do they differ from traditional layered architecture?

Senior

Core Principle: Business logic at center, frameworks/external concerns at edges.

Hexagonal Architecture (Ports & Adapters):

                 ┌─────────────┐
    HTTP ──────→ │   Port      │
                 │  (Interface)│
                 └──────┬──────┘
                        │
              ┌─────────▼─────────┐
    CLI ─────→│   Application     │←───── Tests
              │      Core         │
              └─────────┬─────────┘
                        │
                 ┌──────▼──────┐
                 │   Port      │
                 │  (Interface)│
                 └──────┬──────┘
                        │
            ┌───────────┼───────────┐
            │           │           │
        PostgreSQL   Redis      Email

Ports: Interfaces defining how core interacts with outside
Adapters: Implementations of ports (HTTP, DB, etc.)

Clean Architecture (Onion):

         ┌─────────────────────────────────┐
         │     Frameworks & Drivers        │
         │  ┌─────────────────────────┐    │
         │  │   Interface Adapters    │    │
         │  │  ┌──────────────────┐   │    │
         │  │  │   Use Cases      │   │    │
         │  │  │  ┌───────────┐   │   │    │
         │  │  │  │ Entities  │   │   │    │
         │  │  │  └───────────┘   │   │    │
         │  │  └──────────────────┘   │    │
         │  └─────────────────────────┘    │
         └─────────────────────────────────┘

Dependency Rule:
Dependencies point INWARD only. Inner circles know nothing about outer.

Example Structure:

src/
├── domain/              # Entities, Value Objects
│   ├── User.java
│   └── UserRepository.java  # Interface!
├── application/         # Use Cases
│   └── CreateUserUseCase.java
├── adapters/
│   ├── web/            # HTTP adapter
│   │   └── UserController.java
│   └── persistence/    # DB adapter
│       └── JpaUserRepository.java
└── config/             # Wiring

Key Difference from Layered:

Layered:
- Domain depends on infrastructure interfaces
- Change DB → Change domain

Clean/Hexagonal:
- Infrastructure depends on domain interfaces
- Change DB → Only change adapter

Benefits:
1. Framework independence
2. Testability (mock ports)
3. UI independence
4. Database independence

Key Points to Look For:
- Understands dependency direction
- Knows ports and adapters
- Can explain benefits

Follow-up: How do you handle cross-cutting concerns like logging?

Event-Driven Architecture

What is Event-Driven Architecture? When would you use it?

Senior

Event-Driven Architecture (EDA):
Systems communicate by producing and consuming events.

┌─────────┐     Event      ┌────────────┐
│ Service │────────────────│ Event Bus  │
│    A    │                │ (Kafka,    │
└─────────┘                │ RabbitMQ)  │
                           └─────┬──────┘
                           ┌─────┴──────┐
                    ┌──────┴────┐ ┌─────┴─────┐
                    │ Service B │ │ Service C │
                    └───────────┘ └───────────┘

Event Types:

1. Domain Events:

// Something that happened in the domain
public class OrderPlaced {
    UUID orderId;
    UUID customerId;
    BigDecimal total;
    Instant occurredAt;
}

2. Integration Events:

// Events for external systems
public class OrderPlacedIntegrationEvent {
    String orderId;  // Strings for compatibility
    String timestamp;
}

Patterns:

Event Notification:

"Something happened" → Consumers query for details
Loose coupling, may need callbacks

Event-Carried State Transfer:

Event contains all needed data
No callbacks needed, eventual consistency

Event Sourcing:

Store events as source of truth
Rebuild state by replaying events

Benefits:
1. Loose coupling - Producers don't know consumers
2. Scalability - Add consumers independently
3. Resilience - Events can be replayed
4. Audit trail - Event history

Challenges:
1. Eventual consistency - Not immediate
2. Debugging - Harder to trace flow
3. Event ordering - Need careful design
4. Idempotency - Handle duplicate events

When to Use:
- Decoupled services
- Async is acceptable
- Audit trail needed
- High scalability needed

When NOT to Use:
- Strong consistency required
- Simple CRUD operations
- Small systems

Key Points to Look For:
- Knows event types
- Understands trade-offs
- Can identify use cases

Follow-up: How do you ensure event ordering?

CQRS pattern explained

What is CQRS and when would you use it?

Senior

CQRS (Command Query Responsibility Segregation):
Separate read and write models.

Traditional (Single Model):

Client → API → Service → Repository → Database
  ↑                                      │
  └──────────────────────────────────────┘
         Same model for reads/writes

CQRS:

           ┌─────────────────────────────┐
Write ────→│ Command Handler → Write DB  │
           └─────────────────────────────┘
                        │
                   Sync (events)
                        │
           ┌────────────▼────────────────┐
Read ─────→│ Query Handler → Read DB     │
           └─────────────────────────────┘

Components:

Commands (Write):

public class PlaceOrderCommand {
    UUID customerId;
    List<LineItem> items;
}

public class PlaceOrderHandler {
    void handle(PlaceOrderCommand cmd) {
        Order order = Order.create(cmd);
        orderRepository.save(order);
        eventBus.publish(new OrderPlaced(order));
    }
}

Queries (Read):

public class GetOrderSummaryQuery {
    UUID orderId;
}

public class GetOrderSummaryHandler {
    OrderSummaryDTO handle(GetOrderSummaryQuery query) {
        return readDB.getOrderSummary(query.orderId);
    }
}

Benefits:
1. Optimized models - Read model for queries, write for commands
2. Scalability - Scale reads independently
3. Simplicity - Each side is simpler
4. Performance - Denormalized read model

When to Use:
- Read/write patterns differ significantly
- Complex domain with simple queries
- Need separate scaling
- Event sourcing

When NOT to Use:
- Simple CRUD
- Small applications
- Team unfamiliar with pattern

CQRS + Event Sourcing:

Command → Event Store → Events → Projections → Read DB

Key Points to Look For:
- Understands separation concept
- Knows benefits and trade-offs
- Can identify appropriate use cases

Follow-up: How do you handle consistency between read and write models?

Domain-Driven Design basics

What are the key concepts of Domain-Driven Design?

Senior

DDD focuses on complex domain modeling and collaboration with domain experts.

Strategic Patterns:

1. Bounded Context:

┌───────────────┐    ┌───────────────┐
│    Sales      │    │   Shipping    │
│   Context     │    │    Context    │
│               │    │               │
│ Customer:     │    │ Customer:     │
│ - name        │    │ - address     │
│ - creditLimit │    │ - deliveryPref│
└───────────────┘    └───────────────┘
     Same word, different meaning!

2. Ubiquitous Language:
Shared vocabulary between developers and domain experts.

// Code matches domain language
class Order {
    void place() { }      // Not "save" or "create"
    void fulfill() { }    // Domain term
    void cancel() { }
}

3. Context Mapping:

Sales ←─(Customer/Supplier)─→ Billing
       ←─(Shared Kernel)─→ Inventory
       ←─(Anti-corruption Layer)─→ Legacy

Tactical Patterns:

1. Entities:
Objects with identity.

class Order {
    private OrderId id;  // Identity
    // Two orders with same data but different IDs are different
}

2. Value Objects:
Objects without identity, defined by attributes.

class Money {
    private BigDecimal amount;
    private Currency currency;
    // Two Money with same amount/currency are equal
}

3. Aggregates:
Cluster of entities with consistency boundary.

class Order {  // Aggregate Root
    private List<LineItem> items;  // Part of aggregate

    void addItem(Product p, int qty) {
        // Order controls consistency of items
    }
}

4. Repositories:
Collection-like interface for aggregates.

interface OrderRepository {
    Order findById(OrderId id);
    void save(Order order);
}

5. Domain Services:
Operations that don't belong to any entity.

class PricingService {
    Money calculateTotal(Order order, Customer customer) {
        // Complex pricing across multiple entities
    }
}

6. Domain Events:

class OrderPlaced {
    OrderId orderId;
    CustomerId customerId;
    Instant occurredAt;
}

Key Points to Look For:
- Knows strategic vs tactical
- Understands bounded contexts
- Can explain aggregates

Follow-up: How do you communicate between bounded contexts?

System Design Concepts

Horizontal vs Vertical scaling

What's the difference between horizontal and vertical scaling?

Junior

Vertical Scaling (Scale Up):
Add more resources to existing machine.

Before:        After:
┌────────┐    ┌────────────┐
│ 4 CPU  │    │   16 CPU   │
│ 8GB RAM│ →  │  64GB RAM  │
│ 100GB  │    │   1TB SSD  │
└────────┘    └────────────┘

Horizontal Scaling (Scale Out):
Add more machines.

Before:         After:
┌────────┐     ┌────────┐ ┌────────┐ ┌────────┐
│ Server │     │Server 1│ │Server 2│ │Server 3│
└────────┘     └────────┘ └────────┘ └────────┘
                    └──────────┼──────────┘
                         Load Balancer

Comparison:

Aspect	Vertical	Horizontal
Complexity	Simple	Complex
Limit	Hardware max	Virtually unlimited
Downtime	Often needed	Zero-downtime
Cost	Expensive	Cost-effective
Availability	Single point	High availability
Data consistency	Easy	Challenging

When to Use:

Vertical:
- Database servers (initially)
- Simple applications
- Quick fix needed
- Stateful applications

Horizontal:
- Web servers
- Microservices
- High availability needed
- Unpredictable growth

Challenges with Horizontal:
1. State management - Sessions, cache
2. Data consistency - Distributed transactions
3. Load balancing - Request distribution
4. Service discovery - Finding instances

Best Practice:

Start vertical (simpler), scale horizontal when needed.
Design stateless from the beginning.

Key Points to Look For:
- Knows trade-offs
- Understands complexity
- Can advise on when to use each

Follow-up: How do you handle session state with horizontal scaling?

Load balancing strategies

What are different load balancing strategies?

Mid

Load Balancer: Distributes incoming requests across multiple servers.

        Clients
           │
    ┌──────▼──────┐
    │Load Balancer│
    └──────┬──────┘
     ┌─────┼─────┐
     │     │     │
┌────▼┐ ┌──▼──┐ ┌▼────┐
│ S1  │ │ S2  │ │ S3  │
└─────┘ └─────┘ └─────┘

Strategies:

1. Round Robin:

Request 1 → Server 1
Request 2 → Server 2
Request 3 → Server 3
Request 4 → Server 1 (cycle)

Simple but ignores server capacity.

2. Weighted Round Robin:

Server 1 (weight 3): Gets 3 of every 6 requests
Server 2 (weight 2): Gets 2 of every 6 requests
Server 3 (weight 1): Gets 1 of every 6 requests

3. Least Connections:

Server 1: 10 active connections
Server 2: 5 active connections
Server 3: 8 active connections
→ Send to Server 2

Good for varying request duration.

4. IP Hash:

hash(client_ip) % num_servers → Server
Same client always hits same server

Good for session affinity (sticky sessions).

5. Least Response Time:

Server 1: avg 50ms
Server 2: avg 30ms ← Send here
Server 3: avg 45ms

6. Random:
Simple, works well with many servers.

Layer 4 vs Layer 7:

Layer 4 (Transport)	Layer 7 (Application)
TCP/UDP level	HTTP level
Faster	More features
Can't inspect content	Content-based routing
Connection-based	Request-based

Layer 7 Features:

/api/* → API servers
/static/* → CDN
/admin/* → Admin servers

Health Checks:

Active: LB pings servers
Passive: Monitor responses

Unhealthy → Remove from pool
Healthy → Add back

Key Points to Look For:
- Knows multiple strategies
- Understands Layer 4 vs 7
- Mentions health checks

Follow-up: How do you handle session stickiness?

10.

Caching strategies: write-through, write-back, write-around

Explain different caching write strategies.

Mid

Caching reduces latency and database load.

1. Cache-Aside (Lazy Loading):
Application manages cache.

Read:
1. Check cache
2. If miss, read DB
3. Write to cache
4. Return

Write:
1. Write to DB
2. Invalidate/update cache

def get_user(user_id):
    user = cache.get(f"user:{user_id}")
    if user is None:
        user = db.get_user(user_id)
        cache.set(f"user:{user_id}", user, ttl=3600)
    return user

Pros: Only cache what's needed
Cons: Cache miss penalty, stale data possible

2. Write-Through:
Write to cache and DB synchronously.

Write:
1. Write to cache
2. Cache writes to DB
3. Return

Read:
1. Read from cache (always fresh)

App → Cache → DB
       ↑
    Synchronous

Pros: Cache always fresh
Cons: Write latency, cache may fill with unused data

3. Write-Back (Write-Behind):
Write to cache, async write to DB.

Write:
1. Write to cache
2. Return immediately
3. Cache writes to DB async (batched)

App → Cache ···→ DB (async)
         │
      Immediate return

Pros: Fast writes
Cons: Data loss risk if cache fails

4. Write-Around:
Write directly to DB, bypass cache.

Write:
1. Write to DB only
2. Cache gets populated on read

Read:
1. Check cache
2. If miss, read DB, populate cache

Pros: Cache not flooded with writes
Cons: Read-after-write returns stale/misses

Comparison:

Strategy	Read Perf	Write Perf	Consistency	Durability
Cache-Aside	Good	Medium	Medium	High
Write-Through	Best	Low	High	High
Write-Back	Best	Best	High	Low
Write-Around	Medium	High	Low	High

Key Points to Look For:
- Knows multiple strategies
- Understands trade-offs
- Can choose based on requirements

Follow-up: How do you handle cache invalidation?

11.

CDN and edge caching

How does a CDN work? When would you use one?

Mid

CDN (Content Delivery Network):
Distributed servers that cache content close to users.

              User in Tokyo
                    │
         ┌──────────▼──────────┐
         │ Tokyo Edge Server   │ ← Cache hit!
         │ (CDN PoP)           │
         └──────────┬──────────┘
                    │ Cache miss
         ┌──────────▼──────────┐
         │ Origin Server       │
         │ (Your server in US) │
         └─────────────────────┘

How It Works:
1. User requests content
2. CDN edge receives request
3. If cached → Return immediately
4. If not → Fetch from origin, cache, return

Content Types:

Static Content:
- Images, CSS, JS
- Videos, downloads
- Fonts, documents

Dynamic Content (with Edge Computing):
- Personalized pages
- API responses (short TTL)
- Server-side rendering

Benefits:
1. Latency - Geographically closer
2. Bandwidth - Offload origin server
3. Availability - Redundant edge locations
4. DDoS protection - Distributed defense

CDN Configuration:

# Cache rules
/static/* → Cache 1 year
/api/public/* → Cache 5 minutes
/api/private/* → No cache
/*.html → Cache 1 hour, stale-while-revalidate

Cache Headers:

Cache-Control: public, max-age=31536000, immutable
Cache-Control: public, max-age=300, stale-while-revalidate=60
Cache-Control: private, no-store

Cache Invalidation:

# Purge specific URL
cdn.purge("/static/app.js")

# Purge by tag
cdn.purge(tag="product-images")

# Version in URL (preferred)
/static/app.v123.js

When to Use:
- Global user base
- Static assets
- High traffic
- Video streaming
- API caching

Providers:
Cloudflare, AWS CloudFront, Akamai, Fastly

Key Points to Look For:
- Understands how CDN works
- Knows cache headers
- Mentions invalidation challenges

Follow-up: How do you handle cache invalidation for dynamic content?

12.

Rate limiting algorithms: token bucket, leaky bucket

Explain token bucket and leaky bucket rate limiting algorithms.

Mid

Purpose: Prevent abuse, ensure fair usage, protect resources.

Token Bucket:

Bucket fills with tokens at fixed rate
Each request consumes a token
No token → Request rejected

┌─────────────┐
│ ●●●●○○○○○○ │ ← Tokens (5/10 available)
└─────────────┘
    ↑ Fill rate: 1/second

Implementation:

class TokenBucket:
    def __init__(self, capacity, refill_rate):
        self.capacity = capacity
        self.tokens = capacity
        self.refill_rate = refill_rate
        self.last_refill = time.time()

    def allow_request(self):
        self._refill()
        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False

    def _refill(self):
        now = time.time()
        elapsed = now - self.last_refill
        self.tokens = min(
            self.capacity,
            self.tokens + elapsed * self.refill_rate
        )
        self.last_refill = now

Characteristics:
- Allows bursts (up to bucket capacity)
- Smooth average rate
- Simple to implement

Leaky Bucket:

Requests enter bucket
Bucket "leaks" at constant rate
Overflow → Request rejected

     ↓ Requests
┌─────────────┐
│ ●●●●●●●●   │ ← Buffer
└─────┬───────┘
      ↓ Constant outflow rate
    [Process]

Implementation:

class LeakyBucket:
    def __init__(self, capacity, leak_rate):
        self.capacity = capacity
        self.water = 0
        self.leak_rate = leak_rate
        self.last_leak = time.time()

    def allow_request(self):
        self._leak()
        if self.water < self.capacity:
            self.water += 1
            return True
        return False

    def _leak(self):
        now = time.time()
        elapsed = now - self.last_leak
        self.water = max(0, self.water - elapsed * self.leak_rate)
        self.last_leak = now

Characteristics:
- Constant output rate
- Smooths bursts
- May add latency (queue)

Comparison:

Aspect	Token Bucket	Leaky Bucket
Bursts	Allows	Smooths
Output rate	Variable	Constant
Simplicity	Simple	Simple
Use case	API rate limiting	Traffic shaping

Other Algorithms:

Fixed Window:

Window: 00:00-01:00 → 100 requests allowed
Problem: 200 requests possible at boundary

Sliding Window Log:

Track timestamp of each request
Count requests in last N seconds

Sliding Window Counter:

Combine fixed windows with weighting
Previous window count × overlap + current count

Key Points to Look For:
- Knows both algorithms
- Understands burst handling
- Can implement basic version

Follow-up: How would you implement distributed rate limiting?

13.

Circuit breaker pattern

What is the circuit breaker pattern? How does it work?

Mid

Circuit Breaker: Prevents cascading failures by failing fast when a service is unhealthy.

States:

        Success
    ┌─────────────────┐
    │                 │
    ▼      Failure    │
┌────────┐ threshold ┌▼───────┐
│ CLOSED │──────────→│  OPEN  │
└────────┘           └───┬────┘
    ▲                    │
    │    Timeout expires │
    │                    ▼
    │              ┌───────────┐
    │   Success    │ HALF-OPEN │
    └──────────────┴───────────┘
          │
          │ Failure
          └────────→ Back to OPEN

States Explained:

CLOSED (Normal):
- Requests flow through
- Track failure count/rate
- If threshold exceeded → OPEN

OPEN (Failing Fast):
- Reject requests immediately
- Don't call downstream
- After timeout → HALF-OPEN

HALF-OPEN (Testing):
- Allow limited requests through
- If success → CLOSED
- If failure → OPEN

Implementation:

class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=30):
        self.state = "CLOSED"
        self.failures = 0
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.last_failure_time = None

    def call(self, func):
        if self.state == "OPEN":
            if time.time() - self.last_failure_time > self.timeout:
                self.state = "HALF-OPEN"
            else:
                raise CircuitOpenException()

        try:
            result = func()
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

    def _on_success(self):
        self.failures = 0
        self.state = "CLOSED"

    def _on_failure(self):
        self.failures += 1
        self.last_failure_time = time.time()
        if self.failures >= self.failure_threshold:
            self.state = "OPEN"

Benefits:
1. Fail fast - Don't wait for timeouts
2. Protect downstream - Give service time to recover
3. Provide fallback - Graceful degradation
4. Resource conservation - Don't waste connections

With Fallback:

@CircuitBreaker(name = "inventory", fallbackMethod = "getDefaultInventory")
public Inventory getInventory(String productId) {
    return inventoryService.get(productId);
}

public Inventory getDefaultInventory(String productId, Exception ex) {
    return new Inventory(productId, 0, "UNKNOWN");
}

Libraries:
- Resilience4j (Java)
- Polly (.NET)
- Hystrix (deprecated)

Key Points to Look For:
- Knows all three states
- Understands failure detection
- Mentions fallback handling

Follow-up: How do you determine appropriate thresholds?

14.

Bulkhead pattern for fault isolation

What is the Bulkhead pattern? How does it improve resilience?

Senior

Bulkhead: Isolate components to contain failures, like ship compartments.

Ship without bulkheads:     Ship with bulkheads:
┌────────────────────┐     ┌──────┬──────┬──────┐
│     Flooding       │     │  OK  │Flood │  OK  │
│   ~~~~~~~~~~~~~~~~ │     │      │~~~~~~│      │
└────────────────────┘     └──────┴──────┴──────┘
      SINKS!                   STAYS AFLOAT

In Software:

Without Bulkhead:
┌─────────────────────────────────────┐
│           Shared Thread Pool        │
│  Orders ─────→ ●●●●●●●●●●          │
│  Users ──────→   (stuck)            │
│  Products ───→   (stuck)            │
└─────────────────────────────────────┘
  One slow service blocks everything!

With Bulkhead:
┌───────────┐ ┌───────────┐ ┌───────────┐
│  Orders   │ │   Users   │ │ Products  │
│ ●●●●●●●●●●│ │ ●●●       │ │ ●●●●      │
│ (stuck)   │ │ (working) │ │ (working) │
└───────────┘ └───────────┘ └───────────┘
  Failure contained!

Implementation Types:

1. Thread Pool Bulkhead:

// Separate thread pools per service
ExecutorService ordersPool = Executors.newFixedThreadPool(10);
ExecutorService usersPool = Executors.newFixedThreadPool(5);
ExecutorService productsPool = Executors.newFixedThreadPool(5);

// Orders being slow doesn't affect Users
ordersPool.submit(() -> callOrderService());
usersPool.submit(() -> callUserService());

2. Semaphore Bulkhead:

Semaphore ordersSemaphore = new Semaphore(10);

void callOrderService() {
    if (ordersSemaphore.tryAcquire()) {
        try {
            // Call service
        } finally {
            ordersSemaphore.release();
        }
    } else {
        throw new BulkheadFullException();
    }
}

3. Connection Pool Bulkhead:

# Separate pools per external service
datasource:
  orders:
    maximum-pool-size: 10
  users:
    maximum-pool-size: 5

With Resilience4j:

@Bulkhead(name = "orderService", type = Bulkhead.Type.SEMAPHORE)
public Order getOrder(String id) {
    return orderClient.get(id);
}

// Configuration
resilience4j.bulkhead:
  instances:
    orderService:
      maxConcurrentCalls: 10
      maxWaitDuration: 100ms

Benefits:
1. Fault isolation - Failures don't cascade
2. Fair resource allocation - Critical services protected
3. Predictable behavior - Known limits
4. Graceful degradation - Partial failures

Key Points to Look For:
- Understands isolation concept
- Knows implementation approaches
- Can size bulkheads

Follow-up: How do you combine bulkhead with circuit breaker?

Distributed Systems

15.

CAP theorem in practice

How do you apply CAP theorem when designing systems?

Mid

Recap: During partition, choose Consistency or Availability.

Practical Application:

1. Identify Partition Tolerance Requirement:

Single datacenter, reliable network?
  → Partitions rare, might accept CA behavior

Multi-region, microservices?
  → Partitions will happen, plan for CP or AP

2. Per-Feature Decision:

Same system, different requirements:

Shopping Cart: AP
  - Show cart even if stale
  - User can add items, reconcile later

Checkout/Payment: CP
  - Block until consistent
  - Can't afford duplicate charges

3. Tunable Consistency:

// Cassandra: Consistency level per query
// Quorum = majority must respond
session.execute(
    QueryBuilder.select()
        .from("orders")
        .where(eq("id", orderId))
        .setConsistencyLevel(ConsistencyLevel.QUORUM)
);

// Strong consistency: QUORUM write + QUORUM read
// Eventual consistency: ONE write + ONE read

4. Design for Failure:

# Handle partition gracefully
def get_user_profile(user_id):
    try:
        return user_service.get(user_id, timeout=1)
    except (TimeoutError, ConnectionError):
        # AP: Return cached/default data
        return cache.get(f"user:{user_id}") or DEFAULT_PROFILE

5. Consider PACELC:

Normal operation: What's the latency vs consistency trade-off?

Example: DynamoDB
- Partition: Available (AP)
- Else: Choose latency vs consistency
  - Eventual: Faster reads
  - Strong: Wait for leader

Real System Examples:

System	During Partition	Normal
DynamoDB	AP	Tunable
Cassandra	AP	Tunable
MongoDB	CP	Strong
Spanner	CP	Strong
CockroachDB	CP	Strong

Key Points to Look For:
- Applies per-feature, not system-wide
- Knows tunable consistency
- Understands practical implications

Follow-up: How do you test partition handling?

16.

Consistency models in distributed systems

What are different consistency models in distributed systems?

Senior

Consistency Models (Strongest to Weakest):

1. Linearizability (Strict):
Operations appear instantaneous at some point.

Write X=5 at T1
Read at T2 (T2 > T1) → Must see 5

Global ordering exists
Like a single server

2. Sequential Consistency:
Operations appear in SOME total order consistent with program order.

Thread 1: Write X=1, Write X=2
Thread 2: Read X, Read X

Valid: Read 1, Read 2
Valid: Read 2, Read 2
Invalid: Read 2, Read 1 (order violation)

3. Causal Consistency:
Causally related operations seen in order; concurrent operations may vary.

A writes X=1
A writes Y=2 (caused by A seeing X=1)
→ If B sees Y=2, B must also see X=1

But concurrent writes can be seen in different order.

4. Eventual Consistency:
Given enough time without updates, all replicas converge.

Write X=5 to Node A
Eventually, Node B sees X=5
No timing guarantee

5. Read-Your-Writes:
Client always sees their own writes.

Write X=5
Read X → 5 (guaranteed)

But other clients may not see it yet.

6. Monotonic Reads:
Once you see a value, you won't see older values.

Read X → 5
Read X → 5 or newer, never older

Implementation Patterns:

Strong Consistency:

# Synchronous replication
def write(key, value):
    primary.write(key, value)
    for replica in replicas:
        replica.write(key, value)  # Wait for all
    return success

Eventual Consistency:

# Async replication
def write(key, value):
    primary.write(key, value)
    queue.enqueue(replicate, key, value)  # Async
    return success

Read-Your-Writes:

# Track write version
def write(key, value):
    version = db.write(key, value)
    session.last_write_version[key] = version

def read(key):
    result = db.read(key)
    if result.version < session.last_write_version.get(key, 0):
        result = db.read_from_primary(key)  # Force primary
    return result

Key Points to Look For:
- Knows multiple models
- Understands trade-offs
- Can implement basic patterns

Follow-up: How do you choose consistency model for a given use case?

17.

Distributed transactions: saga pattern

What is the Saga pattern? How does it handle distributed transactions?

Senior

Saga: Sequence of local transactions with compensating actions for rollback.

Problem: Can't use ACID transactions across services.

Order Service → Payment Service → Inventory Service
    │               │                  │
    └───────────────┴──────────────────┘
          No distributed transaction!

Saga Types:

1. Choreography:
Services communicate through events.

Order          Payment         Inventory
  │                │                │
  │ OrderCreated   │                │
  │───────────────→│                │
  │                │ PaymentSucceeded
  │                │───────────────→│
  │                │                │ InventoryReserved
  │←───────────────┼────────────────│

2. Orchestration:
Central coordinator manages saga.

       ┌───────────────┐
       │  Orchestrator │
       └───────┬───────┘
       ┌───────┼───────┐
       │       │       │
       ▼       ▼       ▼
    Order  Payment  Inventory

Implementation (Orchestration):

class CreateOrderSaga:
    def __init__(self):
        self.steps = [
            Step(OrderService.create, OrderService.cancel),
            Step(PaymentService.charge, PaymentService.refund),
            Step(InventoryService.reserve, InventoryService.release),
        ]

    def execute(self, order_data):
        completed = []
        try:
            for step in self.steps:
                step.forward(order_data)
                completed.append(step)
        except Exception:
            # Compensate in reverse order
            for step in reversed(completed):
                step.compensate(order_data)
            raise SagaFailed()

Compensation:

Happy Path:
T1 → T2 → T3 → Success

Failure at T3:
T1 → T2 → T3 (fails)
     ↓
C2 ← C1 (compensate)

Considerations:

1. Compensations must be idempotent:

def refund(payment_id):
    if not already_refunded(payment_id):
        process_refund(payment_id)

2. Handle partial failures:

# What if compensation fails?
# Retry with backoff
# Dead letter queue for manual intervention

3. State tracking:

class SagaState:
    id: str
    current_step: int
    status: Literal["RUNNING", "COMPLETED", "COMPENSATING", "FAILED"]
    data: dict

Choreography vs Orchestration:

Aspect	Choreography	Orchestration
Coupling	Loose	Tight to orchestrator
Complexity	Distributed	Centralized
Debugging	Harder	Easier
Single failure	Resilient	SPOF risk

Key Points to Look For:
- Knows both types
- Understands compensation
- Handles failure scenarios

Follow-up: How do you ensure sagas are idempotent?

18.

Message queues: when and why

When should you use a message queue? What problems does it solve?

Mid

Message Queue: Async communication between services.

Producer → Queue → Consumer
              │
         Decoupled!

When to Use:

1. Async Processing:

# Synchronous (slow)
def create_order(order):
    save_order(order)
    send_email(order)      # Wait
    update_analytics(order) # Wait
    return order

# Async with queue (fast)
def create_order(order):
    save_order(order)
    queue.publish("order.created", order)  # Fire and forget
    return order

# Consumers process later
@subscribe("order.created")
def handle_order(order):
    send_email(order)
    update_analytics(order)

2. Load Leveling:

Spike:    ████████████████ (1000 req/s)
          ↓
Queue:    [●●●●●●●●●●●●●●●●] (buffer)
          ↓
Consumer: ███ (100 req/s steady)

3. Decoupling:

Without queue:
Order Service → Inventory Service
                ↓
             Payment Service

With queue:
Order Service → Queue ← Inventory Service
                      ← Payment Service
Services don't know about each other

4. Reliability:

Message persisted → Consumer can fail/restart
At-least-once delivery
Dead letter queue for failures

Common Patterns:

Work Queue:

Producer → Queue → Consumer 1
                 → Consumer 2
                 → Consumer 3
Load distributed among workers

Pub/Sub:

Producer → Exchange → Queue A → Consumer A
                   → Queue B → Consumer B
Multiple consumers get same message

When NOT to Use:
- Need immediate response
- Simple request-response
- Low latency critical
- Small scale / simple systems

Technologies:
- RabbitMQ: Traditional, AMQP
- Kafka: High throughput, log-based
- SQS: AWS managed
- Redis: Simple pub/sub

Key Points to Look For:
- Knows multiple use cases
- Understands trade-offs
- Mentions reliability patterns

Follow-up: What's the difference between Kafka and RabbitMQ?

19.

Event sourcing explained

What is event sourcing? When would you use it?

Senior

Event Sourcing: Store all changes as a sequence of events, not just current state.

Traditional:

┌─────────────────┐
│ Account         │
│ balance: $100   │  ← Only current state
└─────────────────┘

Event Sourcing:

┌────────────────────────────────────┐
│ Event Store                        │
│ 1. AccountCreated($0)              │
│ 2. Deposited($100)                 │
│ 3. Withdrawn($30)                  │
│ 4. Deposited($50)                  │
│ Current: Replay → $120             │
└────────────────────────────────────┘

Implementation:

# Events
@dataclass
class AccountCreated:
    account_id: str
    timestamp: datetime

@dataclass
class MoneyDeposited:
    account_id: str
    amount: Decimal
    timestamp: datetime

# Aggregate
class Account:
    def __init__(self, events):
        self.balance = Decimal(0)
        for event in events:
            self.apply(event)

    def apply(self, event):
        if isinstance(event, MoneyDeposited):
            self.balance += event.amount
        elif isinstance(event, MoneyWithdrawn):
            self.balance -= event.amount

    def deposit(self, amount):
        if amount <= 0:
            raise InvalidAmount()
        return MoneyDeposited(self.id, amount, datetime.now())

# Usage
events = event_store.get_events(account_id)
account = Account(events)
new_event = account.deposit(100)
event_store.append(account_id, new_event)

Projections (CQRS Read Models):

# Build read-optimized views from events
class AccountBalanceProjection:
    def __init__(self):
        self.balances = {}

    def handle(self, event):
        if isinstance(event, MoneyDeposited):
            self.balances[event.account_id] = \
                self.balances.get(event.account_id, 0) + event.amount

Benefits:
1. Complete audit trail
2. Time travel - Rebuild state at any point
3. Debug - See exactly what happened
4. Derived views - Create any projection
5. Event replay - Fix bugs retroactively

Challenges:
1. Event schema evolution
2. Eventual consistency
3. Query complexity (need projections)
4. Storage growth (snapshots help)

When to Use:
- Audit trail required
- Complex domain
- CQRS fits well
- Need temporal queries

When NOT to Use:
- Simple CRUD
- No audit needs
- Team unfamiliar

Key Points to Look For:
- Understands event vs state storage
- Knows projections
- Can explain trade-offs

Follow-up: How do you handle event schema changes?

20.

Idempotency in distributed systems

What is idempotency and why is it important in distributed systems?

Mid

Idempotent Operation: Same result no matter how many times executed.

f(x) = f(f(x)) = f(f(f(x))) = ...

Why Important:

Client → Server
   ↓ (request)
Server processes
   ↓ (response lost!)
Client retries
Server processes AGAIN ← Problem!

Examples:

Idempotent:

# GET - reading doesn't change state
GET /users/123

# PUT - setting specific value
PUT /users/123 {"name": "Alice"}

# DELETE - already deleted = same result
DELETE /orders/456

NOT Idempotent:

# POST - creates new resource each time
POST /users {"name": "Alice"}  # Creates user 1
POST /users {"name": "Alice"}  # Creates user 2!

# Increment without guard
POST /accounts/123/deposit {"amount": 100}
# Double charge on retry!

Making Operations Idempotent:

1. Idempotency Key:

# Client sends unique key
POST /payments
Idempotency-Key: abc123
{"amount": 100}

# Server checks key before processing
def process_payment(key, amount):
    if redis.exists(f"idempotency:{key}"):
        return get_cached_response(key)

    result = charge_card(amount)
    redis.setex(f"idempotency:{key}", 86400, result)
    return result

2. Database Constraints:

-- Unique constraint prevents duplicates
CREATE TABLE payments (
    id SERIAL PRIMARY KEY,
    idempotency_key VARCHAR(255) UNIQUE,
    amount DECIMAL
);

-- Insert fails on duplicate key
INSERT INTO payments (idempotency_key, amount)
VALUES ('abc123', 100);

3. Check-and-Set:

def transfer(from_id, to_id, amount, transfer_id):
    # Check if already processed
    if Transfer.exists(transfer_id):
        return "Already processed"

    # Process
    with transaction():
        Transfer.create(id=transfer_id, ...)
        Account.debit(from_id, amount)
        Account.credit(to_id, amount)

Best Practices:
1. Use idempotency keys for non-idempotent operations
2. Store key → result mapping
3. Set reasonable expiry
4. Generate keys client-side

Key Points to Look For:
- Understands retry problem
- Knows implementation patterns
- Mentions idempotency keys

Follow-up: How long should you keep idempotency keys?

21.

Leader election algorithms

How do leader election algorithms work in distributed systems?

Senior

Leader Election: Designate one node as leader to coordinate actions.

Why Needed:
- Single writer for consistency
- Coordination tasks
- Distributed locks
- Consensus protocols

Algorithms:

1. Bully Algorithm:

Nodes: 1, 2, 3, 4, 5 (higher = higher priority)
Current leader: 5

Node 5 fails
Node 3 notices, starts election
Node 3 → Sends ELECTION to 4, 5
Node 4 → Responds OK (higher, will take over)
Node 4 → Sends ELECTION to 5
(No response from 5)
Node 4 → Broadcasts COORDINATOR
Node 4 is new leader

2. Ring Algorithm:

Nodes in logical ring: 1 → 2 → 3 → 4 → 5 → 1

Node 3 starts election
Sends [3] to node 4
Node 4 adds ID: [3, 4] → sends to 5
Node 5 adds: [3, 4, 5] → sends to 1
... around the ring
When message returns with all IDs
Highest ID is leader

3. Raft Election:

States: Follower → Candidate → Leader

1. Follower times out (no heartbeat from leader)
2. Becomes Candidate, votes for self
3. Requests votes from others
4. If majority votes: Becomes Leader
5. Leader sends heartbeats

Using ZooKeeper/etcd:

# ZooKeeper ephemeral sequential nodes
def elect_leader(zk, path):
    # Create ephemeral sequential node
    my_node = zk.create(
        f"{path}/candidate-",
        ephemeral=True,
        sequence=True
    )

    while True:
        children = sorted(zk.get_children(path))
        if my_node == children[0]:
            # I'm the leader!
            return True
        else:
            # Watch predecessor
            predecessor = children[children.index(my_node) - 1]
            zk.watch(predecessor, on_change=check_leader)
            wait()

Using Redis:

def try_become_leader(redis, key, node_id, ttl=30):
    # SET if not exists with expiry
    acquired = redis.set(key, node_id, nx=True, ex=ttl)
    if acquired:
        # Extend periodically
        start_heartbeat(redis, key, node_id, ttl)
    return acquired

Considerations:
1. Split brain - Multiple leaders
2. Network partitions - Need majority
3. Failover time - Election duration
4. Thundering herd - Staggered timeouts

Key Points to Look For:
- Knows multiple algorithms
- Understands consensus
- Considers failure scenarios

Follow-up: What happens during a network partition?

Scalability Scenarios

22.

Design a URL shortener

Design a URL shortening service like bit.ly.

Mid

Requirements:
- Shorten URL: Long → Short
- Redirect: Short → Long
- Analytics (optional): Click tracking

API:

POST /shorten
Body: {"url": "https://example.com/very/long/path"}
Response: {"short_url": "https://short.ly/abc123"}

GET /abc123
Response: 301 Redirect to original URL

Short URL Generation:

Option 1: Counter + Base62:

ALPHABET = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"

def encode(num):
    if num == 0:
        return ALPHABET[0]
    result = []
    while num:
        result.append(ALPHABET[num % 62])
        num //= 62
    return ''.join(reversed(result))

# Counter: 1000000 → "4c92"
# 7 chars = 62^7 = 3.5 trillion URLs

Option 2: Hash + Truncate:

def generate_short(url):
    hash = md5(url).hexdigest()[:7]
    # Handle collisions
    while exists(hash):
        hash = md5(url + random_string()).hexdigest()[:7]
    return hash

Database:

CREATE TABLE urls (
    id BIGSERIAL PRIMARY KEY,
    short_code VARCHAR(10) UNIQUE,
    original_url TEXT,
    created_at TIMESTAMP,
    clicks BIGINT DEFAULT 0
);

CREATE INDEX idx_short_code ON urls(short_code);

Architecture:

┌─────────┐     ┌───────────┐     ┌────────────┐
│  Client │────→│   API     │────→│  Database  │
└─────────┘     │  Servers  │     └────────────┘
                └─────┬─────┘
                      │
                ┌─────▼─────┐
                │   Cache   │
                │  (Redis)  │
                └───────────┘

Scaling:
1. Cache popular URLs in Redis
2. Shard database by short_code
3. CDN for redirects
4. Counter service with distributed ID generation

Read Path:

def redirect(short_code):
    # Check cache first
    url = redis.get(f"url:{short_code}")
    if not url:
        url = db.query("SELECT original_url FROM urls WHERE short_code = ?", short_code)
        redis.setex(f"url:{short_code}", 3600, url)

    # Async click tracking
    kafka.send("clicks", {"code": short_code, "time": now()})

    return redirect_301(url)

Key Points to Look For:
- Clear API design
- Encoding strategy
- Caching approach
- Scaling considerations

23.

Design a rate limiter

Design a distributed rate limiter for an API.

Mid

Requirements:
- Limit requests per user/IP
- Distributed (multiple servers)
- Low latency
- Configurable limits

Algorithm Choice: Token Bucket (allows bursts)

Redis Implementation:

class RateLimiter:
    def __init__(self, redis, limit=100, window=60):
        self.redis = redis
        self.limit = limit
        self.window = window

    def is_allowed(self, user_id):
        key = f"rate:{user_id}"

        # Lua script for atomicity
        script = """
        local tokens = redis.call('GET', KEYS[1])
        if not tokens then
            redis.call('SET', KEYS[1], ARGV[1] - 1, 'EX', ARGV[2])
            return 1
        end
        if tonumber(tokens) > 0 then
            redis.call('DECR', KEYS[1])
            return 1
        end
        return 0
        """

        allowed = self.redis.eval(script, 1, key, self.limit, self.window)
        return bool(allowed)

Sliding Window Counter:

def is_allowed_sliding(redis, user_id, limit, window):
    now = time.time()
    minute = int(now / 60)

    # Current and previous minute counts
    curr_key = f"rate:{user_id}:{minute}"
    prev_key = f"rate:{user_id}:{minute - 1}"

    curr_count = int(redis.get(curr_key) or 0)
    prev_count = int(redis.get(prev_key) or 0)

    # Weight previous window
    elapsed = now % 60
    weighted = prev_count * (60 - elapsed) / 60 + curr_count

    if weighted >= limit:
        return False

    redis.incr(curr_key)
    redis.expire(curr_key, 120)
    return True

Architecture:

┌─────────┐     ┌──────────────┐     ┌─────────────┐
│ Client  │────→│ Rate Limiter │────→│   Redis     │
└─────────┘     │  Middleware  │     │  Cluster    │
                └──────┬───────┘     └─────────────┘
                       │
                 ┌─────▼─────┐
                 │   API     │
                 │  Service  │
                 └───────────┘

Response Headers:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 45
X-RateLimit-Reset: 1609459200
Retry-After: 30

Considerations:
1. Race conditions - Use atomic operations
2. Clock sync - Use Redis time
3. Failure mode - Fail open or closed?
4. Per-endpoint limits - Different limits for different APIs

Key Points to Look For:
- Algorithm choice with reasoning
- Atomic operations
- Distributed considerations
- Response headers

24.

Design a cache system

Design a distributed caching system like Memcached or Redis.

Senior

Requirements:
- Key-value storage
- Low latency (<1ms)
- High throughput
- Distributed across nodes
- LRU eviction

Single Node Design:

┌────────────────────────────────────────┐
│              Cache Node                │
│  ┌─────────────────────────────────┐   │
│  │         Hash Table              │   │
│  │  key → node pointer             │   │
│  └─────────────────────────────────┘   │
│  ┌─────────────────────────────────┐   │
│  │     Doubly Linked List (LRU)    │   │
│  │  HEAD ← → node ← → node ← → TAIL│   │
│  └─────────────────────────────────┘   │
└────────────────────────────────────────┘

LRU Cache:

class LRUCache:
    def __init__(self, capacity):
        self.capacity = capacity
        self.cache = {}  # key → node
        self.head = Node(None, None)  # Most recent
        self.tail = Node(None, None)  # Least recent
        self.head.next = self.tail
        self.tail.prev = self.head

    def get(self, key):
        if key in self.cache:
            node = self.cache[key]
            self._move_to_head(node)
            return node.value
        return None

    def put(self, key, value):
        if key in self.cache:
            node = self.cache[key]
            node.value = value
            self._move_to_head(node)
        else:
            if len(self.cache) >= self.capacity:
                self._evict()
            node = Node(key, value)
            self.cache[key] = node
            self._add_to_head(node)

Distributed Design:

Client → Consistent Hashing → Node
              │
        ┌─────┼─────┐
        │     │     │
     Node1  Node2  Node3

Consistent Hashing:

class ConsistentHash:
    def __init__(self, nodes, virtual_nodes=150):
        self.ring = SortedDict()
        for node in nodes:
            for i in range(virtual_nodes):
                key = hash(f"{node}:{i}")
                self.ring[key] = node

    def get_node(self, key):
        if not self.ring:
            return None
        hash_key = hash(key)
        # Find first node clockwise
        idx = self.ring.bisect_right(hash_key)
        if idx == len(self.ring):
            idx = 0
        return self.ring.values()[idx]

Architecture:

┌──────────────────────────────────────────────────┐
│                   Cache Cluster                   │
│                                                   │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐          │
│  │  Node 1 │  │  Node 2 │  │  Node 3 │          │
│  │  Hash   │  │  Hash   │  │  Hash   │          │
│  │  Ring   │  │  Ring   │  │  Ring   │          │
│  └────┬────┘  └────┬────┘  └────┬────┘          │
│       │            │            │                │
│  └────┴────────────┴────────────┘                │
│       Consistent Hashing Ring                    │
└──────────────────────────────────────────────────┘

Additional Features:
- TTL expiration
- Replication for HA
- Pub/sub
- Atomic operations
- Memory management

Key Points to Look For:
- LRU implementation
- Consistent hashing
- Eviction strategy
- Replication considerations

25.

Design a notification system

Design a notification system that can send push notifications, emails, and SMS.

Senior

Requirements:
- Multiple channels (push, email, SMS)
- High throughput (millions/day)
- Template support
- Delivery tracking
- Retry failed deliveries

Architecture:

┌─────────────────────────────────────────────────────────────┐
│                      API Gateway                            │
└────────────────────────────┬────────────────────────────────┘
                             │
┌────────────────────────────▼────────────────────────────────┐
│                  Notification Service                        │
│  • Validate request                                          │
│  • User preferences                                          │
│  • Rate limiting                                             │
└────────────────────────────┬────────────────────────────────┘
                             │
┌────────────────────────────▼────────────────────────────────┐
│                    Message Queue                             │
│         ┌──────────┬──────────┬──────────┐                  │
│         │  Push    │  Email   │   SMS    │                  │
│         │  Queue   │  Queue   │  Queue   │                  │
│         └────┬─────┴────┬─────┴────┬─────┘                  │
└──────────────┼──────────┼──────────┼────────────────────────┘
               │          │          │
    ┌──────────▼──┐ ┌─────▼─────┐ ┌──▼──────────┐
    │Push Workers │ │Email      │ │SMS Workers  │
    │             │ │Workers    │ │             │
    └──────┬──────┘ └─────┬─────┘ └──────┬──────┘
           │              │              │
    ┌──────▼──────┐ ┌─────▼─────┐ ┌──────▼──────┐
    │    FCM      │ │SendGrid   │ │   Twilio    │
    │    APNS     │ │Mailgun    │ │   Vonage    │
    └─────────────┘ └───────────┘ └─────────────┘

API:

POST /v1/notifications
{
    "user_id": "user123",
    "template_id": "order_shipped",
    "channels": ["push", "email"],
    "data": {
        "order_id": "ORD456",
        "tracking_url": "..."
    }
}

Database Schema:

CREATE TABLE notifications (
    id UUID PRIMARY KEY,
    user_id VARCHAR(255),
    template_id VARCHAR(255),
    channel VARCHAR(20),
    status VARCHAR(20),  -- pending, sent, delivered, failed
    created_at TIMESTAMP,
    sent_at TIMESTAMP,
    data JSONB
);

CREATE TABLE user_preferences (
    user_id VARCHAR(255) PRIMARY KEY,
    push_enabled BOOLEAN DEFAULT true,
    email_enabled BOOLEAN DEFAULT true,
    sms_enabled BOOLEAN DEFAULT true,
    quiet_hours_start TIME,
    quiet_hours_end TIME
);

Worker Logic:

class PushWorker:
    def process(self, message):
        notification = message.body
        user = get_user(notification.user_id)

        try:
            # Get device tokens
            tokens = get_device_tokens(user.id)

            # Render template
            content = render_template(
                notification.template_id,
                notification.data
            )

            # Send to FCM/APNS
            for token in tokens:
                send_push(token, content)

            # Update status
            update_status(notification.id, "sent")

        except Exception as e:
            # Retry with backoff
            if message.retry_count < 3:
                queue.publish_with_delay(
                    message,
                    delay=exponential_backoff(message.retry_count)
                )
            else:
                update_status(notification.id, "failed")
                send_to_dlq(message)

Key Considerations:
1. Deduplication - Idempotency keys
2. Rate limiting - Per user, per channel
3. Priority queues - Urgent vs batch
4. Tracking - Open rates, delivery status
5. Unsubscribe - User preferences

Key Points to Look For:
- Multiple channels handled
- Queue-based architecture
- Retry mechanism
- User preferences

26.

Handling millions of concurrent users

How would you design a system to handle millions of concurrent users?

Senior

Principles:

1. Stateless Services:

User → Load Balancer → Any Server
             ↓
        Session Store (Redis)

2. Horizontal Scaling:

┌─────────────┐
│     CDN     │  ← Static content
└──────┬──────┘
       │
┌──────▼──────┐
│     LB      │  ← Distribute load
└──────┬──────┘
  ┌────┴────┐
  │ │ │ │ │ │  ← Auto-scaling group
  └─────────┘
       │
┌──────▼──────┐
│   Cache     │  ← Reduce DB load
└──────┬──────┘
       │
┌──────▼──────┐
│  Database   │  ← Sharded, replicated
└─────────────┘

3. Caching Everywhere:

Browser Cache → CDN → App Cache → DB Cache → Database

4. Database Scaling:

Write:  Primary → Replicas (async)
Read:   Load balance across replicas
Shard:  Distribute by user_id/region

5. Async Processing:

User Request → API → Queue → Workers
     ↑          │
     └──────────┘
      Fast response

Architecture for 10M Concurrent:

                    ┌─────────────┐
                    │     CDN     │
                    │ (CloudFront)│
                    └──────┬──────┘
                           │
        ┌──────────────────┼──────────────────┐
        │                  │                   │
  ┌─────▼─────┐     ┌─────▼─────┐      ┌─────▼─────┐
  │   Region  │     │   Region  │      │   Region  │
  │   US-East │     │  EU-West  │      │   APAC    │
  └─────┬─────┘     └─────┬─────┘      └─────┬─────┘
        │                 │                  │
  ┌─────▼──────────────────────────────────────────┐
  │                    Per Region:                 │
  │  ┌────────────────────────────────────────┐    │
  │  │              Load Balancers             │    │
  │  └────────────────────┬───────────────────┘    │
  │         ┌─────────────┴─────────────┐          │
  │         │                           │          │
  │  ┌──────▼──────┐            ┌───────▼───────┐  │
  │  │ API Servers │            │ WebSocket     │  │
  │  │ (Auto-scale)│            │ Servers       │  │
  │  └──────┬──────┘            └───────┬───────┘  │
  │         │                           │          │
  │  ┌──────▼──────────────────────────▼───────┐  │
  │  │              Redis Cluster              │   │
  │  │         (Cache + Sessions)              │   │
  │  └─────────────────┬───────────────────────┘  │
  │                    │                           │
  │  ┌─────────────────▼───────────────────────┐  │
  │  │            Database Cluster             │   │
  │  │  Primary + Replicas, Sharded            │   │
  │  └─────────────────────────────────────────┘  │
  └────────────────────────────────────────────────┘

Numbers:

10M concurrent users
~1M requests/second (100 RPS per user)

Servers needed:
- 1 server handles ~10K RPS
- Need ~100 servers + buffer
- Auto-scale 50-200 based on load

Cache hit rate target: 95%+
Database: Write-heavy → Sharding
         Read-heavy → Replicas

Key Points to Look For:
- Multi-region deployment
- Caching strategy
- Database scaling
- Async processing
- Auto-scaling

Observability

27.

Three pillars of observability: logs, metrics, traces

What are the three pillars of observability and how do they differ?

Junior

Observability: Ability to understand system state from external outputs.

The Three Pillars:

┌──────────────────────────────────────────────────────────────┐
│                       Observability                          │
│                                                              │
│   ┌──────────┐       ┌──────────┐       ┌──────────┐       │
│   │   Logs   │       │ Metrics  │       │  Traces  │       │
│   │          │       │          │       │          │       │
│   │ "What    │       │ "How     │       │ "Where   │       │
│   │ happened"│       │  much"   │       │  it went"│       │
│   └──────────┘       └──────────┘       └──────────┘       │
└──────────────────────────────────────────────────────────────┘

1. Logs - Event Records:

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "ERROR",
  "service": "payment-service",
  "message": "Payment failed",
  "user_id": "user_123",
  "error": "Card declined",
  "request_id": "req_abc"
}

Characteristics:
- Discrete events
- High cardinality (unique values)
- Human-readable
- Great for debugging specific issues

2. Metrics - Aggregated Measurements:

# Counter - cumulative value
http_requests_total{method="GET", status="200"} 1523

# Gauge - current value
active_connections 45

# Histogram - distribution
http_request_duration_seconds_bucket{le="0.1"} 24054
http_request_duration_seconds_bucket{le="0.5"} 33444

Characteristics:
- Numeric, aggregatable
- Low cardinality
- Efficient storage
- Great for alerting and trends

3. Traces - Request Journey:

Trace ID: abc123
├── Span: API Gateway (50ms)
│   └── Span: Auth Service (10ms)
├── Span: Order Service (200ms)
│   ├── Span: Database Query (50ms)
│   └── Span: Payment Service (100ms)
│       └── Span: External API (80ms)
└── Total: 250ms

Characteristics:
- Shows request flow
- Spans across services
- Includes timing
- Great for debugging distributed systems

Comparison:

Aspect	Logs	Metrics	Traces
Question answered	What happened?	How much/how many?	Where did time go?
Data type	Text/JSON	Numbers	Spans/Context
Cardinality	High	Low	Medium
Storage cost	High	Low	Medium
Best for	Debugging	Alerting/Trends	Latency analysis

Using Together:

Alert fires: "Error rate > 5%"        ← Metrics
Check logs: "Payment service errors"   ← Logs
Trace request: "Where is the latency?" ← Traces

Tools by Pillar:
- Logs: ELK Stack, Splunk, Loki
- Metrics: Prometheus, Datadog, CloudWatch
- Traces: Jaeger, Zipkin, X-Ray

Key Points to Look For:
- Knows all three pillars
- Understands different purposes
- Can explain when to use each

Follow-up: How do you correlate logs, metrics, and traces?

28.

Distributed tracing: How does it work?

How does distributed tracing work in microservices?

Mid

Distributed Tracing: Tracks requests as they flow through multiple services.

Core Concepts:

Trace: Complete journey of a request
Span: Single operation within a trace
Context: Metadata passed between services

Trace ID: abc-123
│
├── Span: api-gateway (start: 0ms, duration: 250ms)
│   │   service: api-gateway
│   │   operation: /orders
│   │
│   ├── Span: auth-service (start: 5ms, duration: 20ms)
│   │       service: auth-service
│   │       operation: validateToken
│   │
│   └── Span: order-service (start: 30ms, duration: 200ms)
│           service: order-service
│           operation: createOrder
│           │
│           ├── Span: postgres (start: 35ms, duration: 50ms)
│           │       operation: INSERT orders
│           │
│           └── Span: payment-service (start: 90ms, duration: 100ms)
│                   service: payment-service
│                   operation: chargeCard

How It Works:

1. Context Propagation:

# Service A creates trace context
import opentelemetry.trace as trace

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("handle_request") as span:
    span.set_attribute("user_id", user_id)

    # Context automatically injected into HTTP headers
    response = requests.get(
        "http://service-b/api",
        headers=inject_context()  # traceparent: 00-abc123-def456-01
    )

2. Header Format (W3C Trace Context):

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
             │  │                                │                  │
             │  │                                │                  └─ flags (sampled)
             │  │                                └─ parent span id
             │  └─ trace id
             └─ version

3. Service B Receives and Continues:

# Service B extracts context
from opentelemetry.propagate import extract

context = extract(request.headers)

# Creates child span with same trace ID
with tracer.start_as_current_span("process_data", context=context) as span:
    span.set_attribute("order_id", order_id)
    # Continue processing...

4. Spans Collected and Assembled:

Service A ──span──→ Collector
Service B ──span──→ Collector ──→ Backend ──→ UI
Service C ──span──→ Collector

Span Attributes:

span.set_attribute("http.method", "POST")
span.set_attribute("http.url", "/api/orders")
span.set_attribute("http.status_code", 200)
span.set_attribute("db.system", "postgresql")
span.set_attribute("db.statement", "SELECT * FROM orders")

# Events within span
span.add_event("cache_miss", {"key": "user:123"})

# Errors
span.record_exception(exception)
span.set_status(Status(StatusCode.ERROR, "Payment failed"))

Sampling Strategies:

# Head-based sampling (decide at start)
sampler = TraceIdRatioBased(0.1)  # 10% of traces

# Tail-based sampling (decide after complete)
# Keep all errors, sample successful requests
if span.status == ERROR or random() < 0.01:
    export(span)

Architecture:

┌─────────────────────────────────────────────────────┐
│                   Application                       │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐            │
│  │Service A│  │Service B│  │Service C│            │
│  │  SDK    │  │  SDK    │  │  SDK    │            │
│  └────┬────┘  └────┬────┘  └────┬────┘            │
│       │            │            │                  │
│       └────────────┼────────────┘                  │
│                    │                               │
│              ┌─────▼─────┐                         │
│              │  Agent/   │                         │
│              │ Collector │                         │
│              └─────┬─────┘                         │
└────────────────────┼───────────────────────────────┘
                     │
              ┌──────▼──────┐
              │   Backend   │
              │ (Jaeger,    │
              │  Zipkin)    │
              └──────┬──────┘
                     │
              ┌──────▼──────┐
              │     UI      │
              └─────────────┘

Key Points to Look For:
- Understands trace context propagation
- Knows span structure
- Can explain sampling

Follow-up: How do you handle tracing with async message queues?

29.

What makes a good log message?

What makes a good log message? What should you include?

Junior

Good Log Message Characteristics:

1. Structured Format:

// Good: Structured, parseable
{
  "timestamp": "2024-01-15T10:30:00.123Z",
  "level": "ERROR",
  "service": "order-service",
  "message": "Failed to process order",
  "order_id": "ord_123",
  "user_id": "usr_456",
  "error": "Insufficient inventory",
  "product_id": "prod_789",
  "requested_quantity": 5,
  "available_quantity": 2,
  "trace_id": "abc123"
}

// Bad: Unstructured, hard to parse
"ERROR: Order ord_123 failed for user usr_456 - not enough inventory for prod_789 (wanted 5, have 2)"

2. Appropriate Log Level:

# DEBUG: Detailed diagnostic information
logger.debug(f"Cache lookup for key: {key}")

# INFO: Normal operations, milestones
logger.info(f"Order {order_id} created successfully")

# WARNING: Unexpected but recoverable
logger.warning(f"Retry attempt {attempt} for external API")

# ERROR: Failures that need attention
logger.error(f"Payment failed", extra={"order_id": order_id})

# CRITICAL: System-level failures
logger.critical(f"Database connection pool exhausted")

3. Include Context:

# Bad: No context
logger.error("Failed to process request")

# Good: Rich context
logger.error(
    "Failed to process payment",
    extra={
        "order_id": order.id,
        "user_id": user.id,
        "amount": payment.amount,
        "payment_method": payment.method,
        "error_code": e.code,
        "trace_id": get_trace_id()
    }
)

4. Correlation IDs:

# Include trace/request ID for correlation
class RequestMiddleware:
    def process_request(self, request):
        request_id = request.headers.get('X-Request-ID') or uuid4()
        # Add to all logs in this request
        logger = logger.bind(request_id=request_id)

# Now all logs include request_id
# Easy to find all logs for one request

5. Don't Log Sensitive Data:

# Bad: Logging sensitive data
logger.info(f"User login: {email}, password: {password}")
logger.info(f"Payment with card: {card_number}")

# Good: Mask or omit sensitive data
logger.info(f"User login: {email}")
logger.info(f"Payment with card: ****{card_number[-4:]}")

# Sensitive fields to never log:
# - Passwords, tokens, API keys
# - Credit card numbers, SSNs
# - Personal health information
# - Full addresses, phone numbers

6. Actionable Messages:

# Bad: Vague
logger.error("Something went wrong")

# Good: Actionable
logger.error(
    "Database connection timeout after 30s",
    extra={
        "host": db_host,
        "action": "Check database health, consider increasing pool size"
    }
)

Log Message Template:

WHEN: Timestamp (ISO 8601, UTC)
WHERE: Service, function, file:line
WHAT: Clear description of event
WHO: User ID, request ID
WHY: Error details, stack trace (for errors)
CONTEXT: Relevant business data

Example Implementation:

import structlog

logger = structlog.get_logger()

def process_order(order):
    log = logger.bind(
        order_id=order.id,
        user_id=order.user_id
    )

    log.info("processing_order_started")

    try:
        result = payment_service.charge(order)
        log.info("payment_successful", amount=order.total)

    except PaymentError as e:
        log.error(
            "payment_failed",
            error_code=e.code,
            error_message=str(e),
            retry_eligible=e.is_retryable
        )
        raise

Key Points to Look For:
- Uses structured logging
- Includes correlation IDs
- Appropriate log levels
- No sensitive data

Follow-up: How do you manage log volume in high-traffic systems?

30.

Metrics: RED vs USE method

What are the RED and USE methods for metrics? When do you use each?

Mid

RED Method: For request-driven services (APIs, microservices)
USE Method: For resources (CPU, memory, disk)

RED Method:

R - Rate:      Requests per second
E - Errors:    Failed requests per second
D - Duration:  Time per request (latency)

Example Metrics:

# Rate: Requests per second
http_requests_total{service="api", endpoint="/orders"}
rate(http_requests_total[5m])

# Errors: Error rate
http_requests_total{service="api", status=~"5.."}
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

# Duration: Latency percentiles
http_request_duration_seconds{service="api", quantile="0.99"}
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

RED Dashboard:

┌─────────────────────────────────────────────────────────────┐
│                    Order Service Dashboard                   │
├─────────────────────┬─────────────────────┬─────────────────┤
│  Request Rate       │  Error Rate         │  Latency (p99)  │
│  ████████████       │  ███                │  █████          │
│  1,234 req/s        │  0.5%               │  125ms          │
└─────────────────────┴─────────────────────┴─────────────────┘

USE Method:

U - Utilization:  % of resource capacity used
S - Saturation:   Queue depth, wait time
E - Errors:       Error count

Example Metrics:

# Utilization: CPU usage
node_cpu_seconds_total
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Saturation: Load average (queue depth)
node_load1
# Or: disk I/O queue
node_disk_io_time_weighted_seconds_total

# Errors: Disk errors
node_disk_read_errors_total
node_disk_write_errors_total

USE Dashboard:

┌─────────────────────────────────────────────────────────────┐
│                    Server Resources                          │
├───────────────────┬───────────────────┬─────────────────────┤
│  CPU Utilization  │  Memory           │  Disk I/O           │
│  ████████░░ 80%   │  ██████░░░░ 60%   │  ████░░░░░░ 40%     │
│                   │                   │                     │
│  Saturation: 2.5  │  Swap: 0 MB       │  Queue: 3           │
│  Errors: 0        │  Errors: 0        │  Errors: 0          │
└───────────────────┴───────────────────┴─────────────────────┘

When to Use Each:

Method	Use For	Examples
RED	Request-driven services	APIs, web servers, microservices
USE	Resources	CPU, memory, disk, network, database connections

Combined Approach:

Application Layer (RED):
├── API Gateway: Rate, Errors, Duration
├── Order Service: Rate, Errors, Duration
└── Payment Service: Rate, Errors, Duration

Infrastructure Layer (USE):
├── Servers: CPU, Memory, Disk
├── Database: Connections, Query time
└── Message Queue: Queue depth, Throughput

Four Golden Signals (Google SRE):

Alternative to RED:
1. Latency      (similar to Duration)
2. Traffic      (similar to Rate)
3. Errors       (same)
4. Saturation   (from USE method)

Combines best of both for services!

Alerting Based on Methods:

# RED-based alerts
- alert: HighErrorRate
  expr: rate(http_errors_total[5m]) / rate(http_requests_total[5m]) > 0.05
  for: 5m
  labels:
    severity: critical

- alert: HighLatency
  expr: histogram_quantile(0.99, rate(http_duration_seconds_bucket[5m])) > 1
  for: 5m
  labels:
    severity: warning

# USE-based alerts
- alert: HighCPU
  expr: node_cpu_utilization > 0.9
  for: 10m
  labels:
    severity: warning

- alert: DiskFull
  expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.1
  for: 5m
  labels:
    severity: critical

Key Points to Look For:
- Knows both methods
- Can apply to appropriate contexts
- Understands practical metrics

Follow-up: How do you set SLOs based on these metrics?

DevOps & Infrastructure

31.

Docker basics: containers vs VMs

What's the difference between containers and virtual machines? Why use containers?

Junior

Virtual Machines (VMs):

┌─────────────────────────────────────────────────────────────┐
│                        Hardware                             │
├─────────────────────────────────────────────────────────────┤
│                     Host OS (Hypervisor)                    │
├───────────────────┬───────────────────┬─────────────────────┤
│     Guest OS      │     Guest OS      │     Guest OS        │
│     (Linux)       │     (Windows)     │     (Linux)         │
├───────────────────┼───────────────────┼─────────────────────┤
│   Bins/Libs       │   Bins/Libs       │   Bins/Libs         │
├───────────────────┼───────────────────┼─────────────────────┤
│     App A         │     App B         │     App C           │
└───────────────────┴───────────────────┴─────────────────────┘

Each VM: Full OS, GB of memory, minutes to start

Containers:

┌─────────────────────────────────────────────────────────────┐
│                        Hardware                             │
├─────────────────────────────────────────────────────────────┤
│                         Host OS                             │
├─────────────────────────────────────────────────────────────┤
│                    Container Runtime                        │
├───────────────────┬───────────────────┬─────────────────────┤
│   Bins/Libs       │   Bins/Libs       │   Bins/Libs         │
├───────────────────┼───────────────────┼─────────────────────┤
│     App A         │     App B         │     App C           │
└───────────────────┴───────────────────┴─────────────────────┘

Containers: Share OS kernel, MB of memory, seconds to start

Key Differences:

Aspect	VMs	Containers
Isolation	Hardware-level	Process-level
Size	GBs	MBs
Startup	Minutes	Seconds
OS	Full guest OS	Shares host kernel
Performance	~5% overhead	Near-native
Portability	Hypervisor-dependent	Runs anywhere

Docker Basics:

Dockerfile:

# Base image
FROM python:3.11-slim

# Set working directory
WORKDIR /app

# Copy dependencies first (caching)
COPY requirements.txt .
RUN pip install -r requirements.txt

# Copy application code
COPY . .

# Expose port
EXPOSE 8000

# Run command
CMD ["python", "app.py"]

Common Commands:

# Build image
docker build -t myapp:1.0 .

# Run container
docker run -d -p 8000:8000 --name myapp myapp:1.0

# List containers
docker ps

# View logs
docker logs myapp

# Execute command in container
docker exec -it myapp /bin/bash

# Stop and remove
docker stop myapp && docker rm myapp

Why Containers:
1. Consistency: "Works on my machine" → Works everywhere
2. Isolation: Dependencies don't conflict
3. Efficiency: Better resource utilization
4. Speed: Fast to build, start, scale
5. DevOps: Same artifact from dev to prod

When to Use VMs:
- Need different OS (Linux + Windows)
- Stronger isolation required
- Running legacy applications
- Compliance requirements

Key Points to Look For:
- Understands isolation difference
- Knows basic Docker commands
- Can explain benefits

Follow-up: What is a Docker image layer and why does it matter?

32.

Kubernetes: pods, services, deployments

Explain the core Kubernetes concepts: pods, services, and deployments.

Mid

Kubernetes: Container orchestration platform for deploying, scaling, and managing containerized applications.

Core Concepts:

1. Pod:
Smallest deployable unit. One or more containers that share storage/network.

apiVersion: v1
kind: Pod
metadata:
  name: my-app
spec:
  containers:
  - name: app
    image: myapp:1.0
    ports:
    - containerPort: 8080
  - name: sidecar
    image: log-collector:1.0

┌─────────────────────────────────────┐
│              Pod                     │
│  ┌────────────┐  ┌────────────┐    │
│  │ Container  │  │ Container  │    │
│  │   (app)    │  │ (sidecar)  │    │
│  └────────────┘  └────────────┘    │
│                                      │
│  Shared: Network (localhost)         │
│         Storage (volumes)            │
│         IP Address                   │
└─────────────────────────────────────┘

2. Service:
Stable network endpoint for a set of pods. Pods come and go, Services provide consistent access.

apiVersion: v1
kind: Service
metadata:
  name: my-app-service
spec:
  selector:
    app: my-app        # Find pods with this label
  ports:
  - port: 80           # Service port
    targetPort: 8080   # Container port
  type: ClusterIP      # Internal only

Service Types:

ClusterIP:   Internal cluster access only (default)
NodePort:    Exposes on each node's IP at static port
LoadBalancer: Exposes via cloud load balancer

┌─────────────────────────────────────────────────────────────┐
│                        Cluster                               │
│                                                              │
│      my-app-service (ClusterIP: 10.0.0.100)                 │
│              ↓                                               │
│      ┌───────┴───────┐                                      │
│      │               │                                      │
│   ┌──▼──┐        ┌──▼──┐                                   │
│   │ Pod │        │ Pod │   ← selector: app=my-app          │
│   └─────┘        └─────┘                                   │
└─────────────────────────────────────────────────────────────┘

3. Deployment:
Manages ReplicaSets and provides declarative updates for Pods.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: app
        image: myapp:1.0
        resources:
          requests:
            memory: "128Mi"
            cpu: "100m"
          limits:
            memory: "256Mi"
            cpu: "500m"

Deployment Features:

# Scale up/down
kubectl scale deployment my-app --replicas=5

# Rolling update
kubectl set image deployment/my-app app=myapp:2.0

# Rollback
kubectl rollout undo deployment/my-app

# Check status
kubectl rollout status deployment/my-app

How They Work Together:

                    Deployment
                        │
                        │ manages
                        ▼
                    ReplicaSet
                        │
                        │ creates
           ┌────────────┼────────────┐
           ▼            ▼            ▼
         Pod          Pod          Pod
           │            │            │
           └────────────┼────────────┘
                        │
                        │ exposed by
                        ▼
                     Service
                        │
                        │ accessed by
                        ▼
                    Clients

Other Important Resources:
- ConfigMap: Configuration data
- Secret: Sensitive data (encrypted)
- Ingress: HTTP routing, TLS termination
- PersistentVolume: Storage

Key Points to Look For:
- Understands pod vs container
- Knows what services provide
- Can explain deployment benefits

Follow-up: What happens during a rolling deployment?

33.

Blue-green vs canary deployments

What's the difference between blue-green and canary deployments?

Mid

Purpose: Both minimize risk when releasing new versions.

Blue-Green Deployment:

Two identical environments. Switch traffic instantly.

Before:
┌─────────────────┐          ┌─────────────────┐
│   Blue (v1)     │ ← 100%   │   Green (v2)    │
│   PRODUCTION    │  traffic │   STAGING       │
└─────────────────┘          └─────────────────┘

After switch:
┌─────────────────┐          ┌─────────────────┐
│   Blue (v1)     │          │   Green (v2)    │ ← 100%
│   STANDBY       │          │   PRODUCTION    │   traffic
└─────────────────┘          └─────────────────┘

How It Works:
1. Deploy v2 to green environment
2. Test v2 thoroughly
3. Switch load balancer to green
4. Blue becomes standby (instant rollback)

# AWS ALB example
- weight: 0   # Blue (v1)
  targetGroup: blue-tg
- weight: 100 # Green (v2)
  targetGroup: green-tg

Canary Deployment:

Gradually shift traffic to new version.

Phase 1: 5% to v2
┌─────────────────────────────────────────────────────┐
│  v1 ████████████████████████████████████████ 95%   │
│  v2 ██ 5%                                          │
└─────────────────────────────────────────────────────┘

Phase 2: 25% to v2
┌─────────────────────────────────────────────────────┐
│  v1 ███████████████████████████████ 75%            │
│  v2 █████████ 25%                                  │
└─────────────────────────────────────────────────────┘

Phase 3: 100% to v2
┌─────────────────────────────────────────────────────┐
│  v2 ████████████████████████████████████████ 100%  │
└─────────────────────────────────────────────────────┘

How It Works:
1. Deploy v2 alongside v1
2. Route small % to v2
3. Monitor metrics (errors, latency)
4. Gradually increase % if healthy
5. Rollback if problems detected

# Kubernetes canary
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
spec:
  http:
  - route:
    - destination:
        host: my-app
        subset: v1
      weight: 90
    - destination:
        host: my-app
        subset: v2
      weight: 10

Comparison:

Aspect	Blue-Green	Canary
Traffic shift	Instant (100%)	Gradual (%, over time)
Risk	Moderate	Low
Infrastructure	2x resources	1x + small %
Rollback	Instant	Instant
Testing	Full before switch	In production
Complexity	Simple	More complex
Best for	Database changes	Feature validation

When to Use:

Blue-Green:
- Database schema changes
- Full system testing needed
- Quick rollback critical
- Simpler setup preferred

Canary:
- Validating with real traffic
- A/B testing new features
- Gradual risk mitigation
- Long-running releases

Rolling Deployment (Alternative):

Replace instances one at a time:
[v1] [v1] [v1] [v1]
[v2] [v1] [v1] [v1]
[v2] [v2] [v1] [v1]
[v2] [v2] [v2] [v1]
[v2] [v2] [v2] [v2]

Kubernetes default strategy
Less control than canary

Key Points to Look For:
- Knows difference
- Can recommend based on scenario
- Understands trade-offs

Follow-up: How do you handle database migrations with blue-green deployments?

34.

Infrastructure as Code: benefits and tools

What is Infrastructure as Code and why is it important?

Mid

Infrastructure as Code (IaC): Managing infrastructure through code instead of manual processes.

Before IaC:

Click Console → Configure VM → Set up network → Manual
"I think I clicked these settings last time..."

With IaC:

resource "aws_instance" "web" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t3.medium"

  tags = {
    Name = "web-server"
  }
}

Benefits:

1. Version Control:

# Track changes over time
git log --oneline
abc123 Add load balancer
def456 Increase instance size
ghi789 Initial infrastructure

# Review changes
git diff HEAD~1

2. Repeatability:

# Same infrastructure every time
terraform apply  # Dev
terraform apply  # Staging
terraform apply  # Production

# No "snowflake" servers

3. Self-Documentation:

# Code IS the documentation
resource "aws_security_group" "web" {
  name        = "web-sg"
  description = "Allow HTTP and HTTPS"

  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

4. Testing:

# Validate before applying
terraform validate
terraform plan

# Automated testing
kitchen test  # Test Kitchen
pytest        # Pulumi/CDK tests

5. Disaster Recovery:

# Rebuild entire infrastructure
terraform destroy
terraform apply
# Back to known state

Major Tools:

Terraform (HashiCorp):

# Declarative, cloud-agnostic
provider "aws" {
  region = "us-east-1"
}

resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
}

resource "aws_subnet" "web" {
  vpc_id     = aws_vpc.main.id
  cidr_block = "10.0.1.0/24"
}

AWS CloudFormation:

# AWS-native, YAML/JSON
AWSTemplateFormatVersion: '2010-09-09'
Resources:
  WebServer:
    Type: AWS::EC2::Instance
    Properties:
      InstanceType: t3.medium
      ImageId: ami-0c55b159cbfafe1f0

Pulumi (Code-based):

# Real programming languages
import pulumi_aws as aws

vpc = aws.ec2.Vpc("main", cidr_block="10.0.0.0/16")

subnet = aws.ec2.Subnet("web",
    vpc_id=vpc.id,
    cidr_block="10.0.1.0/24"
)

Ansible (Configuration Management):

# Imperative, agent-less
- hosts: webservers
  tasks:
    - name: Install nginx
      apt:
        name: nginx
        state: present

    - name: Start nginx
      service:
        name: nginx
        state: started

Tool Comparison:

Tool	Type	State	Language
Terraform	Declarative	Remote/Local	HCL
CloudFormation	Declarative	AWS-managed	YAML/JSON
Pulumi	Declarative	Remote	Python/TS/Go
Ansible	Imperative	Stateless	YAML
CDK	Declarative	CloudFormation	TS/Python

Best Practices:

1. Store in version control
2. Use remote state (S3, Terraform Cloud)
3. Use modules for reusability
4. Implement CI/CD for infrastructure
5. Use environments (dev/staging/prod)
6. Peer review changes

Example Workflow:

Developer → PR → Review → Merge → CI/CD → terraform apply
                   │
                   └── terraform plan (preview)

Key Points to Look For:
- Understands benefits
- Knows major tools
- Mentions version control

Follow-up: How do you handle secrets in Infrastructure as Code?