Back to Questions
05

Databases

SQL, NoSQL, transactions, indexing, and optimization

Difficulty:
Session: 0 asked
1.

Types of JOINs with examples

Explain the different types of SQL JOINs with examples.

Junior

Sample Tables:

Users:                    Orders:
| id | name  |           | id | user_id | amount |
|----|-------|           |----|---------|--------|
| 1  | Alice |           | 1  | 1       | 100    |
| 2  | Bob   |           | 2  | 1       | 200    |
| 3  | Carol |           | 3  | 4       | 300    |

INNER JOIN:
Returns only matching rows from both tables.

SELECT u.name, o.amount
FROM Users u
INNER JOIN Orders o ON u.id = o.user_id;

-- Result:
-- | Alice | 100 |
-- | Alice | 200 |
-- (Carol has no orders, Order 3 has no user)

LEFT JOIN (LEFT OUTER JOIN):
All rows from left table, matched rows from right (NULL if no match).

SELECT u.name, o.amount
FROM Users u
LEFT JOIN Orders o ON u.id = o.user_id;

-- Result:
-- | Alice | 100  |
-- | Alice | 200  |
-- | Bob   | NULL |
-- | Carol | NULL |

RIGHT JOIN (RIGHT OUTER JOIN):
All rows from right table, matched from left.

SELECT u.name, o.amount
FROM Users u
RIGHT JOIN Orders o ON u.id = o.user_id;

-- Result:
-- | Alice | 100 |
-- | Alice | 200 |
-- | NULL  | 300 |  (user_id 4 doesn't exist)

FULL OUTER JOIN:
All rows from both tables.

SELECT u.name, o.amount
FROM Users u
FULL OUTER JOIN Orders o ON u.id = o.user_id;

-- Result: All combinations, NULLs for non-matches

CROSS JOIN:
Cartesian product (every combination).

SELECT u.name, o.amount
FROM Users u
CROSS JOIN Orders o;
-- 3 users × 3 orders = 9 rows

Key Points to Look For:
- Knows all join types
- Understands NULL behavior
- Can write correct syntax

Follow-up: How do you find rows that DON'T have a match?

2.

INNER JOIN vs LEFT JOIN behavior

When would you use LEFT JOIN instead of INNER JOIN?

Junior

Key Difference:
- INNER JOIN: Only matching rows
- LEFT JOIN: All left rows + matches (NULL if no match)

Use LEFT JOIN when:

1. Want to include non-matching rows:

-- Find all users and their orders (including users with no orders)
SELECT u.name, COALESCE(COUNT(o.id), 0) as order_count
FROM Users u
LEFT JOIN Orders o ON u.id = o.user_id
GROUP BY u.id, u.name;

2. Checking for absence (anti-join):

-- Find users who have never ordered
SELECT u.name
FROM Users u
LEFT JOIN Orders o ON u.id = o.user_id
WHERE o.id IS NULL;

3. Optional relationships:

-- Get products with their optional discount
SELECT p.name, p.price, d.percentage
FROM Products p
LEFT JOIN Discounts d ON p.id = d.product_id;
-- Products without discount show NULL

Use INNER JOIN when:
- Only want records that exist in both tables
- Missing matches would be an error
- Both sides are required

Performance Note:

-- INNER JOIN can be more efficient
-- Database can start from smaller table
-- LEFT JOIN must return all left rows

Common Mistake:

-- LEFT JOIN filtered to INNER JOIN
SELECT *
FROM Users u
LEFT JOIN Orders o ON u.id = o.user_id
WHERE o.status = 'active';  -- Filters out NULLs!

-- Fix: Include NULL in filter
WHERE o.status = 'active' OR o.id IS NULL;

Key Points to Look For:
- Knows when each is appropriate
- Understands NULL implications
- Recognizes filtering gotcha

Follow-up: What's the difference between filtering in ON vs WHERE?

3.

GROUP BY and HAVING clause

Explain GROUP BY and HAVING. What's the difference between HAVING and WHERE?

Junior

GROUP BY: Aggregates rows that share values.

SELECT department, COUNT(*) as employee_count
FROM Employees
GROUP BY department;

-- | department | employee_count |
-- |------------|----------------|
-- | Sales      | 10             |
-- | Engineering| 25             |

HAVING: Filters groups (after aggregation).
WHERE: Filters rows (before aggregation).

Execution Order:

1. FROM / JOIN
2. WHERE       ← Filter rows
3. GROUP BY    ← Create groups
4. HAVING      ← Filter groups
5. SELECT
6. ORDER BY

Example:

-- Departments with more than 5 employees earning > $50k
SELECT department, AVG(salary) as avg_salary
FROM Employees
WHERE salary > 50000        -- Filter individuals first
GROUP BY department
HAVING COUNT(*) > 5;        -- Then filter groups

Common Mistakes:

1. Aggregate in WHERE:

-- WRONG
SELECT department
FROM Employees
WHERE COUNT(*) > 5;  -- Error!

-- RIGHT
SELECT department
FROM Employees
GROUP BY department
HAVING COUNT(*) > 5;

2. Non-aggregated column without GROUP BY:

-- WRONG (in strict SQL mode)
SELECT department, name, COUNT(*)
FROM Employees
GROUP BY department;  -- 'name' not aggregated!

-- RIGHT
SELECT department, COUNT(*)
FROM Employees
GROUP BY department;

Performance Consideration:

-- Filter early with WHERE (reduces data to aggregate)
SELECT department, COUNT(*)
FROM Employees
WHERE status = 'active'  -- Better: Filter rows
GROUP BY department
HAVING COUNT(*) > 5;

-- vs filtering late
SELECT department, COUNT(*)
FROM Employees
GROUP BY department
HAVING COUNT(*) > 5 AND department IN (...);  -- Worse

Key Points to Look For:
- Knows execution order
- Distinguishes WHERE vs HAVING
- Understands when each is used

Follow-up: Can you use aliases in HAVING clause?

4.

Subqueries vs JOINs - performance

When should you use a subquery vs a JOIN? Which performs better?

Mid

Subquery Types:

1. Scalar Subquery (single value):

SELECT name,
       salary,
       (SELECT AVG(salary) FROM Employees) as avg_salary
FROM Employees;

2. Table Subquery:

SELECT *
FROM Employees
WHERE department_id IN (
    SELECT id FROM Departments WHERE location = 'NYC'
);

3. Correlated Subquery:

-- Executes once per outer row
SELECT e.name, e.salary
FROM Employees e
WHERE e.salary > (
    SELECT AVG(salary)
    FROM Employees
    WHERE department_id = e.department_id  -- References outer
);

JOIN Equivalent:

SELECT e.name
FROM Employees e
JOIN Departments d ON e.department_id = d.id
WHERE d.location = 'NYC';

Performance Comparison:

Scenario Better Choice
EXISTS check Subquery with EXISTS
Need columns from both tables JOIN
Simple IN clause Similar performance
Correlated subquery Usually rewrite as JOIN
Aggregates for filtering Subquery in WHERE

Correlated Subquery Problem:

-- O(n²) - runs subquery for each row
SELECT *
FROM Orders o
WHERE total > (
    SELECT AVG(total)
    FROM Orders
    WHERE customer_id = o.customer_id
);

-- Better with JOIN
SELECT o.*
FROM Orders o
JOIN (
    SELECT customer_id, AVG(total) as avg_total
    FROM Orders
    GROUP BY customer_id
) avgs ON o.customer_id = avgs.customer_id
WHERE o.total > avgs.avg_total;

EXISTS vs IN:

-- EXISTS: Stops at first match (efficient)
SELECT * FROM Customers c
WHERE EXISTS (SELECT 1 FROM Orders WHERE customer_id = c.id);

-- IN: Evaluates entire subquery first
SELECT * FROM Customers
WHERE id IN (SELECT customer_id FROM Orders);

Key Points to Look For:
- Knows different subquery types
- Understands correlated subquery issue
- Can rewrite for performance

Follow-up: How does the query optimizer handle subqueries?

5.

Window functions: ROW_NUMBER, RANK, LAG

Explain window functions with examples of ROW_NUMBER, RANK, and LAG.

Mid

Window functions perform calculations across rows related to current row, without collapsing into groups.

Syntax:

function() OVER (
    PARTITION BY column
    ORDER BY column
    ROWS/RANGE BETWEEN ...
)

ROW_NUMBER - Sequential numbering:

-- Unique number per row within partition
SELECT
    name,
    department,
    salary,
    ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as rank
FROM Employees;

-- | Alice | Sales | 80000 | 1 |
-- | Bob   | Sales | 70000 | 2 |
-- | Carol | Eng   | 90000 | 1 |
-- | Dan   | Eng   | 85000 | 2 |

RANK vs DENSE_RANK:

-- RANK: Gaps after ties
-- DENSE_RANK: No gaps
SELECT
    name,
    salary,
    RANK() OVER (ORDER BY salary DESC) as rank,
    DENSE_RANK() OVER (ORDER BY salary DESC) as dense_rank
FROM Employees;

-- | Alice | 100 | 1 | 1 |
-- | Bob   | 100 | 1 | 1 |  (tie)
-- | Carol |  90 | 3 | 2 |  (RANK skips 2)

LAG / LEAD - Access adjacent rows:

-- Compare to previous/next row
SELECT
    date,
    revenue,
    LAG(revenue, 1) OVER (ORDER BY date) as prev_revenue,
    revenue - LAG(revenue, 1) OVER (ORDER BY date) as change
FROM DailySales;

-- | 2024-01-01 | 100 | NULL |  NULL |
-- | 2024-01-02 | 120 |  100 |    20 |
-- | 2024-01-03 | 110 |  120 |   -10 |

Running Total:

SELECT
    date,
    amount,
    SUM(amount) OVER (ORDER BY date) as running_total
FROM Transactions;

Top N per Group:

-- Top 3 earners per department
WITH ranked AS (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as rn
    FROM Employees
)
SELECT * FROM ranked WHERE rn <= 3;

Key Points to Look For:
- Knows difference between RANK variants
- Understands PARTITION BY
- Can solve practical problems

Follow-up: How do you find gaps in sequential data using window functions?

6.

Common Table Expressions (CTEs)

What are CTEs and when would you use them?

Mid

CTE (Common Table Expression): Named temporary result set defined within a query.

Syntax:

WITH cte_name AS (
    SELECT ...
)
SELECT * FROM cte_name;

Use Case 1: Readability

-- Without CTE (hard to read)
SELECT *
FROM Orders
WHERE customer_id IN (
    SELECT customer_id
    FROM (
        SELECT customer_id, SUM(total) as total
        FROM Orders
        GROUP BY customer_id
    ) sub
    WHERE total > 10000
);

-- With CTE (clearer)
WITH high_value_customers AS (
    SELECT customer_id, SUM(total) as total
    FROM Orders
    GROUP BY customer_id
    HAVING SUM(total) > 10000
)
SELECT *
FROM Orders
WHERE customer_id IN (SELECT customer_id FROM high_value_customers);

Use Case 2: Reuse within query

WITH monthly_sales AS (
    SELECT
        DATE_TRUNC('month', date) as month,
        SUM(amount) as total
    FROM Sales
    GROUP BY DATE_TRUNC('month', date)
)
SELECT
    m1.month,
    m1.total,
    m2.total as prev_month,
    m1.total - m2.total as change
FROM monthly_sales m1
LEFT JOIN monthly_sales m2
    ON m1.month = m2.month + INTERVAL '1 month';

Use Case 3: Recursive CTE

-- Generate series
WITH RECURSIVE numbers AS (
    SELECT 1 as n
    UNION ALL
    SELECT n + 1 FROM numbers WHERE n < 10
)
SELECT * FROM numbers;

-- Organizational hierarchy
WITH RECURSIVE org_tree AS (
    -- Base case: top-level managers
    SELECT id, name, manager_id, 0 as level
    FROM Employees
    WHERE manager_id IS NULL

    UNION ALL

    -- Recursive case: subordinates
    SELECT e.id, e.name, e.manager_id, t.level + 1
    FROM Employees e
    JOIN org_tree t ON e.manager_id = t.id
)
SELECT * FROM org_tree ORDER BY level, name;

CTE vs Subquery vs Temp Table:

Feature CTE Subquery Temp Table
Scope Single query Single use Session
Recursive Yes No No
Reuse Yes (in query) No Yes
Materialized Usually no No Yes

Key Points to Look For:
- Knows syntax and benefits
- Understands recursive CTEs
- Can compare to alternatives

Follow-up: When would you choose a temp table over a CTE?

7.

UNION vs UNION ALL

What's the difference between UNION and UNION ALL?

Junior

UNION: Combines results and removes duplicates.
UNION ALL: Combines results and keeps all rows (including duplicates).

-- Table A: 1, 2, 3
-- Table B: 2, 3, 4

SELECT * FROM A
UNION
SELECT * FROM B;
-- Result: 1, 2, 3, 4 (duplicates removed)

SELECT * FROM A
UNION ALL
SELECT * FROM B;
-- Result: 1, 2, 3, 2, 3, 4 (all rows)

Performance:

-- UNION is slower:
-- 1. Combines rows
-- 2. Sorts/hashes to find duplicates
-- 3. Removes duplicates

-- UNION ALL is faster:
-- 1. Just combines rows
-- 2. No deduplication

When to Use:

UNION ALL (default choice):
- Tables don't have overlapping data
- You want/need duplicates
- Performance is critical

-- Different sources, no overlap
SELECT id, name FROM Current_Employees
UNION ALL
SELECT id, name FROM Former_Employees;

UNION:
- Need to deduplicate
- Combining overlapping sets

-- Find all distinct product IDs from multiple tables
SELECT product_id FROM Orders
UNION
SELECT product_id FROM Returns
UNION
SELECT product_id FROM Wishlist;

Requirements:
- Same number of columns
- Compatible data types
- Column names from first SELECT

Key Points to Look For:
- Understands duplicate handling
- Knows performance difference
- Chooses correctly by default (UNION ALL)

Follow-up: How do you combine with different column counts?


Database Design

8.

Primary Key vs Unique Key vs Foreign Key

Explain the differences between Primary Key, Unique Key, and Foreign Key.

Junior

Primary Key:
- Uniquely identifies each row
- Cannot be NULL
- Only ONE per table
- Automatically indexed

CREATE TABLE Users (
    id SERIAL PRIMARY KEY,
    email VARCHAR(255) NOT NULL
);

Unique Key:
- Ensures column(s) have unique values
- CAN allow ONE NULL (depends on DB)
- Multiple unique keys per table

CREATE TABLE Users (
    id SERIAL PRIMARY KEY,
    email VARCHAR(255) UNIQUE,        -- Unique key
    username VARCHAR(50) UNIQUE       -- Another unique key
);

Foreign Key:
- References another table's primary/unique key
- Enforces referential integrity
- Can be NULL (optional relationship)
- Multiple foreign keys per table

CREATE TABLE Orders (
    id SERIAL PRIMARY KEY,
    user_id INT REFERENCES Users(id),  -- Foreign key
    product_id INT REFERENCES Products(id)
);

Comparison:

Feature Primary Key Unique Key Foreign Key
Uniqueness Yes Yes No
NULL allowed No Yes (one) Yes
Count per table One Multiple Multiple
Auto-indexed Yes Yes Often
Purpose Identity Constraint Relationship

Foreign Key Actions:

CREATE TABLE Orders (
    user_id INT REFERENCES Users(id)
        ON DELETE CASCADE        -- Delete orders when user deleted
        ON UPDATE SET NULL       -- Set NULL when user ID changes
);

-- Options: CASCADE, SET NULL, SET DEFAULT, RESTRICT, NO ACTION

Key Points to Look For:
- Knows NULL differences
- Understands referential integrity
- Knows FK actions

Follow-up: What's the difference between RESTRICT and NO ACTION?

9.

Normalization forms: 1NF, 2NF, 3NF, BCNF

Explain database normalization forms. Why normalize?

Mid

Why Normalize:
- Reduce data redundancy
- Prevent update anomalies
- Ensure data integrity

1NF (First Normal Form):
- Atomic values (no lists/arrays)
- Each row unique

-- Violates 1NF
| id | name  | phones              |
|----|-------|---------------------|
| 1  | Alice | 123-456, 789-012    |  -- Multiple values!

-- 1NF compliant
| id | name  | phone   |
|----|-------|---------|
| 1  | Alice | 123-456 |
| 1  | Alice | 789-012 |

2NF (Second Normal Form):
- Is in 1NF
- No partial dependencies (all non-key columns depend on ENTIRE primary key)

-- Violates 2NF (composite key)
| student_id | course_id | student_name | course_name |
|------------|-----------|--------------|-------------|
-- student_name depends only on student_id (partial dependency)

-- 2NF compliant: Split tables
Students(student_id, student_name)
Courses(course_id, course_name)
Enrollments(student_id, course_id)

3NF (Third Normal Form):
- Is in 2NF
- No transitive dependencies (non-key depends only on primary key)

-- Violates 3NF
| employee_id | department_id | department_name |
|-------------|---------------|-----------------|
-- department_name depends on department_id, not employee_id

-- 3NF compliant
Employees(employee_id, department_id)
Departments(department_id, department_name)

BCNF (Boyce-Codd Normal Form):
- Is in 3NF
- Every determinant is a candidate key

-- Can violate BCNF even if 3NF
| student | subject | teacher |
|---------|---------|---------|
-- If teacher determines subject (each teacher teaches one subject)
-- But student+subject is key, teacher is non-key determinant

Quick Reference:

1NF: Atomic values, unique rows
2NF: 1NF + No partial dependencies
3NF: 2NF + No transitive dependencies
BCNF: 3NF + Every determinant is candidate key

Key Points to Look For:
- Knows each form's requirement
- Can identify violations
- Understands trade-offs

Follow-up: When is denormalization appropriate?

10.

When to denormalize?

When and why would you denormalize a database?

Senior

Denormalization: Intentionally adding redundancy for performance.

When to Denormalize:

1. Read-Heavy Workloads

-- Normalized: 3 JOINs for every read
SELECT o.id, c.name, p.name, p.price
FROM Orders o
JOIN Customers c ON o.customer_id = c.id
JOIN OrderItems oi ON o.id = oi.order_id
JOIN Products p ON oi.product_id = p.id;

-- Denormalized: Single table read
SELECT id, customer_name, product_name, product_price
FROM OrdersFlat;

2. Reporting / Analytics

-- Star schema for data warehouse
CREATE TABLE fact_sales (
    date_id INT,
    product_id INT,
    customer_id INT,
    quantity INT,
    total_amount DECIMAL,
    customer_city VARCHAR(100),    -- Denormalized
    product_category VARCHAR(50)   -- Denormalized
);

3. Frequently Accessed Aggregates

-- Instead of counting every time
ALTER TABLE Posts ADD COLUMN comment_count INT DEFAULT 0;

-- Update on insert/delete
UPDATE Posts SET comment_count = comment_count + 1
WHERE id = :post_id;

4. Caching Hot Data

-- User table with denormalized profile data
CREATE TABLE Users (
    id SERIAL PRIMARY KEY,
    email VARCHAR(255),
    -- Denormalized from Settings
    notification_pref VARCHAR(50),
    timezone VARCHAR(50),
    -- Denormalized from Billing
    subscription_type VARCHAR(20)
);

Trade-offs:

Benefit Cost
Faster reads Slower writes
Simpler queries Data inconsistency risk
Fewer JOINs Storage space
Better caching Update complexity

How to Manage:
1. Triggers - Auto-update denormalized data
2. Materialized views - Database maintains consistency
3. Application logic - Update both places
4. Eventual consistency - Async sync jobs

Example: Materialized View

CREATE MATERIALIZED VIEW OrderSummary AS
SELECT
    customer_id,
    COUNT(*) as order_count,
    SUM(total) as total_spent
FROM Orders
GROUP BY customer_id;

-- Refresh periodically
REFRESH MATERIALIZED VIEW OrderSummary;

Key Points to Look For:
- Knows when it's appropriate
- Understands trade-offs
- Has strategies for consistency

Follow-up: How do you keep denormalized data in sync?

11.

One-to-Many vs Many-to-Many relationships

How do you model one-to-many and many-to-many relationships?

Junior

One-to-Many (1:N):
One record in table A can relate to many in table B.

-- One department has many employees
CREATE TABLE Departments (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100)
);

CREATE TABLE Employees (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    department_id INT REFERENCES Departments(id)  -- FK on "many" side
);

Many-to-Many (M:N):
Records in A can relate to many in B, and vice versa.

-- Students can enroll in many courses
-- Courses can have many students
CREATE TABLE Students (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100)
);

CREATE TABLE Courses (
    id SERIAL PRIMARY KEY,
    title VARCHAR(200)
);

-- Junction/Bridge table
CREATE TABLE Enrollments (
    student_id INT REFERENCES Students(id),
    course_id INT REFERENCES Courses(id),
    enrolled_at TIMESTAMP DEFAULT NOW(),
    grade VARCHAR(2),
    PRIMARY KEY (student_id, course_id)
);

Junction Table Benefits:
- Stores relationship attributes (enrolled_at, grade)
- Prevents duplicate relationships (composite PK)
- Allows efficient queries both directions

Querying:

-- One-to-Many: Get employees in department
SELECT e.name
FROM Employees e
WHERE e.department_id = 1;

-- Many-to-Many: Get courses for student
SELECT c.title
FROM Courses c
JOIN Enrollments e ON c.id = e.course_id
WHERE e.student_id = 1;

-- Get students in course
SELECT s.name
FROM Students s
JOIN Enrollments e ON s.id = e.student_id
WHERE e.course_id = 1;

Common Patterns:

Relationship Example Implementation
1:1 User ↔ Profile FK with UNIQUE, or same table
1:N Author → Books FK on "many" side
M:N Tags ↔ Posts Junction table

Key Points to Look For:
- Knows where to place FK
- Understands junction tables
- Can add relationship attributes

Follow-up: How do you model self-referential relationships?

12.

Entity-Relationship diagrams

How do you read and create an Entity-Relationship diagram?

Junior

ER Diagram Components:

1. Entities (Tables)
Rectangles representing objects.

┌─────────┐
│ Student │
└─────────┘

2. Attributes (Columns)
Ovals connected to entities.

      ○ name
      │
┌─────────┐
│ Student │───○ email
└─────────┘
      │
      ◉ id (PK: filled/underlined)

3. Relationships
Diamonds connecting entities.

┌─────────┐         ┌────────┐
│ Student │───◇───│ Course │
└─────────┘ enrolls └────────┘

4. Cardinality Notation

1   ─────│         One
M/N ─────<         Many
0   ─────○         Zero (optional)

Chen Notation:
1:1    ─────│─────│─────
1:N    ─────│─────<─────
M:N    ─────<─────>─────

Crow's Foot:
─────||──────    One and only one
─────○|──────    Zero or one
─────|<──────    One or many
─────○<──────    Zero or many

Example ER Diagram:

┌──────────┐       ┌─────────────┐       ┌─────────┐
│ Customer │──||──○<──│ Orders    │>○──||──│ Product │
├──────────┤ places   ├─────────────┤ contains├─────────┤
│ PK: id   │          │ PK: id      │         │ PK: id  │
│ name     │          │ FK: cust_id │         │ name    │
│ email    │          │ order_date  │         │ price   │
└──────────┘          │ total       │         └─────────┘
                      └─────────────┘

Customer places 0-or-more Orders
Order contains 1-or-more Products (via OrderItems)

Reading Cardinality:

"A Customer can place many Orders"
"An Order belongs to exactly one Customer"

Customer ||──○< Order
         1      0..*

Key Points to Look For:
- Knows notation styles
- Can read cardinality
- Can convert to schema

Follow-up: How do you represent inheritance in an ERD?

13.

Database schema design best practices

What are best practices for database schema design?

Mid

1. Naming Conventions

-- Consistent, descriptive names
-- Tables: plural, snake_case
CREATE TABLE order_items (...);

-- Columns: snake_case, avoid abbreviations
customer_id (not cust_id or custId)

-- Foreign keys: table_id
user_id, product_id

-- Indexes: table_column_idx
CREATE INDEX orders_customer_id_idx ON orders(customer_id);

2. Use Appropriate Data Types

-- Be specific
age SMALLINT,           -- Not INT
status VARCHAR(20),     -- Not VARCHAR(255)
price DECIMAL(10,2),    -- Not FLOAT for money
created_at TIMESTAMP,   -- Not VARCHAR for dates
is_active BOOLEAN       -- Not INT/CHAR

-- Use ENUM or lookup tables for fixed values
CREATE TYPE status AS ENUM ('pending', 'active', 'inactive');

3. Always Have Primary Keys

-- Prefer surrogate keys
id SERIAL PRIMARY KEY

-- Use UUID for distributed systems
id UUID DEFAULT gen_random_uuid() PRIMARY KEY

4. Define Foreign Keys

-- Enforce referential integrity
user_id INT NOT NULL REFERENCES users(id)
    ON DELETE CASCADE
    ON UPDATE CASCADE

5. Index Strategically

-- Index foreign keys
CREATE INDEX ON orders(customer_id);

-- Index frequently filtered columns
CREATE INDEX ON products(category_id, status);

-- Composite indexes for multi-column queries
CREATE INDEX ON orders(status, created_at);

6. Use NOT NULL Default

-- Explicit about nullability
name VARCHAR(100) NOT NULL,
middle_name VARCHAR(100),  -- Intentionally nullable

7. Audit Columns

-- Track record lifecycle
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW(),
created_by INT REFERENCES users(id),

8. Soft Delete Pattern

deleted_at TIMESTAMP,  -- NULL = active

-- Query active records
SELECT * FROM users WHERE deleted_at IS NULL;

9. Version/Optimistic Locking

version INT DEFAULT 0,

-- Update with version check
UPDATE products
SET name = 'New', version = version + 1
WHERE id = 1 AND version = 5;

Key Points to Look For:
- Consistent naming
- Appropriate data types
- Thinks about indexes
- Considers audit needs

Follow-up: How do you handle schema migrations in production?


Transactions & ACID

14.

ACID properties explained

What are the ACID properties? Why are they important?

Mid

Test Pyramid:

wzxhzdk:0

Why This Shape:

Cost:

wzxhzdk:1

Speed:

wzxhzdk:2

Reliability:

wzxhzdk:3

Implications:

1. Maximize unit tests:

wzxhzdk:4

2. Integration for boundaries:

wzxhzdk:5

3. E2E for critical paths:

wzxhzdk:6

Anti-Pattern: Ice Cream Cone:

wzxhzdk:7

Key Points to Look For:
- Knows pyramid shape and why
- Understands cost/speed trade-offs
- Recognizes anti-patterns

Follow-up: How does the pyramid change for different types of applications?

15.

Transaction isolation levels

What are the different transaction isolation levels?

Mid

Isolation Levels (least to most strict):

1. READ UNCOMMITTED:
Can see uncommitted changes from other transactions.

-- Transaction A              -- Transaction B
BEGIN;                        BEGIN;
UPDATE x SET val = 10;
                              SELECT val FROM x; -- Sees 10!
ROLLBACK;                     -- "Dirty read"

2. READ COMMITTED (PostgreSQL default):
Only sees committed data.

-- Transaction A              -- Transaction B
BEGIN;                        BEGIN;
SELECT val FROM x; -- 5       UPDATE x SET val = 10;
                              COMMIT;
SELECT val FROM x; -- 10!     -- "Non-repeatable read"

3. REPEATABLE READ (MySQL default):
Same query returns same results within transaction.

-- Transaction A              -- Transaction B
BEGIN;                        BEGIN;
SELECT * FROM x WHERE id < 5;
                              INSERT INTO x (id) VALUES (3);
                              COMMIT;
SELECT * FROM x WHERE id < 5; -- Phantom row possible

4. SERIALIZABLE:
Transactions execute as if sequential.

SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- Full isolation, may cause serialization failures

Phenomena:

Isolation Level Dirty Read Non-Repeatable Read Phantom Read
Read Uncommitted Yes Yes Yes
Read Committed No Yes Yes
Repeatable Read No No Yes*
Serializable No No No

*MySQL's RR prevents phantoms via gap locking

Setting Isolation:

-- PostgreSQL
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
BEGIN;
-- ...
COMMIT;

-- MySQL
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;

Trade-offs:
- Higher isolation = More locking = Lower concurrency
- Lower isolation = Better performance = More anomalies

Key Points to Look For:
- Knows all levels
- Understands phenomena
- Knows database defaults

Follow-up: How does MVCC implement isolation?

16.

Dirty reads, phantom reads, non-repeatable reads

Explain the different read anomalies in database transactions.

Mid

1. Dirty Read:
Reading uncommitted data that may be rolled back.

wzxhzdk:0

2. Non-Repeatable Read:
Same query returns different values within one transaction.

wzxhzdk:1

3. Phantom Read:
New rows appear/disappear between queries.

wzxhzdk:2

Summary:

wzxhzdk:3

Real-World Impact:

Dirty Read:

wzxhzdk:4

Non-Repeatable:

wzxhzdk:5

Phantom:

wzxhzdk:6

Key Points to Look For:
- Can explain all three
- Knows real-world implications
- Understands which isolation level prevents each

Follow-up: How do you handle these in application code?

17.

Optimistic vs Pessimistic locking

What's the difference between optimistic and pessimistic locking?

Senior

Pessimistic Locking:
Lock resource before reading, preventing others from accessing.

-- Lock row for update
BEGIN;
SELECT * FROM products WHERE id = 1 FOR UPDATE;
-- Other transactions wait here
-- ... update logic ...
UPDATE products SET stock = stock - 1 WHERE id = 1;
COMMIT;

Optimistic Locking:
Don't lock, but check for conflicts at update time.

-- Read with version
SELECT id, stock, version FROM products WHERE id = 1;
-- (id=1, stock=10, version=5)

-- Update only if version unchanged
UPDATE products
SET stock = 9, version = 6
WHERE id = 1 AND version = 5;

-- If 0 rows updated, someone else modified it → retry

Comparison:

Aspect Pessimistic Optimistic
Lock timing Before read At update
Conflicts Prevented Detected
Throughput Lower Higher
Best for High conflict Low conflict
Failure mode Wait/timeout Retry

When to Use:

Pessimistic:
- High contention resources
- Conflicts are expensive
- Short transactions
- Banking, inventory

-- Explicit lock
SELECT * FROM accounts WHERE id = 1 FOR UPDATE;

-- Lock types
FOR UPDATE      -- Exclusive lock
FOR SHARE       -- Shared lock (multiple readers)
FOR UPDATE NOWAIT   -- Fail immediately if locked
FOR UPDATE SKIP LOCKED -- Skip locked rows

Optimistic:
- Low contention
- Conflicts rare
- Long read time
- User edits (forms)

// JPA optimistic locking
@Entity
public class Product {
    @Version
    private Long version;
}

// Throws OptimisticLockException on conflict

Hybrid Approach:

-- Optimistic with timeout fallback
BEGIN;
SELECT * FROM products WHERE id = 1;
-- Process...

-- Try optimistic
UPDATE products SET stock = 9 WHERE id = 1 AND version = 5;
IF @@ROWCOUNT = 0 THEN
    -- Fall back to pessimistic retry
    SELECT * FROM products WHERE id = 1 FOR UPDATE;
    -- ...
END IF;
COMMIT;

Key Points to Look For:
- Knows both approaches
- Understands trade-offs
- Can implement version column

Follow-up: How do you handle optimistic lock failures?

18.

Deadlock detection and prevention

How do databases detect and prevent deadlocks?

Senior

Deadlock: Two or more transactions waiting for each other's locks.

Transaction A          Transaction B
───────────────       ───────────────
Lock Row 1
                      Lock Row 2
Lock Row 2 (WAIT)
                      Lock Row 1 (WAIT)
        ↓ DEADLOCK! ↓

Detection:

1. Wait-For Graph:

Database builds: A → B → C → A (cycle = deadlock)

A waits for B
B waits for C
C waits for A  → Cycle detected!

2. Timeout:

SET lock_timeout = 5000;  -- 5 seconds
-- Transaction aborted if waiting too long

Resolution:
Database picks a "victim" transaction to abort.

Victim selection criteria:
- Transaction age (younger = victim)
- Amount of work done
- Number of locks held
- Priority hints

Prevention Strategies:

1. Lock Ordering:

-- Always lock in same order
-- BAD:
-- T1: Lock A, Lock B
-- T2: Lock B, Lock A

-- GOOD:
-- T1: Lock A, Lock B
-- T2: Lock A, Lock B

2. Lock All at Once:

BEGIN;
SELECT * FROM accounts WHERE id IN (1, 2) FOR UPDATE;
-- Lock both before modifying either
COMMIT;

3. Keep Transactions Short:

-- BAD: Long transaction
BEGIN;
SELECT ...
-- User thinks for 5 minutes
UPDATE ...
COMMIT;

-- GOOD: Quick transaction
-- Do thinking outside transaction
BEGIN;
UPDATE ...
COMMIT;

4. Use NOWAIT / SKIP LOCKED:

SELECT * FROM items WHERE id = 1 FOR UPDATE NOWAIT;
-- Fails immediately if locked, no waiting

SELECT * FROM jobs FOR UPDATE SKIP LOCKED LIMIT 1;
-- Gets next unlocked row (queue pattern)

Application Handling:

try {
    // Transaction
} catch (DeadlockException e) {
    // Retry with backoff
    Thread.sleep(random(100, 1000));
    retry();
}

Key Points to Look For:
- Understands deadlock conditions
- Knows detection mechanisms
- Has prevention strategies

Follow-up: How do you investigate deadlocks in production?

19.

Two-phase commit protocol

What is two-phase commit (2PC)? What are its limitations?

Senior

Coverage Metrics:

1. Line Coverage:

wzxhzdk:6

2. Branch Coverage:

wzxhzdk:7

3. Path Coverage:

wzxhzdk:8

Limitations:

1. Coverage doesn't mean correctness:

wzxhzdk:9

2. Doesn't test edge cases:

wzxhzdk:10

3. Quality vs quantity:

wzxhzdk:11

4. Doesn't test interactions:

wzxhzdk:12

Good Practices:

wzxhzdk:13

Key Points to Look For:
- Knows coverage types
- Understands limitations
- Doesn't chase 100%

Follow-up: What is mutation testing?

20.

How do B-tree indexes work?

How do B-tree indexes work and why are they efficient?

Mid

B-tree (Balanced tree) is the default index structure in most databases.

Structure:

                    [50]
                   /    \
           [20, 35]      [70, 85]
          /   |   \     /   |   \
       [10] [25,30] [40] [60] [75,80] [90,95]
              ↓
           Leaf nodes point to actual data

Properties:
- Balanced: All leaf nodes at same depth
- Sorted: Keys in order
- Each node fits in one disk page (4-16KB)
- O(log n) search, insert, delete

Why Efficient:

1. Minimizes Disk I/O:

1 million rows:
- Binary tree: ~20 levels = 20 disk reads
- B-tree (order 100): ~3 levels = 3 disk reads

2. Range Queries:

SELECT * FROM users WHERE age BETWEEN 25 AND 35;

-- B-tree: Find 25, scan leaves until 35
-- No need to traverse tree for each value

3. Sorted Output:

SELECT * FROM users ORDER BY age;
-- Already sorted in index!

Creating Index:

CREATE INDEX idx_users_age ON users(age);

-- Composite index
CREATE INDEX idx_orders_cust_date ON orders(customer_id, order_date);

Index Types:

-- B-tree (default)
CREATE INDEX ... USING btree;

-- Hash (equality only)
CREATE INDEX ... USING hash;

-- GiST (geometric, full-text)
CREATE INDEX ... USING gist;

-- GIN (arrays, full-text)
CREATE INDEX ... USING gin;

When B-tree Helps:
- Equality: WHERE id = 5
- Range: WHERE age > 25
- Sorting: ORDER BY name
- Prefix: WHERE name LIKE 'Jo%'

When B-tree Doesn't Help:
- Suffix: WHERE name LIKE '%son'
- Not equal: WHERE status != 'active'
- Functions: WHERE UPPER(name) = 'JOHN'

Key Points to Look For:
- Knows tree structure
- Understands disk I/O benefit
- Knows when index helps

Follow-up: How does a composite index work?

21.

Clustered vs Non-clustered indexes

What's the difference between clustered and non-clustered indexes?

Mid

Process:

1. Original code:
   if (a > b) return a;

2. Create mutants:
   if (a >= b) return a;  // Mutant 1
   if (a < b) return a;   // Mutant 2
   if (a > b) return b;   // Mutant 3

3. Run tests against each mutant

4. Calculate mutation score:
   Killed mutants / Total mutants

Example:

// Original
public int max(int a, int b) {
    if (a > b) return a;
    return b;
}

// Test
@Test void testMax() {
    assertEquals(5, max(5, 3));
}

// Mutants:
// Mutant 1: a >= b → Killed (max(3,3) returns 3, but original returns 3 too... survives!)
// Mutant 2: a < b → Killed (max(5,3) returns 3, expected 5)
// Mutant 3: return b → Killed (max(5,3) returns 3, expected 5)

// Need more tests!
@Test void testMaxEqual() {
    assertEquals(5, max(5, 5));  // Kills mutant 1
}

Mutation Operators:

Operator           Example
─────────────────────────────────────
Conditional        > → >=, <, <=, ==
Arithmetic         + → -, *, /
Return value       return x → return 0
Void call          remove method call
Constant           5 → 0, 1, -1

Mutation Score:

90% score means:
- 90% of bugs introduced were caught by tests
- 10% of bugs would go undetected!

Benefits:
1. Tests the tests - Are they actually checking?
2. Finds weak tests - Assertions that don't fail
3. Better than coverage - Execution ≠ Verification

Limitations:
1. Slow - Runs tests many times
2. Equivalent mutants - Some mutations don't change behavior
3. Expensive - CPU intensive

Tools:
- PIT (Java)
- Stryker (JavaScript, C#)
- mutmut (Python)

Key Points to Look For:
- Understands the concept
- Knows why it's better than coverage
- Aware of limitations

Follow-up: How do you handle equivalent mutants?

22.

When indexes hurt performance

When can indexes hurt performance instead of helping?

Mid

Indexes Have Costs:

1. Write Overhead:

INSERT INTO users (name, email, phone, city, status)
VALUES (...);

-- Without indexes: 1 write
-- With 5 indexes: 6 writes (data + 5 index updates)

2. Storage Space:

-- Each index is a copy of indexed columns + pointers
-- Large tables with many indexes = significant storage

3. Query Optimizer Confusion:
Too many indexes can lead to suboptimal plans.

When Indexes Hurt:

1. Low Selectivity:

-- Index on gender (2 values)
SELECT * FROM users WHERE gender = 'M';
-- Returns 50% of rows
-- Full scan is faster than index lookup!

-- Rule: >15-20% of rows → full scan

2. Small Tables:

-- 100 row table
-- Full scan: 1-2 pages
-- Index scan: 2+ pages (index + data)

3. Write-Heavy Workloads:

-- Logging table with 1000 inserts/second
-- Each insert updates all indexes
-- Minimal reads → indexes mostly overhead

4. Unused Indexes:

-- Index exists but never used
-- Still updated on every write
-- Find unused indexes:
SELECT * FROM pg_stat_user_indexes
WHERE idx_scan = 0;

5. Duplicate Indexes:

-- These overlap!
CREATE INDEX idx1 ON orders(customer_id);
CREATE INDEX idx2 ON orders(customer_id, order_date);
-- idx2 can serve queries that use idx1

6. Functions on Indexed Columns:

-- Index on created_at
-- This query CAN'T use it:
WHERE YEAR(created_at) = 2024

-- This CAN:
WHERE created_at >= '2024-01-01'
  AND created_at < '2025-01-01'

Best Practices:

-- Regularly check index usage
SELECT
    indexrelname,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY idx_scan;

-- Remove unused indexes
DROP INDEX IF EXISTS unused_idx;

Key Points to Look For:
- Understands write cost
- Knows selectivity impact
- Can identify problematic indexes

Follow-up: How do you identify which indexes to remove?

23.

Query execution plans - how to read them

How do you read a query execution plan to optimize performance?

Senior

Getting Execution Plan:

-- PostgreSQL
EXPLAIN ANALYZE
SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.status = 'pending';

-- MySQL
EXPLAIN FORMAT=JSON
SELECT ...;

Reading the Plan:

                                    QUERY PLAN
--------------------------------------------------------------------------------
Nested Loop  (cost=0.29..16.34 rows=1 width=100) (actual time=0.015..0.017 rows=1)
  -> Index Scan using idx_status on orders o  (cost=0.14..8.16 rows=1 width=50)
       Index Cond: (status = 'pending')
  -> Index Scan using customers_pkey on customers c  (cost=0.15..8.17 rows=1)
       Index Cond: (id = o.customer_id)
Planning Time: 0.123 ms
Execution Time: 0.045 ms

Key Metrics:

Cost: Estimated I/O + CPU

cost=0.29..16.34
     ^      ^
  startup  total

Rows: Estimated vs actual

rows=100 (estimated) vs actual rows=150
-- Big difference = stale statistics!

Width: Bytes per row

Common Operations:

Operation Description Performance
Seq Scan Full table scan Slow for large tables
Index Scan B-tree traversal Fast
Index Only Scan Data in index Fastest
Bitmap Index Scan Multiple index results combined Medium
Nested Loop For each outer, scan inner Good for small sets
Hash Join Build hash, probe Good for large sets
Merge Join Both sides sorted Good if pre-sorted
Sort External sort Can be expensive

Red Flags:

1. Seq Scan on large table
   → Add index?

2. Nested Loop with large tables
   → Force hash join? Add index?

3. Sort with high cost
   → Add index for ordering?

4. Estimated vs actual rows differ significantly
   → ANALYZE table

5. Filter after scan (not in index condition)
   → Index doesn't fully cover query

Optimization Example:

-- Before: Seq Scan (slow)
EXPLAIN SELECT * FROM orders WHERE status = 'pending';
-- Seq Scan on orders, Filter: status = 'pending'

-- Add index
CREATE INDEX idx_orders_status ON orders(status);

-- After: Index Scan (fast)
EXPLAIN SELECT * FROM orders WHERE status = 'pending';
-- Index Scan using idx_orders_status

Key Points to Look For:
- Can read basic plan elements
- Knows what to look for
- Can suggest optimizations

Follow-up: How do you force a specific execution plan?

24.

Index covering and include columns

What is a covering index? When do you use INCLUDE columns?

Senior

Property-Based Testing:

Instead of specific examples, test properties that should always hold.


Example-Based vs Property-Based:


wzxhzdk:0


Properties to Test:


1. Inverse Operations:


wzxhzdk:1


2. Idempotent Operations:


wzxhzdk:2


3. Invariants:


wzxhzdk:3


4. Commutativity:


wzxhzdk:4


Benefits:

1. Edge cases found automatically

2. More coverage with less code

3. Finds unexpected bugs

4. Shrinking - minimizes failing case


Shrinking:


wzxhzdk:5


Tools:

- jqwik (Java)

- QuickCheck (Haskell, ports to many languages)

- Hypothesis (Python)

- fast-check (JavaScript)


Key Points to Look For:

- Understands property concept

- Knows common property types

- Mentions shrinking


Follow-up: How do you identify good properties to test?


25.

Query optimization techniques

What are common query optimization techniques?

Senior

1. Use Indexes Effectively:

-- Add missing indexes
CREATE INDEX idx_orders_status ON orders(status);

-- Use index-friendly conditions
WHERE created_at >= '2024-01-01'  -- Good
WHERE YEAR(created_at) = 2024    -- Bad (can't use index)

2. Avoid SELECT *:

-- Bad: Fetches all columns
SELECT * FROM orders WHERE customer_id = 1;

-- Good: Only needed columns
SELECT id, total, status FROM orders WHERE customer_id = 1;

3. Limit Data Early:

-- Bad: Filter after join
SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.status = 'pending';  -- Joins all, then filters

-- Good: Filter before join (optimizer usually does this)
-- Or use subquery for complex cases

4. Use EXISTS Instead of IN:

-- IN: Executes full subquery
SELECT * FROM customers
WHERE id IN (SELECT customer_id FROM orders);

-- EXISTS: Stops at first match
SELECT * FROM customers c
WHERE EXISTS (SELECT 1 FROM orders WHERE customer_id = c.id);

5. Pagination Properly:

-- Bad: Large offset
SELECT * FROM products ORDER BY id LIMIT 10 OFFSET 100000;
-- Scans 100,010 rows!

-- Good: Keyset pagination
SELECT * FROM products
WHERE id > 100000
ORDER BY id LIMIT 10;

6. Batch Operations:

-- Bad: Many single inserts
INSERT INTO logs VALUES (1, ...);
INSERT INTO logs VALUES (2, ...);

-- Good: Batch insert
INSERT INTO logs VALUES
    (1, ...),
    (2, ...),
    (3, ...);

7. Avoid N+1 Queries:

-- Bad: N+1
SELECT * FROM orders;  -- 1 query
-- For each order:
SELECT * FROM items WHERE order_id = ?;  -- N queries

-- Good: JOIN or batch
SELECT o.*, i.*
FROM orders o
JOIN items i ON o.id = i.order_id;

8. Update Statistics:

-- PostgreSQL
ANALYZE orders;

-- MySQL
ANALYZE TABLE orders;

9. Use UNION ALL:

-- UNION: Sorts and deduplicates
SELECT id FROM table1 UNION SELECT id FROM table2;

-- UNION ALL: Just concatenates
SELECT id FROM table1 UNION ALL SELECT id FROM table2;

10. Denormalize for Read Performance:

-- Add redundant columns for frequent queries
ALTER TABLE orders ADD COLUMN customer_name VARCHAR(100);

Key Points to Look For:
- Knows multiple techniques
- Understands why each helps
- Can apply to real queries

Follow-up: How do you identify slow queries in production?


NoSQL & Distributed

26.

SQL vs NoSQL - decision criteria

When would you choose NoSQL over SQL?

Mid

Choose Based On:

1. Data Structure:

// Bad
@GetMapping("/users/{id}")
public User getUser(@PathVariable Long id) {
    return userRepo.findById(id);  // Any user can access any user!
}

// Good
public User getUser(Long id, Authentication auth) {
    User user = userRepo.findById(id);
    if (!user.getId().equals(auth.getUserId())) {
        throw new ForbiddenException();
    }
    return user;
}

2. Scale Requirements:

// Bad
@GetMapping("/fetch")
public String fetch(@RequestParam String url) {
    return httpClient.get(url);  // Can access internal services!
}

3. Consistency Requirements:

wzxhzdk:2

4. Query Patterns:

wzxhzdk:3

Decision Matrix:

Requirement SQL NoSQL
ACID transactions Yes Varies
Flexible schema No Yes
Horizontal scale Limited Built-in
Complex queries Yes Limited
Relational data Yes No
High write throughput Moderate High

Use Cases:

SQL NoSQL
Banking, finance User sessions
ERP systems Real-time analytics
E-commerce orders Content management
Inventory management IoT data
Reporting Caching

Hybrid Approach:

wzxhzdk:4

Key Points to Look For:
- Knows trade-offs
- Considers specific requirements
- Not dogmatic (uses both)

Follow-up: Can you use both SQL and NoSQL in the same system?

27.

Document, Key-Value, Column, Graph databases

Explain the different types of NoSQL databases.

Mid

1. Key-Value Stores:
Simple key → value mapping.

Key: "user:123"
Value: {name: "Alice", email: "alice@example.com"}

Examples: Redis, DynamoDB, Memcached

Use cases: Caching, sessions, shopping carts

2. Document Databases:
Store semi-structured documents (JSON, BSON).

// MongoDB document
{
  _id: ObjectId("..."),
  name: "Alice",
  orders: [
    {item: "Book", price: 15},
    {item: "Pen", price: 2}
  ]
}

Examples: MongoDB, CouchDB, Firestore
Use cases: Content management, user profiles, catalogs

3. Column-Family (Wide-Column):
Store data in columns, not rows.

Row Key: user:123
Column Family: profile
  name: "Alice"
  email: "alice@example.com"
Column Family: activity
  last_login: "2024-01-15"
  page_views: 1500

Examples: Cassandra, HBase, ScyllaDB
Use cases: Time-series, IoT, write-heavy workloads

4. Graph Databases:
Nodes and relationships as first-class citizens.

(Alice)-[FRIENDS_WITH]->(Bob)
(Alice)-[PURCHASED]->(Product123)

Examples: Neo4j, Amazon Neptune, ArangoDB
Use cases: Social networks, recommendations, fraud detection

Comparison:

Type Strength Weakness
Key-Value Speed, simplicity No complex queries
Document Flexibility, nesting No JOINs
Column Write throughput, scale Query limitations
Graph Relationships Less suitable for non-graph data

Selection Guide:

Need simple caching? → Key-Value
Need flexible documents? → Document
Need high write throughput? → Column
Need relationship queries? → Graph
Need transactions + queries? → SQL

Key Points to Look For:
- Knows all four types
- Understands strengths/weaknesses
- Can match to use cases

Follow-up: How would you model a social network in each type?

28.

CAP theorem explained

Explain the CAP theorem and its implications.

Mid

CAP Theorem: In a distributed system, you can only guarantee 2 of 3:

  • Consistency: Every read receives the most recent write
  • Availability: Every request receives a response
  • Partition tolerance: System works despite network failures

Why Only 2:

1. User logs into bank.com (session cookie set)
2. User visits evil.com
3. evil.com contains:
   <form action="bank.com/transfer" method="POST">
     <input name="to" value="attacker">
     <input name="amount" value="10000">
   </form>
   <script>document.forms[0].submit()</script>
4. Browser sends request with user's bank.com cookies
5. Transfer happens without user's knowledge!

Trade-off Categories:

CP (Consistency + Partition Tolerance):

<!-- Server generates token -->
<form action="/transfer" method="POST">
    <input type="hidden" name="_csrf" value="abc123xyz">
    ...
</form>

AP (Availability + Partition Tolerance):

// Server validates token
@PostMapping("/transfer")
public void transfer(@RequestParam String _csrf) {
    if (!csrfTokenService.validate(_csrf)) {
        throw new InvalidCsrfTokenException();
    }
    // Process transfer
}

CA (Consistency + Availability):

Set-Cookie: session=abc123; SameSite=Strict

Real-World Nuance:

// Cookie
document.cookie = "csrf=xyz123";

// Request includes both
fetch('/api/transfer', {
    headers: {
        'X-CSRF-Token': getCookie('csrf')  // Must match cookie
    }
});

PACELC Extension:

@PostMapping("/transfer")
public void transfer(HttpServletRequest request) {
    String origin = request.getHeader("Origin");
    if (!allowedOrigins.contains(origin)) {
        throw new ForbiddenException();
    }
}

Key Points to Look For:
- Understands trade-offs
- Knows it applies during partitions
- Can give examples of each

Follow-up: How do you design for eventual consistency?

29.

Eventual consistency vs Strong consistency

When would you choose eventual consistency over strong consistency?

Senior

Strong Consistency:
After write completes, all reads return new value.

Write X=5
─────────────────────
Read X → 5 (guaranteed)
Read X → 5 (guaranteed)

Eventual Consistency:
After write, reads MAY return old value temporarily.

Write X=5
─────────────────────
Read X → 3 (old value)
Read X → 5 (propagated)
Read X → 5 (converged)

When to Use:

Strong Consistency:
- Financial transactions
- Inventory (prevent overselling)
- User authentication
- Anything where stale = incorrect

-- Bank transfer: Must be consistent
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;

Eventual Consistency:
- Social media feeds
- Like counts
- Search indexes
- DNS
- Caching

# Like count: Eventually consistent is fine
async def like_post(post_id):
    cache.increment(f"likes:{post_id}")  # Fast, eventually consistent
    queue.enqueue(update_database, post_id)  # Async DB update

Trade-offs:

Aspect Strong Eventual
Latency Higher Lower
Availability Lower Higher
Scalability Harder Easier
Complexity Simpler logic Handle conflicts

Handling Eventual Consistency:

1. Conflict Resolution:

# Last-write-wins
def resolve(v1, v2):
    return v1 if v1.timestamp > v2.timestamp else v2

# Vector clocks for causality

2. Read-Your-Writes:

# After writing, read from same node
write_to_node_a(value)
read_from_node_a()  # See your write

3. Application-Level:

// Show optimistic update in UI
onClick = () => {
    setLikeCount(prev => prev + 1);  // Optimistic
    api.like(postId);  // Async
}

Key Points to Look For:
- Knows appropriate use cases
- Understands trade-offs
- Has strategies for eventual consistency

Follow-up: How do you handle conflicts in eventual consistency?

30.

Sharding strategies: range, hash, directory

What are different sharding strategies and when to use each?

Senior

Sharding: Distributing data across multiple databases.

1. Range-Based Sharding:
Partition by value ranges.

Content-Security-Policy: default-src 'self';
                         script-src 'self' 'nonce-abc123';
                         style-src 'self' 'unsafe-inline';
                         img-src 'self' data: https:;
                         frame-ancestors 'none'

Pros:
- Range queries efficient
- Easy to understand
- Natural for time-series

Cons:
- Hot spots (recent data)
- Uneven distribution
- Rebalancing difficult

2. Hash-Based Sharding:
Hash key to determine shard.

Strict-Transport-Security: max-age=31536000; includeSubDomains; preload

Pros:
- Even distribution
- No hot spots
- Simple routing

Cons:
- Range queries across all shards
- Rebalancing expensive (rehash all)
- Consistent hashing helps

Consistent Hashing:

X-Content-Type-Options: nosniff

3. Directory-Based Sharding:
Lookup table maps key → shard.

X-Frame-Options: DENY

Pros:
- Flexible placement
- Easy to move data
- Custom logic possible

Cons:
- Lookup overhead
- Directory is single point of failure
- Must keep directory updated

Comparison:

Strategy Distribution Range Query Rebalance
Range Uneven Good Hard
Hash Even Bad Hard
Consistent Hash Even Bad Easy
Directory Flexible Depends Easy

Choosing Shard Key:

X-XSS-Protection: 1; mode=block

Key Points to Look For:
- Knows multiple strategies
- Understands trade-offs
- Can choose based on requirements

Follow-up: How do you handle cross-shard queries?

31.

Replication: master-slave, master-master

Compare master-slave and master-master replication.

Senior

Master-Slave (Primary-Replica):
One master accepts writes, slaves replicate for reads.

     ┌─────────┐
     │ Master  │ ← All writes
     └────┬────┘
          │ Replicate
    ┌─────┼─────┐
    ↓     ↓     ↓
┌──────┐ ┌──────┐ ┌──────┐
│Slave1│ │Slave2│ │Slave3│  ← Reads
└──────┘ └──────┘ └──────┘

Pros:
- Simple to implement
- No write conflicts
- Read scalability

Cons:
- Master is SPOF
- Replication lag
- Failover complexity

Master-Master (Multi-Master):
Multiple nodes accept writes, sync between them.

┌────────┐      ┌────────┐
│Master 1│ ←──→ │Master 2│
└────────┘      └────────┘
    ↑               ↑
  Writes          Writes

Pros:
- No single point of failure
- Geographic distribution
- Write scalability

Cons:
- Conflict resolution needed
- Complex setup
- Eventually consistent

Conflict Resolution:

Master 1: SET x = 5
Master 2: SET x = 10
Conflict!

Strategies:
1. Last-write-wins (timestamp)
2. Custom merge logic
3. Application-level resolution
4. CRDTs (Conflict-free Replicated Data Types)

Replication Types:

Synchronous:

Write → Wait for all replicas → Ack
+ Strong consistency
- Higher latency

Asynchronous:

Write → Ack → Replicate later
+ Lower latency
- Potential data loss

Semi-Synchronous:

Write → Wait for 1 replica → Ack
Balance of consistency and latency

Use Cases:

Pattern Use Case
Master-Slave Read-heavy apps, reporting
Master-Master Geographic distribution, HA
Sync replication Critical data
Async replication Performance priority

Key Points to Look For:
- Knows both patterns
- Understands conflict issues
- Knows sync vs async

Follow-up: How do you handle master failover?