OpenWhisper: Building a Privacy-First Voice Dictation App for macOS

🎙️ I use this daily to
dictate commit messages,
Slack replies, and code
comments without ever
leaving my editor.

Here’s a scenario I lived through too many times: I’m deep in a flow state, cursor blinking in the middle of a function, and I need to write a commit message. I alt-tab to Terminal, type the message badly because my fingers are still in code mode, make three typos, fix them, tab back, and spend thirty seconds remembering where I was.

It sounds minor. It isn’t. That context switch is a small death each time it happens. Multiply it by the forty or fifty text-entry moments scattered across a developer’s day — commit messages, PR descriptions, Slack threads, inline code comments, GitHub issue bodies — and you’re hemorrhaging focus.

Voice dictation should be the answer. It is, in theory. In practice, macOS Dictation sends audio to Apple’s servers. Dragon NaturallySpeaking is a $200 Windows relic. The built-in shortcuts require you to navigate to the field before activating them. And none of them are programmable: you can’t say “transcribe this through Gemini first and fix the grammar.”

So I built OpenWhisper — a macOS menu bar app that runs OpenAI Whisper entirely locally. Hold FN, speak, release. The transcription appears in whatever app you were in, even if you’ve switched windows since. Audio never leaves your machine. The whole thing is open source, MIT-licensed, and built in Swift.

This post is a deep dive into how it works, what I learned building it, and an interactive tour of the features that matter most.



The Architecture

Before diving into the code, here’s a bird’s-eye view of the data flow. Click any stage in the diagram to see what’s happening under the hood.

Data flow — click a stage to explore

The key architectural decision is that every single stage happens on your machine. There’s no SaaS intermediary, no background upload, no analytics beacon. The worst that can happen is Gemini post-processing touches the text of your transcription — and even that is opt-in per app.

Why local-first matters for developers: Code snippets, internal project names, credentials you accidentally dictate, unreleased feature names — all of this ends up in dictation buffers on cloud services. With OpenWhisper, the audio never leaves the machine. Whisper runs entirely in-process on your Neural Engine or CPU.


Whisper Model Selection

🚀 The turbo model is a
hidden gem — almost as
good as large-v3 at
roughly 30× the speed.
I use it daily.

Choosing the right Whisper model is one of the most impactful decisions you’ll make. The difference between tiny and large-v3 is the difference between a fuzzy autocomplete and a verbatim transcript.

Model Speed Latency (30 s audio) Best for Accuracy

⚡ Transcription latency comparison (30 s audio clip, Apple M2 Pro)

My daily driver is turbo. It was added relatively late to the Whisper family and isn’t as well-known as the others, but it punches massively above its weight class. For commit messages, Slack replies, and short prose, it’s indistinguishable from large-v3 in practice.

Use base if you’re on an older machine or want the absolute fastest cold start. Use large-v3 only if you’re dictating long-form documents or you notice consistent errors with turbo.


Push-to-Talk: The Interaction Model

🤫 Push-to-talk means no
accidental transcriptions.
Perfect for open offices
and video calls.

The signature interaction is dead simple: hold FN, speak, release. But making that feel right required solving a few non-obvious problems.

Try the demo below to get a feel for it — click and hold the mic button.

🎙️ Interactive Push-to-Talk Demo

Hold the mic button to "record", then release to transcribe

🎙
Hold to record

Click multiple times to cycle through example transcriptions

The FN key problem

FN is special on macOS. It’s not a standard NSEvent key — it doesn’t go through the normal Cocoa responder chain. To intercept it globally, you need a CGEventTap at the kCGHIDEventTap level with kCGEventFlagsChanged monitoring.

Here’s the core setup:

TranscriptionManager.swift — CGEventTap for FN key
func setupHotkey() {
    let eventMask = (1 << CGEventType.flagsChanged.rawValue)
    guard let tap = CGEvent.tapCreate(
        tap: .cghidEventTap,
        place: .headInsertEventTap,
        options: .defaultTap,
        eventsOfInterest: CGEventMask(eventMask),
        callback: { proxy, type, event, refcon in
            guard let refcon else { return Unmanaged.passRetained(event) }
            let mng = Unmanaged<TranscriptionManager>.fromOpaque(refcon).takeUnretainedValue()
            mng.handleFlagsChanged(event: event)
            return Unmanaged.passRetained(event)
        },
        userInfo: Unmanaged.passUnretained(self).toOpaque()
    ) else {
        print("Failed to create event tap — accessibility permission needed")
        return
    }

    let runLoopSource = CFMachPortCreateRunLoopSource(kCFAllocatorDefault, tap, 0)
    CFRunLoopAddSource(CFRunLoopGetCurrent(), runLoopSource, .commonModes)
    CGEvent.tapEnable(tap: tap, enable: true)
}

private func handleFlagsChanged(event: CGEvent) {
    let fnKeyCode: Int64 = 63  // kVK_Function
    let keyCode = event.getIntegerValueField(.keyboardEventKeycode)
    guard keyCode == fnKeyCode else { return }

    let isDown = event.flags.contains(.maskSecondaryFn)

    if isDown && !isFnKeyCurrentlyPressed {
        isFnKeyCurrentlyPressed = true
        handleFnDown()
    } else if !isDown && isFnKeyCurrentlyPressed {
        isFnKeyCurrentlyPressed = false
        handleFnUp()
    }
}

Launching Whisper: Process Management in Swift

Once recording stops, the WAV file gets handed to Whisper via Foundation.Process. This is more nuanced than it sounds — Whisper is a Python CLI, and it needs ffmpeg on PATH to decode audio formats. If you just pass your $PATH, it may not include Homebrew’s bin directory when launched from a sandboxed app.

WhisperService.swift — launching transcription
func transcribe(
    audioURL: URL,
    whisperPath: String,
    ffmpegPath: String,
    model: String = "base",
    completion: @escaping (String?) -> Void
) {
    let process = Process()
    process.executableURL = URL(fileURLWithPath: whisperPath)

    // Inject ffmpeg's directory into PATH so whisper can find it
    let ffmpegDir = URL(fileURLWithPath: ffmpegPath)
        .deletingLastPathComponent().path
    var env = ProcessInfo.processInfo.environment
    env["PATH"] = ffmpegDir + ":" + (env["PATH"] ?? "")
    process.environment = env

    let outputDir = FileManager.default.temporaryDirectory
        .appendingPathComponent("whisper_output")
    try? FileManager.default.createDirectory(
        at: outputDir, withIntermediateDirectories: true)

    process.arguments = [
        audioURL.path,
        "--output_dir", outputDir.path,
        "--output_format", "txt",
        "--model", model.isEmpty ? "base" : model
    ]

    process.terminationHandler = { _ in
        let txtFile = outputDir.appendingPathComponent(
            audioURL.deletingPathExtension().lastPathComponent + ".txt")
        if let text = try? String(contentsOf: txtFile, encoding: .utf8) {
            completion(text.trimmingCharacters(in: .whitespacesAndNewlines))
        } else {
            completion(nil)
        }
        try? FileManager.default.removeItem(at: txtFile)
        try? FileManager.default.removeItem(at: audioURL)
    }
    try? process.run()
}

The Transcription Queue

One of the most satisfying features to build was the queue. The naive implementation processes one recording at a time and blocks new recordings until it’s done. That’s terrible UX — you end up holding your breath waiting for the transcription to finish before you can start speaking again.

The queue decouples recording from transcription. You can hold FN and start a new recording the moment you release from the previous one, even if Whisper is still processing the first clip.

TranscriptionJob.swift — job state machine
enum JobState: Equatable {
    case recording      // FN is held; microphone is active
    case queued         // recording done; waiting for previous job to finish
    case transcribing   // whisper process is running
    case postProcessing(String) // Gemini / Shortcut running; payload = rule name
}

class TranscriptionJob: ObservableObject, Identifiable {
    let id = UUID()
    let targetApp: NSRunningApplication?
    let appIcon: NSImage?

    @Published var state: JobState = .recording

    /// Temp WAV path — whisper deletes this after transcription
    var audioURL: URL?
    /// Persistent copy saved to ~/Library/Application Support/OpenWhisper/Recordings/
    var savedAudioURL: URL?

    /// The AXUIElement of the focused window at recording start.
    /// Used to target the exact browser tab / terminal pane during paste.
    var targetWindow: AXUIElement?
    var targetPID: pid_t = 0

    init(targetApp: NSRunningApplication?) {
        self.targetApp = targetApp
        self.appIcon = targetApp?.icon
    }
}

Each job carries its own targetApp, targetWindow (an AXUIElement), and targetPID. When the transcription is ready to paste, it activates that specific window rather than just the foreground app. This is what makes the “switch away freely while transcribing” feature work correctly.


Pasting with Precision: AXUIElement Targeting

📋 I’ve been using this to
dictate GitHub issue
descriptions while
reading code in a
different window.

This is the part of the codebase I’m most proud of. The naive paste implementation is NSPasteboard.general.setString(text) followed by ⌘V via CGEvent. That works — until you have multiple browser windows open, or multiple terminal tabs, or you’ve switched windows since starting the recording.

The precision targeting flow:

  1. At recording start, capture AXFocusedUIElement from the system-wide accessibility element
  2. Walk up the AX tree to find the parent AXWindow
  3. Store both the AXUIElement and the app’s pid_t
  4. At paste time, call AXUIElementPerformAction(window, kAXRaiseAction) to bring that specific window forward, then post ⌘V via CGEventPostToPid
TranscriptionManager.swift — AX window capture and paste
static func captureAXTarget() -> (AXUIElement?, pid_t) {
    let systemWide = AXUIElementCreateSystemWide()
    var focusedElement: AnyObject?
    AXUIElementCopyAttributeValue(
        systemWide,
        kAXFocusedUIElementAttribute as CFString,
        &focusedElement
    )
    guard let element = focusedElement as! AXUIElement? else {
        return (nil, 0)
    }

    // Walk up the AX hierarchy to find the containing AXWindow
    func findWindow(from el: AXUIElement) -> AXUIElement? {
        var role: AnyObject?
        AXUIElementCopyAttributeValue(el, kAXRoleAttribute as CFString, &role)
        if (role as? String) == kAXWindowRole { return el }
        var parent: AnyObject?
        AXUIElementCopyAttributeValue(el, kAXParentAttribute as CFString, &parent)
        guard let p = parent as! AXUIElement? else { return nil }
        return findWindow(from: p)
    }

    var pid: pid_t = 0
    AXUIElementGetPid(element, &pid)
    return (findWindow(from: element), pid)
}

private func pasteToTarget(job: TranscriptionJob, text: String) {
    NSPasteboard.general.clearContents()
    NSPasteboard.general.setString(text, forType: .string)

    // Raise the specific window before pasting
    if let window = job.targetWindow {
        AXUIElementPerformAction(window, kAXRaiseAction as CFString)
    }
    job.targetApp?.activate(options: .activateIgnoringOtherApps)

    // Post Cmd+V to the target PID directly
    let src = CGEventSource(stateID: .hidSystemState)
    let keyDown = CGEvent(keyboardEventSource: src, virtualKey: 0x09, keyDown: true)!
    let keyUp   = CGEvent(keyboardEventSource: src, virtualKey: 0x09, keyDown: false)!
    keyDown.flags = .maskCommand
    keyUp.flags   = .maskCommand
    keyDown.postToPid(job.targetPID)
    keyUp.postToPid(job.targetPID)
}
⚠️ Accessibility permission is non-negotiable. Without it, the CGEventTap for FN key monitoring won't work, and AXUIElement calls will return kAXErrorAPIDisabled. macOS will prompt you on first launch — if you accidentally decline, go to System Settings → Privacy & Security → Accessibility.

Post-Processing Rules

🧠 I use a Gemini rule in
Slack to automatically
clean up my dictated
messages — it removes
filler words and fixes
grammar before pasting.

This is the feature that transforms OpenWhisper from a transcription tool into a writing assistant. For every app, you can set a rule that transforms the raw transcription before it gets pasted.

The three rule types:

  • Pass-through — raw Whisper output, no changes
  • macOS Shortcut — run any Shortcut you’ve built (grammar fix, translation, code formatting, anything)
  • Gemini AI — a custom prompt template; {text} is replaced with the transcription

Try the configurator below:

⚙️ Post-Processing Rule Builder

Processing
Raw transcription (Whisper output)
After rule:

The rules are stored as a simple array in UserDefaults (encoded as JSON). The matching logic checks if the frontmost app’s bundle identifier or display name matches the rule’s app name — first match wins.


Privacy: How It Actually Works

I want to be precise here because “privacy-first” is overused. Here’s the concrete truth about what data moves where:

🔐 Data flow — where does your audio and text go?

Four specific guarantees:

  1. Audio is never uploaded. The whisper binary runs as a local subprocess. Your WAV file lives on disk for the duration of transcription, then is deleted. If you enable audio history, a copy is saved to ~/Library/Application Support/OpenWhisper/Recordings/.

  2. Clipboard is restored. If you have “Copy to clipboard” disabled, OpenWhisper saves your existing clipboard contents before pasting, restores them afterwards, and clears the new content.

  3. Gemini is text-only and opt-in. Even if you configure Gemini post-processing, only the text transcription is sent — never the audio. It’s also per-app and requires you to provide your own API key.

  4. No telemetry, no analytics, no network traffic beyond what you explicitly configure.


Getting Started

Everything below is a one-time setup. Once it’s done, the app requires zero maintenance — Whisper models are cached locally, and there’s nothing to subscribe to or keep updated.

🚀 Setup Checklist

0 / 7 complete
Apple Silicon tip: Whisper runs natively on the Neural Engine on M-series chips. The small and even medium models are fast enough for real-time use. Intel Macs should stick to tiny, base, or turbo for comfortable latency.

What’s Next

The app is in active development. The CHANGELOG already tracks two significant releases, and there are a few things I want to tackle next:

  • Swift-native Whisper via whisper.cpp or the Metal-accelerated bindings — eliminating the Python CLI dependency entirely would dramatically improve cold-start time and enable sandboxing
  • On-device post-processing using Apple Intelligence APIs when they mature
  • Better hands-free mode — the current double-tap detection has edge cases on fast FN keypresses
  • Waveform in the Dynamic Island overlay — there’s already an animated waveform during recording in the mini overlay; I want to make it show actual audio levels in real time

If any of this sounds interesting, the repo is at github.com/malukenho/open-whisper. PRs welcome — especially for the Swift-native Whisper integration, which I know a few people have been working on independently.