parley

Your meetings, transcribed entirely on your Mac.

Captures your mic and the far side of the call (Zoom, Teams, Meet), and turns them into speaker-attributed transcripts that never leave your machine. 100% open-source, open-model, airgapped.

macOS 15+ Apple Silicon Swift cloud: none ~62× end-to-end AGPL-3.0

View on GitHub → Frédéric Masi · LinkedIn

Parley menu bar panel, recording in progress

Why it exists

Cloud meeting AI means uploading the raw audio of every conversation to someone else's servers. For anything confidential, that's a non-starter.

So I built the opposite: transcription that runs entirely on-device, on open models, with nothing phoning home. And the transcripts aren't just notes — they're private context my own AI agents can draw on for total recall, without renting my memory to anyone. parley covers calls and meetings; mailrag covers email. Independent tools; my agents know about both and use what fits.

Why it's built this way

Meetings are sensitive. Most of what's said in them shouldn't be uploaded to anyone's servers, and for confidential work that rules out every cloud notetaker.

On-device and airgap-capable. The one that matters most. Nothing leaves your Mac, ever.
Open-source and free. Read the code, trust the code. Donations welcome. :)
Crash-resilient. It shouldn't fall over in the meeting you most needed it for.
Tamper-proof. Signed transcripts and summaries you can actually trust (on the roadmap).
Agent-queryable. Your own AI agents can search the record over MCP, on your machine (also coming).

What it does

Dual-stream capture

Records your mic and system audio as separate streams, so local and remote voices stay distinguishable. No virtual drivers.

On-device diarization

Automatic who-said-what (pyannote + WeSpeaker + VBx), with a quality score per segment.

Two engines

FluidAudio / Parakeet (fastest, 25 EU languages) or Apple SpeechAnalyzer. Swap in Settings.

Echo / mic-bleed removal

Strips the far-end voice that bleeds into your mic so it isn't mistaken for a phantom speaker.

Crash-safe recording

Survives UI and XPC crashes with auto-relaunch, silent re-attach, and multi-segment stitching.

Open formats

JSON, SRT, and TXT with timestamps, speaker labels, confidence scores, and local/remote tags.

Local LLM summaries

Optional meeting summaries via any OpenAI-compatible or LM Studio endpoint — point it at a local model and even the summary never leaves your Mac.

How it works — the hard parts

Dual-stream capture (the core constraint). macOS exposes no API for a pre-mixed mic + system stream — verified through the macOS 26 SDK. So the app runs two independent capture streams — your mic and the system output — and treats "local" and "remote" as first-class. That's what makes reliable speaker separation possible.
Capturing the calls a screen recorder can't see. ScreenCaptureKit is the default for system audio, but it misses Continuity (iPhone) calls and some VoIP audio. An optional Core Audio process tap grabs the system output directly, so phone and app calls land in the transcript too — switchable in Settings, off by default.
Echo / mic-bleed removal. A triple-confirmed gate removes the far-end voice bleeding into your mic: >50% temporal overlap and >70% word overlap and >0.8 speaker-embedding cosine. Across 7 real recordings it caught 22% more far-end bleed than the heuristic it replaced (158 vs 129 segments), with zero false positives.
Cross-chunk speaker reconciliation. Audio is chunked and transcribed in parallel; per-chunk speaker IDs are merged into one global identity via greedy cosine matching on embeddings.
Crash-safe by design. Sentinel file + LaunchAgent restart + multi-segment stitching mean a crash mid-meeting costs ~300–800 ms, not your recording.
Chunked over streaming — a deliberate accuracy call. Streaming ASR and diarization trade accuracy for low latency. Parley processes complete chunks in the background instead, so transcript and speaker labels come out as good as a full offline run, and only the final chunk waits for you to stop.
On-device ML across the Neural Engine and GPU. Parakeet ASR and Silero VAD run on the Neural Engine; pyannote/WeSpeaker/VBx diarization runs through CoreML across the Neural Engine and GPU.

Speed

Measured on an M5 Pro (release build) with the bundled harness (tools/engine-benchmark), on a 4-minute AMI clip.

Stage	Engine	Real-time
Transcription	FluidAudio (Parakeet, ANE)	~142×
Transcription	WhisperKit (large-v3-turbo), for contrast	~2.5×
Speaker diarization	pyannote + WeSpeaker + VBx	~111×
Full pipeline	transcription + diarization	~62×

Chunks process in the background during the meeting, so the number you actually feel is the last row: at ~62× end-to-end, the final chunk (≤30 min by default, configurable) finishes in well under a minute after you stop, whatever the meeting's length. Apple SpeechAnalyzer isn't benchmarked here — it needs a per-language model and errored on this clip.

Open models, on-device

Component	Model	License
Speech recognition	NVIDIA Parakeet TDT 0.6B (CoreML)	CC-BY-4.0
Speaker diarization	pyannote segmentation + WeSpeaker	CC-BY-4.0
Voice activity	Silero VAD	MIT
Engine SDK	FluidAudio	Apache-2.0

Get started: git clone https://github.com/fmasi/parley.git && cd parley && bash package_app.sh --install