June 2, 2026

Sing for me!

How I made my Mac sing. Porting of Khala Music AI to Apple Silicon.

Music AI Khala Porting

My absolutely favorite musical

TL;DR: I ported Khala Music AI to Mac, you can get it here along with some examples.

Model Card

**The bug that was supposed to be there**

Weights in hand, I wired up the vanilla model, ran

If you read my last entry - two weeks of maddening descent into internals of Pixal3D - you could think I may had enough of pulling things over from Nvidia side of the AI world at least for some time.

Well I did not.

So my Mac was now able to generate some cool 3D models. That's great and all, but what about music?

Wellll... there is ACE-Step-1.5 But despite pretty good press I found it severely lacking in quality. Oh, and it's ready to be used straight on the Mac - where's the fun in that?!

So here I was, watching some YouTube somewhere back in May and I think I heard about Khala on AISearch channel and it looked (sounded hehe) great! Like something Suno or Udio did just a while ago.

I had to get it....aaaaaannd it's Nvidia only. Go figure.

At that time I had been already deep into Pixal3D so I left it alone for the time being. But now that I was not anymore, I decided to pick it up for a FUN WEEKEND PROJECT. And you know what? It was a fun weekend project.

Spoiler alert: it did not take me weeks to get things working.

Transformers?

What does THIS guy have anything to do with this?

Okay, so this time it was not the custom kernels that were going to annoy me. At first it seemed as if there wasn't that much to port to begin with, these were really two GPT-style language models and an audio codec:

flowchart LR
	A[text + lyrics] --> B["backbone</br>(Megatron GPT)"]
	B --> C[coarse q0/q1 audio tokens]
	C --> D["super-res<br/>(Megatron GPT)"]
	D --> E[fine q0…q63 tokens]
	E --> F[DAC RVQ decoder]
	F --> G[waveform]

Ah, but you see, there it is - MEGATRON. What is Megatron you might ask? Well, I didn't know, so let me tell you, straight from their readme: "GPU-optimized library for training transformer models at scale". The problem this time that the CUDA lock-in wasn't kernels. It was the entire NVIDIA training stack - Megatron-Core, TransformerEngine, apex, flash-attn, cuda-graphs - wrapped around otherwise-ordinary transformers. So the port wasn't "rewrite the math for Metal." It was "de-Megatron-ify into vanilla PyTorch and let MPS run it."

The plan, refined by what I did last time was:

- Read the Megatron checkpoint into plain PyTorch tensors.
- Rebuild the two transformers as boring nn.Modules.
- Swap cuda for mps.
- Generate a banger and brag on the internet.

Seems simple enough eh?

Getting the weights of the mothership

Since I cannot run Megatron on Mac, I have to crawl back again to rental Nvidia box so I can run Megatron over there and get the weights converted and ready to be used on my Mac.

Literally me

The good news: although the torch_dist shards are pretty chonky (22.6 and 26.3 GB), about 85% is optimizer state nobody needs for inference. The real bf16 weights are a tidy ~3 GB each. And you don't need the NGC stack just to read them — I wrote a small loader that installs a fake megatron module tree into sys.modules, so when the pickled metadata reaches for Megatron classes it gets harmless stubs instead of an import error. Torch-only, filter to model.*, write safetensors, done. Well, since I was already there I decided to get golden capture of model's run on the reference architecture so I have a stuff to compare against.

ℹ️ How the weights got converted (the short version)

Megatron saves checkpoints as distributed-checkpoint (DCP) shards that normally need the whole Megatron stack to read. We skipped that:
1. Read torch-only. Trained at TP=1, so every tensor is whole — DCP just chunks it physically. A fake-Megatron import hook (_MegatronStubFinder) feeds the unpickler harmless stub classes, so dcp.load works with no Megatron, no GPU.
2. Diet. Drop optim / extrastate / rng keys → ~26 GB shards become ~3 GB of bf16 weights, plus per-tensor fingerprints for later.
3. Un-fuse. Megatron packs QKV (GQA-interleaved) and SwiGLU (gate+up) into single linears; we de-interleave them back into q/k/v_proj and gate/up/down_proj, and lift the fused norms out.
4. Trust, but verify. Keys must match the target model exactly, param counts must conserve, fingerprints must match within 1e-3, and the QKV split must round-trip bit-exact — then a stage-by-stage bisect against CUDA goldens, ending in the greedy 64/64 check.

Rsync'd all of this over to my Mac and off we were to the races.

**The bug that was supposed to be there**

Weights in hand, I wired up the vanilla model, ran greedy decode against a golden capture from the real CUDA model, and got: 1 out of 64 tokens correct. Great success.

So - the one habit that actually pays off - I stopped squinting and bisected. Find the first stage that diverges; everything before it is correct, everything after is consequence. Embedding: perfect. Attention: a false alarm I'll spare you (I briefly blamed my RoPE; the RoPE was fine, my measurement wasn't). The real culprit was one layer down, in the feed-forward network (MLP), and it was genuinely strange.

The fused first linear returns output that already includes its bias - and then the fused SwiGLU activation adds the same bias again before the gate. The model, as trained, computes:

\operatorname{silu}(W_w x + 2 b_w)\,(W_v x + 2 b_v)

The bias, twice. On purpose. Baked into the weights. I proved it cold: the double-bias reconstruction matched the real activations at cosine 0.999998, while the mathematically "correct" single-bias version sat at 0.9938.

greedy decode:  1/64  →  64/64   (bit-exact vs CUDA)

Here's the cool part: this isn't a Mac bug or even my bug. It's a quirk in the original trained model. To faithfully reproduce CUDA output on a Mac, I had to faithfully reproduce that wart.

It runs - and it makes music

Backbone parity is necessary, not sufficient. I still had to port the super-res path and write a KV-cache sampler to replace Megatron's inference engine, all behind a backend switch so the CUDA path stayed byte-for-byte untouched.

Then I made the classic mistake of celebrating a smoke test. It fed the pipeline synthetic tokens, the whole chain ran end to end, finite audio came out - and for one shining moment I thought "it works!" It did not work. It was wired. A pipeline that emits finite floats without crashing has proven exactly one thing: that it doesn't crash.

The real test is a real prompt. Pop/Instrumental, seed 42: the backbone emitted ~900 tokens and stopped at a natural EOS, super-res lifted them to fine tokens, and the decoder produced 20.8 seconds of coherent stereo music - a percussive intro opening into a piano melody.

It made music - on a Mac - from a text prompt. Through two transformers and a codec I'd rebuilt. Holy smokes.

(The first real run did crash once, in super-res, because the sampler was appending the terminating EOS token into the audio stream - and EOS is a control token, not audio, so the odd count broke the q0/q1 pairing. The fix was to break the loop before appending. The bit-exact gate never caught it because it runs with no EOS to leak. Hold that thought.)

The Mac ran out of memory, and then I ran out of theories

Generation worked. Then someone tried a longer track and the wheels came off: a 2-minute clip wanted ~90 GB of RAM plus 54 GB of swap and 15 minutes, fans screaming.

The single most important thing to know about porting a transformer to Apple Silicon: MPS has no FlashAttention - everyone's favorite attention lib over in CUDA land. PyTorch's attention falls back to a math path that materializes the full [B, H, S, S] score matrix - O(S²) memory. CUDA's flash-attn is O(S), which is exactly why upstream's "48 GB is enough" does not survive the trip to a Mac.

A couple of fixes helped a lot - sizing super-res to the actual sequence instead of the 8192 training floor (provably identical output, and it cut peak memory by ~8×), and calling empty_cache between passes so the allocator stops hoarding the high-water mark. Good. But then I declared victory on a deeper fix based on a small 3-pass probe that peaked at 30 GB. The real 62-pass pipeline hit 73.6 GB. My convenient little probe had been too small to provoke the bug - just like the smoke test, just like the greedy gate.

I'd been operating on a tidy story: the attention mask forces MPS onto the slow path, so the memory blows up. Plausible. The kind of thing that's true on some backends. So I finally sat down to measure it - honestly, in a fresh process per config, because a warm MPS allocator is a liar (same measurement read 5.4 GB cold and 1.1 GB later in the same process, purely from reuse).

The numbers, cold:

Config @ S=8192	Peak driver memory
masked SDPA	22.57 GB
maskless SDPA	22.56 GB

Identical. The mask wasn't the trigger. Maskless attention is also O(S²). My "just drop the mask" theory would have saved exactly zero bytes.

The actual fix is to never hand attention the full S×S in the first place. I wrote query-chunked attention - an explicit matmul → softmax → matmul over blocks of queries, peaking at [B, H, block, S] and freeing each tile as it goes - the same mechanism that I used in Pixal3D. Wired into the super-res branch only, so the backbone stays byte-identical.

	before	after
super-res forward @ S=8192	22.57 GB	1.09 GB
total memory usage	~90 GB + 54 GB swap	~39.7 GB
generation time	~15 minutes	~3.2 minutes

There we have it - generating songs of 'normal' length now take 'normal' amounts of memory. Phew.

The old enemy from the last project - the MPS fused-attention kernel that silently returns garbage past ~18k tokens - simply doesn't live here: super-res caps at 8192, well below the cliff, but my explicit matmul path sidesteps the fused kernel anyway.

I also evaluated adopting metal-flash-attention-improved and rejected it: it's a Swift/Metal, JIT-compiled, single-headed, training-oriented throughput tool, and it would need an FFI/MTLBuffer bridge into a Python inference path to solve a throughput problem I don't have. My problem was memory, and pure-PyTorch chunking won on effort, risk, and correctness.

The Epilogue

Final tally: two real bugs (the double-bias the model wanted, the EOS leak the sampler didn't), a stack of MPS memory work, one ground-up chunked-attention rewrite aaand boom chakalaka we're done. The Mac now reproduces the CUDA backbone bit-for-bit, makes coherent stereo music end to end, and does a 2-minute track in ~3.2 minutes at 39.7 GB, no swap - with zero custom Metal kernels written. I cleaned it up a little bit, updated README and pushed converted models to Hugging Face. You can find all the files needed at the top of the post.

I did throw in port of the frontend the team had made for good measure so it works with Mac backend. That said it has a nasty habit of repeatedly polling for status which makes it chug a sizable part of GPU processing, slowing the process of generating song overall...

Next stop: AniGen!

Model Card

The bug that was supposed to be there

Transformers?

Getting the weights of the mothership

The bug that was supposed to be there

It runs - and it makes music

The Mac ran out of memory, and then I ran out of theories

The Epilogue

Comments

**The bug that was supposed to be there**

**The bug that was supposed to be there**