Training a 67M-Parameter Transformer on an M4 Mac Mini with Apple Silicon MPS

Tools & Engineering

The Engineer

29 Jan 2026 · 3 min read

This experiment pushes the boundaries of what's possible with consumer hardware by training a large transformer model on an M4 Mac Mini, achieving impressive accuracy despite strict memory and processing constraints.

In a recent experiment, I trained a 67-million-parameter transformer model from scratch on an M4 Mac Mini using Apple Silicon’s Metal Performance Shaders (MPS) backend. The goal was to see how far a carefully designed small model could go when constrained by consumer hardware limits. Despite the limitations-24GB of unified memory and no discrete GPU-the model achieved 93.94 percent exact-match accuracy on CLI command generation, a task where even a single missing character results in failure.

The Hardware Constraint

The defining constraint was the M4 Mac Mini with its 24GB of unified memory and no discrete GPU. This setup used Apple Silicon’s Metal Performance Shaders (MPS) backend for training, which is optimized for Apple’s hardware but lacks the parallel processing power of dedicated GPUs. Every design decision had to balance memory pressure and computational efficiency.

Hardware: M4 Mac Mini with 24GB unified memory
Backend: Apple Silicon MPS
No discrete GPU
No CUDA

The Task: CLI Command Generation

CLI command generation is a stringent task. Commands are short, compositional, and highly sensitive to errors. A missing flag or an incomplete pipe can render the entire command invalid. This made exact-match accuracy the only relevant metric, as there was no room for partial correctness.

Task: Generate syntactically correct shell commands
Metric: Exact-match accuracy

Model Architecture and Training

The model leveraged modern architectural components like Rotary Position Embedding (RoPE), Root Mean Square Normalization (RMSNorm), and the SwiGLU activation function. These choices were driven by the need for efficiency and performance on limited hardware.

Model Size: 66.73 million parameters
Training Data: 204.8 million tokens
Pretraining Time: Roughly 13 hours wall time
Supervised Fine-Tuning Time: Approximately 4 minutes

Key Results

The final results were impressive given the constraints:

Exact-Match Accuracy: 93.94 percent on a held-out CLI evaluation set
Electricity Usage: Roughly 1 kilowatt-hour, costing under $0.50 at typical US electricity rates

Lessons Learned

This project highlighted several key insights:

Efficiency is Crucial: Every inefficiency in the training pipeline was immediately apparent due to the hardware constraints. This forced a focus on data efficiency and optimized model architecture.
Modern Architectures Matter: Components like RoPE, RMSNorm, and SwiGLU significantly improved performance and allowed the model to achieve high accuracy despite its size.
Data Streaming: Downloading all training data upfront was not feasible. Instead, streaming data as needed helped manage memory usage effectively.
Fine-Tuning is Powerful: The supervised fine-tuning step, which took only about 4 minutes, had a significant impact on the model’s performance.

Conclusion

This experiment demonstrates that with careful design and modern architectural components, it is possible to train effective small language models on consumer hardware. While the accuracy of 93.94 percent is impressive, the real takeaway is the feasibility of training from scratch on limited resources. This opens up new possibilities for researchers and practitioners who may not have access to high-end GPUs or large datasets.