Running Local LLMs on Apple Silicon with MLX and Qwen3

Apple Silicon has become a surprisingly capable platform for running Large Language Models (LLMs) locally. Thanks to Apple’s MLX framework, developers can leverage the Neural Engine, GPU, and unified memory architecture to run optimized AI models directly on their MacBooks.

In this tutorial, we’ll walk through installing MLX, downloading a quantized Qwen3 model, and chatting with it locally from the terminal.


What is MLX?

MLX is Apple’s machine learning framework designed specifically for Apple Silicon devices.

The MLX ecosystem includes:

  • mlx-lm – Run and fine-tune Large Language Models
  • mlx-vlm – Vision Language Models
  • mlx-audio – Speech-to-Text, Text-to-Speech, Speech-to-Speech
  • mlx-swift-examples – Swift integrations for MLX models

The MLX Community provides pre-converted and quantized model weights that are ready to run on Macs without additional conversion steps.


Prerequisites

Before getting started, ensure you have:

  • macOS running on Apple Silicon (M1, M2, M3, or M4)
  • Homebrew installed
  • Python 3.10+

Verify your processor:

uname -m

Expected output:

arm64

Step 1: Install MLX-LM

Install MLX-LM using Homebrew:

brew install mlx-lm --appdir="~/Applications"

Homebrew will install the required MLX runtime and supporting dependencies.

Verify the installation:

mlx_lm --help

Step 2: Generate Text with Qwen3

Let’s run our first prompt using a quantized Qwen3 model from the MLX Community.

mlx_lm.generate \
  --model mlx-community/Qwen3-4B-Instruct-2507-4bit \
  --prompt "hello"

What happens behind the scenes?

  1. MLX downloads the model from Hugging Face.
  2. The model is cached locally.
  3. The prompt is tokenized.
  4. Inference runs directly on Apple Silicon.
  5. The generated response is streamed back to the terminal.

Example output:

Hello! How can I assist you today?

The first run may take several minutes because the model needs to be downloaded.

Subsequent runs are significantly faster since the model is loaded from the local cache.


Step 3: Start an Interactive Chat Session

Once the model is downloaded, you can start a conversational session.

mlx_lm.chat

Example:

>>> Explain Kubernetes in simple terms

Kubernetes is a container orchestration platform...

Unlike individual generation commands, the chat interface maintains conversation history throughout the session.

This makes it useful for:

  • Learning new technologies
  • Code reviews
  • Architecture discussions
  • Documentation generation
  • Debugging assistance

Understanding Model Caching

Downloaded models are stored locally and reused.

Benefits include:

  • Faster startup times
  • No repeated downloads
  • Offline usage after initial download

This makes MLX ideal for developers who want a local AI assistant without sending code to external services.


Optional: Convert and Quantize Your Own Models

MLX can convert Hugging Face models into Apple-optimized formats.

Example:

mlx_lm.convert \
  --model Qwen/Qwen3-4B-Instruct-2507 \
  -q

The -q flag applies quantization, reducing memory requirements while maintaining strong performance.

Benefits:

  • Smaller model size
  • Lower RAM usage
  • Faster inference
  • Better performance on laptops

Upload Quantized Models to Hugging Face

After conversion, you can publish models directly.

mlx_lm.convert \
  --model Qwen/Qwen3-4B-Instruct-2507 \
  -q \
  --upload-repo mlx-community/Qwen3-4B-Instruct-2507-4bit

This workflow is useful for platform teams creating optimized internal model repositories.


Common Use Cases for Developers

Local Coding Assistant

Run AI without exposing source code to external APIs.

mlx_lm.chat

Ask:

Review this Spring Boot service.

Kubernetes Troubleshooting

Why is my pod stuck in CrashLoopBackOff?

Gradle Build Assistance

Explain api vs implementation dependencies.

Architecture Reviews

Design a scalable notification system using Kafka.

Performance Expectations

Model SizeRecommended Memory
1B–3B8 GB+
4B–8B16 GB+
14B32 GB+
32B+64 GB+

For most developers, the 4B quantized Qwen3 model provides an excellent balance between quality and speed.


Troubleshooting

Command Not Found

Verify installation:

which mlx_lm

Slow First Run

This is expected because model weights are being downloaded.


Out of Memory

Try a smaller model or a more aggressively quantized version.


Final Thoughts

MLX is one of the most exciting developments for Apple Silicon developers. It allows you to run powerful LLMs locally with minimal setup while benefiting from Apple’s optimized hardware stack.

In just a few commands, you can transform your Mac into a private AI workstation capable of code generation, architecture reviews, documentation assistance, and learning support.

Commands recap:

brew install mlx-lm --appdir="~/Applications"

mlx_lm.generate \
  --model mlx-community/Qwen3-4B-Instruct-2507-4bit \
  --prompt "hello"

mlx_lm.chat

If you’re a software engineer working with Spring Boot, Kubernetes, Dapr, Gradle, Java 21, or cloud-native platforms, MLX is worth adding to your developer toolkit.

Leave a Reply

Your email address will not be published. Required fields are marked *