Run Xiaomi MiMo-V2-Flash Locally: Step-by-Step Installation & Setup Guide

About 3 min

Run Xiaomi MiMo-V2-Flash Locally: Step-by-Step Installation & Setup Guide

Xiaomi’s MiMo-V2-Flash is an open-source, high-efficiency Mixture-of-Experts (MoE) language model, delivering powerful AI inference on local hardware and giving developers full control over data, latency, and tuning without API costs. :contentReference[oaicite:0]

Below is a step-by-step guide to install and run MiMo-V2-Flash on your own machine, with multiple methods to suit different environments and tools.

What Is MiMo-V2-Flash?

MiMo-V2-Flash is an open-source AI model developed by Xiaomi. It features a massive 309 billion total parameters design but only activates ~15 billion parameters during inference, making it efficient for large-scale tasks. :contentReference[oaicite:1]

Key strengths include:

High-speed inference optimized for intelligent agents. :contentReference[oaicite:2]
Open-source weights & code under MIT-like licensing. :contentReference[oaicite:3]
Large context window and strong capabilities for reasoning and code generation. :contentReference[oaicite:4]

Prerequisites

Hardware Requirements

Component	Minimum	Recommended
GPU	12 GB VRAM (e.g., RTX 3080)	24 GB+ (e.g., RTX 4090 / A6000)
System RAM	32 GB	64 GB+
Storage	100 GB free	200 GB+ NVMe SSD
CPU	Modern multi-core	High-clock multi-core

👉 Local deployment demands serious GPU and system resources. Consider quantization if memory is limited.

Software Setup

Before installing the model, make sure:

Python 3.10+ is installed.
NVIDIA Drivers + CUDA Toolkit 11.8/12.4 are configured.
Git is available in your PATH.

Check with:

nvidia-smi
nvcc --version

Method 1 — Using SGLang (Recommended)

SGLang provides optimized MoE support and is the most performance-tuned path for MiMo-V2-Flash.

Step 1 — Prepare Environment

python -m venv mimo-env
source mimo-env/bin/activate   # Windows: mimo-env\Scripts\activate

Install PyTorch (CUDA 12.4 example):

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
pip install sglang

Step 2 — Download Model

huggingface-cli login
huggingface-cli download XiaomiMiMo/MiMo-V2-Flash --local-dir ./models/MiMo-V2-Flash

Step 3 — Launch Server

python -m sglang.launch_server \
  --model-path ./models/MiMo-V2-Flash \
  --host 0.0.0.0 \
  --port 30000 \
  --trust-remote-code \
  --dtype float16 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.9

Step 4 — Quick Test

Use Python to send a request:

import requests, json

url = "http://localhost:30000/v1/chat/completions"
data = {
  "model": "MiMo-V2-Flash",
  "messages": [{"role":"user","content":"Write a Python function to compute Fibonacci"}],
  "max_tokens": 200
}

res = requests.post(url, headers={"Content-Type":"application/json"}, data=json.dumps(data))
print(res.json())

Method 2 — Hugging Face Transformers

This approach uses the Transformers library to load MiMo directly.

Step 1 — Dependencies

pip install transformers==4.51.0 accelerate bitsandbytes
pip install torch --index-url https://download.pytorch.org/whl/cu124

Step 2 — Create Script (run_mimo.py)

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "XiaomiMiMo/MiMo-V2-Flash"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(
  model_id,
  torch_dtype=torch.float16,
  device_map="auto",
  trust_remote_code=True,
  load_in_8bit=True
)

prompt = "Explain how neural networks work"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=150)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Step 3 — Run

python run_mimo.py

Method 3 — Docker Deployment

If you prefer containerization:


FROM nvidia/cuda:12.4-devel-ubuntu20.04

RUN apt-get update && apt-get install -y python3.10 python3-pip git

WORKDIR /app
RUN pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
RUN pip3 install sglang transformers accelerate
COPY models/MiMo-V2-Flash /app/models/MiMo-V2-Flash

EXPOSE 30000

CMD ["python3", "-m", "sglang.launch_server", "--model-path", "/app/models/MiMo-V2-Flash", "--host", "0.0.0.0", "--port", "30000"]

Build & Run

docker build -t mimo-v2-flash .
docker run --gpus all -p 30000:30000 mimo-v2-flash

Performance Tips

Enable Flash Attention: pip install flash-attn
Use Quantization: 8-bit or 4-bit to reduce GPU memory.
Multi-GPU: Set device_map="auto" and split layers across GPUs.
Monitor: Use nvidia-smi to watch memory & temperature.

Testing & Validation

Create a simple test script:

import requests, json

def test_prompt(prompt):
  url = "http://localhost:30000/v1/chat/completions"
  data = {"model":"MiMo-V2-Flash","messages":[{"role":"user","content":prompt}], "max_tokens":100}
  res = requests.post(url, headers={"Content-Type":"application/json"}, data=json.dumps(data))
  print(res.json())

test_prompt("What is the capital of France?")
test_prompt("Generate a Hello World in JavaScript")

FAQ

Do I need an internet connection to run locally?

After downloading the model files once, you can run completely offline.

What if GPU memory is insufficient?

Try quantization (load_in_8bit=True/4-bit), or switch to a smaller local model. A high-VRAM GPU is ideal.

Can this run on CPU only?

Technically yes, but inference will be very slow and may not complete due to large model size.

Is there a Windows version?

Yes — all the methods above work on Windows with adjusted paths and environment activation.

Where are the official model files hosted?

Xiaomi hosts MiMo-V2-Flash on Hugging Face and GitHub as part of their open-source offerings.