Run Xiaomi MiMo-V2-Flash Locally: Step-by-Step Installation & Setup Guide
Run Xiaomi MiMo-V2-Flash Locally: Step-by-Step Installation & Setup Guide
Xiaomi’s MiMo-V2-Flash is an open-source, high-efficiency Mixture-of-Experts (MoE) language model, delivering powerful AI inference on local hardware and giving developers full control over data, latency, and tuning without API costs. :contentReference[oaicite:0]
Below is a step-by-step guide to install and run MiMo-V2-Flash on your own machine, with multiple methods to suit different environments and tools.
What Is MiMo-V2-Flash?
MiMo-V2-Flash is an open-source AI model developed by Xiaomi. It features a massive 309 billion total parameters design but only activates ~15 billion parameters during inference, making it efficient for large-scale tasks. :contentReference[oaicite:1]
Key strengths include:
- High-speed inference optimized for intelligent agents. :contentReference[oaicite:2]
- Open-source weights & code under MIT-like licensing. :contentReference[oaicite:3]
- Large context window and strong capabilities for reasoning and code generation. :contentReference[oaicite:4]
Prerequisites
Hardware Requirements
| Component | Minimum | Recommended |
|---|---|---|
| GPU | 12 GB VRAM (e.g., RTX 3080) | 24 GB+ (e.g., RTX 4090 / A6000) |
| System RAM | 32 GB | 64 GB+ |
| Storage | 100 GB free | 200 GB+ NVMe SSD |
| CPU | Modern multi-core | High-clock multi-core |
👉 Local deployment demands serious GPU and system resources. Consider quantization if memory is limited.
Software Setup
Before installing the model, make sure:
- Python 3.10+ is installed.
- NVIDIA Drivers + CUDA Toolkit 11.8/12.4 are configured.
- Git is available in your PATH.
Check with:
nvidia-smi
nvcc --versionMethod 1 — Using SGLang (Recommended)
SGLang provides optimized MoE support and is the most performance-tuned path for MiMo-V2-Flash.
Step 1 — Prepare Environment
python -m venv mimo-env
source mimo-env/bin/activate # Windows: mimo-env\Scripts\activateInstall PyTorch (CUDA 12.4 example):
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
pip install sglangStep 2 — Download Model
huggingface-cli login
huggingface-cli download XiaomiMiMo/MiMo-V2-Flash --local-dir ./models/MiMo-V2-FlashStep 3 — Launch Server
python -m sglang.launch_server \
--model-path ./models/MiMo-V2-Flash \
--host 0.0.0.0 \
--port 30000 \
--trust-remote-code \
--dtype float16 \
--max-model-len 262144 \
--gpu-memory-utilization 0.9Step 4 — Quick Test
Use Python to send a request:
import requests, json
url = "http://localhost:30000/v1/chat/completions"
data = {
"model": "MiMo-V2-Flash",
"messages": [{"role":"user","content":"Write a Python function to compute Fibonacci"}],
"max_tokens": 200
}
res = requests.post(url, headers={"Content-Type":"application/json"}, data=json.dumps(data))
print(res.json())Method 2 — Hugging Face Transformers
This approach uses the Transformers library to load MiMo directly.
Step 1 — Dependencies
pip install transformers==4.51.0 accelerate bitsandbytes
pip install torch --index-url https://download.pytorch.org/whl/cu124Step 2 — Create Script (run_mimo.py)
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "XiaomiMiMo/MiMo-V2-Flash"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True,
load_in_8bit=True
)
prompt = "Explain how neural networks work"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=150)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))Step 3 — Run
python run_mimo.pyMethod 3 — Docker Deployment
If you prefer containerization:
FROM nvidia/cuda:12.4-devel-ubuntu20.04
RUN apt-get update && apt-get install -y python3.10 python3-pip git
WORKDIR /app
RUN pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
RUN pip3 install sglang transformers accelerate
COPY models/MiMo-V2-Flash /app/models/MiMo-V2-Flash
EXPOSE 30000
CMD ["python3", "-m", "sglang.launch_server", "--model-path", "/app/models/MiMo-V2-Flash", "--host", "0.0.0.0", "--port", "30000"]Build & Run
docker build -t mimo-v2-flash .
docker run --gpus all -p 30000:30000 mimo-v2-flashPerformance Tips
Enable Flash Attention: pip install flash-attn
Use Quantization: 8-bit or 4-bit to reduce GPU memory.
Multi-GPU: Set device_map="auto" and split layers across GPUs.
Monitor: Use nvidia-smi to watch memory & temperature.
Testing & Validation
Create a simple test script:
import requests, json
def test_prompt(prompt):
url = "http://localhost:30000/v1/chat/completions"
data = {"model":"MiMo-V2-Flash","messages":[{"role":"user","content":prompt}], "max_tokens":100}
res = requests.post(url, headers={"Content-Type":"application/json"}, data=json.dumps(data))
print(res.json())
test_prompt("What is the capital of France?")
test_prompt("Generate a Hello World in JavaScript")FAQ
Do I need an internet connection to run locally?
After downloading the model files once, you can run completely offline.
What if GPU memory is insufficient?
Try quantization (load_in_8bit=True/4-bit), or switch to a smaller local model. A high-VRAM GPU is ideal.
Can this run on CPU only?
Technically yes, but inference will be very slow and may not complete due to large model size.
Is there a Windows version?
Yes — all the methods above work on Windows with adjusted paths and environment activation.
Where are the official model files hosted?
Xiaomi hosts MiMo-V2-Flash on Hugging Face and GitHub as part of their open-source offerings.