AI Discussions

TechAiNewb

After much fiddling with knobs like the luddite that I am, I've finally succeeded in converting MistralAI's Mistral Nemo Instruct 2407 to F32 ONNX format. Tests are currently underway, however I was able to use Kaggle to get fairly basic benchmarks of the base model. As soon as the ONNX model finishes loading to Kaggle, I should have the ability to benchmark this monstrosity. If it shows well, I'll be releasing F16, Q8, Q4, and yes, a Q2 simply for dev and research purposes.

The F32 version is (should be) live on HF right now for fiddling with. https://huggingface.co/techAInewb/Mistral-Nemo-Instruct-2407

Assuming success with this, I'll be setting my sights on making an ONNX version of Mamba 2 to see if I can retain the Mamba 2 architecture's strengths and quantization resistance and use it for my overall project goals.

I wanted to thank everyone here and online for their help, encouragement, and assistance while I've worked on this. To my family and friends, I'm not dead (lol). To everyone out there sharing my passion for exploration, and have no fear of "failing forward" thank you for putting up with my absolute newb questions. Hopefully, this F32 convert - even if it's somewhat problematic, shows that with grit and desire, you can teach yourself how to do anything you want.

To those who have helped by donating to this project, a HUGE thank you. without you, I couldn't have taken the time to get this ball rolling!

And… I know y'all are WAY better at all of this than I am, but I'm running on a Dell G15 5535, which means I have one CUDA, (and the NPU, which I think you can see where I'm going with ONNX. I'm looking at you Vitis AI…) and a lot of patience. lol

Wait until you see what's next…

Love to everyone out there.
-Ryan

ZavionTrevix

I am glad to hear that.

TechAiNewb

I ended up pulling it because I couldn't combine the shards into a small enough group of files to not get picked on by other folks LOL.

I was able to get it to run and inference on my CPU with a 250 gig swap file, but it was obviously far too large to run with any amount of real usability on my rig.

Quantizing to Q8 is just outside of the realm of my ability at the moment. 😕

That said, right now I'm working with a version of zamba2 instruct I'm trying to bake tool use into with colab and Kaggle, similar to Qwen3.

Forgive the over exuberance of my post, I had been going for about 6 days straight trying to get it to work, and was very excited once it compiled and successfully inferenced.

The most important takeaway was that trying to run a model that big on CPU is dumb. Lol

... It's an amazingly capable model though.

AI Discussions

Mistral-Nemo-Instruct-2407 in ONNX?!?