The numbers do not They lie

We compare LEMoE against traditional MoE architectures to demonstrate the true efficiency of decentralizing experts.

RAM Memory Consumption

Commercial MoE architectures such as Mixtral 8x7B (the open-weights benchmark) force all memory experts to be loaded simultaneously, even if only two are used at a time. LEMoE breaks this limit by keeping only the router in active memory.

Mixtral 8x7B (4-bit quant) ~32.0GB
LEMoE (Router + 1 Active Expert) 2.0 GB

*Based on 1.5GB E5 Router + 500MB active reference expert model (grape-malbec). The other experts can be delegated to external APIs with no impact on RAM.

Latency Overhead

How much time is it penalized to go through the Router before reaching the final model? Next to nothing. The E5 vectorizer and cosine distance calculation are optimized to the millimeter.

Text Cleaning 12ms
Vectorization (E5) 180ms
Softmax & Routing 8ms

A fraction of a second that saves gigabytes of processing and money on APIs.