The numbers do not They lie
We compare LEMoE against traditional MoE architectures to demonstrate the true efficiency of decentralizing experts.
RAM Memory Consumption
Commercial MoE architectures such as Mixtral 8x7B (the open-weights benchmark) force all memory experts to be loaded simultaneously, even if only two are used at a time. LEMoE breaks this limit by keeping only the router in active memory.
*Based on 1.5GB E5 Router + 500MB active reference expert model (grape-malbec). The other experts can be delegated to external APIs with no impact on RAM.
Latency Overhead
How much time is it penalized to go through the Router before reaching the final model? Next to nothing. The E5 vectorizer and cosine distance calculation are optimized to the millimeter.
A fraction of a second that saves gigabytes of processing and money on APIs.