![]() ![]() Microsoft says it plans to add additional models in the future.ĭeepSpeed-FastGen offers two deployment options: an interactive non-persistent pipeline or a persistent serving deployment. All current models leverage HuggingFace APIs in the backend to provide both the model weights and the model's corresponding tokenizer. The throughput with 16 replicas reaches 23.7 queries/sec, marking a linear 16x increase compared to a single replica.ĭeepSpeed-FastGen currently supports model architectures, including LLaMA, LLaMA2, Mistral, and OPT. DeepSpeed-FastGen also offers replica-level load balancing that evenly distributes requests across multiple servers, allowing for easy scalability. 14 seconds) while achieving the same throughput (1.2 rps). ![]() 0.67 rps) at identical latency (9 seconds) or up to 50% latency reduction (7 seconds vs. For example, on Llama-2 70B with 4 A100x80GB, DeepSpeed-FastGen demonstrates up to 2x higher throughput (1.36 rps vs. It provides equivalent latency with greater throughput or more responsive latency and the same throughput. In terms of performance, DeepSpeed-FastGen outperforms vLLM in both throughput and latency. ![]() Furthermore, it integrates low-overhead load-balancer that offers perfect linear scaling on dozens of replicas. This results in improved responsiveness, efficiency, and lower variance, providing lower latency and higher throughput streaming generation to all clients compared to other serving systems.Īccording to Samyam Rajbhandari on LinkedIn:įastGen effectively synthesizes novel batch scheduling techniques with efficient KV cahce management, communication optimized tensor-parallelsim and ultra fast CUDA kernels. It allows DeepSpeed-FastGen to run at a consistent forward size by taking partial tokens from prompts and composing this with generation. SplitFuse enables it to offer up to 2.3 times higher effective throughput compared to systems like vLLM. The Dynamic SplitFuse technique is a new token composition strategy for prompt processing and token generation. The system currently supports several model architectures. DeepSpeed-FastGen is based on the Dynamic SplitFuse technique. DeepSpeed-FastGen is the synergistic composition of DeepSpeed-MII and DeepSpeed-Inference. Microsoft has announced the alpha release of DeepSpeed-FastGen, a system designed to improve the deployment and serving of large language models (LLMs). ![]()
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |