FlexGen: 単一の GPU で楽しい言語デバイスを操作する
FlexGen is a high-throughput generation engine for running large language models with limited GPU memory. FlexGen allows high-throughput generation by IO-efficient offloading, compression, and large effective batch sizes. Throughput-Oriented Inference for Large Language Models In recent years, large language models (LLMs) have shown great performance across a wide range of tasks. Increasingly, LLMs have been…