LLMEasyQuant: Scalable Quantization for Parallel and Distributed LLM Inference

LLMEasyQuant introduces a modular framework leveraging fused CUDA kernels and NCCL synchronization to optimize LLM inference across diverse hardware. This re...

Level: advanced

By Unknown

Category: research