49 lines
2.1 KiB
Markdown
49 lines
2.1 KiB
Markdown
## 使用说明
|
|
1. 在 https://huggingface.co/ 注册账号并获取模型权限创建 token
|
|
2. 机器上有 Nvidia GPU
|
|
3. 修改 /etc/docker/daemon.json 并增加
|
|
```
|
|
"runtimes": {
|
|
"nvidia": {
|
|
"path": "nvidia-container-runtime",
|
|
"runtimeArgs": []
|
|
}
|
|
}
|
|
```
|
|
4. 安装 nvidia-container-runtime 和 nvidia-docker2 组件
|
|
|
|
|
|
## About
|
|
vLLM is a fast and easy-to-use library for LLM inference and serving.
|
|
|
|
vLLM is fast with:
|
|
|
|
- State-of-the-art serving throughput
|
|
- Efficient management of attention key and value memory with **PagedAttention**
|
|
- Continuous batching of incoming requests
|
|
- Fast model execution with CUDA/HIP graph
|
|
- Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8.
|
|
- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
|
|
- Speculative decoding
|
|
- Chunked prefill
|
|
|
|
**Performance benchmark**: We include a [performance benchmark](https://buildkite.com/vllm/performance-benchmark/builds/4068) that compares the performance of vLLM against other LLM serving engines ([TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), [text-generation-inference](https://github.com/huggingface/text-generation-inference) and [lmdeploy](https://github.com/InternLM/lmdeploy)).
|
|
|
|
vLLM is flexible and easy to use with:
|
|
|
|
- Seamless integration with popular Hugging Face models
|
|
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
|
|
- Tensor parallelism and pipeline parallelism support for distributed inference
|
|
- Streaming outputs
|
|
- OpenAI-compatible API server
|
|
- Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, TPU, and AWS Neuron.
|
|
- Prefix caching support
|
|
- Multi-lora support
|
|
|
|
vLLM seamlessly supports most popular open-source models on HuggingFace, including:
|
|
- Transformer-like LLMs (e.g., Llama)
|
|
- Mixture-of-Expert LLMs (e.g., Mixtral)
|
|
- Embedding Models (e.g. E5-Mistral)
|
|
- Multi-modal LLMs (e.g., LLaVA)
|
|
|
|
Find the full list of supported models [here](https://docs.vllm.ai/en/latest/models/supported_models.html). |