* chore(deps): update vllm/vllm-openai docker tag to v0.8.2 * Update app version [skip ci] --------- Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: github-action update-app-version <githubaction@githubaction.com> |
||
|---|---|---|
| .. | ||
| 0.8.2 | ||
| README.md | ||
| README_en.md | ||
| data.yml | ||
| logo.png | ||
README_en.md
使用说明
- Register an account at https://huggingface.co/ and get model access to create a token.
- Ensure the machine has an Nvidia GPU.
- Modify the /etc/docker/daemon.json file and add:
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
- Install the nvidia-container-runtime and nvidia-docker2 components.
About
vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM is fast with:
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with PagedAttention
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantizations: GPTQ, AWQ, INT4, INT8, and FP8.
- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
- Speculative decoding
- Chunked prefill
Performance benchmark: We include a performance benchmark that compares the performance of vLLM against other LLM serving engines (TensorRT-LLM, text-generation-inference and lmdeploy).
vLLM is flexible and easy to use with:
- Seamless integration with popular Hugging Face models
- High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
- Tensor parallelism and pipeline parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
- Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, TPU, and AWS Neuron.
- Prefix caching support
- Multi-lora support
vLLM seamlessly supports most popular open-source models on HuggingFace, including:
- Transformer-like LLMs (e.g., Llama)
- Mixture-of-Expert LLMs (e.g., Mixtral)
- Embedding Models (e.g. E5-Mistral)
- Multi-modal LLMs (e.g., LLaVA)
Find the full list of supported models here.