1
0
Fork 0
1Panel-Appstore/vllm
renovate[bot] 9745e711a0 chore(deps): update vllm/vllm-openai docker tag to v0.8.2 (#3547)
* chore(deps): update vllm/vllm-openai docker tag to v0.8.2

* Update app version [skip ci]

---------

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: github-action update-app-version <githubaction@githubaction.com>
2025-03-26 13:53:07 +08:00
..
0.8.2 chore(deps): update vllm/vllm-openai docker tag to v0.8.2 (#3547) 2025-03-26 13:53:07 +08:00
README.md add vLLM (#3145) 2025-02-07 16:49:19 +08:00
README_en.md add vLLM (#3145) 2025-02-07 16:49:19 +08:00
data.yml feat: Add gpuSupport Config (#3195) 2025-02-17 17:39:16 +08:00
logo.png add vLLM (#3145) 2025-02-07 16:49:19 +08:00

README_en.md

使用说明

  1. Register an account at https://huggingface.co/ and get model access to create a token.
  2. Ensure the machine has an Nvidia GPU.
  3. Modify the /etc/docker/daemon.json file and add:
   "runtimes": {
      "nvidia": {
      "path": "nvidia-container-runtime",
      "runtimeArgs": []
      }
   }
  1. Install the nvidia-container-runtime and nvidia-docker2 components.

About

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

  • State-of-the-art serving throughput
  • Efficient management of attention key and value memory with PagedAttention
  • Continuous batching of incoming requests
  • Fast model execution with CUDA/HIP graph
  • Quantizations: GPTQ, AWQ, INT4, INT8, and FP8.
  • Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
  • Speculative decoding
  • Chunked prefill

Performance benchmark: We include a performance benchmark that compares the performance of vLLM against other LLM serving engines (TensorRT-LLM, text-generation-inference and lmdeploy).

vLLM is flexible and easy to use with:

  • Seamless integration with popular Hugging Face models
  • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
  • Tensor parallelism and pipeline parallelism support for distributed inference
  • Streaming outputs
  • OpenAI-compatible API server
  • Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, TPU, and AWS Neuron.
  • Prefix caching support
  • Multi-lora support

vLLM seamlessly supports most popular open-source models on HuggingFace, including:

  • Transformer-like LLMs (e.g., Llama)
  • Mixture-of-Expert LLMs (e.g., Mixtral)
  • Embedding Models (e.g. E5-Mistral)
  • Multi-modal LLMs (e.g., LLaVA)

Find the full list of supported models here.