1Panel-Appstore

pooneyy

History

renovate[bot] 9745e711a0 chore(deps): update vllm/vllm-openai docker tag to v0.8.2 (#3547 ) * chore(deps): update vllm/vllm-openai docker tag to v0.8.2 * Update app version [skip ci] --------- Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: github-action update-app-version <githubaction@githubaction.com>		2025-03-26 13:53:07 +08:00
..
0.8.2	chore(deps): update vllm/vllm-openai docker tag to v0.8.2 (#3547 )	2025-03-26 13:53:07 +08:00
README.md	add vLLM (#3145 )	2025-02-07 16:49:19 +08:00
README_en.md	add vLLM (#3145 )	2025-02-07 16:49:19 +08:00
data.yml	feat: Add gpuSupport Config (#3195 )	2025-02-17 17:39:16 +08:00
logo.png	add vLLM (#3145 )	2025-02-07 16:49:19 +08:00

使用说明

Register an account at https://huggingface.co/ and get model access to create a token.
Ensure the machine has an Nvidia GPU.
Modify the /etc/docker/daemon.json file and add:

   "runtimes": {
      "nvidia": {
      "path": "nvidia-container-runtime",
      "runtimeArgs": []
      }
   }

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Fast model execution with CUDA/HIP graph
Quantizations: GPTQ, AWQ, INT4, INT8, and FP8.
Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
Speculative decoding
Chunked prefill

Performance benchmark: We include a performance benchmark that compares the performance of vLLM against other LLM serving engines (TensorRT-LLM, text-generation-inference and lmdeploy).

vLLM is flexible and easy to use with:

Seamless integration with popular Hugging Face models
High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
Tensor parallelism and pipeline parallelism support for distributed inference
Streaming outputs
OpenAI-compatible API server
Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, TPU, and AWS Neuron.
Prefix caching support
Multi-lora support

vLLM seamlessly supports most popular open-source models on HuggingFace, including:

Find the full list of supported models here.