vLLM으로 OpenAI API 스타일 서버를 Docker로 배포하는 방법

dun1211

|2024. 12. 27. 21:10

개요

vLLM은 OpenAI API 스타일의 서버를 실행할 수 있는 강력한 LLM 서빙 라이브러리입니다. 이 가이드에서는 vLLM을 Docker 컨테이너로 배포하는 방법을 소개합니다. Docker를 사용하면 GPU 리소스를 효율적으로 활용할 수 있으며, 설정과 배포 과정을 간소화할 수 있습니다😎

Docker로 vLLM 실행 스크립트

다음은 vLLM 서버를 바로 실행할 수 있는 Docker 실행 스크립트(run_vllm_docker.sh)입니다. vLLM 공식 문서를 참조했습니다!

# Script to run vLLM Docker container with NVIDIA runtime and custom settings

# Load environment variables from .env file
source .env

# Set variables
GPU_ID=0 # or 0,1 for multi-GPU
PORT=8000
MAX_MODEL_LEN=131072
MAX_NUM_BATCHED_TOKENS=4096
MAX_NUM_SEQS=256
MODEL_PATH="meta-llama/Llama-3.2-1B-Instruct" #~/local_model_path/meta-llama_Llama-3.2-1B-Instruct"
SERVED_MODEL_NAME="llama-3.2-1b-instruct"
GPU_MEMORY_UTILIZATION=0.9 # default 0.9
LOG_FILE="/home/ec2-user/vllm/vllm_docker.log" # 로그 파일 경로 설정

# Calculate tensor parallel size based on the number of GPUs
IFS=',' read -r -a GPU_ARRAY <<< "$GPU_ID"
TENSOR_PARALLEL_SIZE=${#GPU_ARRAY[@]}

# Check if MODEL_PATH is a directory
if [ -d "$MODEL_PATH" ]; then
    VOLUME_OPTION="-v $MODEL_PATH:$MODEL_PATH"
else
    VOLUME_OPTION=""
fi

# Run the Docker container
docker run --runtime nvidia --gpus '"device='$GPU_ID'"' \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    $VOLUME_OPTION \
    --env "HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN" \
    -p $PORT:8000 \
    --ipc=host \
    vllm/vllm-openai:v0.6.6 \
    --model $MODEL_PATH \
    --trust-remote-code \
    --served-model-name $SERVED_MODEL_NAME \
    --max-model-len $MAX_MODEL_LEN \
    --tensor-parallel-size $TENSOR_PARALLEL_SIZE \
    --swap-space 0 \
    --gpu-memory-utilization $GPU_MEMORY_UTILIZATION \
    --max-num-seqs $MAX_NUM_SEQS \
    2>&1 | tee "$LOG_FILE"

run_vllm_docker.sh 에는 .env 파일로부터 HUGGING_FACE_HUB_TOKEN (=Hugging Face 개인 액세스 토큰)을 읽습니다.

추가적으로 .env 파일도 준비해 둡니다.

# .env file for vLLM Docker script
HUGGING_FACE_HUB_TOKEN=본인의 Hugging Face 개인 엑세스 토큰 키 값 입력

실행 과정 설명

1. GPU와 포트 설정

GPU_ID를 사용하여 사용할 GPU를 지정합니다. 멀티 GPU를 사용하는 경우 GPU_ID=0,1과 같이 설정합니다. PORT는 외부에서 접근할 수 있는 API 서버의 포트를 지정합니다.

2. 모델 경로 및 설정

MODEL_PATH: Hugging Face 모델 경로나 이름을 지정합니다. 로컬에 저장된 모델을 사용하려면 모델 경로를 입력합니다. 만약 모델 이름을 입력하면, Hugging Face로부터 다운로드가 자동으로 진행됩니다. 이름을 적을 시, 주의해야 할 점은 Hugging Face에 올라온 공식적인 이름을 입력해야 합니다!! 🧨

3. Hugging Face Access Token 관리

HUGGING_FACE_HUB_TOKEN은 Hugging Face에서 제공하는 개인 액세스 토큰입니다. 특정 모델의 경우(llama3.2 모델), 다운로드 권한 신청이 필요합니다. 이 토큰은 민감한 정보이므로 .env 파일에 저장하고 source .env 명령어를 통해 불러옵니다. 토큰은 Hugging Face 모델 다운로드 및 인증에 필요합니다.

4. Docker 이미지와 실행

2024년 12월 27일 기준, vllm 이미지 중 가장 최신인 vllm/vllm-openai:v0.6.6 이미지를 사용하고 있습니다. 추후 변경 가능합니다! 해당 이미지는 OpenAI API 서버를 실행하는 데 최적화되어 있습니다. Dockerfile에서 확인할 수 있듯이 컨테이너 내부에서 python3 -m vllm.entrypoints.openai.api_server가 실행됩니다.

주요 파라미터 설명

https://github.com/vllm-project/vllm/blob/main/vllm/engine/arg_utils.py 에 vllm 서빙 관련 파라미터들이 있습니다. 주요 파라미터들은 아래와 같습니다.

--model
- Hugging Face 모델 경로나 이름을 입력합니다. 모델 이름으로 입력하면 자동으로 다운로드합니다.
--served-model-name
- API에서 사용하는 모델 이름입니다. 기본값은 --model 파라미터 값입니다.
--max-model-len
- 모델의 최대 컨텍스트 길이(입력 시퀀스 길이)입니다. 이 길이를 초과하는 토큰 개수를 가진 문장은 처리할 수는 없습니다. 지정하지 않으면 모델의 config 파일에서 자동으로 설정됩니다.
--tensor-parallel-size
- 사용하려는 텐서 병렬 처리 크기를 설정합니다. 현재 스크립트에서는 GPU 개수에 따라 자동으로 계산됩니다. 만약 GPU 1개이면, tp=1, 사용하는 GPU가 2개이면 tp=2로 설정해야 합니다.
--gpu-memory-utilization
- GPU 메모리 사용 비율을 설정합니다. 기본값은 0.9입니다.
--max-num-seqs
- 반복당 처리할 수 있는 최대 시퀀스 수를 설정합니다. 최대 배치 사이즈라고 생각하시면 됩니다. 만약 max-num-seqs가 4인데, 8개의 문장이 요청이 들어온다면, 4개만 처리하고, 나머지는 pending 상태에 들어갑니다.

AWS T4 24GB 1장에서 실행해 보기

1. 스크립트를 통해 컨테이너 실행 >> vLLM OpenAI API 서버 띄우기

AWS T4 24GB 1장에서 해당 스크립트를 실행해 보았습니다 😀

#스크립트 실행 권한
chmod +x run_vllm_docker.sh

./run_vllm_docker.sh

Llama3.2-1B config 파일에는 dtype이 bfloat16입니다. 그러나, T4에서는 bfloat16을 지원하지 않기 때문에,
--dtype float16을 스크립트에 추가해줘야 합니다..!

실행하면, 터미널에 아래와 같은 로그가 나타납니다.

max_model_len가 32k보다 크면, chunked prefill이 default로 실행됩니다. 이때의 max_num_batched_tokens을 2048입니다.

chunked_prefill을 사용하면 큰 청크를 더 작은 청크로 나누어 디코드 요청과 함께 일괄 처리할 수 있습니다. --enable-chunked-prefill을 False로 해서 안 할 수도 있지만, 큰 prefill이 들어올 때는 하기를 권장합니다 (https://docs.vllm.ai/en/v0.4.2/models/performance.html)

INFO 12-27 03:31:13 api_server.py:712] vLLM API server version 0.6.6
INFO 12-27 03:31:13 api_server.py:713] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='meta-llama/Llama-3.2-1B-Instruct', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='half', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=131072, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=0.0, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['llama-3.2-1b-instruct'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
INFO 12-27 03:31:13 api_server.py:199] Started engine process with PID 16
WARNING 12-27 03:31:13 config.py:2276] Casting torch.bfloat16 to torch.float16.
WARNING 12-27 03:31:18 config.py:2276] Casting torch.bfloat16 to torch.float16.
INFO 12-27 03:31:20 config.py:510] This model supports multiple tasks: {'reward', 'generate', 'classify', 'embed', 'score'}. Defaulting to 'generate'.
WARNING 12-27 03:31:20 arg_utils.py:1103] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 12-27 03:31:20 config.py:1458] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 12-27 03:31:24 config.py:510] This model supports multiple tasks: {'classify', 'score', 'reward', 'embed', 'generate'}. Defaulting to 'generate'.
WARNING 12-27 03:31:24 arg_utils.py:1103] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 12-27 03:31:24 config.py:1458] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 12-27 03:31:24 llm_engine.py:234] Initializing an LLM engine (v0.6.6) with config: model='meta-llama/Llama-3.2-1B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.2-1B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=llama-3.2-1b-instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 
INFO 12-27 03:31:26 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 12-27 03:31:26 selector.py:129] Using XFormers backend.
INFO 12-27 03:31:27 model_runner.py:1094] Starting to load model meta-llama/Llama-3.2-1B-Instruct...
INFO 12-27 03:31:27 weight_utils.py:251] Using model weights format ['*.safetensors']
INFO 12-27 03:31:27 weight_utils.py:296] No model.safetensors.index.json found in remote.

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]

Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.85s/it]

Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.85s/it]

INFO 12-27 03:31:29 model_runner.py:1099] Loading model weights took 2.3185 GB
INFO 12-27 03:31:30 worker.py:241] Memory profiling takes 0.62 seconds
INFO 12-27 03:31:30 worker.py:241] the current vLLM instance can use total_gpu_memory (14.57GiB) x gpu_memory_utilization (0.90) = 13.11GiB
INFO 12-27 03:31:30 worker.py:241] model weights take 2.32GiB; non_torch_memory takes 0.08GiB; PyTorch activation peak memory takes 1.18GiB; the rest of the memory reserved for KV Cache is 9.54GiB.
INFO 12-27 03:31:30 gpu_executor.py:76] # GPU blocks: 19528, # CPU blocks: 0
INFO 12-27 03:31:30 gpu_executor.py:80] Maximum concurrency for 131072 tokens per request: 2.38x
INFO 12-27 03:31:30 model_runner.py:1415] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.

Capturing CUDA graph shapes:   0%|          | 0/35 [00:00<?, ?it/s]
Capturing CUDA graph shapes:   3%|▎         | 1/35 [00:00<00:17,  1.99it/s]
Capturing CUDA graph shapes:   6%|▌         | 2/35 [00:00<00:15,  2.12it/s]
Capturing CUDA graph shapes:   9%|▊         | 3/35 [00:01<00:14,  2.18it/s]
Capturing CUDA graph shapes:  11%|█▏        | 4/35 [00:01<00:14,  2.21it/s]
Capturing CUDA graph shapes:  14%|█▍        | 5/35 [00:02<00:13,  2.22it/s]
Capturing CUDA graph shapes:  17%|█▋        | 6/35 [00:02<00:13,  2.22it/s]
Capturing CUDA graph shapes:  20%|██        | 7/35 [00:03<00:12,  2.22it/s]
Capturing CUDA graph shapes:  23%|██▎       | 8/35 [00:03<00:12,  2.22it/s]
Capturing CUDA graph shapes:  26%|██▌       | 9/35 [00:04<00:11,  2.24it/s]
Capturing CUDA graph shapes:  29%|██▊       | 10/35 [00:04<00:11,  2.27it/s]
Capturing CUDA graph shapes:  31%|███▏      | 11/35 [00:04<00:10,  2.28it/s]
Capturing CUDA graph shapes:  34%|███▍      | 12/35 [00:05<00:10,  2.29it/s]
Capturing CUDA graph shapes:  37%|███▋      | 13/35 [00:05<00:09,  2.31it/s]
Capturing CUDA graph shapes:  40%|████      | 14/35 [00:06<00:09,  2.30it/s]
Capturing CUDA graph shapes:  43%|████▎     | 15/35 [00:06<00:08,  2.31it/s]
Capturing CUDA graph shapes:  46%|████▌     | 16/35 [00:07<00:08,  2.32it/s]
Capturing CUDA graph shapes:  49%|████▊     | 17/35 [00:07<00:07,  2.34it/s]
Capturing CUDA graph shapes:  51%|█████▏    | 18/35 [00:07<00:07,  2.36it/s]
Capturing CUDA graph shapes:  54%|█████▍    | 19/35 [00:08<00:06,  2.37it/s]
Capturing CUDA graph shapes:  57%|█████▋    | 20/35 [00:08<00:06,  2.38it/s]
Capturing CUDA graph shapes:  60%|██████    | 21/35 [00:09<00:05,  2.40it/s]
Capturing CUDA graph shapes:  63%|██████▎   | 22/35 [00:09<00:05,  2.41it/s]
Capturing CUDA graph shapes:  66%|██████▌   | 23/35 [00:09<00:05,  2.40it/s]
Capturing CUDA graph shapes:  69%|██████▊   | 24/35 [00:10<00:04,  2.40it/s]
Capturing CUDA graph shapes:  71%|███████▏  | 25/35 [00:10<00:04,  2.40it/s]
Capturing CUDA graph shapes:  74%|███████▍  | 26/35 [00:11<00:03,  2.40it/s]
Capturing CUDA graph shapes:  77%|███████▋  | 27/35 [00:11<00:03,  2.39it/s]
Capturing CUDA graph shapes:  80%|████████  | 28/35 [00:12<00:02,  2.39it/s]
Capturing CUDA graph shapes:  83%|████████▎ | 29/35 [00:12<00:02,  2.38it/s]
Capturing CUDA graph shapes:  86%|████████▌ | 30/35 [00:12<00:02,  2.39it/s]
Capturing CUDA graph shapes:  89%|████████▊ | 31/35 [00:13<00:01,  2.29it/s]
Capturing CUDA graph shapes:  91%|█████████▏| 32/35 [00:13<00:01,  2.32it/s]
Capturing CUDA graph shapes:  94%|█████████▍| 33/35 [00:14<00:00,  2.35it/s]
Capturing CUDA graph shapes:  97%|█████████▋| 34/35 [00:14<00:00,  2.39it/s]
Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:15<00:00,  2.38it/s]
Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:15<00:00,  2.32it/s]
INFO 12-27 03:31:45 model_runner.py:1535] Graph capturing finished in 15 secs, took 0.12 GiB
INFO 12-27 03:31:45 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 15.91 seconds
INFO 12-27 03:31:46 api_server.py:640] Using supplied chat template:
INFO 12-27 03:31:46 api_server.py:640] None
INFO 12-27 03:31:46 launcher.py:19] Available routes are:
INFO 12-27 03:31:46 launcher.py:27] Route: /openapi.json, Methods: HEAD, GET
INFO 12-27 03:31:46 launcher.py:27] Route: /docs, Methods: HEAD, GET
INFO 12-27 03:31:46 launcher.py:27] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 12-27 03:31:46 launcher.py:27] Route: /redoc, Methods: HEAD, GET
INFO 12-27 03:31:46 launcher.py:27] Route: /health, Methods: GET
INFO 12-27 03:31:46 launcher.py:27] Route: /tokenize, Methods: POST
INFO 12-27 03:31:46 launcher.py:27] Route: /detokenize, Methods: POST
INFO 12-27 03:31:46 launcher.py:27] Route: /v1/models, Methods: GET
INFO 12-27 03:31:46 launcher.py:27] Route: /version, Methods: GET
INFO 12-27 03:31:46 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 12-27 03:31:46 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 12-27 03:31:46 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO 12-27 03:31:46 launcher.py:27] Route: /pooling, Methods: POST
INFO 12-27 03:31:46 launcher.py:27] Route: /score, Methods: POST
INFO 12-27 03:31:46 launcher.py:27] Route: /v1/score, Methods: POST
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

2. OpenAI Completions API 호출을 통해 서버 동작 확인해 보기

터미널 창을 하나 더 생성한 다음, 아래 명령어를 붙여서 실행해 봅니다.

curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Llama-3.2-1B-Instruct",
"prompt": "San Francisco is a",
"max_tokens": 20,
"temperature": 0
}'

그러면은, 다음과 같은 결과가 나옵니다.

{
"id":"cmpl-2788fe257a514fd8893302c389ec8f62",
"object":"text_completion",
"created":1735299207,
"model":"llama-3.2-1b-instruct",
"choices":[{"index":0,"text":" city that is full of life,","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],
"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7,"prompt_tokens_details":null}
}

결론

이 가이드를 통해 vLLM을 Docker로 실행하는 과정을 쉽게 따라 할 수 있습니다. vLLM의 강력한 성능과 Docker의 유연성을 결합하여 효율적으로 LLM을 배포해 보세요. 부족하지만 읽어봐 주셔서 감사합니다. 혹시 모르는 점이나 궁금한 부분이 있다면 댓글로 남겨주시면 아는 선에서 최대한 답변드리겠습니다!

'인공지능' 카테고리의 다른 글

Node.js 실행했는데 아무것도 안뜨는 이유... 경로 문제였다고?! (2)	2024.11.27
모델 압축 기술의 최신 동향 리뷰: A Survey on Model Compression Techniques for LLMs (0)	2024.11.20
LLM에서 사용되는 Float16, Float8, Int4, Int3는 뭘까? (2)	2024.11.16
LLM은 어떻게 평가할까? - 인기 성능 평가 도구와 리더보드 파헤치기 (3)	2024.11.15
어르신들을 위한 ‘취미 공유 플랫폼’을 만든다면? 웹사이트로 시작해도 될까? (34)	2024.11.14