Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model] DeepSeek-V3 Enhancements #11539

Open
2 of 10 tasks
simon-mo opened this issue Dec 27, 2024 · 28 comments
Open
2 of 10 tasks

[Model] DeepSeek-V3 Enhancements #11539

simon-mo opened this issue Dec 27, 2024 · 28 comments
Labels
new model Requests to new models performance Performance-related issues

Comments

@simon-mo
Copy link
Collaborator

simon-mo commented Dec 27, 2024

This issue tracks follow up enhancements after initial support for the Deepseek V3 model. Please feel free to chime in and contribute!

@simon-mo simon-mo added misc performance Performance-related issues new model Requests to new models and removed misc labels Dec 27, 2024
@simon-mo simon-mo changed the title [Model] Deepseek V3 Enhancements [Model] DeepSeek-V3 Enhancements Dec 27, 2024
@july8023
Copy link

If I want to deploy deepseek 600B use vllm and RTX4090, are there any restrictions? How many RTX 4090 do I need at least?

@fsaudm
Copy link

fsaudm commented Dec 31, 2024

Is inference with A100s supported? How about quantization??

@mphilippnv
Copy link

Deepseek v3 doesn't appear to support pipeline parallelism. I get this error when attempting to deploy to 2 8x H100 nodes:

NotImplementedError: Pipeline parallelism is only supported for the following  architectures: ['AquilaForCausalLM', 'AquilaModel', 'DeepseekV2ForCausalLM', 'GPT2LMHeadModel', 'InternLM2ForCausalLM', 'InternLMForCausalLM', 'InternVLChatModel', 'JAISLMHeadModel', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'NemotronForCausalLM', 'Phi3ForCausalLM', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'QWenLMHeadModel', 'Qwen2VLForConditionalGeneration'].

I'm using --tensor-parallel-size 8 --pipeline-parallel-size 2

@simon-mo
Copy link
Collaborator Author

@july8023 It should work on 4090, generally the models takes about 600GB memory, then you want about 100-300GB for KV cache so feel free to plan around that.
@fsaudm A100s are not supported because this models requires FP8 tensor cores.
@mphilippnv which version of vLLM are you using? You might need to update to v0.6.6 or higher.

@fsaudm
Copy link

fsaudm commented Dec 31, 2024

@simon-mo right, A100s don't support fp8. Would the arg --dtype bfloat16 suffice? If not, I found the bf16 version in Huggingface, any insights on whether that would work?

@simon-mo
Copy link
Collaborator Author

The model currently does not support --dtype bfloat16 because it is natively trained in fp8. Can you point me to the bf16 version?

@fsaudm
Copy link

fsaudm commented Dec 31, 2024

@simon-mo on HF: https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main

, on the official repo they provide a script to cast fp8 to bf16, but of course you can't do it on A100s... my guess is a good soul did it and uploaded it to HF. In the repo, see 6.

https://github.com/deepseek-ai/DeepSeek-V3

@simon-mo
Copy link
Collaborator Author

vLLM does support this bf16 model on A100. It looks like the config.json properly removed quantization_config so it would already.

@mphilippnv
Copy link

mphilippnv commented Dec 31, 2024

@july8023 It should work on 4090, generally the models takes about 600GB memory, then you want about 100-300GB for KV cache so feel free to plan around that. @fsaudm A100s are not supported because this models requires FP8 tensor cores. @mphilippnv which version of vLLM are you using? You might need to update to v0.6.6 or higher.

Using v0.6.6

EDIT: Apologies, I was using 0.6.2. Redeploying helm chart with 0.6.6.post1. Will see how it goes.

@fsaudm
Copy link

fsaudm commented Dec 31, 2024

Any knowledge of a working example of serving deepseekv3 on A100s with vLLM? I'll try later, but any hints or help is very much appreciated

@JamesBVMNetwork
Copy link

Hi everyone,
I’m encountering the following error when trying to run the image vllm/vllm-openai:v0.6.6.post1 on a node equipped with 8x H100 SMX GPUs:

ValueError: Error in model execution (input dumped to /tmp/err_execute_model_input_20250102-072212.pkl): functional_call got multiple values for keys ['mlp.experts.e_score_correction_bias', 'mlp.gate.e_score_correction_bias'], which are tied. Consider using tie_weights=False
2025-01-02T15:22:12.753719474Z 

Here’s the command I used:

--model deepseek-ai/DeepSeek-V3-Base \
--tensor-parallel-size 8 \
--disable_log_requests \
--uvicorn_log_level error \
--max-model-len 16384 \
--cpu-offload-gb 400 \
--max_num_seqs 1 \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--enforce-eager

Does anyone have suggestions or solutions for resolving this issue?

Thanks in advance!

@glowwormX
Copy link

Hi everyone, I’m encountering the following error when trying to run the image vllm/vllm-openai:v0.6.6.post1 on a node equipped with 8x H100 SMX GPUs:

ValueError: Error in model execution (input dumped to /tmp/err_execute_model_input_20250102-072212.pkl): functional_call got multiple values for keys ['mlp.experts.e_score_correction_bias', 'mlp.gate.e_score_correction_bias'], which are tied. Consider using tie_weights=False
2025-01-02T15:22:12.753719474Z 

Here’s the command I used:

--model deepseek-ai/DeepSeek-V3-Base \
--tensor-parallel-size 8 \
--disable_log_requests \
--uvicorn_log_level error \
--max-model-len 16384 \
--cpu-offload-gb 400 \
--max_num_seqs 1 \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--enforce-eager

Does anyone have suggestions or solutions for resolving this issue?

Thanks in advance!

I've had this problem, too. Is there a solution?

@ishaandatta
Copy link

I've had this problem, too. Is there a solution?

Was getting this error- got resolved by removing cpu offloading... hoping for an explanation.

Also, any suggestions to increase token throughput & context length.
We're stuck at 6 tokens/second, max 10k context length despite 1600GB VRAM.
I am currently running with tensor+pipeline parallelism on 5 Nodes (4x A100 80GB each). The vm are without Infiniband.

Would having Infiniband (i.e. higher inter-node bandwidth & lower latency) be the main solution to increase token throughput? And for context length > 40k, how much more VRAM would be required..?

@shaowei-su
Copy link

I've had this problem, too. Is there a solution?

Was getting this error- got resolved by removing cpu offloading... hoping for an explanation.

Also, any suggestions to increase token throughput & context length. We're stuck at 6 tokens/second, max 10k context length despite 1600GB VRAM. I am currently running with tensor+pipeline parallelism on 5 Nodes (4x A100 80GB each). The vm are without Infiniband.

Would having Infiniband (i.e. higher inter-node bandwidth & lower latency) be the main solution to increase token throughput? And for context length > 40k, how much more VRAM would be required..?

Hi @ishaandatta could you share which model version are you using? I'm getting errors complaining fp8e4nv data type is not supported on CUDA arch < 89 when loading the model on A100 GPUs. Or maybe you are on the bf16 version? https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main. Thanks

@merlintang
Copy link

we also run int over very slow token processing speed like 3 token/s, even if we use h100 and IB. any suggestions?

@lhl
Copy link

lhl commented Jan 9, 2025

we also run int over very slow token processing speed like 3 token/s, even if we use h100 and IB. any suggestions?

I found tp16 to be about 2X faster than pp=2 tp=8 w/ 2 x H100 nodes. Here's my testing: https://llm-tracker.info/DeepSeek-V3-Testing

Here's vLLM vs SGLang at concurrency=64 atm:

output(3)

Note, I found that vLLM has some stop token errors for output (that SGLang doesn't have) w/ some of my testing.

@fan-niu
Copy link

fan-niu commented Jan 9, 2025

Same issue. I used 16 H100 GPUs, set TP=16, deployed using ray in k8s, and opened the IB network. I made a simple curl request, input 10 tokens, and output 242 tokens. This curl test It took 44 seconds. Can anyone help me figure out why?

@merlintang
Copy link

does the perf issues related to the MOE opt ? it is not included in the current version.?

@ishaandatta
Copy link

ishaandatta commented Jan 9, 2025

@shaowei-su I'm using the bf16 version you linked.

@lhl thank you for sharing this! I'm currently using tp=4 pp=6 as we're aiming for context lengths > 64k.
Just to clarify, your benchmarks indicate ~5 output tokens/s on vLLM & around 10 for SGLang ?
If so- I am wondering as to how deepseek-chat is able to achieve their throughput, I measured it at over 60 output tokens/sec

@lhl
Copy link

lhl commented Jan 10, 2025

Just to clarify, your benchmarks indicate ~5 output tokens/s on vLLM & around 10 for SGLang ?

for bs=1 SGLang outputs around 26 tok/s:

(sglang) ubuntu@ip-10-1-1-135:~$ python3 -m sglang.bench_serving --backend sglang --num-prompts 50 --max-concurrency 1 --port 8000
Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=8000, dataset_name='sharegpt', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=50, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=0.0, request_rate=inf, max_concurrency=1, seed=1, multi=False, request_rate_range='2,34,2', output_file=None, disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None)

#Input tokens: 10354
#Output tokens: 11509
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [07:20<00:00,  8.82s/it]

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max reqeuest concurrency:                1
Successful requests:                     50
Benchmark duration (s):                  440.98
Total input tokens:                      10354
Total generated tokens:                  11509
Total generated tokens (retokenized):    11467
Request throughput (req/s):              0.11
Input token throughput (tok/s):          23.48
Output token throughput (tok/s):         26.10
Total token throughput (tok/s):          49.58
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   8819.11
Median E2E Latency (ms):                 4817.32
---------------Time to First Token----------------
Mean TTFT (ms):                          318.37
Median TTFT (ms):                        259.02
P99 TTFT (ms):                           1658.59
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          36.41
Median TPOT (ms):                        36.97
P99 TPOT (ms):                           37.60
---------------Inter-token Latency----------------
Mean ITL (ms):                           37.18
Median ITL (ms):                         37.06
P99 ITL (ms):                            38.91
==================================================

You should read the DeepSeek Technical Report in the infrastructure, they deploy in 320 GPU blocks w/ specialized/separated functions.

That being said, there's certainly optimizations that can be made for "regular" inference. On vLLM, when doing throughput optimization, with some tuning I can generate >7000 tok/s on a single H100 node for a Llama 3 70B class model at c=512. DSv3 has about half the activations, and at c=512 sglang currently tops out at about 1100 tok/s on 2xH100 nodes (vLLM is about half of that). You could imagine that there might be a 5-10X in throughput optimization available based naively on activations/fwd pass. This is before spec decode like EAGLE or Medusa is factored in.

@fan-niu
Copy link

fan-niu commented Jan 10, 2025

@simon-mo Is there any way or plan to improve the speed of vllm on deepseek v3? Thanks a lot

@panpan0000
Copy link
Contributor

we also see 3 token/s on 16x H20 with TP=8,PP=2

@drikster80
Copy link
Contributor

When I tested TP=16 on GH200 nodes (FP8 version), I was getting ~7.1 t/s (single batch). Ironically, when I used TP=8 (max_model_len=2048 so it all fit), I was getting slightly faster, which seemed strange.

One of the issues that might be slowing VLLM down is that one of the MoE specific CUDA kernels is hard-coded for DSv3 to force the use of Global memory, which is significantly slower than shared memory. This is due to the limited amount of shared memory available (dependent on the GPU model... for example, the H100 has 227KB of shared memory per block).
https://github.com/vllm-project/vllm/blob/main/csrc/moe/moe_align_sum_kernels.cu#L232

I don't know how much effect this has for this specific kernel, but it likely has some consequence. Techniques like distributed shared memory (H100+ specific) might be able to be used, or only keeping the active experts in there... but unfortunately I don't know much about CUDA programming. Spent 2 days messing trying to implement the "active-expert only" approach, but only served to slow down to 4.5 t/s...

@xpmemeda
Copy link

您好。请问现在使用vllm部署,要支持tool call功能,应该使用哪个parser?

@WangxuP
Copy link

WangxuP commented Jan 16, 2025

vLLM does support this bf16 model on A100. It looks like the config.json properly removed quantization_config so it would already.

I use vllm==0.6.6.post1 can support this feature?

@teknium1
Copy link

Can anyone explain why we can only get like 7tok/s across 2hgxs in any configuration over verified 3.2tbps IB

@tot0
Copy link

tot0 commented Jan 28, 2025

Can anyone explain why we can only get like 7tok/s across 2hgxs in any configuration over verified 3.2tbps IB

I get 10.5 tok/s (1 sequence) on 8*MI300x using sglang, just for reference.

@pseudotensor
Copy link

Same, about 13 tokens/sec (1 sequence, long output) on 8*MI300x using sglang. It has an unexpected TTFT lag for me of about 2 seconds though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new model Requests to new models performance Performance-related issues
Projects
None yet
Development

No branches or pull requests