Pulse · sgl-project/sglang · GitHub

January 21, 2025 – January 28, 2025

Overview

111 Active pull requests

56 Active issues

94 Pull requests merged by 24 people

[Fix] Address remaining issues of supporting MiniCPMV
#2977 merged Jan 28, 2025
Fix typo in README
#3190 merged Jan 28, 2025
[kernel] Use sgl_kernel rope
#3169 merged Jan 28, 2025
clean up useless file
#3192 merged Jan 28, 2025
[test] deduplicate test_session_control
#3183 merged Jan 28, 2025
Docs fix about EAGLE and streaming output
#3166 merged Jan 28, 2025
Sanity check to prevent performance regression
#3171 merged Jan 27, 2025
fix: update Dockerfile for cu118
#3181 merged Jan 27, 2025
chore: bump v0.4.2
#3180 merged Jan 27, 2025
feat: use sgl-kernel 0.0.3 in sglang
#3179 merged Jan 27, 2025
chore: bump 0.0.3 for sgl-kernel
#3178 merged Jan 27, 2025
cleanup sgl-kernel kernels
#3175 merged Jan 27, 2025
Update thresholds in test_nightly_gsm8k_eval.py
#3176 merged Jan 27, 2025
Improve weight loading and code style
#3174 merged Jan 27, 2025
add dsv3 mi300 triton config for block scale
#3146 merged Jan 27, 2025
[kernel] Fix position ids in rope
#3173 merged Jan 27, 2025
Add activation parameters to fused_moe
#3170 merged Jan 27, 2025
Bump sgl kernel to 0.0.2.post19
#3167 merged Jan 27, 2025
add unit test for block wise fp8
#3156 merged Jan 27, 2025
[kernel] Integrate flashinfer's rope with higher precision and better perf
#3134 merged Jan 27, 2025
Add more logprob tests
#3162 merged Jan 27, 2025
Doc: Add Docs about EAGLE speculative decoding
#3144 merged Jan 27, 2025
Add function calling in index.rst
#3155 merged Jan 26, 2025
Feature/function calling update
#2700 merged Jan 26, 2025
use self-hosted to build sgl-kernel
#3154 merged Jan 26, 2025
fix link in README
#3153 merged Jan 26, 2025
Return more infos for computing average acceptance length
#3152 merged Jan 26, 2025
udpate sgl-kernel version for srt
#3150 merged Jan 26, 2025
Temporarily skip the openai frontend tests
#3151 merged Jan 26, 2025
chore: bump 0.0.2.post18 for sgl-kernel
#3149 merged Jan 26, 2025
Do not load OPENAI_KEY from secrets
#3147 merged Jan 26, 2025
Simplify the computation of cached_tokens
#3145 merged Jan 26, 2025
Add CPU affinity setting to latency benchmark
#3085 merged Jan 26, 2025
support w8a8 fp8 kernel with CUTLASS
#3047 merged Jan 26, 2025
minor: cleanup sgl-kernel
#3143 merged Jan 26, 2025
Fix repetition penalty
#3139 merged Jan 26, 2025
[Fix] Not skip NVML Check on AMD Platform
#3135 merged Jan 26, 2025
feat: cross python wheel for sgl-kernel
#3138 merged Jan 26, 2025
enable kv_scale for Gemma2
#3113 merged Jan 26, 2025
Use torch.compile for scaling penalty
#3133 merged Jan 26, 2025
Fix CI tests
#3132 merged Jan 26, 2025
feat: refactor sgl-kernel and use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops
#3130 merged Jan 25, 2025
update installation doc for sgl-kernel
#3129 merged Jan 25, 2025
Update whl index path
#3128 merged Jan 25, 2025
Update tag name for whl release
#3127 merged Jan 25, 2025
speedup pr test for sgl-kernel
#3126 merged Jan 25, 2025
chore: bump v0.0.2.post17 for sgl-kernel
#3125 merged Jan 25, 2025
mirror fix for custom allreduce
#3124 merged Jan 25, 2025
support fp32 in sampling_scaling_penalties kernel
#3121 merged Jan 25, 2025
Add step to update sgl-kernel whl index
#3110 merged Jan 24, 2025
Add workflow for sgl-kernel cu118 release
#3109 merged Jan 24, 2025
minor: update sgl-kernel setup
#3107 merged Jan 24, 2025
[Docs] minor update for phi-3 and phi-4
#3096 merged Jan 24, 2025
Allow local cutlass directory to be used in sgl-kernel build
#3037 merged Jan 24, 2025
minor: sync flashinfer and add turbomind as 3rdparty
#3105 merged Jan 24, 2025
Fix cu118 group gemm compile issue
#3097 merged Jan 24, 2025
[router] Fix twine uploading
#3095 merged Jan 24, 2025
bump router to 0.1.4
#3094 merged Jan 24, 2025
[router] Forward all request headers from router to workers
#3070 merged Jan 24, 2025
Add shapes for int8 gemm benchmark
#3093 merged Jan 24, 2025
Update doc for server arguments
#2742 merged Jan 23, 2025
chore: bump sgl-kernel 0.0.2.post16
#3087 merged Jan 23, 2025
feat: integrate sampling kernels into sgl-kernel
#3086 merged Jan 23, 2025
[hotfix] fix test_sampling_scaling_penalties.py ci test
#3084 merged Jan 23, 2025
Use flashinfer vec_dtypes in sgl_kernel
#3083 merged Jan 23, 2025
sync flashinfer and update sgl-kernel tests
#3081 merged Jan 23, 2025
use env variable to control the build conf on the CPU build node
#3080 merged Jan 23, 2025
update version setup for sgl-kernel
#3079 merged Jan 23, 2025
fix build error for sgl-kernel
#3078 merged Jan 23, 2025
Remove torch dependency in sgl-kernel
#3074 merged Jan 23, 2025
support lightning_attention_decode in sgl-kernel for MiniMax-Text-01
#3030 merged Jan 23, 2025
use v0.6.4.post1 for sgl-kernel ci
#3071 merged Jan 23, 2025
docs: update developer guide for sgl-kernel
#3069 merged Jan 23, 2025
docs: add developer guide for sgl-kernel
#3068 merged Jan 23, 2025
Revert "disable custom allreduce on HIP"
#3067 merged Jan 23, 2025
Support loading of larger models with on-the-fly quantization
#3061 merged Jan 23, 2025
Fix tp token sync for dp attention
#3062 merged Jan 23, 2025
[router] make error actionable
#3063 merged Jan 23, 2025
[devcontainer] add non-root user
#2989 merged Jan 23, 2025
Add some flags to allow sync token ids across TP ranks
#3060 merged Jan 22, 2025
Fix the FP8 E4M3 parsing offline scales failure bug
#3045 merged Jan 22, 2025
[Doc]Update doc of profiling with PyTorch Profiler
#3038 merged Jan 22, 2025
disable custom allreduce on HIP
#3058 merged Jan 22, 2025
add notice about flashinfer in sgl-kernel
#3057 merged Jan 22, 2025
fix rotary_embedding rope_scaling for phi
#3055 merged Jan 22, 2025
feat: integrate bmm_fp8 kernel into sgl-kernel
#3056 merged Jan 22, 2025
minor: update header and use pytest
#3054 merged Jan 22, 2025
feat: integrate activation kernels into sgl-kernel
#3053 merged Jan 22, 2025
feat: integrate norm kernels into sgl-kernel
#3052 merged Jan 22, 2025
sync the upstream updates of flashinfer
#3051 merged Jan 22, 2025
update norm cu
#3048 merged Jan 22, 2025
Fix sgl-kernel compile for sm80
#3046 merged Jan 22, 2025
Use int64 as indices for set_kv_buffer
#3039 merged Jan 22, 2025
fix pr-test-sgl-kernel
#3036 merged Jan 21, 2025

17 Pull requests opened by 13 people

Modify the kernel test path & add it to the CI process.
#3044 opened Jan 22, 2025
[Feature] Beam Search
#3066 opened Jan 23, 2025
Accuracy measurement
#3114 opened Jan 24, 2025
Extract generation_manager from tokenizer_manager
#3115 opened Jan 25, 2025
Rename TokenizerManager to StdOrchestrator
#3116 opened Jan 25, 2025
Let DetokenizerManager use TypeBasedDispatcher
#3117 opened Jan 25, 2025
Split communication logic from computation logic into orchestrator
#3118 opened Jan 25, 2025
Add EngineFragment
#3120 opened Jan 25, 2025
fix: Fix deprecated max_tokens param in openai ChatCompletionRequest
#3122 opened Jan 25, 2025
[MOE] Try to optimize moe align block size multiblocks cuda kernel
#3137 opened Jan 26, 2025
Apply sgl w8a8 fp8 kernel
#3148 opened Jan 26, 2025
[Feature] Define backends and add Triton backend for Lora
#3161 opened Jan 27, 2025
Initial Enablement of CI on MI300
#3168 opened Jan 27, 2025
[Feature] Rewrite Sampling Parameter #3165
#3185 opened Jan 27, 2025
Add logit bias into the SGLang interface.
#3187 opened Jan 27, 2025
Add deepseek_v3 fused gate
#3191 opened Jan 28, 2025
Fixing a typo engine.py
#3193 opened Jan 28, 2025

28 Issues closed by 12 people

[Bug] Qwen2-VL-7B with sglang has significant numerical calculation errors compared to HF Transformers
#3106 closed Jan 28, 2025
Inference Speeds across 2x HGXs with infiniband 3.2tbps
#3172 closed Jan 28, 2025
Offline batch inference for mullti-modality with prefix caching feature
#3177 closed Jan 28, 2025
[Bug] Slow throughput/s on H200 (llama 3.1 8b)
#3186 closed Jan 28, 2025
[Feature] add unit test for block wise fp8
#2768 closed Jan 27, 2025
[Bug] Frontend choices and `input_token_logprobs` mis-match
#2873 closed Jan 27, 2025
[Feature] request smoothquant (int8, W8A8) quantization on 40G A100
#2474 closed Jan 26, 2025
QVQ Prefill stage slow
#2961 closed Jan 26, 2025
[Bug] Qwen2-VL-7B with sglang Performance Degradation
#3041 closed Jan 26, 2025
[Bug] constrained decoding performance is worse when qps>2
#3104 closed Jan 26, 2025
[Bug] Qwen-2.5-Math-7B-Instruct and Llama-3.1-8B-Instruct Produce Nonsensical Results
#2084 closed Jan 26, 2025
[Bug] frequency penalty
#2177 closed Jan 25, 2025
Question About Model Integration and Parameter Updates (update_weight) in Sglang
#3101 closed Jan 24, 2025
[Bug] The batch decoding speed of DeepSeek V3 is too slow.
#3100 closed Jan 24, 2025
[Bug] PyTorch profiler trace is not generated
#2874 closed Jan 24, 2025
[Bug] libcudart.so.12: cannot open shared object file: No such file or directory
#2584 closed Jan 24, 2025
Can router support --api-key parameter
#3031 closed Jan 24, 2025
[BUG] Problems with jump forward decoding
#2045 closed Jan 24, 2025
[Benchmarks] Cant'run examples benchmark. Flashinfer error:
#3089 closed Jan 23, 2025
Some question about layernom in MLA code
#3072 closed Jan 23, 2025
[Bug] DeepSeek-V3 load weights failed with --enable-ep-moe
#3075 closed Jan 23, 2025
Indexing.cu:1255: indexSelectSmallIndex: block: [3,0,0], thread: [33,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
#3064 closed Jan 23, 2025
[Feature] Support LLaMA-3.2 finetuned with Sentence Transformers !
#2131 closed Jan 23, 2025
[Bug] Eagle2 has an unstable sampling rate during multi concurrency。
#2537 closed Jan 22, 2025
[Bug] embedding model failed with `--enable-metrics`
#2800 closed Jan 22, 2025
[Feature] When will function calls with deepseek support be available?
#2855 closed Jan 21, 2025
Can multiple services be deployed simultaneously?
#2916 closed Jan 21, 2025
[Feature] Add progress bar in `Engine.generate` method
#2994 closed Jan 21, 2025

28 Issues opened by 20 people

[Bug] ERROR: No matching distribution found for vllm==0.6.3.post2.dev1; extra == "srt-hip"
#3189 opened Jan 28, 2025
Any benchmarks comparing with TGI?
#3188 opened Jan 27, 2025
[Feature] Step-by-Step Guide to Use SGLang on NVIDIA Jetson Orin platform
#3182 opened Jan 27, 2025
[Feature] Rewrite Sampling Parameter
#3165 opened Jan 27, 2025
[Feature] fix docs in Streaming-Synchronous-Generation
#3164 opened Jan 27, 2025
[Feature] Reduce docs CI time
#3163 opened Jan 27, 2025
[Feature] Remove Redundent CI of Docs
#3160 opened Jan 27, 2025
[Feature] Support new Qwen Models
#3159 opened Jan 27, 2025
[Feature] Split Docs CI
#3158 opened Jan 27, 2025
[Feature] Accuracy test of VLM
#3142 opened Jan 26, 2025
[Feature] Vision LM accuracy test
#3141 opened Jan 26, 2025
[Feature] GGUF Q4KM(4bit) format for deepseek R1 support
#3140 opened Jan 26, 2025
[Feature] Star attention support
#3131 opened Jan 25, 2025
[Bug] Service crashed with 4 H100s and QPS=25
#3112 opened Jan 24, 2025
[Bug] Crash special token xgrammar
#3108 opened Jan 24, 2025
Batch inference over multiple nodes
#3103 opened Jan 24, 2025
[Bug] Multi-node BUG
#3099 opened Jan 24, 2025
[Bug] Qwen2-VL Online Serving Issue
#3098 opened Jan 24, 2025
[Feature] Support InterVL
#3092 opened Jan 24, 2025
[Feature] Add support for Phi4
#3090 opened Jan 23, 2025
[Feature] docs: Improve documentation on how to use EAGLE speculative docoding
#3077 opened Jan 23, 2025
[Feature] Support service discovery on Kubernetes in router
#3073 opened Jan 23, 2025
[Bug]ImportError: undefined symbol: cuModuleGetFunction when using lmsysorg/sglang:v0.4.1.post7-cu124
#3065 opened Jan 23, 2025
[Bug] Problems with logit_bias.
#3059 opened Jan 22, 2025
[Bug] Decode Throughput Inconsistency Between bench_serving and Engine Logs
#3050 opened Jan 22, 2025
[Help wanted] CANN'T capture GPU activities using `nsight system`
#3049 opened Jan 22, 2025
[Feature] Reasoning model API support
#3043 opened Jan 22, 2025
[Feature] batch concurrent requests while streaming responses
#3040 opened Jan 22, 2025

50 Unresolved conversations

Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.

Support int8 kvcahe
#3034 commented on Jan 26, 2025 • 20 new comments
Speculative decoding with lookahead
#2790 commented on Jan 28, 2025 • 10 new comments
Integrate turbomind into sgl-kernel
#2999 commented on Jan 28, 2025 • 2 new comments
[Feature] Support dynamic loading and unloading of Lora adapters
#2891 commented on Jan 23, 2025 • 2 new comments
Debug radixcache: refactor recursive helper methods
#3029 commented on Jan 27, 2025 • 1 new comment
[Bug] Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'original_max_position_embeddings'}
#2943 commented on Jan 25, 2025 • 0 new comments
[Bug] NCCL Crash with SIGSEGV Frequently when deploying deepseek v3
#2803 commented on Jan 26, 2025 • 0 new comments
[Feature] DeepSeek V3 optimization
#2591 commented on Jan 27, 2025 • 0 new comments
[Feature] add support for deepseek v3 gptq / awq
#2706 commented on Jan 27, 2025 • 0 new comments
[Feature] Lora optimization
#2929 commented on Jan 27, 2025 • 0 new comments
[Bug] Regex isn't precluding parentheticals. And maybe more.
#2957 commented on Jan 28, 2025 • 0 new comments
[Bug] Issue with batch mode
#2762 commented on Jan 28, 2025 • 0 new comments
[Feature] remove vllm _custom_ops
#2965 commented on Jan 28, 2025 • 0 new comments
[Feature] (Willing to PR) Proposal: Drop-in fast replacement of `PreTrainedModel.generate`
#2569 commented on Jan 28, 2025 • 0 new comments
[Feature] support EAGLE 2 with Triton Backend
#2940 commented on Jan 28, 2025 • 0 new comments
prometheus query return no result
#2677 commented on Jan 28, 2025 • 0 new comments
[Bug] Launching Llama-3.2-11B-Vision-Instruct just hangs on generation
#2619 commented on Jan 28, 2025 • 0 new comments
[Experimental] Add a gRPC server for completion request
#2478 commented on Jan 22, 2025 • 0 new comments
Hierarchical Caching for SGLang
#2693 commented on Jan 28, 2025 • 0 new comments
Add endpoint for file support, purely to speed up processing of input_embeds.
#2797 commented on Jan 28, 2025 • 0 new comments
[WIP] [Feature] Support Deepseek-VL2
#2798 commented on Jan 25, 2025 • 0 new comments
[WIP] Integration of TurboMind AWQ
#2900 commented on Jan 28, 2025 • 0 new comments
[Core] Optimize the delay scheduling of in batch prefix caching
#2962 commented on Jan 22, 2025 • 0 new comments
support telechat2 model
#3000 commented on Jan 23, 2025 • 0 new comments
Minicpmo
#3023 commented on Jan 25, 2025 • 0 new comments
[Feature] FP8 weight only w8a16 quantization native support
#3007 commented on Jan 21, 2025 • 0 new comments
what is the most efficient way to do with a 72b model and 8 * A100 ?
#3002 commented on Jan 21, 2025 • 0 new comments
[Bug] JSONResponse fails if the probability distribution is very spiky.
#2955 commented on Jan 21, 2025 • 0 new comments
[Feature] Enhancement on Sparse Attention and KV-Cache Compression
#2946 commented on Jan 21, 2025 • 0 new comments
[Feature] Support for rerank models
#2109 commented on Jan 21, 2025 • 0 new comments
[Bug] tensor_model_parallel_all_reduce' is not defined
#2931 commented on Jan 21, 2025 • 0 new comments
Warning while running Deepseek-V3
#2921 commented on Jan 21, 2025 • 0 new comments
[Bug] ipv6 dist_init_addr doesn't connect when running multi-node inference
#2892 commented on Jan 21, 2025 • 0 new comments
[Bug] Why can't I use multi-lora adapter and radix attention together?
#2880 commented on Jan 21, 2025 • 0 new comments
[Bug] Bug of top_logprobs for the first chunk
#2825 commented on Jan 21, 2025 • 0 new comments
[Bug] Using MLA with Lk >= 576 report out of resource: shared memory ERROR
#2847 commented on Jan 21, 2025 • 0 new comments
Do not use tools param in stream request!
#2810 commented on Jan 21, 2025 • 0 new comments
[Bug] Huggingface model weight download failures do not cause the process to exit
#2801 commented on Jan 21, 2025 • 0 new comments
[Bug] Forking state before submitting any string causes backend crashing in sgl.function: "UnboundLocalError: local variable 'model_worker_batch' referenced before assignment"
#2755 commented on Jan 21, 2025 • 0 new comments
[Bug] def get_nvgpu_memory_capacity() causes crash on NVIDIA H100 MIG
#2933 commented on Jan 21, 2025 • 0 new comments
[Bug] compressed-tensors format not supported
#2871 commented on Jan 22, 2025 • 0 new comments
[Feature] Add docs for local accuracy tests
#2953 commented on Jan 22, 2025 • 0 new comments
[Bug] [OpenAI compatible API] Chunks of tokens aren't being split into separate indexes when specifying n > 1 generations
#2912 commented on Jan 22, 2025 • 0 new comments
[Feature] Dynamic Lora Support in SGLang (like VLLM)
#2686 commented on Jan 22, 2025 • 0 new comments
[Bug] finish_reason is not right when Qwen call a tool
#2877 commented on Jan 22, 2025 • 0 new comments
[Bug] KeyError: 'lm_head.weight' when loading quantized llama 3.2 3B and 1B models
#2935 commented on Jan 22, 2025 • 0 new comments
[Bug] Cannot capture kernel trace using nsys
#2776 commented on Jan 22, 2025 • 0 new comments
[Bug] How to load weight with torchao
#2721 commented on Jan 23, 2025 • 0 new comments
[Feature] Support General Reward Model
#2427 commented on Jan 24, 2025 • 0 new comments
[Bug] Gemma 2 GGUF
#2451 commented on Jan 24, 2025 • 0 new comments