-
Notifications
You must be signed in to change notification settings - Fork 783
Insights: sgl-project/sglang
Overview
Could not load contribution data
Please try again later
94 Pull requests merged by 24 people
-
[Fix] Address remaining issues of supporting MiniCPMV
#2977 merged
Jan 28, 2025 -
Fix typo in README
#3190 merged
Jan 28, 2025 -
[kernel] Use sgl_kernel rope
#3169 merged
Jan 28, 2025 -
clean up useless file
#3192 merged
Jan 28, 2025 -
[test] deduplicate test_session_control
#3183 merged
Jan 28, 2025 -
Docs fix about EAGLE and streaming output
#3166 merged
Jan 28, 2025 -
Sanity check to prevent performance regression
#3171 merged
Jan 27, 2025 -
fix: update Dockerfile for cu118
#3181 merged
Jan 27, 2025 -
chore: bump v0.4.2
#3180 merged
Jan 27, 2025 -
feat: use sgl-kernel 0.0.3 in sglang
#3179 merged
Jan 27, 2025 -
chore: bump 0.0.3 for sgl-kernel
#3178 merged
Jan 27, 2025 -
cleanup sgl-kernel kernels
#3175 merged
Jan 27, 2025 -
Update thresholds in test_nightly_gsm8k_eval.py
#3176 merged
Jan 27, 2025 -
Improve weight loading and code style
#3174 merged
Jan 27, 2025 -
add dsv3 mi300 triton config for block scale
#3146 merged
Jan 27, 2025 -
[kernel] Fix position ids in rope
#3173 merged
Jan 27, 2025 -
Add activation parameters to fused_moe
#3170 merged
Jan 27, 2025 -
Bump sgl kernel to 0.0.2.post19
#3167 merged
Jan 27, 2025 -
add unit test for block wise fp8
#3156 merged
Jan 27, 2025 -
[kernel] Integrate flashinfer's rope with higher precision and better perf
#3134 merged
Jan 27, 2025 -
Add more logprob tests
#3162 merged
Jan 27, 2025 -
Doc: Add Docs about EAGLE speculative decoding
#3144 merged
Jan 27, 2025 -
Add function calling in index.rst
#3155 merged
Jan 26, 2025 -
Feature/function calling update
#2700 merged
Jan 26, 2025 -
use self-hosted to build sgl-kernel
#3154 merged
Jan 26, 2025 -
fix link in README
#3153 merged
Jan 26, 2025 -
Return more infos for computing average acceptance length
#3152 merged
Jan 26, 2025 -
udpate sgl-kernel version for srt
#3150 merged
Jan 26, 2025 -
Temporarily skip the openai frontend tests
#3151 merged
Jan 26, 2025 -
chore: bump 0.0.2.post18 for sgl-kernel
#3149 merged
Jan 26, 2025 -
Do not load OPENAI_KEY from secrets
#3147 merged
Jan 26, 2025 -
Simplify the computation of cached_tokens
#3145 merged
Jan 26, 2025 -
Add CPU affinity setting to latency benchmark
#3085 merged
Jan 26, 2025 -
support w8a8 fp8 kernel with CUTLASS
#3047 merged
Jan 26, 2025 -
minor: cleanup sgl-kernel
#3143 merged
Jan 26, 2025 -
Fix repetition penalty
#3139 merged
Jan 26, 2025 -
[Fix] Not skip NVML Check on AMD Platform
#3135 merged
Jan 26, 2025 -
feat: cross python wheel for sgl-kernel
#3138 merged
Jan 26, 2025 -
enable kv_scale for Gemma2
#3113 merged
Jan 26, 2025 -
Use torch.compile for scaling penalty
#3133 merged
Jan 26, 2025 -
Fix CI tests
#3132 merged
Jan 26, 2025 -
feat: refactor sgl-kernel and use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops
#3130 merged
Jan 25, 2025 -
update installation doc for sgl-kernel
#3129 merged
Jan 25, 2025 -
Update whl index path
#3128 merged
Jan 25, 2025 -
Update tag name for whl release
#3127 merged
Jan 25, 2025 -
speedup pr test for sgl-kernel
#3126 merged
Jan 25, 2025 -
chore: bump v0.0.2.post17 for sgl-kernel
#3125 merged
Jan 25, 2025 -
mirror fix for custom allreduce
#3124 merged
Jan 25, 2025 -
support fp32 in sampling_scaling_penalties kernel
#3121 merged
Jan 25, 2025 -
Add step to update sgl-kernel whl index
#3110 merged
Jan 24, 2025 -
Add workflow for sgl-kernel cu118 release
#3109 merged
Jan 24, 2025 -
minor: update sgl-kernel setup
#3107 merged
Jan 24, 2025 -
[Docs] minor update for phi-3 and phi-4
#3096 merged
Jan 24, 2025 -
Allow local cutlass directory to be used in sgl-kernel build
#3037 merged
Jan 24, 2025 -
minor: sync flashinfer and add turbomind as 3rdparty
#3105 merged
Jan 24, 2025 -
Fix cu118 group gemm compile issue
#3097 merged
Jan 24, 2025 -
[router] Fix twine uploading
#3095 merged
Jan 24, 2025 -
bump router to 0.1.4
#3094 merged
Jan 24, 2025 -
[router] Forward all request headers from router to workers
#3070 merged
Jan 24, 2025 -
Add shapes for int8 gemm benchmark
#3093 merged
Jan 24, 2025 -
Update doc for server arguments
#2742 merged
Jan 23, 2025 -
chore: bump sgl-kernel 0.0.2.post16
#3087 merged
Jan 23, 2025 -
feat: integrate sampling kernels into sgl-kernel
#3086 merged
Jan 23, 2025 -
[hotfix] fix test_sampling_scaling_penalties.py ci test
#3084 merged
Jan 23, 2025 -
Use flashinfer vec_dtypes in sgl_kernel
#3083 merged
Jan 23, 2025 -
sync flashinfer and update sgl-kernel tests
#3081 merged
Jan 23, 2025 -
use env variable to control the build conf on the CPU build node
#3080 merged
Jan 23, 2025 -
update version setup for sgl-kernel
#3079 merged
Jan 23, 2025 -
fix build error for sgl-kernel
#3078 merged
Jan 23, 2025 -
Remove torch dependency in sgl-kernel
#3074 merged
Jan 23, 2025 -
support lightning_attention_decode in sgl-kernel for MiniMax-Text-01
#3030 merged
Jan 23, 2025 -
use v0.6.4.post1 for sgl-kernel ci
#3071 merged
Jan 23, 2025 -
docs: update developer guide for sgl-kernel
#3069 merged
Jan 23, 2025 -
docs: add developer guide for sgl-kernel
#3068 merged
Jan 23, 2025 -
Revert "disable custom allreduce on HIP"
#3067 merged
Jan 23, 2025 -
Support loading of larger models with on-the-fly quantization
#3061 merged
Jan 23, 2025 -
Fix tp token sync for dp attention
#3062 merged
Jan 23, 2025 -
[router] make error actionable
#3063 merged
Jan 23, 2025 -
[devcontainer] add non-root user
#2989 merged
Jan 23, 2025 -
Add some flags to allow sync token ids across TP ranks
#3060 merged
Jan 22, 2025 -
Fix the FP8 E4M3 parsing offline scales failure bug
#3045 merged
Jan 22, 2025 -
[Doc]Update doc of profiling with PyTorch Profiler
#3038 merged
Jan 22, 2025 -
disable custom allreduce on HIP
#3058 merged
Jan 22, 2025 -
add notice about flashinfer in sgl-kernel
#3057 merged
Jan 22, 2025 -
fix rotary_embedding rope_scaling for phi
#3055 merged
Jan 22, 2025 -
feat: integrate bmm_fp8 kernel into sgl-kernel
#3056 merged
Jan 22, 2025 -
minor: update header and use pytest
#3054 merged
Jan 22, 2025 -
feat: integrate activation kernels into sgl-kernel
#3053 merged
Jan 22, 2025 -
feat: integrate norm kernels into sgl-kernel
#3052 merged
Jan 22, 2025 -
sync the upstream updates of flashinfer
#3051 merged
Jan 22, 2025 -
update norm cu
#3048 merged
Jan 22, 2025 -
Fix sgl-kernel compile for sm80
#3046 merged
Jan 22, 2025 -
Use int64 as indices for set_kv_buffer
#3039 merged
Jan 22, 2025 -
fix pr-test-sgl-kernel
#3036 merged
Jan 21, 2025
17 Pull requests opened by 13 people
-
Modify the kernel test path & add it to the CI process.
#3044 opened
Jan 22, 2025 -
[Feature] Beam Search
#3066 opened
Jan 23, 2025 -
Accuracy measurement
#3114 opened
Jan 24, 2025 -
Extract generation_manager from tokenizer_manager
#3115 opened
Jan 25, 2025 -
Rename TokenizerManager to StdOrchestrator
#3116 opened
Jan 25, 2025 -
Let DetokenizerManager use TypeBasedDispatcher
#3117 opened
Jan 25, 2025 -
Split communication logic from computation logic into orchestrator
#3118 opened
Jan 25, 2025 -
Add EngineFragment
#3120 opened
Jan 25, 2025 -
fix: Fix deprecated max_tokens param in openai ChatCompletionRequest
#3122 opened
Jan 25, 2025 -
[MOE] Try to optimize moe align block size multiblocks cuda kernel
#3137 opened
Jan 26, 2025 -
Apply sgl w8a8 fp8 kernel
#3148 opened
Jan 26, 2025 -
[Feature] Define backends and add Triton backend for Lora
#3161 opened
Jan 27, 2025 -
Initial Enablement of CI on MI300
#3168 opened
Jan 27, 2025 -
[Feature] Rewrite Sampling Parameter #3165
#3185 opened
Jan 27, 2025 -
Add logit bias into the SGLang interface.
#3187 opened
Jan 27, 2025 -
Add deepseek_v3 fused gate
#3191 opened
Jan 28, 2025 -
Fixing a typo engine.py
#3193 opened
Jan 28, 2025
28 Issues closed by 12 people
-
[Bug] Qwen2-VL-7B with sglang has significant numerical calculation errors compared to HF Transformers
#3106 closed
Jan 28, 2025 -
Inference Speeds across 2x HGXs with infiniband 3.2tbps
#3172 closed
Jan 28, 2025 -
Offline batch inference for mullti-modality with prefix caching feature
#3177 closed
Jan 28, 2025 -
[Bug] Slow throughput/s on H200 (llama 3.1 8b)
#3186 closed
Jan 28, 2025 -
[Feature] add unit test for block wise fp8
#2768 closed
Jan 27, 2025 -
[Bug] Frontend choices and `input_token_logprobs` mis-match
#2873 closed
Jan 27, 2025 -
[Feature] request smoothquant (int8, W8A8) quantization on 40G A100
#2474 closed
Jan 26, 2025 -
QVQ Prefill stage slow
#2961 closed
Jan 26, 2025 -
[Bug] Qwen2-VL-7B with sglang Performance Degradation
#3041 closed
Jan 26, 2025 -
[Bug] constrained decoding performance is worse when qps>2
#3104 closed
Jan 26, 2025 -
[Bug] Qwen-2.5-Math-7B-Instruct and Llama-3.1-8B-Instruct Produce Nonsensical Results
#2084 closed
Jan 26, 2025 -
[Bug] frequency penalty
#2177 closed
Jan 25, 2025 -
Question About Model Integration and Parameter Updates (update_weight) in Sglang
#3101 closed
Jan 24, 2025 -
[Bug] The batch decoding speed of DeepSeek V3 is too slow.
#3100 closed
Jan 24, 2025 -
[Bug] PyTorch profiler trace is not generated
#2874 closed
Jan 24, 2025 -
[Bug] libcudart.so.12: cannot open shared object file: No such file or directory
#2584 closed
Jan 24, 2025 -
Can router support --api-key parameter
#3031 closed
Jan 24, 2025 -
[BUG] Problems with jump forward decoding
#2045 closed
Jan 24, 2025 -
[Benchmarks] Cant'run examples benchmark. Flashinfer error:
#3089 closed
Jan 23, 2025 -
Some question about layernom in MLA code
#3072 closed
Jan 23, 2025 -
[Bug] DeepSeek-V3 load weights failed with --enable-ep-moe
#3075 closed
Jan 23, 2025 -
[Feature] Support LLaMA-3.2 finetuned with Sentence Transformers !
#2131 closed
Jan 23, 2025 -
[Bug] Eagle2 has an unstable sampling rate during multi concurrency。
#2537 closed
Jan 22, 2025 -
[Bug] embedding model failed with `--enable-metrics`
#2800 closed
Jan 22, 2025 -
[Feature] When will function calls with deepseek support be available?
#2855 closed
Jan 21, 2025 -
Can multiple services be deployed simultaneously?
#2916 closed
Jan 21, 2025 -
[Feature] Add progress bar in `Engine.generate` method
#2994 closed
Jan 21, 2025
28 Issues opened by 20 people
-
[Bug] ERROR: No matching distribution found for vllm==0.6.3.post2.dev1; extra == "srt-hip"
#3189 opened
Jan 28, 2025 -
Any benchmarks comparing with TGI?
#3188 opened
Jan 27, 2025 -
[Feature] Step-by-Step Guide to Use SGLang on NVIDIA Jetson Orin platform
#3182 opened
Jan 27, 2025 -
[Feature] Rewrite Sampling Parameter
#3165 opened
Jan 27, 2025 -
[Feature] fix docs in Streaming-Synchronous-Generation
#3164 opened
Jan 27, 2025 -
[Feature] Reduce docs CI time
#3163 opened
Jan 27, 2025 -
[Feature] Remove Redundent CI of Docs
#3160 opened
Jan 27, 2025 -
[Feature] Support new Qwen Models
#3159 opened
Jan 27, 2025 -
[Feature] Split Docs CI
#3158 opened
Jan 27, 2025 -
[Feature] Accuracy test of VLM
#3142 opened
Jan 26, 2025 -
[Feature] Vision LM accuracy test
#3141 opened
Jan 26, 2025 -
[Feature] GGUF Q4KM(4bit) format for deepseek R1 support
#3140 opened
Jan 26, 2025 -
[Feature] Star attention support
#3131 opened
Jan 25, 2025 -
[Bug] Service crashed with 4 H100s and QPS=25
#3112 opened
Jan 24, 2025 -
[Bug] Crash special token xgrammar
#3108 opened
Jan 24, 2025 -
Batch inference over multiple nodes
#3103 opened
Jan 24, 2025 -
[Bug] Multi-node BUG
#3099 opened
Jan 24, 2025 -
[Bug] Qwen2-VL Online Serving Issue
#3098 opened
Jan 24, 2025 -
[Feature] Support InterVL
#3092 opened
Jan 24, 2025 -
[Feature] Add support for Phi4
#3090 opened
Jan 23, 2025 -
[Feature] docs: Improve documentation on how to use EAGLE speculative docoding
#3077 opened
Jan 23, 2025 -
[Feature] Support service discovery on Kubernetes in router
#3073 opened
Jan 23, 2025 -
[Bug]ImportError: undefined symbol: cuModuleGetFunction when using lmsysorg/sglang:v0.4.1.post7-cu124
#3065 opened
Jan 23, 2025 -
[Bug] Problems with logit_bias.
#3059 opened
Jan 22, 2025 -
[Bug] Decode Throughput Inconsistency Between bench_serving and Engine Logs
#3050 opened
Jan 22, 2025 -
[Help wanted] CANN'T capture GPU activities using `nsight system`
#3049 opened
Jan 22, 2025 -
[Feature] Reasoning model API support
#3043 opened
Jan 22, 2025 -
[Feature] batch concurrent requests while streaming responses
#3040 opened
Jan 22, 2025
50 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
Support int8 kvcahe
#3034 commented on
Jan 26, 2025 • 20 new comments -
Speculative decoding with lookahead
#2790 commented on
Jan 28, 2025 • 10 new comments -
Integrate turbomind into sgl-kernel
#2999 commented on
Jan 28, 2025 • 2 new comments -
[Feature] Support dynamic loading and unloading of Lora adapters
#2891 commented on
Jan 23, 2025 • 2 new comments -
Debug radixcache: refactor recursive helper methods
#3029 commented on
Jan 27, 2025 • 1 new comment -
[Bug] Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'original_max_position_embeddings'}
#2943 commented on
Jan 25, 2025 • 0 new comments -
[Bug] NCCL Crash with SIGSEGV Frequently when deploying deepseek v3
#2803 commented on
Jan 26, 2025 • 0 new comments -
[Feature] DeepSeek V3 optimization
#2591 commented on
Jan 27, 2025 • 0 new comments -
[Feature] add support for deepseek v3 gptq / awq
#2706 commented on
Jan 27, 2025 • 0 new comments -
[Feature] Lora optimization
#2929 commented on
Jan 27, 2025 • 0 new comments -
[Bug] Regex isn't precluding parentheticals. And maybe more.
#2957 commented on
Jan 28, 2025 • 0 new comments -
[Bug] Issue with batch mode
#2762 commented on
Jan 28, 2025 • 0 new comments -
[Feature] remove vllm _custom_ops
#2965 commented on
Jan 28, 2025 • 0 new comments -
[Feature] (Willing to PR) Proposal: Drop-in fast replacement of `PreTrainedModel.generate`
#2569 commented on
Jan 28, 2025 • 0 new comments -
[Feature] support EAGLE 2 with Triton Backend
#2940 commented on
Jan 28, 2025 • 0 new comments -
prometheus query return no result
#2677 commented on
Jan 28, 2025 • 0 new comments -
[Bug] Launching Llama-3.2-11B-Vision-Instruct just hangs on generation
#2619 commented on
Jan 28, 2025 • 0 new comments -
[Experimental] Add a gRPC server for completion request
#2478 commented on
Jan 22, 2025 • 0 new comments -
Hierarchical Caching for SGLang
#2693 commented on
Jan 28, 2025 • 0 new comments -
Add endpoint for file support, purely to speed up processing of input_embeds.
#2797 commented on
Jan 28, 2025 • 0 new comments -
[WIP] [Feature] Support Deepseek-VL2
#2798 commented on
Jan 25, 2025 • 0 new comments -
[WIP] Integration of TurboMind AWQ
#2900 commented on
Jan 28, 2025 • 0 new comments -
[Core] Optimize the delay scheduling of in batch prefix caching
#2962 commented on
Jan 22, 2025 • 0 new comments -
support telechat2 model
#3000 commented on
Jan 23, 2025 • 0 new comments -
Minicpmo
#3023 commented on
Jan 25, 2025 • 0 new comments -
[Feature] FP8 weight only w8a16 quantization native support
#3007 commented on
Jan 21, 2025 • 0 new comments -
what is the most efficient way to do with a 72b model and 8 * A100 ?
#3002 commented on
Jan 21, 2025 • 0 new comments -
[Bug] JSONResponse fails if the probability distribution is very spiky.
#2955 commented on
Jan 21, 2025 • 0 new comments -
[Feature] Enhancement on Sparse Attention and KV-Cache Compression
#2946 commented on
Jan 21, 2025 • 0 new comments -
[Feature] Support for rerank models
#2109 commented on
Jan 21, 2025 • 0 new comments -
[Bug] tensor_model_parallel_all_reduce' is not defined
#2931 commented on
Jan 21, 2025 • 0 new comments -
Warning while running Deepseek-V3
#2921 commented on
Jan 21, 2025 • 0 new comments -
[Bug] ipv6 dist_init_addr doesn't connect when running multi-node inference
#2892 commented on
Jan 21, 2025 • 0 new comments -
[Bug] Why can't I use multi-lora adapter and radix attention together?
#2880 commented on
Jan 21, 2025 • 0 new comments -
[Bug] Bug of top_logprobs for the first chunk
#2825 commented on
Jan 21, 2025 • 0 new comments -
[Bug] Using MLA with Lk >= 576 report out of resource: shared memory ERROR
#2847 commented on
Jan 21, 2025 • 0 new comments -
Do not use tools param in stream request!
#2810 commented on
Jan 21, 2025 • 0 new comments -
[Bug] Huggingface model weight download failures do not cause the process to exit
#2801 commented on
Jan 21, 2025 • 0 new comments -
[Bug] Forking state before submitting any string causes backend crashing in sgl.function: "UnboundLocalError: local variable 'model_worker_batch' referenced before assignment"
#2755 commented on
Jan 21, 2025 • 0 new comments -
[Bug] def get_nvgpu_memory_capacity() causes crash on NVIDIA H100 MIG
#2933 commented on
Jan 21, 2025 • 0 new comments -
[Bug] compressed-tensors format not supported
#2871 commented on
Jan 22, 2025 • 0 new comments -
[Feature] Add docs for local accuracy tests
#2953 commented on
Jan 22, 2025 • 0 new comments -
[Bug] [OpenAI compatible API] Chunks of tokens aren't being split into separate indexes when specifying n > 1 generations
#2912 commented on
Jan 22, 2025 • 0 new comments -
[Feature] Dynamic Lora Support in SGLang (like VLLM)
#2686 commented on
Jan 22, 2025 • 0 new comments -
[Bug] finish_reason is not right when Qwen call a tool
#2877 commented on
Jan 22, 2025 • 0 new comments -
[Bug] KeyError: 'lm_head.weight' when loading quantized llama 3.2 3B and 1B models
#2935 commented on
Jan 22, 2025 • 0 new comments -
[Bug] Cannot capture kernel trace using nsys
#2776 commented on
Jan 22, 2025 • 0 new comments -
[Bug] How to load weight with torchao
#2721 commented on
Jan 23, 2025 • 0 new comments -
[Feature] Support General Reward Model
#2427 commented on
Jan 24, 2025 • 0 new comments -
[Bug] Gemma 2 GGUF
#2451 commented on
Jan 24, 2025 • 0 new comments