vllm.model_executor.layers.fused_moe.flashinfer_trtllm_moe ¶
_supports_activation ¶
_supports_parallel_config ¶
_supports_parallel_config(
moe_parallel_config: FusedMoEParallelConfig,
) -> bool
Supports TRTLLM Kernel does not support EPLB.
_supports_quant_scheme ¶
Supports Fp8 per-tensor and Fp8 block.
Source code in vllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.py
_supports_routing_method ¶
_supports_routing_method(
weight_key: QuantKey | None,
activation_key: QuantKey | None,
routing_method: RoutingMethodType,
) -> bool
Monolithic kernels need to express router support.
Source code in vllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.py
fi_trtllm_fp8_per_tensor_moe ¶
fi_trtllm_fp8_per_tensor_moe(
routing_logits: Tensor,
routing_bias: Tensor | None,
hidden_states: Tensor,
input_scale: Tensor,
gemm1_weights: Tensor,
gemm2_weights: Tensor,
output1_scales_scalar: Tensor,
output1_scales_gate_scalar: Tensor,
output2_scales_scalar: Tensor,
num_experts: int,
top_k: int,
num_expert_group: int | None,
topk_group: int | None,
intermediate_size: int,
local_expert_offset: int,
local_num_experts: int,
use_routing_scales_on_input: bool,
routing_method_type: int,
routed_scaling_factor: float = 1.0,
) -> Tensor
Source code in vllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.py
fi_trtllm_fp8_per_tensor_moe_fake ¶
fi_trtllm_fp8_per_tensor_moe_fake(
routing_logits: Tensor,
routing_bias: Tensor | None,
hidden_states: Tensor,
input_scale: Tensor,
gemm1_weights: Tensor,
gemm2_weights: Tensor,
output1_scales_scalar: Tensor,
output1_scales_gate_scalar: Tensor,
output2_scales_scalar: Tensor,
num_experts: int,
top_k: int,
num_expert_group: int | None,
topk_group: int | None,
intermediate_size: int,
local_expert_offset: int,
local_num_experts: int,
use_routing_scales_on_input: bool,
routing_method_type: int,
routed_scaling_factor: float = 1.0,
) -> Tensor
Source code in vllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.py
flashinfer_fused_moe_blockscale_fp8 ¶
flashinfer_fused_moe_blockscale_fp8(
routing_logits: Tensor,
routing_bias: Tensor,
x: Tensor,
w13_weight: Tensor,
w13_weight_scale_inv: Tensor,
w2_weight: Tensor,
w2_weight_scale_inv: Tensor,
global_num_experts: int,
top_k: int,
num_expert_group: int | None,
topk_group: int | None,
intermediate_size: int,
expert_offset: int,
local_num_experts: int,
block_shape: list[int],
routing_method_type: int = int(DeepSeekV3),
routed_scaling: float | None = 1.0,
) -> Tensor
Source code in vllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.py
flashinfer_fused_moe_blockscale_fp8_fake ¶
flashinfer_fused_moe_blockscale_fp8_fake(
routing_logits: Tensor,
routing_bias: Tensor,
x: Tensor,
w13_weight: Tensor,
w13_weight_scale_inv: Tensor,
w2_weight: Tensor,
w2_weight_scale_inv: Tensor,
global_num_experts: int,
top_k: int,
num_expert_group: int,
topk_group: int,
intermediate_size: int,
expert_offset: int,
local_num_experts: int,
block_shape: list[int],
routing_method_type: int,
routed_scaling: float = 1.0,
) -> Tensor
Source code in vllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.py
is_supported_config_trtllm ¶
is_supported_config_trtllm(
moe_config: FusedMoEConfig,
weight_key: QuantKey | None,
activation_key: QuantKey | None,
activation_format: FusedMoEActivationFormat,
) -> tuple[bool, str | None]
This method mirrors mk.FusedMoEPermuteExpertsUnpermute.is_supported_config