ApeCloud
GithubTry KubeBlocks for Free

Kwai: Containerizing Massive Redis Clusters at Scale

Speaker: Yuyu Liu, Kuaishou Cloud Native Team

Value 1
Reduce infrastructure operating costs
Value 2
Simplify stateful service management

Executive Summary

Kwai successfully containerized their massive Redis infrastructure using KubeBlocks, managing clusters with over 10,000 instances while achieving significant cost optimization and operational efficiency improvements.

I. Business Context and Scale Challenge

Redis Infrastructure at Kwai

Kwai operates one of the largest Redis deployments in the industry:

  • Architecture: Master-slave topology with Server, Sentinel, and Proxy components
  • Scale: Individual clusters exceeding 10,000 instances
  • Complexity: Massive distributed infrastructure supporting hundreds of millions of users
Transformation Drivers

Despite stable operations at scale, Kwai identified critical optimization opportunities:

Resource Efficiency Gap
  • Low resource utilization across the Redis fleet
  • Significant cost-saving potential given infrastructure scale
  • Need for improved resource allocation strategies
Strategic Cloud-Native Alignment
  • Stateless services already migrated to container platforms
  • Infrastructure unification as long-term strategy
  • Enhanced business agility through infrastructure decoupling
  • Operational cost reduction through standardized platforms

II. Why KubeBlocks?

Stateful Service Management Challenge

Traditional container orchestration treats instances as interchangeable, but stateful services require different approaches:

Instance Inequality
  • Different roles and responsibilities (master/slave relationships)
  • Persistent state and data storage requirements
  • Dynamic role changes during runtime (failover scenarios)
  • Cannot arbitrarily terminate or replace instances
KubeBlocks Solution
  • Role-Based Management: Native support for instance hierarchies and relationships
  • Multi-Database Platform: Single operator supporting diverse database types
  • Process-Oriented APIs: OpsRequest framework for complex database operations
  • Simplified Operations: Lower complexity compared to database-specific operators

III. Redis Cluster Architecture in KubeBlocks

Hierarchical Component Model

Component Definitions

  • Cluster: Complete Redis deployment specification
  • ShardSpec: Horizontal scaling through data sharding
  • Component: Individual services (Server, Sentinel, Proxy)
  • InstanceTemplate: Configuration variants within components
  • InstanceSet: Generated workloads with role-aware management

Key Capabilities

  • Flexible Sharding: Support for massive data distribution
  • Configuration Variance: Different settings for master/slave within shards
  • Reusable Definitions: Component and version separation for efficiency
Role Management Strategy

Core Principles

  • Relationship Maintenance: Correct master-slave relationships
  • Fine-Grained Control: Role-specific management operations

Data/Control Plane Separation

  • Critical master node information managed separately for stability
  • Business continuity prioritized over complete automation
  • Reduced risk of control plane failures affecting data operations

IV. Multi-Cluster Federation Architecture

Scale Requirements

Kwai's Redis scale exceeds single Kubernetes cluster capacity, necessitating multi-cluster deployment.

Federation Solution

Distributed Control Architecture

  • Federation Cluster: Hosts Cluster and Component Operators
  • Member Clusters: Run InstanceSet Controllers locally
  • Federal InstanceSet Controller: Cross-cluster instance distribution

Key Features

  • Intelligent Scheduling: Automated instance distribution based on cluster capacity
  • InstanceSet Splitting: Seamless distribution of large workloads
  • Ordinal Preservation: Custom numbering ranges maintain consistency
  • Transparent Operation: Hidden complexity from Redis teams

Operational Benefits

  • Eliminated need for manual buffer pool management
  • Automated scaling across cluster boundaries
  • Reduced resource waste through optimal allocation

V. Production-Grade Reliability

Advanced Scheduling Capabilities

High Availability Features

  • Instance Distribution: Configurable maximum instances per node
  • Cluster Spreading: Maximum nodes per Redis cluster limits
  • Load Balancing: CPU, memory, and bandwidth-aware placement
  • Failure Domain Management: Single-node failure impact calculation
Runtime Safety Controls

Change Management

  • Concurrency Limits: Controlled simultaneous operations
  • In-Place Updates: Minimized disruption strategies
  • Gradual Rollouts: Safe deployment of configuration changes

Operational Safeguards

  • Extensive automation controls to prevent cascading failures
  • Multiple validation layers before executing changes
  • Comprehensive monitoring and alerting systems

VI. Results and Future Roadmap

Technical Achievements

Enhanced Stateful Service Management

  • Superior role-based capabilities compared to StatefulSet
  • Unified platform for diverse database types
  • Simplified operational workflows through standardized APIs

Collaborative Development

Kwai contributed key enhancements to KubeBlocks:

  • Direct Management: InstanceSet controlling Pods and PVCs
  • Instance Templates: Flexible configuration management
  • Federation Integration: Multi-cluster orchestration capabilities
  • Custom Scheduling: Advanced placement algorithms
Strategic Vision

Ecosystem Development

  • Operator Integration: Bridging existing database operators with KubeBlocks
  • API Standardization: Promoting unified stateful service standards
  • Community Collaboration: Continued partnership in cloud-native database management

Platform Evolution

  • Extended support for additional stateful workloads
  • Enhanced automation and operational efficiency
  • Improved resource utilization across the infrastructure

Conclusion

Kwai's implementation demonstrates KubeBlocks' capability to manage stateful services at unprecedented scale while maintaining production-grade reliability. The solution successfully addresses Redis containerization challenges, providing a foundation for continued cloud-native evolution and significant operational cost savings.

This collaboration showcases the potential for standardized stateful service management in cloud-native environments, contributing valuable patterns and capabilities to the broader Kubernetes ecosystem.

Production-grade databases, streaming, and AI-ready infrastructure for your applications

Experience efficient, flexible, and cost-effective database operations across multi-cloud and hybrid cloud environments — empowering your enterprise’s digital transformation.

Try it now