Kwai: Containerizing Massive Redis Clusters at Scale
Speaker: Yuyu Liu, Kuaishou Cloud Native Team

Executive Summary
Kwai successfully containerized their massive Redis infrastructure using KubeBlocks, managing clusters with over 10,000 instances while achieving significant cost optimization and operational efficiency improvements.
I. Business Context and Scale Challenge
Redis Infrastructure at Kwai
Kwai operates one of the largest Redis deployments in the industry:
- Architecture: Master-slave topology with Server, Sentinel, and Proxy components
- Scale: Individual clusters exceeding 10,000 instances
- Complexity: Massive distributed infrastructure supporting hundreds of millions of users
Transformation Drivers
Despite stable operations at scale, Kwai identified critical optimization opportunities:
Resource Efficiency Gap
- Low resource utilization across the Redis fleet
- Significant cost-saving potential given infrastructure scale
- Need for improved resource allocation strategies
Strategic Cloud-Native Alignment
- Stateless services already migrated to container platforms
- Infrastructure unification as long-term strategy
- Enhanced business agility through infrastructure decoupling
- Operational cost reduction through standardized platforms
II. Why KubeBlocks?
Stateful Service Management Challenge
Traditional container orchestration treats instances as interchangeable, but stateful services require different approaches:
Instance Inequality
- Different roles and responsibilities (master/slave relationships)
- Persistent state and data storage requirements
- Dynamic role changes during runtime (failover scenarios)
- Cannot arbitrarily terminate or replace instances
KubeBlocks Solution
- Role-Based Management: Native support for instance hierarchies and relationships
- Multi-Database Platform: Single operator supporting diverse database types
- Process-Oriented APIs: OpsRequest framework for complex database operations
- Simplified Operations: Lower complexity compared to database-specific operators
III. Redis Cluster Architecture in KubeBlocks
Hierarchical Component Model
Component Definitions
- Cluster: Complete Redis deployment specification
- ShardSpec: Horizontal scaling through data sharding
- Component: Individual services (Server, Sentinel, Proxy)
- InstanceTemplate: Configuration variants within components
- InstanceSet: Generated workloads with role-aware management
Key Capabilities
- Flexible Sharding: Support for massive data distribution
- Configuration Variance: Different settings for master/slave within shards
- Reusable Definitions: Component and version separation for efficiency
Role Management Strategy
Core Principles
- Relationship Maintenance: Correct master-slave relationships
- Fine-Grained Control: Role-specific management operations
Data/Control Plane Separation
- Critical master node information managed separately for stability
- Business continuity prioritized over complete automation
- Reduced risk of control plane failures affecting data operations
IV. Multi-Cluster Federation Architecture
Scale Requirements
Kwai's Redis scale exceeds single Kubernetes cluster capacity, necessitating multi-cluster deployment.
Federation Solution
Distributed Control Architecture
- Federation Cluster: Hosts Cluster and Component Operators
- Member Clusters: Run InstanceSet Controllers locally
- Federal InstanceSet Controller: Cross-cluster instance distribution
Key Features
- Intelligent Scheduling: Automated instance distribution based on cluster capacity
- InstanceSet Splitting: Seamless distribution of large workloads
- Ordinal Preservation: Custom numbering ranges maintain consistency
- Transparent Operation: Hidden complexity from Redis teams
Operational Benefits
- Eliminated need for manual buffer pool management
- Automated scaling across cluster boundaries
- Reduced resource waste through optimal allocation
V. Production-Grade Reliability
Advanced Scheduling Capabilities
High Availability Features
- Instance Distribution: Configurable maximum instances per node
- Cluster Spreading: Maximum nodes per Redis cluster limits
- Load Balancing: CPU, memory, and bandwidth-aware placement
- Failure Domain Management: Single-node failure impact calculation
Runtime Safety Controls
Change Management
- Concurrency Limits: Controlled simultaneous operations
- In-Place Updates: Minimized disruption strategies
- Gradual Rollouts: Safe deployment of configuration changes
Operational Safeguards
- Extensive automation controls to prevent cascading failures
- Multiple validation layers before executing changes
- Comprehensive monitoring and alerting systems
VI. Results and Future Roadmap
Technical Achievements
Enhanced Stateful Service Management
- Superior role-based capabilities compared to StatefulSet
- Unified platform for diverse database types
- Simplified operational workflows through standardized APIs
Collaborative Development
Kwai contributed key enhancements to KubeBlocks:
- Direct Management: InstanceSet controlling Pods and PVCs
- Instance Templates: Flexible configuration management
- Federation Integration: Multi-cluster orchestration capabilities
- Custom Scheduling: Advanced placement algorithms
Strategic Vision
Ecosystem Development
- Operator Integration: Bridging existing database operators with KubeBlocks
- API Standardization: Promoting unified stateful service standards
- Community Collaboration: Continued partnership in cloud-native database management
Platform Evolution
- Extended support for additional stateful workloads
- Enhanced automation and operational efficiency
- Improved resource utilization across the infrastructure
Conclusion
Kwai's implementation demonstrates KubeBlocks' capability to manage stateful services at unprecedented scale while maintaining production-grade reliability. The solution successfully addresses Redis containerization challenges, providing a foundation for continued cloud-native evolution and significant operational cost savings.
This collaboration showcases the potential for standardized stateful service management in cloud-native environments, contributing valuable patterns and capabilities to the broader Kubernetes ecosystem.

Production-grade databases, streaming, and AI-ready infrastructure for your applications
Experience efficient, flexible, and cost-effective database operations across multi-cloud and hybrid cloud environments — empowering your enterprise’s digital transformation.
Try it now