- Altinity ClickHouse Operator: Enterprise-grade ClickHouse management for Kubernetes
- ClickHouse Keeper: Distributed coordination service (replaces ZooKeeper)
- ClickHouse Cluster: High-availability database cluster for trace storage
- S3-Compatible Storage: Object storage for ClickHouse data persistence
For a detailed reference architecture, see W&B Self-Managed Reference Architecture.
Important Notes
Configuration Customization RequiredThe configurations provided in this guide (including security contexts, pod anti-affinity rules, resource allocations, naming conventions, and storage classes) are reference examples only.
- Security & Compliance: Adjust security contexts, runAsUser/fsGroup values, and other security settings according to your organization’s security policies and Kubernetes/OpenShift requirements.
- Resource Sizing: The resource allocations shown are starting points. Consult with your W&B Solutions Architect team for proper sizing based on your expected trace volume and performance requirements.
- Infrastructure Specifics: Update storage classes, node selectors, and other infrastructure-specific settings to match your environment.
Architecture
Prerequisites
Required Resources
- Kubernetes Cluster: Version 1.24+
- Kubernetes Nodes: Multi-node cluster (minimum 3 nodes recommended for high availability)
- Storage Class: A working StorageClass for persistent volumes (e.g.,
gp3,standard,nfs-csi) - S3 Bucket: Pre-configured S3 or S3-compatible bucket with appropriate access permissions
- W&B Platform: Already installed and running (see W&B Self-Managed Deployment Guide)
- W&B License: Weave-enabled license from W&B Support
Resource SizingDo not make sizing decisions based on this prerequisites list alone. See the detailed Resource Requirements section below for proper cluster sizing guidance. Resource needs vary significantly based on trace volume and usage patterns.
Required Tools
kubectlconfigured with cluster accesshelmv3.0+- AWS credentials (if using S3) or access to S3-compatible storage
Network Requirements
- Pods in the
clickhousenamespace must communicate with pods in thewandbnamespace - ClickHouse nodes must communicate with each other on ports 8123, 9000, 9009, and 2181
Deployment Steps
Step 1: Deploy Altinity ClickHouse Operator
The Altinity ClickHouse Operator manages ClickHouse installations in Kubernetes.1.1 Add the Altinity Helm repository
1.2 Create the ClickHouse namespace
1.3 Install the ClickHouse Operator
1.4 Verify the operator is running
Step 2: Configure S3 Storage
ClickHouse uses S3 for persistent storage. Create a Kubernetes secret with your S3 credentials.Option A: Using AWS S3
Option B: Using MinIO or S3-compatible storage
Step 3: Deploy ClickHouse Cluster
3.1 Create the ClickHouse configuration
Save the following asclickhouse-cluster.yaml:
Important Configuration Updates RequiredBefore applying this configuration:
- S3 Endpoint: Replace
YOUR_BUCKET_NAMEwith your actual S3 bucket name - Storage Class: Update
storageClassNameto match your cluster’s storage class - Resource Allocations: Adjust CPU and memory based on your expected workload
- Security Context: Modify
runAsUser,runAsGroup, andfsGroupvalues according to your security policies
3.2 Deploy ClickHouse Keeper
ClickHouse Keeper provides distributed coordination (replaces ZooKeeper). Save the following asclickhouse-keeper.yaml:
3.3 Apply the configurations
Step 4: Configure W&B Platform
Update your W&B Platform configuration to connect to ClickHouse.4.1 Update the W&B Custom Resource
Edit your W&B Platform Custom Resource (CR):4.2 Apply the configuration
4.3 Verify Weave is running
Step 5: Initialize the Database
The Weave service will automatically create the required database and tables on first startup. You can verify this:Resource Requirements
ClickHouse Cluster Sizing
Resource requirements vary based on trace volume and retention period. Here are recommended starting points:Small (< 1M traces/day)
- Nodes: 3 replicas
- CPU: 2-4 cores per node
- Memory: 8-16 GB per node
- Storage: 100-200 GB per node
Medium (1-10M traces/day)
- Nodes: 3 replicas
- CPU: 4-8 cores per node
- Memory: 16-32 GB per node
- Storage: 500 GB - 1 TB per node
Large (> 10M traces/day)
- Nodes: 3+ replicas (consider sharding)
- CPU: 8-16 cores per node
- Memory: 32-64 GB per node
- Storage: 1-2 TB per node
Contact your W&B Solutions Architect team for assistance with sizing for your specific use case. Factors to consider include:
- Expected trace volume
- Trace complexity and size
- Query patterns and frequency
- Retention requirements
ClickHouse Keeper Sizing
Keeper has minimal resource requirements:- CPU: 0.5-1 core per node
- Memory: 1-2 GB per node
- Storage: 10-20 GB per node
Monitoring and Maintenance
Health Checks
Check ClickHouse cluster status
Check Keeper status
Backup and Recovery
S3-based backups
Since data is stored in S3, backups can be managed at the S3 level:- Point-in-time recovery: Use S3 versioning
- Cross-region backups: Configure S3 replication
- Snapshot backups: Use S3 lifecycle policies
ClickHouse native backups
Scaling
Vertical scaling (increasing resources)
- Update the resource requests/limits in your ClickHouse configuration
- Apply the changes:
kubectl apply -f clickhouse-cluster.yaml - The operator will perform a rolling update
Horizontal scaling (adding replicas)
- Increase
replicasCountin your ClickHouse configuration - Apply the changes:
kubectl apply -f clickhouse-cluster.yaml - The operator will add new replicas and rebalance data
Adding shards
For very high volume deployments:- Increase
shardsCountin your ClickHouse configuration - Apply the changes and migrate data as needed
Troubleshooting
Common Issues
ClickHouse pods not starting
Check pod events:- Insufficient resources
- Storage class not available
- S3 credentials incorrect
Connection refused from W&B Platform
Verify network connectivity:Slow query performance
Check ClickHouse metrics:- Increasing memory allocation
- Optimizing table settings
- Adding more replicas
Logs
ClickHouse logs
Keeper logs
Operator logs
Security Considerations
Network Security
- Network Policies: Implement Kubernetes NetworkPolicies to restrict traffic between namespaces
- TLS/SSL: Configure ClickHouse to use TLS for inter-node communication
- Service Mesh: Consider using Istio or Linkerd for additional security
Access Control
- ClickHouse Users: Create dedicated users with minimal permissions
- RBAC: Implement Kubernetes RBAC for operator and pod access
- Secrets Management: Use external secret managers (Vault, AWS Secrets Manager)
Data Protection
- Encryption at Rest: Enable S3 bucket encryption
- Encryption in Transit: Use HTTPS endpoints for S3
- Audit Logging: Enable ClickHouse query logging and Kubernetes audit logs
Advanced Configuration
Custom ClickHouse Settings
Add custom settings to the ClickHouse configuration:Performance Tuning
Optimize for Weave workloads:Multi-Region Deployment
For global deployments:- Deploy ClickHouse clusters in multiple regions
- Configure cross-region replication
- Use geo-distributed S3 buckets
- Implement query routing based on region
Support
For assistance with your Weave self-managed deployment:- Documentation: W&B Documentation
- Community: W&B Community Forum
- Enterprise Support: Contact your W&B Solutions Architect or email support@wandb.com
Appendix
Complete Example Configuration
A complete example configuration is available in the W&B Examples Repository.Migration from Previous Versions
If migrating from an earlier Weave deployment:- Back up your existing data
- Deploy the new ClickHouse cluster
- Migrate data using ClickHouse tools
- Update W&B Platform configuration
- Verify data integrity
Integration with Existing ClickHouse
If you have an existing ClickHouse deployment:- Ensure ClickHouse version compatibility (24.8+)
- Create a dedicated database for Weave
- Configure appropriate user permissions
- Update W&B Platform to point to existing cluster