Skip to main content
Self-hosting W&B Weave allows you to have more control over its environment and configuration. This can be helpful for use cases that require isolation and additional security compliance. This guide explains how to deploy all the components required to run W&B Weave in a self-managed environment using the Altinity ClickHouse Operator. Self-managed Weave deployments rely on ClickHouseDB to manage its backend. This deployment uses:
  • Altinity ClickHouse Operator: Enterprise-grade ClickHouse management for Kubernetes
  • ClickHouse Keeper: Distributed coordination service (replaces ZooKeeper)
  • ClickHouse Cluster: High-availability database cluster for trace storage
  • S3-Compatible Storage: Object storage for ClickHouse data persistence
For a detailed reference architecture, see W&B Self-Managed Reference Architecture.

Important Notes

Configuration Customization RequiredThe configurations provided in this guide (including security contexts, pod anti-affinity rules, resource allocations, naming conventions, and storage classes) are reference examples only.
  • Security & Compliance: Adjust security contexts, runAsUser/fsGroup values, and other security settings according to your organization’s security policies and Kubernetes/OpenShift requirements.
  • Resource Sizing: The resource allocations shown are starting points. Consult with your W&B Solutions Architect team for proper sizing based on your expected trace volume and performance requirements.
  • Infrastructure Specifics: Update storage classes, node selectors, and other infrastructure-specific settings to match your environment.
Each organization’s Kubernetes environment is unique - treat these configurations as templates to be adapted, not prescriptive solutions.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    W&B Platform (wandb)                     │
│  ┌────────────┐  ┌────────────┐  ┌────────────────────┐     │
│  │ weave-trace│──│  app/api   │──│  console/parquet   │     │
│  └──────┬─────┘  └────────────┘  └────────────────────┘     │
└─────────┼────────────────────────────────────────────────---┘


┌─────────────────────────────────────────────────────────────┐
│              ClickHouse Cluster (clickhouse)                │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐       │
│  │ ch-server-0  │  │ ch-server-1  │  │ ch-server-2  │       │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘       │
│         │                 │                 │               │
│         └─────────────────┼─────────────────┘               │
│                           │                                 │
│  ┌──────────────────────────────────────────────────────┐   │
│  │           ClickHouse Keeper Cluster                  │   │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐            │   │
│  │  │ keeper-0 │  │ keeper-1 │  │ keeper-2 │            │   │
│  │  └──────────┘  └──────────┘  └──────────┘            │   │
│  └──────────────────────────────────────────────────────┘   │
└─────────┬───────────────────────────────────────────────────┘


      ┌───────────────┐
      │  S3 Storage   │
      │  (AWS/MinIO)  │
      └───────────────┘

Prerequisites

Required Resources

  • Kubernetes Cluster: Version 1.24+
  • Kubernetes Nodes: Multi-node cluster (minimum 3 nodes recommended for high availability)
  • Storage Class: A working StorageClass for persistent volumes (e.g., gp3, standard, nfs-csi)
  • S3 Bucket: Pre-configured S3 or S3-compatible bucket with appropriate access permissions
  • W&B Platform: Already installed and running (see W&B Self-Managed Deployment Guide)
  • W&B License: Weave-enabled license from W&B Support
Resource SizingDo not make sizing decisions based on this prerequisites list alone. See the detailed Resource Requirements section below for proper cluster sizing guidance. Resource needs vary significantly based on trace volume and usage patterns.

Required Tools

  • kubectl configured with cluster access
  • helm v3.0+
  • AWS credentials (if using S3) or access to S3-compatible storage

Network Requirements

  • Pods in the clickhouse namespace must communicate with pods in the wandb namespace
  • ClickHouse nodes must communicate with each other on ports 8123, 9000, 9009, and 2181

Deployment Steps

Step 1: Deploy Altinity ClickHouse Operator

The Altinity ClickHouse Operator manages ClickHouse installations in Kubernetes.

1.1 Add the Altinity Helm repository

helm repo add altinity https://helm.altinity.com
helm repo update

1.2 Create the ClickHouse namespace

kubectl create namespace clickhouse

1.3 Install the ClickHouse Operator

helm install clickhouse-operator altinity/altinity-clickhouse-operator \
  --namespace clickhouse \
  --version 0.24.0

1.4 Verify the operator is running

kubectl get pods -n clickhouse
Expected output:
NAME                                      READY   STATUS    RESTARTS   AGE
clickhouse-operator-857c69ffc6-2v4jh     2/2     Running   0          1m

Step 2: Configure S3 Storage

ClickHouse uses S3 for persistent storage. Create a Kubernetes secret with your S3 credentials.

Option A: Using AWS S3

kubectl create secret generic clickhouse-s3-credentials \
  --namespace clickhouse \
  --from-literal=access_key_id=YOUR_AWS_ACCESS_KEY_ID \
  --from-literal=secret_access_key=YOUR_AWS_SECRET_ACCESS_KEY

Option B: Using MinIO or S3-compatible storage

kubectl create secret generic clickhouse-s3-credentials \
  --namespace clickhouse \
  --from-literal=access_key_id=YOUR_MINIO_ACCESS_KEY \
  --from-literal=secret_access_key=YOUR_MINIO_SECRET_KEY

Step 3: Deploy ClickHouse Cluster

3.1 Create the ClickHouse configuration

Save the following as clickhouse-cluster.yaml:
apiVersion: "clickhouse.altinity.com/v1"
kind: "ClickHouseInstallation"
metadata:
  name: "weave-clickhouse"
  namespace: "clickhouse"
spec:
  defaults:
    templates:
      podTemplate: clickhouse-pod-template
      dataVolumeClaimTemplate: data-volume-template
      serviceTemplate: service-template
      
  configuration:
    settings:
      # Logging configuration
      logger/level: "information"
      
    clusters:
      - name: "weave_cluster"
        templates:
          clusterServiceTemplate: cluster-service-template
        layout:
          shardsCount: 1
          replicasCount: 3
    
    zookeeper:
      nodes:
        - host: weave-clickhouse-keeper-0.clickhouse-keeper-headless
          port: 2181
        - host: weave-clickhouse-keeper-1.clickhouse-keeper-headless
          port: 2181
        - host: weave-clickhouse-keeper-2.clickhouse-keeper-headless
          port: 2181
    
    files:
      config.d/storage.xml: |
        <clickhouse>
          <storage_configuration>
            <disks>
              <default>
                <keep_free_space_bytes>10485760</keep_free_space_bytes>
              </default>
              <s3_disk>
                <type>s3</type>
                <endpoint>https://s3.us-east-1.amazonaws.com/YOUR_BUCKET_NAME/clickhouse/{replica}</endpoint>
                <access_key_id from_env="AWS_ACCESS_KEY_ID"/>
                <secret_access_key from_env="AWS_SECRET_ACCESS_KEY"/>
                <metadata_path>/var/lib/clickhouse/disks/s3_disk/</metadata_path>
                <cache_enabled>true</cache_enabled>
                <cache_path>/var/lib/clickhouse/disks/s3_cache/</cache_path>
                <max_cache_size>10737418240</max_cache_size>
              </s3_disk>
            </disks>
            <policies>
              <s3_main>
                <volumes>
                  <main>
                    <disk>s3_disk</disk>
                  </main>
                </volumes>
              </s3_main>
            </policies>
          </storage_configuration>
          
          <merge_tree>
            <storage_policy>s3_main</storage_policy>
          </merge_tree>
        </clickhouse>

  templates:
    serviceTemplates:
      - name: service-template
        spec:
          type: ClusterIP
          ports:
            - name: http
              port: 8123
            - name: tcp
              port: 9000
      
      - name: cluster-service-template
        spec:
          type: ClusterIP
          ports:
            - name: http
              port: 8123
            - name: tcp
              port: 9000
    
    podTemplates:
      - name: clickhouse-pod-template
        spec:
          securityContext:
            runAsUser: 101
            runAsGroup: 101
            fsGroup: 101
          
          containers:
            - name: clickhouse
              image: altinity/clickhouse-server:24.8.5.115.altinitystable
              
              env:
                - name: AWS_ACCESS_KEY_ID
                  valueFrom:
                    secretKeyRef:
                      name: clickhouse-s3-credentials
                      key: access_key_id
                - name: AWS_SECRET_ACCESS_KEY
                  valueFrom:
                    secretKeyRef:
                      name: clickhouse-s3-credentials
                      key: secret_access_key
              
              resources:
                requests:
                  memory: "8Gi"
                  cpu: "2"
                limits:
                  memory: "32Gi"
                  cpu: "8"
              
              volumeMounts:
                - name: data
                  mountPath: /var/lib/clickhouse
          
          affinity:
            podAntiAffinity:
              preferredDuringSchedulingIgnoredDuringExecution:
                - weight: 100
                  podAffinityTerm:
                    labelSelector:
                      matchExpressions:
                        - key: "clickhouse.altinity.com/chi"
                          operator: In
                          values: ["weave-clickhouse"]
                    topologyKey: kubernetes.io/hostname
    
    volumeClaimTemplates:
      - name: data-volume-template
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 100Gi
          # Update this to match your storage class
          storageClassName: gp3
Important Configuration Updates RequiredBefore applying this configuration:
  1. S3 Endpoint: Replace YOUR_BUCKET_NAME with your actual S3 bucket name
  2. Storage Class: Update storageClassName to match your cluster’s storage class
  3. Resource Allocations: Adjust CPU and memory based on your expected workload
  4. Security Context: Modify runAsUser, runAsGroup, and fsGroup values according to your security policies

3.2 Deploy ClickHouse Keeper

ClickHouse Keeper provides distributed coordination (replaces ZooKeeper). Save the following as clickhouse-keeper.yaml:
apiVersion: "clickhouse.altinity.com/v1"
kind: "ClickHouseKeeperInstallation"
metadata:
  name: "weave-clickhouse-keeper"
  namespace: "clickhouse"
spec:
  replicas: 3
  
  configuration:
    settings:
      logger/level: "information"
      keeper_server/storage_path: /var/lib/clickhouse-keeper
      keeper_server/tcp_port: 2181
      keeper_server/four_letter_word_white_list: "*"
      keeper_server/coordination_settings/raft_logs_level: "information"
      keeper_server/raft_configuration/server:
        - id: 1
          hostname: weave-clickhouse-keeper-0.clickhouse-keeper-headless
          port: 9444
        - id: 2
          hostname: weave-clickhouse-keeper-1.clickhouse-keeper-headless
          port: 9444
        - id: 3
          hostname: weave-clickhouse-keeper-2.clickhouse-keeper-headless
          port: 9444
  
  templates:
    podTemplate:
      metadata:
        labels:
          app: clickhouse-keeper
      spec:
        securityContext:
          runAsUser: 101
          runAsGroup: 101
          fsGroup: 101
        
        affinity:
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              - labelSelector:
                  matchLabels:
                    app: clickhouse-keeper
                topologyKey: kubernetes.io/hostname
        
        containers:
          - name: clickhouse-keeper
            image: altinity/clickhouse-keeper:24.8.5.115.altinitystable
            
            resources:
              requests:
                memory: "1Gi"
                cpu: "500m"
              limits:
                memory: "2Gi"
                cpu: "1"
            
            volumeMounts:
              - name: data
                mountPath: /var/lib/clickhouse-keeper
    
    volumeClaimTemplate:
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 10Gi
        # Update this to match your storage class
        storageClassName: gp3

3.3 Apply the configurations

# Deploy ClickHouse Keeper first
kubectl apply -f clickhouse-keeper.yaml

# Wait for Keeper pods to be ready
kubectl wait --for=condition=ready pod -l app=clickhouse-keeper -n clickhouse --timeout=300s

# Deploy ClickHouse cluster
kubectl apply -f clickhouse-cluster.yaml

# Monitor the deployment
kubectl get pods -n clickhouse -w

Step 4: Configure W&B Platform

Update your W&B Platform configuration to connect to ClickHouse.

4.1 Update the W&B Custom Resource

Edit your W&B Platform Custom Resource (CR):
apiVersion: apps.wandb.com/v1
kind: WeightsAndBiases
metadata:
  name: wandb
  namespace: wandb
spec:
  values:
    global:
      # ... existing configuration ...
      
      clickhouse:
        host: weave-clickhouse.clickhouse.svc.cluster.local
        port: 8123
        database: wandb_weave
        user: default
        password: ""  # Empty for default user, or set a password
        
      weave-trace:
        enabled: true
    
    weave-trace:
      install: true
      extraEnv:
        WF_CLICKHOUSE_REPLICATED: "true"
        WF_CLICKHOUSE_REPLICATED_CLUSTER: "weave_cluster"

4.2 Apply the configuration

kubectl apply -f wandb-cr.yaml

4.3 Verify Weave is running

kubectl get pods -n wandb | grep weave-trace

Step 5: Initialize the Database

The Weave service will automatically create the required database and tables on first startup. You can verify this:
# Connect to ClickHouse
kubectl exec -it weave-clickhouse-0-0-0 -n clickhouse -- clickhouse-client

# In the ClickHouse client, verify the database exists
SHOW DATABASES;

# Check for Weave tables
USE wandb_weave;
SHOW TABLES;

Resource Requirements

ClickHouse Cluster Sizing

Resource requirements vary based on trace volume and retention period. Here are recommended starting points:

Small (< 1M traces/day)

  • Nodes: 3 replicas
  • CPU: 2-4 cores per node
  • Memory: 8-16 GB per node
  • Storage: 100-200 GB per node

Medium (1-10M traces/day)

  • Nodes: 3 replicas
  • CPU: 4-8 cores per node
  • Memory: 16-32 GB per node
  • Storage: 500 GB - 1 TB per node

Large (> 10M traces/day)

  • Nodes: 3+ replicas (consider sharding)
  • CPU: 8-16 cores per node
  • Memory: 32-64 GB per node
  • Storage: 1-2 TB per node
Contact your W&B Solutions Architect team for assistance with sizing for your specific use case. Factors to consider include:
  • Expected trace volume
  • Trace complexity and size
  • Query patterns and frequency
  • Retention requirements

ClickHouse Keeper Sizing

Keeper has minimal resource requirements:
  • CPU: 0.5-1 core per node
  • Memory: 1-2 GB per node
  • Storage: 10-20 GB per node

Monitoring and Maintenance

Health Checks

Check ClickHouse cluster status

kubectl exec -it weave-clickhouse-0-0-0 -n clickhouse -- clickhouse-client \
  --query "SELECT * FROM system.clusters WHERE cluster = 'weave_cluster'"

Check Keeper status

kubectl exec -it weave-clickhouse-keeper-0 -n clickhouse -- \
  echo ruok | nc localhost 2181

Backup and Recovery

S3-based backups

Since data is stored in S3, backups can be managed at the S3 level:
  1. Point-in-time recovery: Use S3 versioning
  2. Cross-region backups: Configure S3 replication
  3. Snapshot backups: Use S3 lifecycle policies

ClickHouse native backups

# Create a backup
kubectl exec -it weave-clickhouse-0-0-0 -n clickhouse -- clickhouse-client \
  --query "BACKUP DATABASE wandb_weave TO S3('s3://YOUR_BUCKET/backups/backup_name', 'YOUR_ACCESS_KEY', 'YOUR_SECRET_KEY')"

# Restore from backup
kubectl exec -it weave-clickhouse-0-0-0 -n clickhouse -- clickhouse-client \
  --query "RESTORE DATABASE wandb_weave FROM S3('s3://YOUR_BUCKET/backups/backup_name', 'YOUR_ACCESS_KEY', 'YOUR_SECRET_KEY')"

Scaling

Vertical scaling (increasing resources)

  1. Update the resource requests/limits in your ClickHouse configuration
  2. Apply the changes: kubectl apply -f clickhouse-cluster.yaml
  3. The operator will perform a rolling update

Horizontal scaling (adding replicas)

  1. Increase replicasCount in your ClickHouse configuration
  2. Apply the changes: kubectl apply -f clickhouse-cluster.yaml
  3. The operator will add new replicas and rebalance data

Adding shards

For very high volume deployments:
  1. Increase shardsCount in your ClickHouse configuration
  2. Apply the changes and migrate data as needed

Troubleshooting

Common Issues

ClickHouse pods not starting

Check pod events:
kubectl describe pod weave-clickhouse-0-0-0 -n clickhouse
Common causes:
  • Insufficient resources
  • Storage class not available
  • S3 credentials incorrect

Connection refused from W&B Platform

Verify network connectivity:
# From a W&B pod
kubectl exec -it <wandb-pod> -n wandb -- \
  curl -v http://weave-clickhouse.clickhouse.svc.cluster.local:8123/ping

Slow query performance

Check ClickHouse metrics:
kubectl exec -it weave-clickhouse-0-0-0 -n clickhouse -- clickhouse-client \
  --query "SELECT * FROM system.metrics WHERE metric LIKE '%Query%'"
Consider:
  • Increasing memory allocation
  • Optimizing table settings
  • Adding more replicas

Logs

ClickHouse logs

kubectl logs weave-clickhouse-0-0-0 -n clickhouse

Keeper logs

kubectl logs weave-clickhouse-keeper-0 -n clickhouse

Operator logs

kubectl logs deployment/clickhouse-operator -n clickhouse

Security Considerations

Network Security

  1. Network Policies: Implement Kubernetes NetworkPolicies to restrict traffic between namespaces
  2. TLS/SSL: Configure ClickHouse to use TLS for inter-node communication
  3. Service Mesh: Consider using Istio or Linkerd for additional security

Access Control

  1. ClickHouse Users: Create dedicated users with minimal permissions
  2. RBAC: Implement Kubernetes RBAC for operator and pod access
  3. Secrets Management: Use external secret managers (Vault, AWS Secrets Manager)

Data Protection

  1. Encryption at Rest: Enable S3 bucket encryption
  2. Encryption in Transit: Use HTTPS endpoints for S3
  3. Audit Logging: Enable ClickHouse query logging and Kubernetes audit logs

Advanced Configuration

Custom ClickHouse Settings

Add custom settings to the ClickHouse configuration:
configuration:
  settings:
    max_concurrent_queries: 100
    max_memory_usage: 10737418240  # 10GB
    max_memory_usage_for_user: 10737418240
    max_execution_time: 300  # 5 minutes

Performance Tuning

Optimize for Weave workloads:
configuration:
  settings:
    # Merge tree settings
    merge_tree/max_bytes_to_merge_at_max_space_in_pool: 161061273600
    merge_tree/max_bytes_to_merge_at_min_space_in_pool: 1048576
    
    # Query cache
    query_cache_max_size_in_bytes: 1073741824  # 1GB
    query_cache_max_entries: 1024
    
    # Background operations
    background_pool_size: 16
    background_schedule_pool_size: 16

Multi-Region Deployment

For global deployments:
  1. Deploy ClickHouse clusters in multiple regions
  2. Configure cross-region replication
  3. Use geo-distributed S3 buckets
  4. Implement query routing based on region

Support

For assistance with your Weave self-managed deployment:
  1. Documentation: W&B Documentation
  2. Community: W&B Community Forum
  3. Enterprise Support: Contact your W&B Solutions Architect or email support@wandb.com

Appendix

Complete Example Configuration

A complete example configuration is available in the W&B Examples Repository.

Migration from Previous Versions

If migrating from an earlier Weave deployment:
  1. Back up your existing data
  2. Deploy the new ClickHouse cluster
  3. Migrate data using ClickHouse tools
  4. Update W&B Platform configuration
  5. Verify data integrity

Integration with Existing ClickHouse

If you have an existing ClickHouse deployment:
  1. Ensure ClickHouse version compatibility (24.8+)
  2. Create a dedicated database for Weave
  3. Configure appropriate user permissions
  4. Update W&B Platform to point to existing cluster