W&B Weave Self-Managed - Weights & Biases Documentation

Self-hosting W&B Weave allows you to have more control over its environment and configuration. This can be helpful for use cases that require isolation and additional security compliance. This guide explains how to deploy all the components required to run W&B Weave in a self-managed environment using the Altinity ClickHouse Operator. Self-managed Weave deployments rely on ClickHouseDB to manage its backend. This deployment uses:

Altinity ClickHouse Operator: Enterprise-grade ClickHouse management for Kubernetes
ClickHouse Keeper: Distributed coordination service (replaces ZooKeeper)
ClickHouse Cluster: High-availability database cluster for trace storage
S3-Compatible Storage: Object storage for ClickHouse data persistence

For a detailed reference architecture, see W&B Self-Managed Reference Architecture.

Important Notes

Configuration Customization RequiredThe configurations provided in this guide (including security contexts, pod anti-affinity rules, resource allocations, naming conventions, and storage classes) are reference examples only.

Security & Compliance: Adjust security contexts, runAsUser/fsGroup values, and other security settings according to your organization’s security policies and Kubernetes/OpenShift requirements.
Resource Sizing: The resource allocations shown are starting points. Consult with your W&B Solutions Architect team for proper sizing based on your expected trace volume and performance requirements.
Infrastructure Specifics: Update storage classes, node selectors, and other infrastructure-specific settings to match your environment.

Each organization’s Kubernetes environment is unique - treat these configurations as templates to be adapted, not prescriptive solutions.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    W&B Platform (wandb)                     │
│  ┌────────────┐  ┌────────────┐  ┌────────────────────┐     │
│  │ weave-trace│──│  app/api   │──│  console/parquet   │     │
│  └──────┬─────┘  └────────────┘  └────────────────────┘     │
└─────────┼────────────────────────────────────────────────---┘
          │
          ▼
┌─────────────────────────────────────────────────────────────┐
│              ClickHouse Cluster (clickhouse)                │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐       │
│  │ ch-server-0  │  │ ch-server-1  │  │ ch-server-2  │       │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘       │
│         │                 │                 │               │
│         └─────────────────┼─────────────────┘               │
│                           │                                 │
│  ┌──────────────────────────────────────────────────────┐   │
│  │           ClickHouse Keeper Cluster                  │   │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐            │   │
│  │  │ keeper-0 │  │ keeper-1 │  │ keeper-2 │            │   │
│  │  └──────────┘  └──────────┘  └──────────┘            │   │
│  └──────────────────────────────────────────────────────┘   │
└─────────┬───────────────────────────────────────────────────┘
              │
              ▼
      ┌───────────────┐
      │  S3 Storage   │
      │  (AWS/MinIO)  │
      └───────────────┘

Prerequisites

Required Resources

Kubernetes Cluster: Version 1.24+
Kubernetes Nodes: Multi-node cluster (minimum 3 nodes recommended for high availability)
Storage Class: A working StorageClass for persistent volumes (e.g., gp3, standard, nfs-csi)
S3 Bucket: Pre-configured S3 or S3-compatible bucket with appropriate access permissions
W&B Platform: Already installed and running (see W&B Self-Managed Deployment Guide)
W&B License: Weave-enabled license from W&B Support

Resource SizingDo not make sizing decisions based on this prerequisites list alone. See the detailed Resource Requirements section below for proper cluster sizing guidance. Resource needs vary significantly based on trace volume and usage patterns.

Required Tools

kubectl configured with cluster access
helm v3.0+
AWS credentials (if using S3) or access to S3-compatible storage

Network Requirements

Pods in the clickhouse namespace must communicate with pods in the wandb namespace
ClickHouse nodes must communicate with each other on ports 8123, 9000, 9009, and 2181

Deployment Steps

Step 1: Deploy Altinity ClickHouse Operator

The Altinity ClickHouse Operator manages ClickHouse installations in Kubernetes.

1.1 Add the Altinity Helm repository

helm repo add altinity https://helm.altinity.com
helm repo update

1.2 Create the ClickHouse namespace

kubectl create namespace clickhouse

1.3 Install the ClickHouse Operator

helm install clickhouse-operator altinity/altinity-clickhouse-operator \
  --namespace clickhouse \
  --version 0.24.0

1.4 Verify the operator is running

kubectl get pods -n clickhouse

Expected output:

NAME                                      READY   STATUS    RESTARTS   AGE
clickhouse-operator-857c69ffc6-2v4jh     2/2     Running   0          1m

Step 2: Configure S3 Storage

ClickHouse uses S3 for persistent storage. Create a Kubernetes secret with your S3 credentials.

Option A: Using AWS S3

kubectl create secret generic clickhouse-s3-credentials \
  --namespace clickhouse \
  --from-literal=access_key_id=YOUR_AWS_ACCESS_KEY_ID \
  --from-literal=secret_access_key=YOUR_AWS_SECRET_ACCESS_KEY

Option B: Using MinIO or S3-compatible storage

kubectl create secret generic clickhouse-s3-credentials \
  --namespace clickhouse \
  --from-literal=access_key_id=YOUR_MINIO_ACCESS_KEY \
  --from-literal=secret_access_key=YOUR_MINIO_SECRET_KEY

Step 3: Deploy ClickHouse Cluster

3.1 Create the ClickHouse configuration

Save the following as clickhouse-cluster.yaml:

apiVersion: "clickhouse.altinity.com/v1"
kind: "ClickHouseInstallation"
metadata:
  name: "weave-clickhouse"
  namespace: "clickhouse"
spec:
  defaults:
    templates:
      podTemplate: clickhouse-pod-template
      dataVolumeClaimTemplate: data-volume-template
      serviceTemplate: service-template
      
  configuration:
    settings:
      # Logging configuration
      logger/level: "information"
      
    clusters:
      - name: "weave_cluster"
        templates:
          clusterServiceTemplate: cluster-service-template
        layout:
          shardsCount: 1
          replicasCount: 3
    
    zookeeper:
      nodes:
        - host: weave-clickhouse-keeper-0.clickhouse-keeper-headless
          port: 2181
        - host: weave-clickhouse-keeper-1.clickhouse-keeper-headless
          port: 2181
        - host: weave-clickhouse-keeper-2.clickhouse-keeper-headless
          port: 2181
    
    files:
      config.d/storage.xml: |
        <clickhouse>
          <storage_configuration>
            <disks>
              <default>
                <keep_free_space_bytes>10485760</keep_free_space_bytes>
              </default>
              <s3_disk>
                <type>s3</type>
                <endpoint>https://s3.us-east-1.amazonaws.com/YOUR_BUCKET_NAME/clickhouse/{replica}</endpoint>
                <access_key_id from_env="AWS_ACCESS_KEY_ID"/>
                <secret_access_key from_env="AWS_SECRET_ACCESS_KEY"/>
                <metadata_path>/var/lib/clickhouse/disks/s3_disk/</metadata_path>
                <cache_enabled>true</cache_enabled>
                <cache_path>/var/lib/clickhouse/disks/s3_cache/</cache_path>
                <max_cache_size>10737418240</max_cache_size>
              </s3_disk>
            </disks>
            <policies>
              <s3_main>
                <volumes>
                  <main>
                    <disk>s3_disk</disk>
                  </main>
                </volumes>
              </s3_main>
            </policies>
          </storage_configuration>
          
          <merge_tree>
            <storage_policy>s3_main</storage_policy>
          </merge_tree>
        </clickhouse>

  templates:
    serviceTemplates:
      - name: service-template
        spec:
          type: ClusterIP
          ports:
            - name: http
              port: 8123
            - name: tcp
              port: 9000
      
      - name: cluster-service-template
        spec:
          type: ClusterIP
          ports:
            - name: http
              port: 8123
            - name: tcp
              port: 9000
    
    podTemplates:
      - name: clickhouse-pod-template
        spec:
          securityContext:
            runAsUser: 101
            runAsGroup: 101
            fsGroup: 101
          
          containers:
            - name: clickhouse
              image: altinity/clickhouse-server:24.8.5.115.altinitystable
              
              env:
                - name: AWS_ACCESS_KEY_ID
                  valueFrom:
                    secretKeyRef:
                      name: clickhouse-s3-credentials
                      key: access_key_id
                - name: AWS_SECRET_ACCESS_KEY
                  valueFrom:
                    secretKeyRef:
                      name: clickhouse-s3-credentials
                      key: secret_access_key
              
              resources:
                requests:
                  memory: "8Gi"
                  cpu: "2"
                limits:
                  memory: "32Gi"
                  cpu: "8"
              
              volumeMounts:
                - name: data
                  mountPath: /var/lib/clickhouse
          
          affinity:
            podAntiAffinity:
              preferredDuringSchedulingIgnoredDuringExecution:
                - weight: 100
                  podAffinityTerm:
                    labelSelector:
                      matchExpressions:
                        - key: "clickhouse.altinity.com/chi"
                          operator: In
                          values: ["weave-clickhouse"]
                    topologyKey: kubernetes.io/hostname
    
    volumeClaimTemplates:
      - name: data-volume-template
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 100Gi
          # Update this to match your storage class
          storageClassName: gp3

Important Configuration Updates RequiredBefore applying this configuration:

S3 Endpoint: Replace YOUR_BUCKET_NAME with your actual S3 bucket name
Storage Class: Update storageClassName to match your cluster’s storage class
Resource Allocations: Adjust CPU and memory based on your expected workload
Security Context: Modify runAsUser, runAsGroup, and fsGroup values according to your security policies

3.2 Deploy ClickHouse Keeper

ClickHouse Keeper provides distributed coordination (replaces ZooKeeper). Save the following as clickhouse-keeper.yaml:

apiVersion: "clickhouse.altinity.com/v1"
kind: "ClickHouseKeeperInstallation"
metadata:
  name: "weave-clickhouse-keeper"
  namespace: "clickhouse"
spec:
  replicas: 3
  
  configuration:
    settings:
      logger/level: "information"
      keeper_server/storage_path: /var/lib/clickhouse-keeper
      keeper_server/tcp_port: 2181
      keeper_server/four_letter_word_white_list: "*"
      keeper_server/coordination_settings/raft_logs_level: "information"
      keeper_server/raft_configuration/server:
        - id: 1
          hostname: weave-clickhouse-keeper-0.clickhouse-keeper-headless
          port: 9444
        - id: 2
          hostname: weave-clickhouse-keeper-1.clickhouse-keeper-headless
          port: 9444
        - id: 3
          hostname: weave-clickhouse-keeper-2.clickhouse-keeper-headless
          port: 9444
  
  templates:
    podTemplate:
      metadata:
        labels:
          app: clickhouse-keeper
      spec:
        securityContext:
          runAsUser: 101
          runAsGroup: 101
          fsGroup: 101
        
        affinity:
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              - labelSelector:
                  matchLabels:
                    app: clickhouse-keeper
                topologyKey: kubernetes.io/hostname
        
        containers:
          - name: clickhouse-keeper
            image: altinity/clickhouse-keeper:24.8.5.115.altinitystable
            
            resources:
              requests:
                memory: "1Gi"
                cpu: "500m"
              limits:
                memory: "2Gi"
                cpu: "1"
            
            volumeMounts:
              - name: data
                mountPath: /var/lib/clickhouse-keeper
    
    volumeClaimTemplate:
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 10Gi
        # Update this to match your storage class
        storageClassName: gp3

3.3 Apply the configurations

# Deploy ClickHouse Keeper first
kubectl apply -f clickhouse-keeper.yaml

# Wait for Keeper pods to be ready
kubectl wait --for=condition=ready pod -l app=clickhouse-keeper -n clickhouse --timeout=300s

# Deploy ClickHouse cluster
kubectl apply -f clickhouse-cluster.yaml

# Monitor the deployment
kubectl get pods -n clickhouse -w

Step 4: Configure W&B Platform

Update your W&B Platform configuration to connect to ClickHouse.

4.1 Update the W&B Custom Resource

Edit your W&B Platform Custom Resource (CR):

apiVersion: apps.wandb.com/v1
kind: WeightsAndBiases
metadata:
  name: wandb
  namespace: wandb
spec:
  values:
    global:
      # ... existing configuration ...
      
      clickhouse:
        host: weave-clickhouse.clickhouse.svc.cluster.local
        port: 8123
        database: wandb_weave
        user: default
        password: ""  # Empty for default user, or set a password
        
      weave-trace:
        enabled: true
    
    weave-trace:
      install: true
      extraEnv:
        WF_CLICKHOUSE_REPLICATED: "true"
        WF_CLICKHOUSE_REPLICATED_CLUSTER: "weave_cluster"

4.2 Apply the configuration

kubectl apply -f wandb-cr.yaml

4.3 Verify Weave is running

kubectl get pods -n wandb | grep weave-trace

Step 5: Initialize the Database

The Weave service will automatically create the required database and tables on first startup. You can verify this:

# Connect to ClickHouse
kubectl exec -it weave-clickhouse-0-0-0 -n clickhouse -- clickhouse-client

# In the ClickHouse client, verify the database exists
SHOW DATABASES;

# Check for Weave tables
USE wandb_weave;
SHOW TABLES;

Resource Requirements

ClickHouse Cluster Sizing

Resource requirements vary based on trace volume and retention period. Here are recommended starting points:

Small (< 1M traces/day)

Nodes: 3 replicas
CPU: 2-4 cores per node
Memory: 8-16 GB per node
Storage: 100-200 GB per node

Medium (1-10M traces/day)

Nodes: 3 replicas
CPU: 4-8 cores per node
Memory: 16-32 GB per node
Storage: 500 GB - 1 TB per node

Large (> 10M traces/day)

Nodes: 3+ replicas (consider sharding)
CPU: 8-16 cores per node
Memory: 32-64 GB per node
Storage: 1-2 TB per node

Contact your W&B Solutions Architect team for assistance with sizing for your specific use case. Factors to consider include:

Expected trace volume
Trace complexity and size
Query patterns and frequency
Retention requirements

ClickHouse Keeper Sizing

Keeper has minimal resource requirements:

CPU: 0.5-1 core per node
Memory: 1-2 GB per node
Storage: 10-20 GB per node

Monitoring and Maintenance

Health Checks

Check ClickHouse cluster status

kubectl exec -it weave-clickhouse-0-0-0 -n clickhouse -- clickhouse-client \
  --query "SELECT * FROM system.clusters WHERE cluster = 'weave_cluster'"

Check Keeper status

kubectl exec -it weave-clickhouse-keeper-0 -n clickhouse -- \
  echo ruok | nc localhost 2181

Backup and Recovery

S3-based backups

Since data is stored in S3, backups can be managed at the S3 level:

Point-in-time recovery: Use S3 versioning
Cross-region backups: Configure S3 replication
Snapshot backups: Use S3 lifecycle policies

ClickHouse native backups

# Create a backup
kubectl exec -it weave-clickhouse-0-0-0 -n clickhouse -- clickhouse-client \
  --query "BACKUP DATABASE wandb_weave TO S3('s3://YOUR_BUCKET/backups/backup_name', 'YOUR_ACCESS_KEY', 'YOUR_SECRET_KEY')"

# Restore from backup
kubectl exec -it weave-clickhouse-0-0-0 -n clickhouse -- clickhouse-client \
  --query "RESTORE DATABASE wandb_weave FROM S3('s3://YOUR_BUCKET/backups/backup_name', 'YOUR_ACCESS_KEY', 'YOUR_SECRET_KEY')"

Scaling

Vertical scaling (increasing resources)

Update the resource requests/limits in your ClickHouse configuration
Apply the changes: kubectl apply -f clickhouse-cluster.yaml
The operator will perform a rolling update

Horizontal scaling (adding replicas)

Increase replicasCount in your ClickHouse configuration
Apply the changes: kubectl apply -f clickhouse-cluster.yaml
The operator will add new replicas and rebalance data

Adding shards

For very high volume deployments:

Increase shardsCount in your ClickHouse configuration
Apply the changes and migrate data as needed

Troubleshooting

Common Issues

ClickHouse pods not starting

Check pod events:

kubectl describe pod weave-clickhouse-0-0-0 -n clickhouse

Common causes:

Insufficient resources
Storage class not available
S3 credentials incorrect

Connection refused from W&B Platform

Verify network connectivity:

# From a W&B pod
kubectl exec -it <wandb-pod> -n wandb -- \
  curl -v http://weave-clickhouse.clickhouse.svc.cluster.local:8123/ping

Slow query performance

Check ClickHouse metrics:

kubectl exec -it weave-clickhouse-0-0-0 -n clickhouse -- clickhouse-client \
  --query "SELECT * FROM system.metrics WHERE metric LIKE '%Query%'"

Consider:

Increasing memory allocation
Optimizing table settings
Adding more replicas

Logs

ClickHouse logs

kubectl logs weave-clickhouse-0-0-0 -n clickhouse

Keeper logs

kubectl logs weave-clickhouse-keeper-0 -n clickhouse

Operator logs

kubectl logs deployment/clickhouse-operator -n clickhouse

Security Considerations

Network Security

Network Policies: Implement Kubernetes NetworkPolicies to restrict traffic between namespaces
TLS/SSL: Configure ClickHouse to use TLS for inter-node communication
Service Mesh: Consider using Istio or Linkerd for additional security

Access Control

ClickHouse Users: Create dedicated users with minimal permissions
RBAC: Implement Kubernetes RBAC for operator and pod access
Secrets Management: Use external secret managers (Vault, AWS Secrets Manager)

Data Protection

Encryption at Rest: Enable S3 bucket encryption
Encryption in Transit: Use HTTPS endpoints for S3
Audit Logging: Enable ClickHouse query logging and Kubernetes audit logs

Advanced Configuration

Custom ClickHouse Settings

Add custom settings to the ClickHouse configuration:

configuration:
  settings:
    max_concurrent_queries: 100
    max_memory_usage: 10737418240  # 10GB
    max_memory_usage_for_user: 10737418240
    max_execution_time: 300  # 5 minutes

Performance Tuning

Optimize for Weave workloads:

configuration:
  settings:
    # Merge tree settings
    merge_tree/max_bytes_to_merge_at_max_space_in_pool: 161061273600
    merge_tree/max_bytes_to_merge_at_min_space_in_pool: 1048576
    
    # Query cache
    query_cache_max_size_in_bytes: 1073741824  # 1GB
    query_cache_max_entries: 1024
    
    # Background operations
    background_pool_size: 16
    background_schedule_pool_size: 16

Multi-Region Deployment

For global deployments:

Deploy ClickHouse clusters in multiple regions
Configure cross-region replication
Use geo-distributed S3 buckets
Implement query routing based on region

Support

For assistance with your Weave self-managed deployment:

Documentation: W&B Documentation
Community: W&B Community Forum
Enterprise Support: Contact your W&B Solutions Architect or email support@wandb.com

Appendix

Complete Example Configuration

A complete example configuration is available in the W&B Examples Repository.

Migration from Previous Versions

If migrating from an earlier Weave deployment:

Back up your existing data
Deploy the new ClickHouse cluster
Migrate data using ClickHouse tools
Update W&B Platform configuration
Verify data integrity

Integration with Existing ClickHouse

If you have an existing ClickHouse deployment:

Ensure ClickHouse version compatibility (24.8+)
Create a dedicated database for Weave
Configure appropriate user permissions
Update W&B Platform to point to existing cluster

Guides

Cookbooks

Reference

Open Source

Community

​Important Notes

​Architecture

​Prerequisites

​Required Resources

​Required Tools

​Network Requirements

​Deployment Steps

​Step 1: Deploy Altinity ClickHouse Operator

​1.1 Add the Altinity Helm repository

​1.2 Create the ClickHouse namespace

​1.3 Install the ClickHouse Operator

​1.4 Verify the operator is running

​Step 2: Configure S3 Storage

​Option A: Using AWS S3

​Option B: Using MinIO or S3-compatible storage

​Step 3: Deploy ClickHouse Cluster

​3.1 Create the ClickHouse configuration

​3.2 Deploy ClickHouse Keeper

​3.3 Apply the configurations

​Step 4: Configure W&B Platform

​4.1 Update the W&B Custom Resource

​4.2 Apply the configuration

​4.3 Verify Weave is running

​Step 5: Initialize the Database

​Resource Requirements

​ClickHouse Cluster Sizing

​Small (< 1M traces/day)

​Medium (1-10M traces/day)

​Large (> 10M traces/day)

​ClickHouse Keeper Sizing

​Monitoring and Maintenance

​Health Checks

​Check ClickHouse cluster status

​Check Keeper status

​Backup and Recovery

​S3-based backups

​ClickHouse native backups

​Scaling

​Vertical scaling (increasing resources)

​Horizontal scaling (adding replicas)

​Adding shards

​Troubleshooting

​Common Issues

​ClickHouse pods not starting

​Connection refused from W&B Platform

​Slow query performance

​Logs

​ClickHouse logs

​Keeper logs

​Operator logs

​Security Considerations

​Network Security

​Access Control

​Data Protection

​Advanced Configuration

​Custom ClickHouse Settings

​Performance Tuning

​Multi-Region Deployment

​Support

​Appendix

​Complete Example Configuration

​Migration from Previous Versions

​Integration with Existing ClickHouse

Important Notes

Architecture

Prerequisites

Required Resources

Required Tools

Network Requirements

Deployment Steps

Step 1: Deploy Altinity ClickHouse Operator

1.1 Add the Altinity Helm repository

1.2 Create the ClickHouse namespace

1.3 Install the ClickHouse Operator

1.4 Verify the operator is running

Step 2: Configure S3 Storage

Option A: Using AWS S3

Option B: Using MinIO or S3-compatible storage

Step 3: Deploy ClickHouse Cluster

3.1 Create the ClickHouse configuration

3.2 Deploy ClickHouse Keeper

3.3 Apply the configurations

Step 4: Configure W&B Platform

4.1 Update the W&B Custom Resource

4.2 Apply the configuration

4.3 Verify Weave is running

Step 5: Initialize the Database

Resource Requirements

ClickHouse Cluster Sizing

Small (< 1M traces/day)

Medium (1-10M traces/day)

Large (> 10M traces/day)

ClickHouse Keeper Sizing

Monitoring and Maintenance

Health Checks

Check ClickHouse cluster status

Check Keeper status

Backup and Recovery

S3-based backups

ClickHouse native backups

Scaling

Vertical scaling (increasing resources)

Horizontal scaling (adding replicas)

Adding shards

Troubleshooting

Common Issues

ClickHouse pods not starting

Connection refused from W&B Platform

Slow query performance

Logs

ClickHouse logs

Keeper logs

Operator logs

Security Considerations

Network Security

Access Control

Data Protection

Advanced Configuration

Custom ClickHouse Settings

Performance Tuning

Multi-Region Deployment

Support

Appendix

Complete Example Configuration

Migration from Previous Versions

Integration with Existing ClickHouse