Performance and sizing

This guide provides sizing recommendations and performance characteristics to help you plan Virtual MCP Server (vMCP) deployments.

Resource requirements

Baseline resources

Minimal deployment (development/testing):

CPU: 100m (0.1 cores)
Memory: 128Mi

Production deployment (recommended):

CPU: 500m (0.5 cores)
Memory: 512Mi

Scaling factors

Resource needs increase based on:

Number of backends: Each backend adds minimal overhead (~10-20MB memory)
Request volume: Higher traffic requires more CPU for request processing
Composite tool complexity: Workflows with many parallel steps consume more memory
Token caching: Authentication token cache grows with unique client count

Backend scale recommendations

vMCP performs well across different scales:

Backend Count	Use Case	Notes
1-5	Small teams, focused toolsets	Minimal resource overhead
5-15	Medium teams, diverse tools	Recommended range for most use cases
15-30	Large teams, comprehensive	Increase health check interval
30+	Enterprise-scale deployments	Consider multiple vMCP instances

Performance characteristics

Backend discovery

Timing: Happens once per client session
Duration: Typically completes in 1-3 seconds for 10 backends
Timeout: 15 seconds (returns HTTP 504 on timeout)
Parallelism: Backends queried concurrently for capabilities

Health checks

Interval: Every 30 seconds by default (configurable)
Impact: Minimal overhead on backend servers
Timeout: 10 seconds by default (configurable via healthCheckTimeout)
Configuration: See Configure health checks

Tool routing

Overhead: Single-digit millisecond latency for routing and conflict resolution
Caching: Routing table cached per session for consistent behavior
Lookup: O(1) hash table lookup for tool/resource/prompt routing

Composite workflows

Parallelism: Up to 10 parallel step executions by default (configurable)
Execution model: DAG-based with dependency resolution
Bottleneck: Limited by slowest backend response time in each level
Memory: Step results cached in memory during workflow execution

Token caching

Reduction: 90%+ reduction in authentication overhead for repeated requests
Duration: Tokens cached until expiration
Scope: Per-client, per-backend token cache
Impact: Significantly improves response times for authenticated backends

Horizontal scaling

vMCP is stateless and supports horizontal scaling:

Scaling characteristics

Independence: Each vMCP instance operates independently
Session affinity: Client sessions are sticky to a single instance (via session ID)
State: No shared state between instances
Method: Scale by increasing replicas in the Deployment

Example scaling configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vmcp-my-vmcp
spec:
  replicas: 3 # Scale to 3 instances
  # ... rest of deployment spec

Load balancing

When using multiple replicas, ensure your load balancer supports session affinity:

Kubernetes Service: Use sessionAffinity: ClientIP
Ingress: Configure session affinity/sticky sessions at the Ingress level
Gateway API: Use appropriate session affinity configuration

When to scale

Scale up (increase resources)

Increase CPU and memory when you observe:

High CPU usage (>70% sustained) during normal operations
Memory pressure or OOM (out-of-memory) kills
Slow response times (>1 second) for simple tool calls
Health check timeouts or frequent backend unavailability

Scale out (increase replicas)

Add more vMCP instances when:

CPU usage remains high despite increasing resources
You need higher availability and fault tolerance
Request volume exceeds capacity of a single instance
You want to distribute load across multiple availability zones

Scale configuration

Adjust operational settings when scaling:

For large backend counts (15+):

spec:
  config:
    operational:
      failureHandling:
        # Reduce health check frequency to minimize overhead
        healthCheckInterval: 60s

        # Increase thresholds for better stability
        unhealthyThreshold: 5

For high request volumes:

spec:
  podTemplateSpec:
    spec:
      containers:
        - name: vmcp
          resources:
            requests:
              cpu: '1'
              memory: 1Gi
            limits:
              cpu: '2'
              memory: 2Gi

Performance optimization

Reduce backend discovery time

Use inline mode for static backend configurations (eliminates Kubernetes API queries)
Minimize backend count by grouping related tools in fewer servers
Ensure fast backend responses to initialize requests

Reduce authentication overhead

Enable token caching (enabled by default)
Use unauthenticated mode for internal/trusted backends
Configure appropriate token expiration in your OIDC provider

Optimize composite workflows

Minimize dependencies between steps to maximize parallelism
Use failureMode: continue when appropriate to avoid blocking entire workflows
Set appropriate timeouts for slow backends

Monitor performance

Use the vMCP telemetry integration to monitor:

Backend request latency and error rates
Workflow execution times and failure patterns
Health check success/failure rates

See Telemetry and metrics for configuration details.

Resource requirements​

Baseline resources​

Scaling factors​

Backend scale recommendations​

Performance characteristics​

Backend discovery​

Health checks​

Tool routing​

Composite workflows​

Token caching​

Horizontal scaling​

Scaling characteristics​

Example scaling configuration​

Load balancing​

When to scale​

Scale up (increase resources)​

Scale out (increase replicas)​

Scale configuration​

Performance optimization​

Reduce backend discovery time​

Reduce authentication overhead​

Optimize composite workflows​

Monitor performance​

Related information​