Skip to main content

Performance and sizing

This guide provides sizing recommendations and performance characteristics to help you plan Virtual MCP Server (vMCP) deployments.

Resource requirements

Baseline resources

Minimal deployment (development/testing):

  • CPU: 100m (0.1 cores)
  • Memory: 128Mi

Production deployment (recommended):

  • CPU: 500m (0.5 cores)
  • Memory: 512Mi

Scaling factors

Resource needs increase based on:

  • Number of backends: Each backend adds minimal overhead (~10-20MB memory)
  • Request volume: Higher traffic requires more CPU for request processing
  • Composite tool complexity: Workflows with many parallel steps consume more memory
  • Token caching: Authentication token cache grows with unique client count

Backend scale recommendations

vMCP performs well across different scales:

Backend CountUse CaseNotes
1-5Small teams, focused toolsetsMinimal resource overhead
5-15Medium teams, diverse toolsRecommended range for most use cases
15-30Large teams, comprehensiveIncrease health check interval
30+Enterprise-scale deploymentsConsider multiple vMCP instances

Performance characteristics

Backend discovery

  • Timing: Happens once per client session
  • Duration: Typically completes in 1-3 seconds for 10 backends
  • Timeout: 15 seconds (returns HTTP 504 on timeout)
  • Parallelism: Backends queried concurrently for capabilities

Health checks

  • Interval: Every 30 seconds by default (configurable)
  • Impact: Minimal overhead on backend servers
  • Timeout: 10 seconds by default (configurable via healthCheckTimeout)
  • Configuration: See Configure health checks

Tool routing

  • Overhead: Single-digit millisecond latency for routing and conflict resolution
  • Caching: Routing table cached per session for consistent behavior
  • Lookup: O(1) hash table lookup for tool/resource/prompt routing

Composite workflows

  • Parallelism: Up to 10 parallel step executions by default (configurable)
  • Execution model: DAG-based with dependency resolution
  • Bottleneck: Limited by slowest backend response time in each level
  • Memory: Step results cached in memory during workflow execution

Token caching

  • Reduction: 90%+ reduction in authentication overhead for repeated requests
  • Duration: Tokens cached until expiration
  • Scope: Per-client, per-backend token cache
  • Impact: Significantly improves response times for authenticated backends

Horizontal scaling

vMCP is stateless and supports horizontal scaling:

Scaling characteristics

  • Independence: Each vMCP instance operates independently
  • Session affinity: Client sessions are sticky to a single instance (via session ID)
  • State: No shared state between instances
  • Method: Scale by increasing replicas in the Deployment

Example scaling configuration

apiVersion: apps/v1
kind: Deployment
metadata:
name: vmcp-my-vmcp
spec:
replicas: 3 # Scale to 3 instances
# ... rest of deployment spec

Load balancing

When using multiple replicas, ensure your load balancer supports session affinity:

  • Kubernetes Service: Use sessionAffinity: ClientIP
  • Ingress: Configure session affinity/sticky sessions at the Ingress level
  • Gateway API: Use appropriate session affinity configuration

When to scale

Scale up (increase resources)

Increase CPU and memory when you observe:

  • High CPU usage (>70% sustained) during normal operations
  • Memory pressure or OOM (out-of-memory) kills
  • Slow response times (>1 second) for simple tool calls
  • Health check timeouts or frequent backend unavailability

Scale out (increase replicas)

Add more vMCP instances when:

  • CPU usage remains high despite increasing resources
  • You need higher availability and fault tolerance
  • Request volume exceeds capacity of a single instance
  • You want to distribute load across multiple availability zones

Scale configuration

Adjust operational settings when scaling:

For large backend counts (15+):

spec:
config:
operational:
failureHandling:
# Reduce health check frequency to minimize overhead
healthCheckInterval: 60s

# Increase thresholds for better stability
unhealthyThreshold: 5

For high request volumes:

spec:
podTemplateSpec:
spec:
containers:
- name: vmcp
resources:
requests:
cpu: '1'
memory: 1Gi
limits:
cpu: '2'
memory: 2Gi

Performance optimization

Reduce backend discovery time

  1. Use inline mode for static backend configurations (eliminates Kubernetes API queries)
  2. Minimize backend count by grouping related tools in fewer servers
  3. Ensure fast backend responses to initialize requests

Reduce authentication overhead

  1. Enable token caching (enabled by default)
  2. Use unauthenticated mode for internal/trusted backends
  3. Configure appropriate token expiration in your OIDC provider

Optimize composite workflows

  1. Minimize dependencies between steps to maximize parallelism
  2. Use failureMode: continue when appropriate to avoid blocking entire workflows
  3. Set appropriate timeouts for slow backends

Monitor performance

Use the vMCP telemetry integration to monitor:

  • Backend request latency and error rates
  • Workflow execution times and failure patterns
  • Health check success/failure rates

See Telemetry and metrics for configuration details.