Performance & Capacity Planning

Overview

LogFleet is designed for high-throughput edge deployments. This guide covers hardware requirements, performance benchmarks, capacity planning, and tuning recommendations for different scales.

All benchmarks were conducted on standard hardware configurations. Your results may vary based on log complexity, network conditions, and workload patterns.

Hardware Requirements

Minimum Requirements (Development/Testing)

For local development and small-scale testing:

Component	Specification
CPU	2 cores
RAM	4 GB
Storage	20 GB SSD
Network	10 Mbps

# Docker resource limits for development
docker run -d \
  --cpus="2" \
  --memory="4g" \
  logfleet/edge-agent

Recommended (Single Edge Location)

For production single-location deployments handling typical retail/IoT workloads:

Component	Specification	Notes
CPU	4 cores (Intel i5/AMD Ryzen 5)	Vector benefits from multiple cores
RAM	8 GB	4 GB for Vector, 2 GB for Loki, 2 GB OS
Storage	100 GB NVMe SSD	Scales with retention period
Network	100 Mbps	For metric shipping and on-demand streaming

Expected throughput: 10,000-50,000 logs/second

Production (High-Volume Location)

For high-volume locations (large retail stores, manufacturing floors):

Component	Specification	Notes
CPU	8 cores (Intel i7/Xeon)	Enables parallel processing
RAM	16 GB	Larger buffers, more concurrent queries
Storage	500 GB NVMe SSD	30-day retention at high volume
Network	1 Gbps	Burst capacity for streaming

Expected throughput: 50,000-200,000 logs/second

Enterprise (3-Node Cluster)

For mission-critical deployments requiring high availability:

Component	Per Node	Total Cluster
CPU	8 cores	24 cores
RAM	32 GB	96 GB
Storage	1 TB NVMe	3 TB (with replication)
Network	10 Gbps	Dedicated management network

Expected throughput: 500,000+ logs/second with HA

Performance Benchmarks

Log Ingestion Throughput

Measured on recommended single-location hardware (4 cores, 8 GB RAM):

Log Size	Throughput	CPU Usage	Memory
256 bytes	85,000 logs/s	65%	2.1 GB
512 bytes	62,000 logs/s	72%	2.4 GB
1 KB	45,000 logs/s	78%	2.8 GB
4 KB	18,000 logs/s	85%	3.2 GB

Log-to-Metric Extraction

Vector’s log_to_metric transform performance:

Metrics per Log	Throughput Impact	CPU Overhead
1 metric	-5%	+8%
3 metrics	-12%	+15%
5 metrics	-18%	+22%
10 metrics	-28%	+35%

Keep metric extractions under 5 per log for optimal performance. Use aggregation for high-cardinality data.

Query Latency (Loki)

Query performance on 7-day retention with 50GB data:

Query Type	Latency (p50)	Latency (p99)
Simple filter (`{service="api"}`)	45ms	180ms
Regex match (`\|~ "error"`)	120ms	450ms
JSON parsing (`\| json`)	200ms	800ms
Aggregation (`count_over_time`)	350ms	1.2s
Full-text search	500ms	2.5s

Network Bandwidth

Metric shipping bandwidth (compressed, to cloud):

Locations	Metrics/min	Bandwidth
10	6,000	50 KB/s
100	60,000	500 KB/s
1,000	600,000	5 MB/s
10,000	6,000,000	50 MB/s

Log streaming bandwidth (when enabled):

Typical: 1-10 MB/s per location
Peak: 50-100 MB/s during incident investigation

Capacity Planning

Storage Calculator

Estimate storage requirements based on your workload:

Daily Storage = (logs_per_second × avg_log_size × 86400) ÷ compression_ratio

Where:
- compression_ratio ≈ 5-10x for Loki (typical logs)
- Add 20% overhead for indexes

Example calculations:

Logs/sec	Avg Size	Retention	Raw Data	Compressed
1,000	512 B	7 days	302 GB	45 GB
5,000	512 B	7 days	1.5 TB	225 GB
10,000	256 B	14 days	2.4 TB	360 GB
50,000	256 B	7 days	8.6 TB	1.3 TB

Memory Sizing

Minimum RAM = Vector (1.5 GB) + Loki (1 GB) + OS (1 GB) + Buffer (20%)

Recommended RAM = Vector (3 GB) + Loki (3 GB) + OS (2 GB) + Query Cache (2 GB)

Memory scaling guidelines:

Throughput	Vector	Loki	Total Recommended
10K logs/s	2 GB	2 GB	6 GB
50K logs/s	4 GB	4 GB	12 GB
100K logs/s	6 GB	6 GB	16 GB
200K+ logs/s	8 GB	8 GB	24 GB

CPU Sizing

Base CPU = 2 cores (Vector) + 1 core (Loki) + 1 core (OS)

Scale factor:
- +1 core per 25K logs/s above baseline
- +1 core per 3 metric extractions
- +2 cores if using complex transforms (grok, VRL scripts)

Tuning Guidelines

Vector Configuration

Optimize Vector for your workload:

# vector.yaml - High throughput configuration
data_dir: /var/lib/vector

# Increase buffer sizes for high volume
buffer:
  type: disk
  max_size: 5368709120  # 5 GB

# HTTP source tuning
sources:
  http_logs:
    type: http_server
    address: "0.0.0.0:8080"
    # Increase for high concurrency
    keepalive:
      max_connection_age_secs: 300
    # Batch incoming requests
    framing:
      method: newline_delimited

# Batch sink writes
sinks:
  loki:
    type: loki
    endpoint: "http://loki:3100"
    batch:
      max_bytes: 10485760  # 10 MB
      max_events: 100000
      timeout_secs: 5
    # Compression
    compression: snappy
    # Request tuning
    request:
      concurrency: 10
      rate_limit_num: 100
      retry_max_duration_secs: 300

Loki Configuration

Optimize Loki for edge deployments:

# loki-config.yaml - Edge optimized
auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  # Increase for high query load
  grpc_server_max_recv_msg_size: 104857600
  grpc_server_max_send_msg_size: 104857600

ingester:
  lifecycler:
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
  # Tune chunk sizing
  chunk_idle_period: 5m
  chunk_retain_period: 30s
  max_chunk_age: 1h
  chunk_target_size: 1572864  # 1.5 MB
  # Memory optimization
  max_transfer_retries: 0
  wal:
    enabled: true
    dir: /loki/wal

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

storage_config:
  tsdb_shipper:
    active_index_directory: /loki/index
    cache_location: /loki/cache
  filesystem:
    directory: /loki/chunks

compactor:
  working_directory: /loki/compactor
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h
  retention_delete_worker_count: 150

limits_config:
  # Ingestion limits
  ingestion_rate_mb: 50
  ingestion_burst_size_mb: 100
  per_stream_rate_limit: 10MB
  per_stream_rate_limit_burst: 30MB
  # Query limits
  max_query_parallelism: 32
  max_query_series: 10000
  max_entries_limit_per_query: 50000
  # Retention
  retention_period: 168h  # 7 days

query_range:
  align_queries_with_step: true
  cache_results: true
  results_cache:
    cache:
      embedded_cache:
        enabled: true
        max_size_mb: 500

OS-Level Tuning

For high-throughput Linux deployments:

# /etc/sysctl.d/99-logfleet.conf

# Network tuning
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.ipv4.tcp_fin_timeout = 10
net.ipv4.tcp_tw_reuse = 1

# File descriptor limits
fs.file-max = 2097152
fs.nr_open = 2097152

# Memory tuning
vm.swappiness = 10
vm.dirty_ratio = 60
vm.dirty_background_ratio = 2

# Apply changes
sudo sysctl -p /etc/sysctl.d/99-logfleet.conf

# /etc/security/limits.d/99-logfleet.conf
* soft nofile 1048576
* hard nofile 1048576
* soft nproc 65535
* hard nproc 65535

Monitoring & Alerting

Key Metrics to Monitor

Metric	Warning	Critical	Action
CPU usage	>70%	>90%	Scale up or reduce transforms
Memory usage	>75%	>90%	Increase RAM or reduce buffers
Disk usage	>70%	>85%	Reduce retention or add storage
Ingestion rate drop	>20%	>50%	Check sources and network
Query latency p99	>2s	>5s	Optimize queries or add cache
Buffer backpressure	>50%	>80%	Scale sink capacity

Vector Metrics Endpoint

# Enable internal metrics
sources:
  internal_metrics:
    type: internal_metrics
    scrape_interval_secs: 15

sinks:
  prometheus:
    type: prometheus_exporter
    inputs: ["internal_metrics"]
    address: "0.0.0.0:9598"

Key Vector metrics:

vector_component_received_events_total - Ingestion rate
vector_buffer_events - Buffer pressure
vector_component_sent_events_total - Output rate
vector_component_errors_total - Error rate

Loki Metrics

Loki exposes Prometheus metrics at /metrics: Key Loki metrics:

loki_ingester_chunks_stored_total - Storage growth
loki_request_duration_seconds - Query latency
loki_ingester_memory_chunks - Memory pressure
loki_distributor_bytes_received_total - Ingestion rate

Sample Prometheus Alerts

groups:
  - name: logfleet
    rules:
      - alert: HighCPUUsage
        expr: avg(rate(process_cpu_seconds_total[5m])) > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.15
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Disk space below 15% on {{ $labels.instance }}"

      - alert: IngestionDrop
        expr: rate(loki_distributor_bytes_received_total[5m]) < 1000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Log ingestion dropped significantly"

      - alert: HighQueryLatency
        expr: histogram_quantile(0.99, rate(loki_request_duration_seconds_bucket[5m])) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Query latency p99 > 5s"

Scaling Strategies

Vertical Scaling

When to scale up a single node:

Symptom	Solution
CPU consistently >80%	Add cores or upgrade CPU
Memory pressure / OOM	Add RAM, reduce buffers
Disk I/O bottleneck	Upgrade to NVMe, add RAID
Query timeouts	Add RAM for cache, faster storage

Horizontal Scaling (Multi-Node)

When to deploy a cluster:

High availability requirement - Deploy 3+ nodes with replication
Throughput >200K logs/s - Distribute ingestion load
Multi-tenant isolation - Separate workloads
Geographic distribution - Regional edge clusters

# Example 3-node cluster topology
Node 1 (Ingester):
  - Vector (primary)
  - Loki Ingester

Node 2 (Ingester):
  - Vector (replica)
  - Loki Ingester

Node 3 (Query):
  - Loki Querier
  - Grafana

Best Practices

Right-size your hardware

Start with recommended specs and monitor for 2 weeks before scaling. Over-provisioning wastes resources; under-provisioning causes data loss.

Use SSDs, not HDDs

Loki’s write patterns require fast random I/O. NVMe SSDs provide 10-100x better performance than spinning disks.

Set retention limits

Always configure retention limits to prevent disk exhaustion. Ring buffer semantics ensure oldest logs are deleted first.

Batch sink writes

Configure Vector sinks to batch writes. Larger batches reduce network overhead and improve throughput.

Limit metric cardinality

High-cardinality labels (user IDs, request IDs) explode storage. Use log fields for high-cardinality data, labels for low-cardinality.

Monitor buffer backpressure

Buffer backpressure indicates sinks can’t keep up. Investigate sink bottlenecks before increasing buffer sizes.

Next Steps

Edge Agent Setup

Deploy the LogFleet agent

Custom Metrics

Extract metrics from logs

Troubleshooting

Debug common issues

Code Examples

Integration examples

Deployment

Operations

Support

Performance & Capacity Planning

Overview

Hardware Requirements

Minimum Requirements (Development/Testing)

Recommended (Single Edge Location)

Production (High-Volume Location)

Enterprise (3-Node Cluster)

Performance Benchmarks

Log Ingestion Throughput

Log-to-Metric Extraction

Query Latency (Loki)

Network Bandwidth

Capacity Planning

Storage Calculator

Memory Sizing

CPU Sizing

Tuning Guidelines

Vector Configuration

Loki Configuration

OS-Level Tuning

Monitoring & Alerting

Key Metrics to Monitor

Vector Metrics Endpoint

Loki Metrics

Sample Prometheus Alerts

Scaling Strategies

Vertical Scaling

Horizontal Scaling (Multi-Node)

Best Practices

Next Steps

Edge Agent Setup

Custom Metrics

Troubleshooting

Code Examples

Deployment

Operations

Support

​Overview

​Hardware Requirements

​Minimum Requirements (Development/Testing)

​Recommended (Single Edge Location)

​Production (High-Volume Location)

​Enterprise (3-Node Cluster)

​Performance Benchmarks

​Log Ingestion Throughput

​Log-to-Metric Extraction

​Query Latency (Loki)

​Network Bandwidth

​Capacity Planning

​Storage Calculator

​Memory Sizing

​CPU Sizing

​Tuning Guidelines

​Vector Configuration

​Loki Configuration

​OS-Level Tuning

​Monitoring & Alerting

​Key Metrics to Monitor

​Vector Metrics Endpoint

​Loki Metrics

​Sample Prometheus Alerts

​Scaling Strategies

​Vertical Scaling

​Horizontal Scaling (Multi-Node)

​Best Practices

​Next Steps

Edge Agent Setup

Custom Metrics

Troubleshooting

Code Examples

Overview

Hardware Requirements

Minimum Requirements (Development/Testing)

Recommended (Single Edge Location)

Production (High-Volume Location)

Enterprise (3-Node Cluster)

Performance Benchmarks

Log Ingestion Throughput

Log-to-Metric Extraction

Query Latency (Loki)

Network Bandwidth

Capacity Planning

Storage Calculator

Memory Sizing

CPU Sizing

Tuning Guidelines

Vector Configuration

Loki Configuration

OS-Level Tuning

Monitoring & Alerting

Key Metrics to Monitor

Vector Metrics Endpoint

Loki Metrics

Sample Prometheus Alerts

Scaling Strategies

Vertical Scaling

Horizontal Scaling (Multi-Node)

Best Practices

Next Steps