Skip to main content
Designed for failure. LogFleet is built by reliability engineers who understand that networks fail, disks fill up, and services crash. Here’s how we handle it.

Design Principles

LogFleet follows three core resilience principles:
  1. Fail locally, recover automatically - Edge components should continue working during cloud outages
  2. Never lose logs - Use buffering and disk persistence to survive restarts
  3. Degrade gracefully - When resources are constrained, shed load predictably

Failure Scenarios

Scenario 1: Internet Connection Lost

What happens:
  • Edge agent continues collecting logs normally
  • Logs are stored in local Loki instance
  • Metrics buffer in Vector’s disk queue
  • Heartbeats fail (expected)
When connection returns:
  • Buffered metrics are automatically forwarded
  • Cloud platform marks agent as “online” again
  • No manual intervention required
Diagram: During an internet outage, logs continue to local Loki while metrics buffer. On reconnection, buffered metrics flush automatically.
Buffer capacity: Vector buffers up to 1GB of metrics by default. At typical volumes, this covers 24-72 hours of outage.

Scenario 2: Loki Storage Full

What happens:
  • Loki’s ring buffer automatically deletes oldest logs
  • Compactor runs retention policy (default: 7 days)
  • New logs continue being ingested
  • Alert fired via Vector’s internal metrics
Prevention:
  • Configure appropriate retention_period in Loki
  • Monitor disk usage via Prometheus endpoint
  • Set up alerts for >80% disk utilization
# Loki config - ring buffer behavior
limits_config:
  retention_period: 168h  # 7 days - auto-deletes older logs
compactor:
  retention_enabled: true
  retention_delete_delay: 2h
Data loss is expected. Ring buffer semantics mean old logs are deleted to make room for new ones. This is by design-edge storage is finite.

Scenario 3: Vector Crashes

What happens:
  • Vector process exits
  • Kubernetes/Docker restarts the container
  • On restart, Vector reads checkpoint from disk
  • Resumes from last known position (no duplicate logs)
Checkpoint mechanism:
  • File inputs: Uses file offset tracking
  • Network inputs: Acknowledges after successful write
  • Crash recovery: Replays unacknowledged data
# Vector data directory persists across restarts
data_dir: /var/lib/vector
Test your recovery. Run docker restart logfleet-vector and verify logs continue flowing without gaps or duplicates.

Scenario 4: Cloud Platform Unavailable

What happens at the edge:
  • Logs continue to local Loki (unaffected)
  • Metrics buffer to disk
  • Heartbeats fail
  • Streaming sessions cannot be started
What happens in the cloud:
  • Dashboard shows agents as “unknown” status
  • Historical data remains accessible
  • New streaming requests fail
Recovery:
  • Automatic reconnection with exponential backoff
  • Buffered data forwarded on recovery
  • Agent status updates within 60 seconds
Edge-first means edge-independent. Your locations keep working. The cloud is for visibility, not functionality.

Scenario 5: High Log Volume Spike

What happens:
  • Vector’s internal queue fills up
  • Backpressure applied to inputs
  • Oldest buffered events dropped if queue overflows
Mitigation:
  • Rate limiting at source when possible
  • Sampling for non-critical log sources
  • Increase buffer size for expected spikes
# Vector rate limiting
transforms:
  rate_limit:
    type: throttle
    inputs: ["source_http"]
    threshold: 10000  # events per second
    window_secs: 1

Monitoring LogFleet Health

Key Metrics to Watch

MetricAlert ThresholdMeaning
vector_buffer_byte_size>800MBBuffer approaching capacity
vector_events_out_total0 for 5minNo data flowing to cloud
loki_ingester_memory_chunks>10000High memory pressure
disk_used_percent>85%Storage filling up

Prometheus Endpoint

Vector exposes metrics at http://localhost:9598/metrics:
curl http://localhost:9598/metrics | grep vector_buffer

Recovery Procedures

Manual Buffer Flush

If metrics are stuck in the buffer:
# Check buffer status
curl http://localhost:9598/metrics | grep buffer

# Restart Vector to force flush attempt
docker restart logfleet-vector

Clear Corrupted Checkpoints

If Vector won’t start after crash:
# Backup and remove checkpoint
mv /var/lib/vector/checkpoints /var/lib/vector/checkpoints.bak

# Restart - will re-read from current file positions
docker restart logfleet-vector
Clearing checkpoints may cause duplicate logs from file inputs. Network inputs are idempotent.

Loki Recovery

If Loki won’t start:
# Check for WAL corruption
docker logs logfleet-loki 2>&1 | grep -i "corrupt"

# If corrupted, clear WAL (loses recent unbatched data)
rm -rf /loki/wal/*
docker restart logfleet-loki

Testing Resilience

We recommend running these tests quarterly:
1

Network Partition Test

Disconnect edge from internet for 1 hour. Verify logs continue locally and metrics flush on reconnect.
2

Process Crash Test

Kill Vector with kill -9. Verify automatic restart and no log gaps.
3

Disk Pressure Test

Fill disk to 90%. Verify ring buffer rotation and no service crashes.
4

High Volume Test

Send 10x normal log volume. Verify rate limiting activates and system remains stable.

Summary

FailureImpactRecovery TimeData Loss
Internet outageMetrics delayedAutomaticNone
Loki disk fullOld logs deletedImmediateOldest logs
Vector crashBrief gapUnder 30 secondsNone*
Cloud outageNo dashboardAutomaticNone
High volumeRate limitedImmediateExcess events
*With proper checkpoint configuration. File inputs may replay briefly on restart.