Resilience and Failure Handling

Designed for failure. LogFleet is built by reliability engineers who understand that networks fail, disks fill up, and services crash. Here’s how we handle it.

Design Principles

LogFleet follows three core resilience principles:

Fail locally, recover automatically - Edge components should continue working during cloud outages
Never lose logs - Use buffering and disk persistence to survive restarts
Degrade gracefully - When resources are constrained, shed load predictably

Failure Scenarios

Scenario 1: Internet Connection Lost

What happens:

Edge agent continues collecting logs normally
Logs are stored in local Loki instance
Metrics buffer in Vector’s disk queue
Heartbeats fail (expected)

When connection returns:

Buffered metrics are automatically forwarded
Cloud platform marks agent as “online” again
No manual intervention required

Diagram: During an internet outage, logs continue to local Loki while metrics buffer. On reconnection, buffered metrics flush automatically.

Buffer capacity: Vector buffers up to 1GB of metrics by default. At typical volumes, this covers 24-72 hours of outage.

Scenario 2: Loki Storage Full

What happens:

Loki’s ring buffer automatically deletes oldest logs
Compactor runs retention policy (default: 7 days)
New logs continue being ingested
Alert fired via Vector’s internal metrics

Prevention:

Configure appropriate retention_period in Loki
Monitor disk usage via Prometheus endpoint
Set up alerts for >80% disk utilization

# Loki config - ring buffer behavior
limits_config:
  retention_period: 168h  # 7 days - auto-deletes older logs
compactor:
  retention_enabled: true
  retention_delete_delay: 2h

Data loss is expected. Ring buffer semantics mean old logs are deleted to make room for new ones. This is by design-edge storage is finite.

Scenario 3: Vector Crashes

What happens:

Vector process exits
Kubernetes/Docker restarts the container
On restart, Vector reads checkpoint from disk
Resumes from last known position (no duplicate logs)

Checkpoint mechanism:

File inputs: Uses file offset tracking
Network inputs: Acknowledges after successful write
Crash recovery: Replays unacknowledged data

# Vector data directory persists across restarts
data_dir: /var/lib/vector

Test your recovery. Run docker restart logfleet-vector and verify logs continue flowing without gaps or duplicates.

Scenario 4: Cloud Platform Unavailable

What happens at the edge:

Logs continue to local Loki (unaffected)
Metrics buffer to disk
Heartbeats fail
Streaming sessions cannot be started

What happens in the cloud:

Dashboard shows agents as “unknown” status
Historical data remains accessible
New streaming requests fail

Recovery:

Automatic reconnection with exponential backoff
Buffered data forwarded on recovery
Agent status updates within 60 seconds

Edge-first means edge-independent. Your locations keep working. The cloud is for visibility, not functionality.

Scenario 5: High Log Volume Spike

What happens:

Vector’s internal queue fills up
Backpressure applied to inputs
Oldest buffered events dropped if queue overflows

Mitigation:

Rate limiting at source when possible
Sampling for non-critical log sources
Increase buffer size for expected spikes

# Vector rate limiting
transforms:
  rate_limit:
    type: throttle
    inputs: ["source_http"]
    threshold: 10000  # events per second
    window_secs: 1

Monitoring LogFleet Health

Key Metrics to Watch

Metric	Alert Threshold	Meaning
`vector_buffer_byte_size`	>800MB	Buffer approaching capacity
`vector_events_out_total`	0 for 5min	No data flowing to cloud
`loki_ingester_memory_chunks`	>10000	High memory pressure
`disk_used_percent`	>85%	Storage filling up

Prometheus Endpoint

Vector exposes metrics at http://localhost:9598/metrics:

curl http://localhost:9598/metrics | grep vector_buffer

Recovery Procedures

Manual Buffer Flush

If metrics are stuck in the buffer:

# Check buffer status
curl http://localhost:9598/metrics | grep buffer

# Restart Vector to force flush attempt
docker restart logfleet-vector

Clear Corrupted Checkpoints

If Vector won’t start after crash:

# Backup and remove checkpoint
mv /var/lib/vector/checkpoints /var/lib/vector/checkpoints.bak

# Restart - will re-read from current file positions
docker restart logfleet-vector

Clearing checkpoints may cause duplicate logs from file inputs. Network inputs are idempotent.

Loki Recovery

If Loki won’t start:

# Check for WAL corruption
docker logs logfleet-loki 2>&1 | grep -i "corrupt"

# If corrupted, clear WAL (loses recent unbatched data)
rm -rf /loki/wal/*
docker restart logfleet-loki

Testing Resilience

We recommend running these tests quarterly:

Network Partition Test

Disconnect edge from internet for 1 hour. Verify logs continue locally and metrics flush on reconnect.

Process Crash Test

Kill Vector with kill -9. Verify automatic restart and no log gaps.

Disk Pressure Test

Fill disk to 90%. Verify ring buffer rotation and no service crashes.

High Volume Test

Send 10x normal log volume. Verify rate limiting activates and system remains stable.

Summary

Failure	Impact	Recovery Time	Data Loss
Internet outage	Metrics delayed	Automatic	None
Loki disk full	Old logs deleted	Immediate	Oldest logs
Vector crash	Brief gap	Under 30 seconds	None*
Cloud outage	No dashboard	Automatic	None
High volume	Rate limited	Immediate	Excess events

*With proper checkpoint configuration. File inputs may replay briefly on restart.

Deployment

Operations

Support

Resilience and Failure Handling

Design Principles

Failure Scenarios

Scenario 1: Internet Connection Lost

Scenario 2: Loki Storage Full

Scenario 3: Vector Crashes

Scenario 4: Cloud Platform Unavailable

Scenario 5: High Log Volume Spike

Monitoring LogFleet Health

Key Metrics to Watch

Prometheus Endpoint

Recovery Procedures

Manual Buffer Flush

Clear Corrupted Checkpoints

Loki Recovery

Testing Resilience

Summary

Deployment

Operations

Support

​Design Principles

​Failure Scenarios

​Scenario 1: Internet Connection Lost

​Scenario 2: Loki Storage Full

​Scenario 3: Vector Crashes

​Scenario 4: Cloud Platform Unavailable

​Scenario 5: High Log Volume Spike

​Monitoring LogFleet Health

​Key Metrics to Watch

​Prometheus Endpoint

​Recovery Procedures

​Manual Buffer Flush

​Clear Corrupted Checkpoints

​Loki Recovery

​Testing Resilience

​Summary

Design Principles

Failure Scenarios

Scenario 1: Internet Connection Lost

Scenario 2: Loki Storage Full

Scenario 3: Vector Crashes

Scenario 4: Cloud Platform Unavailable

Scenario 5: High Log Volume Spike

Monitoring LogFleet Health

Key Metrics to Watch

Prometheus Endpoint

Recovery Procedures

Manual Buffer Flush

Clear Corrupted Checkpoints

Loki Recovery

Testing Resilience

Summary