Designed for failure. LogFleet is built by reliability engineers who understand that networks fail, disks fill up, and services crash. Here’s how we handle it.
Design Principles
LogFleet follows three core resilience principles:- Fail locally, recover automatically - Edge components should continue working during cloud outages
- Never lose logs - Use buffering and disk persistence to survive restarts
- Degrade gracefully - When resources are constrained, shed load predictably
Failure Scenarios
Scenario 1: Internet Connection Lost
What happens:- Edge agent continues collecting logs normally
- Logs are stored in local Loki instance
- Metrics buffer in Vector’s disk queue
- Heartbeats fail (expected)
- Buffered metrics are automatically forwarded
- Cloud platform marks agent as “online” again
- No manual intervention required
Buffer capacity: Vector buffers up to 1GB of metrics by default. At typical volumes, this covers 24-72 hours of outage.
Scenario 2: Loki Storage Full
What happens:- Loki’s ring buffer automatically deletes oldest logs
- Compactor runs retention policy (default: 7 days)
- New logs continue being ingested
- Alert fired via Vector’s internal metrics
- Configure appropriate
retention_periodin Loki - Monitor disk usage via Prometheus endpoint
- Set up alerts for >80% disk utilization
Scenario 3: Vector Crashes
What happens:- Vector process exits
- Kubernetes/Docker restarts the container
- On restart, Vector reads checkpoint from disk
- Resumes from last known position (no duplicate logs)
- File inputs: Uses file offset tracking
- Network inputs: Acknowledges after successful write
- Crash recovery: Replays unacknowledged data
Scenario 4: Cloud Platform Unavailable
What happens at the edge:- Logs continue to local Loki (unaffected)
- Metrics buffer to disk
- Heartbeats fail
- Streaming sessions cannot be started
- Dashboard shows agents as “unknown” status
- Historical data remains accessible
- New streaming requests fail
- Automatic reconnection with exponential backoff
- Buffered data forwarded on recovery
- Agent status updates within 60 seconds
Edge-first means edge-independent. Your locations keep working. The cloud is for visibility, not functionality.
Scenario 5: High Log Volume Spike
What happens:- Vector’s internal queue fills up
- Backpressure applied to inputs
- Oldest buffered events dropped if queue overflows
- Rate limiting at source when possible
- Sampling for non-critical log sources
- Increase buffer size for expected spikes
Monitoring LogFleet Health
Key Metrics to Watch
| Metric | Alert Threshold | Meaning |
|---|---|---|
vector_buffer_byte_size | >800MB | Buffer approaching capacity |
vector_events_out_total | 0 for 5min | No data flowing to cloud |
loki_ingester_memory_chunks | >10000 | High memory pressure |
disk_used_percent | >85% | Storage filling up |
Prometheus Endpoint
Vector exposes metrics athttp://localhost:9598/metrics:
Recovery Procedures
Manual Buffer Flush
If metrics are stuck in the buffer:Clear Corrupted Checkpoints
If Vector won’t start after crash:Loki Recovery
If Loki won’t start:Testing Resilience
We recommend running these tests quarterly:1
Network Partition Test
Disconnect edge from internet for 1 hour. Verify logs continue locally and metrics flush on reconnect.
2
Process Crash Test
Kill Vector with
kill -9. Verify automatic restart and no log gaps.3
Disk Pressure Test
Fill disk to 90%. Verify ring buffer rotation and no service crashes.
4
High Volume Test
Send 10x normal log volume. Verify rate limiting activates and system remains stable.
Summary
| Failure | Impact | Recovery Time | Data Loss |
|---|---|---|---|
| Internet outage | Metrics delayed | Automatic | None |
| Loki disk full | Old logs deleted | Immediate | Oldest logs |
| Vector crash | Brief gap | Under 30 seconds | None* |
| Cloud outage | No dashboard | Automatic | None |
| High volume | Rate limited | Immediate | Excess events |
*With proper checkpoint configuration. File inputs may replay briefly on restart.