Understand router availability states and how to manage your router's online presence. Learn about status monitoring, health checks, and how status changes affect request routing.
Routers can exist in four distinct states that reflect their operational health and availability to handle requests. Understanding these states helps you monitor and troubleshoot your router effectively.
Your router is fully operational and performing optimally. All services are running and accepting new requests.
Your router is still serving requests but experiencing performance issues or partial service failures.
Your router is completely unavailable and cannot serve any requests. Requires immediate attention.
The router's status cannot be determined, typically during startup or network issues.
The SyftBox platform continuously monitors your router's health by making regular requests to its health endpoint. This automated monitoring ensures quick detection of any issues.
Your router must implement this endpoint to participate in the monitoring system:
# Health check endpoint monitoring
@app.get("/health")
async def health_check():
return {
"status": "healthy",
"timestamp": datetime.utcnow().isoformat(),
"services": {
"chat": "running",
"search": "running"
},
"uptime": get_uptime(),
"version": "1.0.0"
}
The monitoring system uses configurable intervals and thresholds to determine when to change a router's status. These settings balance responsiveness with system stability.
# Configuration for status monitoring
MONITORING_CONFIG = {
"health_check_interval": 30, # seconds
"timeout": 10, # seconds for health check
"max_failures": 3, # consecutive failures before offline
"recovery_checks": 2, # successful checks to go online
"degraded_threshold": 5000, # ms response time for degraded
}
These configuration values determine how quickly the system detects issues and recovers from failures. Adjust them based on your router's expected performance characteristics.
Routers transition between states based on their health check responses and performance metrics. The system uses hysteresis to prevent rapid state changes that could cause instability.
[Unknown] → [Online] → [Degraded] → [Offline]
↑ ↑ ↑ ↑
Startup Healthy Slow/Errors Failed
State Response Response Health
Checks
The transition logic ensures that temporary issues don't immediately mark a router as offline, while persistent problems are quickly detected and addressed.
When your router's status changes, it affects how requests are routed and how usage is tracked:
Monitoring these transitions helps you understand your router's reliability and identify patterns that might indicate underlying issues.
Design your health checks to be:
class HealthChecker:
def __init__(self, config):
self.config = config
async def check_chat_service(self):
"""Check if chat service is responsive."""
try:
# Test actual chat functionality
response = await self.chat_service.generate_response("health check")
return {"status": "healthy", "response_time": response.time}
except Exception as e:
return {"status": "unhealthy", "error": str(e)}
async def check_search_service(self):
"""Check if search service is responsive."""
try:
# Test actual search functionality
results = await self.search_service.search("test query", limit=1)
return {"status": "healthy", "results_count": len(results)}
except Exception as e:
return {"status": "unhealthy", "error": str(e)}
async def get_comprehensive_health(self):
"""Get detailed health status for all services."""
chat_health = await self.check_chat_service()
search_health = await self.check_search_service()
overall_healthy = (
chat_health["status"] == "healthy" and
search_health["status"] == "healthy"
)
return {
"status": "healthy" if overall_healthy else "degraded",
"timestamp": datetime.utcnow().isoformat(),
"services": {
"chat": chat_health,
"search": search_health
},
"uptime": self.get_uptime(),
"version": "1.0.0"
}
Common status issues and their solutions: