Large data centers are organized into silos such as storage, compute, networking, etc. This promotes specialization but it also creates barriers to effective root cause analysis (RCA).
Join our Team
About this opportunity:
Large data centers are organized into silos such as storage, compute, networking, etc. This promotes specialization but it also creates barriers to effective root cause analysis (RCA). These silos operate with separate monitoring tools, limiting cross-team visibility and collaboration, leading to slow, inefficient RCA during incidents. The complexity of large-scale infrastructures amplifies these challenges, making it difficult to pinpoint the origin of problems, often resulting in extended downtime
What you will do:
This thesis aims to develop a framework to enhance root cause analysis in large-scale data centers with siloed organizational structures. The research will focus on three main areas:
- Centralized Data Collection: Creating a system that integrates observability data (logs, metrics, traces) from all silos, offering a unified view of incidents across storage, compute, networking, and other domains.
- AI/ML-Powered RCA: Applying machine learning models to predict root causes by correlating data across silos and analyzing historical incidents. These models will automate parts of the RCA process, reducing reliance on manual investigation.
- Key User-Focused Metrics: Identifying and proposing a set of key metrics that directly indicate user-facing problems (e.g., latency spikes, error rates, request failures). These metrics will help proactively detecting user impact during incidents, guiding teams to the root cause faster.
- Additionally, this work will provide best practices for improving collaboration between siloed teams, including tools for capturing and diagnosing problems efficiently in real-time. The goal is to reduce downtime, optimize resource usage, and improve user experience by enabling faster and more accurate RCA.
The skills you bring:
Being a student in electrical engineering, computer science, computer engineering or similar.