Walmart SRE Hub
Overview
After finishing Packet Patrol, I wanted to further development of my UX/UI design skills by seeking real-world applications. I reached out to one of the SREs for a case study and received valuable information on the pain points experienced by him and his team.
Problem Statement
Walmart's Site Reliability Engineer team faces the challenge of not being able to quickly access relevant data as well as view MTTD, MTTE, and MTTR information in 1 location as they’re currently scattered across different systems. This fragmentation of data hinders efficient identification and resolution of issues. To overcome this challenge, I designed a centralized hub that provides a single point of access to all relevant information.
Solution
My solution utilizes Figma to create a centralized hub that displays crucial information in one place. The hub is designed with SREs in mind, ensuring easy navigation and understanding of the information.
The hub features 6 panels designed to provide a better user experience by displaying the information in an organized and visually appealing manner. Each panel will contain charts, graphs, and tables to display key data points, such as the number of incidents, average time to detect incidents, and average time to resolve incidents. Additionally, the hub will be customizable, allowing SREs to configure the panels as per their requirements.
Project Goals
The primary goal is to create a centralized hub that makes it convenient for the team to effectively identify the source of each incident. The hub will provide SREs with easy access to incidents with a visualized tree and detection graphs that lead to the respective applications.
The final design will provide a clear and concise overview of essential statistics for anomaly detection and incident response.
Process
In order to gain a thorough understanding of the needs of the SRE team, I met with Shreenidhi, a Site Reliability Engineer at Walmart.
Shreenidhi explained the technical aspects of their process when it came to locating and resolving incidents along with the various factors that unnecessarily complicated this process.
As he showed me some of the current software they use, I began to understand how confusing and complicated it was to identify the source of the incident. This process can seem like a never-ending maze in a large-scale system such as Walmart due to the complexity of the codebase and interdependence of various components. This process is time-consuming, mentally taxing, and lacks clarity which this project aims to resolve.
If every SRE had a map that led them straight to the source of the incident, this would significantly speed up this process. Instead of encountering a bunch of dead ends, they would effortlessly walk straight into the center of the maze.
This design revolves around transforming each component into a node. Each node is color-coded based on their status and mapped out onto a tree which depicts every node, allowing SREs to easily navigate through the tree.
I chose to place each node onto a tree as each tree has several branches that are all connected, which I felt most accurately represented the task at hand. Since branching is commonly used in software management and development, this would be a very familiar concept to the SRE team.
Design Choices
To address the challenge of eye strain caused by long hours of working on a computer screen, a color palette consisting mainly of darker colors and brighter accent colors was selected. The accent colors help important information stand out over the dark background.
Color choice - 60/30/10 (accent colors):
Green, yellow, and red were selected to represent the statuses of each node on the trees. These specific colors were chosen as they’re commonly associated with traffic signals. Not only do they assist with the locating of incidents, they also help users report incidents quickly by selecting the appropriate color, rather than having to read the status options in text form.
Easy-to-digest dense information:
The large graphs featured in each panel provides users with a quick overview of critical data. The data is also color-coded, which enhances the experience of interpreting the data. Color-coding reduces cognitive load and allows users to understand the information faster and with more ease. Additionally, the graphs’ size and placement provides a visual hierarchy, ensuring that the most critical data is easily accessible and prominent.
Conclusion
Overall, my solution aims to transform the incident resolution process from a never-ending maze to an effortless walk, providing a clear and concise overview of essential statistics for anomaly detection and incident response.