by John Allspaw
Web engineering and operations is a nascent, fast-growing and largely untouched discipline from a human factors and ergonomics (HF/E) standpoint. The domain is ripe with complexity and ambiguity, due to the dynamic nature of the systems and networks involved, across both geographic and geopolitical boundaries. The traditional effects, challenges and trade-offs brought by the use of automation are amplified to a great extent, and there is an unmet need for HF/E support. The field is unique in many ways, but primarily in that the operating environment is quite opaque and that operators are also frequently the designers of their technical systems. There is no singular overarching regulatory, standards, or policy-making body for these services. Instead, there is a myriad of overlapping local and regional regulations for various layers of the services, such as telecommunications, privacy, content, and commerce.
“As Internet-connected services continue to weave their way into the fabric of modern life and business, the design and operation of the supporting software systems has largely gone unnoticed, and to a large extent unstudied by the HF/E community.”
“…there is always a gap between software-as-imagined (by the author) and software-as-operated (by the user).”
“From an HF/E perspective, this preparation for failures can manifest in many ways: Alert design and anomaly response (Woods, 1995), operator overload/underload during outage scenarios and diagnosis as it happens in distributed teams.”
“Despite not having industry-wide standardization or formal agreement on procedures, the community has taken a rather progressive approach to learning from accidents and untoward events…in the shape of “blameless postmortems”.”
“HF/E practice can be confused with user experience (UX) practice. … Different organizations view the UX role differently, so HF/E practitioners should work closely with those groups to draw the parallels and contrast as sharp as they can.”
“The operation of software on such a massive scale, such as the Internet, provides challenges for organizations and individuals that lie outside the boundary of their own companies, so effort must be made to discover the foundational dynamics of how people work in this (relatively) nascent industry. The economic and geopolitical influence that this industry has on modern society is too great to ignore.”
Practitioner reflections (scroll down to add your own reflection)
The chapter very much speaks to the core things I have experienced myself as a practitioner. The fact that automation is so often at the core of what we are doing and thus the myriad ways of how it can go wrong or result in ways that weren’t imagined before is a daily reality. It is so often that a particular behaviour of a system has only been uncovered after it has been running for months and a confluence of factors has emerged that make this property come to light. I would very much welcome more HFE research in our domain as we are comparatively young and especially operating such large scale web systems has only come up in the last 10-15 years. A lot of the operating habits, procedures and tools have been created out of an immediate need without much reflection on the human factors of its design. And we have just recently even started to look into this to improve the way we interact with and reason about the systems we are creating.
The fact that often enough the designer and user of a system are the same person is something I identify very strongly with. Having to run and operate systems and seeing them fail has strongly influenced how I go about designing them. Before reading this chapter I hadn’t really thought about the fact that this is in contrast to the reality in other disciplines. I had just always taken for granted that I can immediately change a system to adapt to what I deem important in a situation. It’s something I very much enjoy about my work but also know to potentially be a pitfall when it comes to solving very narrow use cases and what we call “feature creep”.
I would love to enable more HFE research in the way we design tooling and systems around our daily work. We generally have a good amount of tool improvisation as well, especially when it comes to dynamic fault management and the ways we come up with to diagnose a problem while at the same time alleviating pressure from certain components to keep the system as a whole running. The way we most often surface those are in blameless postmortems where we ask questions around the tools at hand and what needed to be created in order to help in a particular situation.
And then there is the whole situation of alerting systems, alert design and monitoring. We are in the very comfortable situation now of being able to easily add a new monitor and a new alert for a sub component of a system without much work. This has led to a number of dashboards we can’t even all feasibly look at on any given day. In addition to that the addition of more and more checks and alerts for components has led to a high level of noise in our systems and to alert fatigue in many cases. A more HFE centric approach to how we do monitoring and alerting and what is helpful versus what approaches just add noise is something I am convinced is a huge undiscovered field in our profession. This chapter has definitely made me think about this again and want to invest more into this in the future.
I really like the chapter and especially the call to more HFE research in the field of web operations and infrastructure is something I very strongly agree with. I think – as John said – we default to open information sharing in our field and are very welcoming to new methods that help us reason about the complexity we have created. And the chapter does a very good job of describing the current situation and opportunities for researchers to improve something that impacts many people’s work (and life) every day.