I make developers, operators, and workers as a whole more productive and empowered.
Liz is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with over two decades of experience. She is currently the Field CTO at Honeycomb, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights.
She lives in Vancouver, BC with her wife Elly, partners, and a Samoyed/Golden Retriever mix, and in Sydney, NSW. She plays classical piano, leads an EVE Online alliance, and advocates for transgender rights.
Talks
Cultivating Production Excellence
Taming the complex distributed systems we're responsible for requires changing not just the tools and technical approaches we use; it also requires changing who is involved in production, how they collaborate, and how we measure success.
In this talk, you'll learn about several practices core to production excellence: giving everyone a stake in production, collaborating to ensure observability, measuring with Service Level Objectives, and prioritizing improvements using risk analysis.
Presented at QCon London 2019, Velocity San Jose (keynote), DevOpsDays Atlanta, and more. Example slides (from DevOpsDays Minneapolis)
Tradeoffs on the Road to Observability
SRE and infrastructure engineering are about allocating adequate time to do project work that improves the long-term sustainability of our services. But what do we reward SREs for doing? Does your company have a culture of "not invented here" or the converse of "ask the consultants to design it for us"?
We need to step back and ensure that we are doing the most efficient thing with our time as SREs. The main way that we can improve sustainability is empowering other engineers, especially non-SREs, to understand our systems, rather than for the sake of our resumes and/or glory.
Presented at Monitorama PDX 2019 (slides).
Refining Systems Data Without Losing Fidelity
It is not feasible to run an observability infrastructure that is the same size as your production infrastructure. Past a certain scale, the cost to collect, process, and save every log entry, every event, and every trace that your systems generate dramatically outweighs the benefits. If your SLO is 99.95%, then you'll be naively collecting 2,000 times as much data about requests that satisfied your SLI than those that burnt error budget. The question is, how to scale back the flood of data without losing the crucial information your engineering team needs to troubleshoot and understand your system's production behaviors?
Statistics can come to our rescue, enabling us to gather accurate, specific, and error-bounded data on our services' top-level performance and inner workings. This talk advocates a three-R approach to data retention: Reducing junk data, statistically Reusing data points as samples, and Recycling data into counters. We can keep the context of the anomalous data flows and cases in our supported services while not allowing the volume of ordinary data to drown it out.
Presented at SREcon Europe 2019 (slides); upcoming at Monitorama Baltimore 2019.
Organizing for Your Ethical Principles
Our job as engineers does not stop with eliminating technical defects and ensuring high reliability. Engineers of all kinds must ensure their work serves the public good. A service that reliably harms, exacerbates injustices, or excludes marginalized groups is not a service worth building and maintaining. Learn how to effectively accomplish change in your working conditions or your employer's products through grassroots employee advocacy.
Video: SREcon EMEA 2018 (slides) as a keynote joint with Emily Gorcenski
Also presented at Write/Speak/Code 2018 (slides), QCon NYC 2018 (video), and privately within Google.
Older talks are archived on a separate page.
Publications & Videos
Observability Engineering w/ Charity Majors and George Miranda, in fine bookstores near you as of May 2022!
SRE and DevOps video series w/ Seth Vargo on the GCP YouTube channel
"SRE vs. DevOps: competing standards or close friends?" w/ Seth Vargo on the GCP Blog
"How SRE relates to DevOps" w/ Betsy Beyer and Niall Murphy in the Site Reliability Workbook
Sustainable Operations in Complex Systems With Production Excellence in InfoQ
Framework for an Observability Maturity Model (vendor-neutral whitepaper sponsored by Honeycomb)
"Intersections between Operations and Social Activism" w/ Emily Gorcenski in Seeking SRE
“Jeff Bezos is wrong, tech workers are not bullies” w/ Laura Nolan et al. in the Financial Times
“Our Executives Engaged in Abuse. Don’t Let Kink and Polyamory Be Their Scapegoats.” in Medium Featured Stories
“Google Workers Lost a Leader, But the Fight Will Continue” in Medium Featured Stories
"Interrupt Reduction Projects" w/ John Tobin and Betsy Beyer in USENIX ;login:
"A Hierarchy of SRE Needs" (blog)
Interviews & Podcasts
GCP Podcast Episode 127 with Seth Vargo, Melanie Warrick, and Mark Mandel
GCP Podcast Episode 139 with Melanie Warrick and Mark Mandel
Screaming in the Cloud Episode 19 with Corey Quinn
Fireside Chat at FutureStack NYC with Matthew Flaming
DevOps/SRE AMA with Charity Majors and Adam Jacob, hosted by Andrew Smirnov of Catchpoint
o11ycast Episode 6 with Charity Majors and Rachel Chalmers
I also frequently sit on panels about management, SRE, and ethics.
Technical Press & Citations
Citations
"Site Reliability Engineering: Philosophies, Habits, and Tools for SRE Success" (blog by New Relic)
"Accelerate: State of DevOps Report: Strategies for a New Economy" (by DevOps Research and Assessment)
Press
SRE model requires technical, organizational optimization skills
Beth Pariseau, TechTarget, October 16, 2018
"Grumpy humans are really bad at running systems," said Liz Fong-Jones, developer advocate at Google and former leader of the Google SRE team responsible for Bigtable. Fong-Jones spoke from experience about how to optimize human labor at an SRE conference here last week. "Unfair distribution of work prevents system scale," she said.
Google Cloud Next '18: What datacentre operators can learn from how Google SRE teams operate
Caroline Donnelly, Computer Weekly, July 24, 2018
Google has used the statement “class SRE implements DevOps” to title a new (and growing) video playlist by Liz Fong-Jones and Seth Vargo of Google Cloud Platform, showing how and where these disciplines connect, while nudging DevOps practitioners to consider some key SRE insights.
How Facebook operations got 10 times faster while getting 10 times bigger
Stephen Shankland, CNET, July 19, 2018
At the conference, engineers from Facebook and other tech companies, like Amazon, Shopify, Lyft, Google and Yahoo gave talks and asked questions of their peers.The profusion of management tools shows how complex it is to run suites of services on hundreds or thousands of servers. Over and over, engineers spoke of completely overhauling their technology every few years as massive growth overwhelmed the earlier system.
Increasingly sophisticated tools spotlight problems and help people trace their origins, said Google site reliability engineer Liz Fong-Jones.
Debugging Microservices: Lessons from Google, Facebook, Lyft
Joab Jackson, The New Stack, Jul 3, 2018
As your system grows more complex, and your knowledge of what can go wrong increases, you may be tempted to expand a dashboard with more metrics representing outages. This is a bad idea, advised Google Site Reliability Engineer (SRE) Liz Fong-Jones. Too many dashboards leads to cognitive overload, and as the SRE just blindly looks through a set of a set of visualized queries, looking for patterns. It’s wasted time, she warned.
Defining the role of a Site Reliability Engineer
Matt Santamaria, ITOpsTimes, March 27, 2018
“Site Reliability Engineering is a specialized job function that focuses on the reliability and maintainability of large systems,” said Liz Fong-Jones, staff Site Reliability Engineer at Google. “SREs couple operational responsibility with the competence and agency of software engineering to guide system architecture. They aim to strike the right balance between reliability and development speed by engineering solutions to operational problems.”
No Grumpy Humans and Other Site Reliability Engineering Lessons from Google
TC Currie, The New Stack, October 24, 2017
“It’s really about communication, humility and trust,” said Google engineer Liz Fong-Jones of the emerging practice of site reliability engineering, at New Relic’s FutureStack New York 2017 last month.
Press
Three Years of Misery Inside Google, the Happiest Company in Tech
Nitasha Tiku, Wired, August 13, 2019
To Liz Fong-Jones, a site reliability engineer at Google, the memo's arguments were especially familiar. Google's engineers are not unionized, but inside Google, Fong-Jones essentially performed the function of a union rep, translating employee concerns to managers on everything from product decisions to inclusion practices.
As part of this internal advocacy work, Fong-Jones had become attuned to the way discussions about diversity on internal forums were beset by men like Cernekee, Damore, and other coworkers who were “just asking questions.” To her mind, Google's management had allowed these dynamics to fester for too long, and now it was time for executives to take a stand.
Inside Google’s Civil War
Beth Kowitt, Fortune, May 17, 2019
Much of the organizing efforts have been led by site reliability engineers (SREs). Their remit is to operate the most critical services Google runs. When something breaks, they’re the ones who get paged to fix it. They troubleshoot and diagnose problems, and they are expected to have opinions and questions. “You have to go probe for weaknesses,” says Fong-Jones, who was an SRE, “and also challenge people when you think something that they’re trying to railroad through is not okay.” Within the SRE world, there’s a concept called blameless postmortem—it’s a way of looking back at mistakes made without throwing anyone under the bus. “It’s a fundamental part of the culture at Google,” says Tariq Yusuf, a privacy engineer who’s been with the company almost five years. “It’s an ability to say this is a thing that’s wrong.” Retaliation, he says, removes the core barrier of being able to safely raise issues. “The whole process breaks down.”
Community
2017-present: Global Steering Committee Member, SREcon (USENIX)
SREcon Asia/Australia 2022: Program Co-Chair
SREcon Americas {2016,2017,2019}: Program Co-Chair
SREcon Europe {2016,2017}: Program Committee Member
SREcon {Americas,EMEA,Asia/Australia} 2018: Program Committee Member
OpenTelemetry Governance Committee Emeritus
Grants & Investments
I engage in angel investing in social-benefit-focused, for-profit startups, and do targeted grant-making to enable non-profits to scale. My areas of competency and focus are on problems faced by transgender people (especially trans people of color), including policy work, impact litigation, poverty alleviation, violence prevention, suicide prevention, and addressing online/offline harassment.
Non-Profit Grantees
National Center for Transgender Equality (also a 2018-2020 board member)
Coworker Solidarity Fund (founder, board chair)
AIDS Action Committee of Massachusetts (Youth on Fire) - story
Paladin (fiscally sponsored by Black and Brown Founders)
For-Profit Seed Investments
Resume
Paid Experience
Field CTO
honeycomb.io
October 2022 to present
Principal Developer Advocate
honeycomb.io
February 2019 to September 2022
Staff Developer Advocate, SRE/DevOps/Infra&Ops
Google LLC
August 2018 to January 2019
Staff Site Reliability Engineer, Customer Reliability Engineering
Google LLC
July 2017 to July 2018
Site Reliability Engineering Manager, Bigtable
Google Inc
June 2015 to June 2017
Senior Site Reliability Engineer [Google Play Books, GFE, Google Flights]
Google Inc
June 2012 to May 2015
Site Reliability Engineer [HR Info Systems, Developer Infrastructure, Bigtable]
Google Inc
January 2008 to May 2012
Technical Operations Manager, Puzzle Pirates Support Tools & Anti-Cheating (contract)
Three Rings Design
March 2005 to December 2007
OS X Systems Administrator
College Preparatory Mathematics
June 2004 to August 2005
Education
SB Computer Science and Engineering (course 6-3)
Massachusetts Institute of Technology
February 2014
Volunteer Experience
Board Member
National Center for Transgender Equality
December 2017 to February 2020
UNIX System Administrator, Undergraduate Computer Science Lab
California Institute of Technology
February 2006 to December 2007
Skills & Languages
Go
C++
Java
Python
Distributed Systems
Incident response
Patents
US8656465B1 - "Userspace permissions service"
US8694791B1 and US9015827B2- "Transitioning between access states of a computing device" (w/ Florian Rohrweck)
English (native)
Spanish (intermediate/B2)
Technical Communication
Livetweeting/liveblogging
Connect
Public Key: 1F77 14D7 EC34 41D2 CECC 2460 6A3F 8B00 FBDD D2A4