Liz Fong-Jones
Liz Fong-Jones
@lizthegrey
 
liz_headshot.jpg
 

I make developers, operators, and workers as a whole more productive and empowered.

Liz is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with over two decades of experience. She is currently the Field CTO at Honeycomb, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights.

She lives in Vancouver, BC with her wife Elly, partners, and a Samoyed/Golden Retriever mix, and in Sydney, NSW. She plays classical piano, leads an EVE Online alliance, and advocates for transgender rights.

Connect

Public Key: 1F77 14D7 EC34 41D2 CECC  2460 6A3F 8B00 FBDD D2A4

Talks

 

Cultivating Production Excellence

Taming the complex distributed systems we're responsible for requires changing not just the tools and technical approaches we use; it also requires changing who is involved in production, how they collaborate, and how we measure success.

In this talk, you'll learn about several practices core to production excellence: giving everyone a stake in production, collaborating to ensure observability, measuring with Service Level Objectives, and prioritizing improvements using risk analysis.

Presented at QCon London 2019, Velocity San Jose (keynote), DevOpsDays Atlanta, and more. Example slides (from DevOpsDays Minneapolis)


Tradeoffs on the Road to Observability

SRE and infrastructure engineering are about allocating adequate time to do project work that improves the long-term sustainability of our services. But what do we reward SREs for doing? Does your company have a culture of "not invented here" or the converse of "ask the consultants to design it for us"?

We need to step back and ensure that we are doing the most efficient thing with our time as SREs. The main way that we can improve sustainability is empowering other engineers, especially non-SREs, to understand our systems, rather than for the sake of our resumes and/or glory.

Presented at Monitorama PDX 2019 (slides).


Refining Systems Data Without Losing Fidelity

It is not feasible to run an observability infrastructure that is the same size as your production infrastructure. Past a certain scale, the cost to collect, process, and save every log entry, every event, and every trace that your systems generate dramatically outweighs the benefits. If your SLO is 99.95%, then you'll be naively collecting 2,000 times as much data about requests that satisfied your SLI than those that burnt error budget. The question is, how to scale back the flood of data without losing the crucial information your engineering team needs to troubleshoot and understand your system's production behaviors?

Statistics can come to our rescue, enabling us to gather accurate, specific, and error-bounded data on our services' top-level performance and inner workings. This talk advocates a three-R approach to data retention: Reducing junk data, statistically Reusing data points as samples, and Recycling data into counters. We can keep the context of the anomalous data flows and cases in our supported services while not allowing the volume of ordinary data to drown it out.

Presented at SREcon Europe 2019 (slides); upcoming at Monitorama Baltimore 2019.


Organizing for Your Ethical Principles

Our job as engineers does not stop with eliminating technical defects and ensuring high reliability. Engineers of all kinds must ensure their work serves the public good. A service that reliably harms, exacerbates injustices, or excludes marginalized groups is not a service worth building and maintaining. Learn how to effectively accomplish change in your working conditions or your employer's products through grassroots employee advocacy.

Video: SREcon EMEA 2018 (slides) as a keynote joint with Emily Gorcenski

Also presented at Write/Speak/Code 2018 (slides), QCon NYC 2018 (video), and privately within Google.


 

Older talks are archived on a separate page.

Publications & Videos

Interviews & Podcasts

GCP Podcast Episode 127 with Seth Vargo, Melanie Warrick, and Mark Mandel

GCP Podcast Episode 139 with Melanie Warrick and Mark Mandel

Screaming in the Cloud Episode 19 with Corey Quinn

Fireside Chat at FutureStack NYC with Matthew Flaming

DevOps/SRE AMA with Charity Majors and Adam Jacob, hosted by Andrew Smirnov of Catchpoint

o11ycast Episode 6 with Charity Majors and Rachel Chalmers

I also frequently sit on panels about management, SRE, and ethics.

Technical Press & Citations

Citations

"Site Reliability Engineering: Philosophies, Habits, and Tools for SRE Success" (blog by New Relic)

"Accelerate: State of DevOps Report: Strategies for a New Economy" (by DevOps Research and Assessment)

Press

SRE model requires technical, organizational optimization skills

Beth Pariseau, TechTarget, October 16, 2018

"Grumpy humans are really bad at running systems," said Liz Fong-Jones, developer advocate at Google and former leader of the Google SRE team responsible for Bigtable. Fong-Jones spoke from experience about how to optimize human labor at an SRE conference here last week. "Unfair distribution of work prevents system scale," she said.

Google Cloud Next '18: What datacentre operators can learn from how Google SRE teams operate

Caroline Donnelly, Computer Weekly, July 24, 2018

Google has used the statement “class SRE implements DevOps” to title a new (and growing) video playlist by Liz Fong-Jones and Seth Vargo of Google Cloud Platform, showing how and where these disciplines connect, while nudging DevOps practitioners to consider some key SRE insights.

How Facebook operations got 10 times faster while getting 10 times bigger

Stephen Shankland, CNET, July 19, 2018

At the conference, engineers from Facebook and other tech companies, like Amazon, Shopify, Lyft, Google and Yahoo gave talks and asked questions of their peers.The profusion of management tools shows how complex it is to run suites of services on hundreds or thousands of servers. Over and over, engineers spoke of completely overhauling their technology every few years as massive growth overwhelmed the earlier system.

Increasingly sophisticated tools spotlight problems and help people trace their origins, said Google site reliability engineer Liz Fong-Jones.

Debugging Microservices: Lessons from Google, Facebook, Lyft

Joab Jackson, The New Stack, Jul 3, 2018

As your system grows more complex, and your knowledge of what can go wrong increases, you may be tempted to expand a dashboard with more metrics representing outages. This is a bad idea, advised Google Site Reliability Engineer (SRE) Liz Fong-Jones. Too many dashboards leads to cognitive overload, and as the SRE just blindly looks through a set of a set of visualized queries, looking for patterns. It’s wasted time, she warned.

Defining the role of a Site Reliability Engineer

Matt Santamaria, ITOpsTimes, March 27, 2018

“Site Reliability Engineering is a specialized job function that focuses on the reliability and maintainability of large systems,” said Liz Fong-Jones, staff Site Reliability Engineer at Google. “SREs couple operational responsibility with the competence and agency of software engineering to guide system architecture. They aim to strike the right balance between reliability and development speed by engineering solutions to operational problems.”

No Grumpy Humans and Other Site Reliability Engineering Lessons from Google

TC Currie, The New Stack, October 24, 2017

“It’s really about communication, humility and trust,” said Google engineer Liz Fong-Jones of the emerging practice of site reliability engineering, at New Relic’s FutureStack New York 2017 last month.

Press

 
2709_cover_google.jpg

Three Years of Misery Inside Google, the Happiest Company in Tech

Nitasha Tiku, Wired, August 13, 2019

To Liz Fong-Jones, a site reliability engineer at Google, the memo's arguments were especially familiar. Google's engineers are not unionized, but inside Google, Fong-Jones essentially performed the function of a union rep, translating employee concerns to managers on everything from product decisions to inclusion practices.

As part of this internal advocacy work, Fong-Jones had become attuned to the way discussions about diversity on internal forums were beset by men like Cernekee, Damore, and other coworkers who were “just asking questions.” To her mind, Google's management had allowed these dynamics to fester for too long, and now it was time for executives to take a stand.

mqbdgyIN.jpeg

Inside Google’s Civil War

Beth Kowitt, Fortune, May 17, 2019

Much of the organizing efforts have been led by site reliability engineers (SREs). Their remit is to operate the most critical services Google runs. When something breaks, they’re the ones who get paged to fix it. They troubleshoot and diagnose problems, and they are expected to have opinions and questions. “You have to go probe for weaknesses,” says Fong-Jones, who was an SRE, “and also challenge people when you think something that they’re trying to railroad through is not okay.” Within the SRE world, there’s a concept called blameless postmortem—it’s a way of looking back at mistakes made without throwing anyone under the bus. “It’s a fundamental part of the culture at Google,” says Tariq Yusuf, a privacy engineer who’s been with the company almost five years. “It’s an ability to say this is a thing that’s wrong.” Retaliation, he says, removes the core barrier of being able to safely raise issues. “The whole process breaks down.”

Community

2017-present: Global Steering Committee Member, SREcon (USENIX)

SREcon Asia/Australia 2022: Program Co-Chair

SREcon Americas {2016,2017,2019}: Program Co-Chair

SREcon Europe {2016,2017}: Program Committee Member

SREcon {Americas,EMEA,Asia/Australia} 2018: Program Committee Member

OpenTelemetry Governance Committee Emeritus

Grants & Investments

I engage in angel investing in social-benefit-focused, for-profit startups, and do targeted grant-making to enable non-profits to scale. My areas of competency and focus are on problems faced by transgender people (especially trans people of color), including policy work, impact litigation, poverty alleviation, violence prevention, suicide prevention, and addressing online/offline harassment.

Non-Profit Grantees

For-Profit Seed Investments

Resume

Paid Experience

Field CTO
honeycomb.io
October 2022 to present

Principal Developer Advocate
honeycomb.io
February 2019 to September 2022

Staff Developer Advocate, SRE/DevOps/Infra&Ops
Google LLC
August 2018 to January 2019

Staff Site Reliability Engineer, Customer Reliability Engineering
Google LLC
July 2017 to July 2018

Site Reliability Engineering Manager, Bigtable
Google Inc
June 2015 to June 2017

Senior Site Reliability Engineer [Google Play Books, GFE, Google Flights]
Google Inc
June 2012 to May 2015

Site Reliability Engineer [HR Info Systems, Developer Infrastructure, Bigtable]
Google Inc
January 2008 to May 2012

Technical Operations Manager, Puzzle Pirates Support Tools & Anti-Cheating (contract)
Three Rings Design
March 2005 to December 2007

OS X Systems Administrator
College Preparatory Mathematics
June 2004 to August 2005

Education

SB Computer Science and Engineering (course 6-3)
Massachusetts Institute of Technology
February 2014

Volunteer Experience

Board Member
National Center for Transgender Equality
December 2017 to February 2020

UNIX System Administrator, Undergraduate Computer Science Lab
California Institute of Technology
February 2006 to December 2007

Skills & Languages

Go

C++

Java

Python

Distributed Systems

Incident response

Patents

US8656465B1 - "Userspace permissions service"

US8694791B1 and US9015827B2- "Transitioning between access states of a computing device" (w/ Florian Rohrweck)

English (native)

Spanish (intermediate/B2)


Technical Communication

Livetweeting/liveblogging

color-3.png