OpenAI blames its massive ChatGPT outage on a 'new telemetry service'

OpenAI is blaming one of the longest outages in its history on a “new telemetry service” gone awry.

On Wednesday, OpenAI’s AI-powered chatbot platform, ChatGPT; its video generator, Sora; and its developer-facing API experienced major disruptions starting at around 3 p.m. Pacific. OpenAI acknowledged the problem soon after and began working on a fix. But it’d take the company roughly three hours to restore all services.

In a postmortem published late Thursday, OpenAI wrote that the outage wasn’t caused by a security incident or recent product launch, but by a telemetry service it deployed Wednesday to collect Kubernetes metrics. Kubernetes is an open source program that helps manage containers, or packages of apps and related files that are used to run software in isolated environments.

“Telemetry services have a very wide footprint, so this new service’s configuration unintentionally caused … resource-intensive Kubernetes API operations,” OpenAI wrote in the postmortem. “[Our] Kubernetes API servers became overwhelmed, taking down the Kubernetes control plane in most of our large [Kubernetes] clusters.”

That’s a lot of jargon, but basically, the new telemetry service affected OpenAI’s Kubernetes operations, including a resource that many of the company’s services rely on for DNS resolution. DNS resolution converts IP addresses to domain names; it’s the reason you’re able to type “Google.com” instead of “142.250.191.78.”

OpenAI’s use of DNS caching, which holds info about previously looked-up domain names (like website addresses) and their corresponding IP addresses, complicated matters by “delay[ing] visibility,” OpenAI wrote, and “allowing the rollout [of the telemetry service] to continue before the full scope of the problem was understood.”

OpenAI says that it was able to detect the issue “a few minutes” before customers ultimately started seeing an impact, but that it wasn’t able to quickly implement a fix because it had to work around the overwhelmed Kubernetes servers.

Techcrunch event

San Francisco, CA | October 13-15, 2026

“This was a confluence of multiple systems and processes failing simultaneously and interacting in unexpected ways,” the company wrote. “Our tests didn’t catch the impact the change was having on the Kubernetes control plane [and] remediation was very slow because of the locked-out effect.”

OpenAI says that it’ll adopt several measures to prevent similar incidents from occurring in the future, including improvements to phased rollouts with better monitoring for infrastructure changes and new mechanisms to ensure OpenAI engineers can access the company’s Kubernetes API servers in any circumstances.

“We apologize for the impact that this incident caused to all of our customers – from ChatGPT users to developers to businesses who rely on OpenAI products,” OpenAI wrote. “We’ve fallen short of our own expectations.”

TechCrunch has an AI-focused newsletter! Sign up here to get it in your inbox every Wednesday.

Topics

AI, ChatGPT, Enterprise, Generative AI, Kubernetes, OpenAI, Outage, postmortem

Kyle Wiggers

AI Editor

Kyle Wiggers was TechCrunch’s AI Editor until June 2025. His writing has appeared in VentureBeat and Digital Trends, as well as a range of gadget blogs including Android Police, Android Authority, Droid-Life, and XDA-Developers. He lives in Manhattan with his partner, a music therapist.

View Bio

Topics

More from TechCrunch

OpenAI blames its massive ChatGPT outage on a ‘new telemetry service’

Disrupt 2026: The tech ecosystem, all in one room

Your next round. Your next hire. Your next breakout opportunity. Find it at TechCrunch Disrupt 2026, where 10,000+ founders, investors, and tech leaders gather for three days of 250+ tactical sessions, powerful introductions, and market-defining innovation. Register now to save up to $400.

Save up to $300 or 30% to TechCrunch Founder Summit

The AI skills gap is here, says AI company, and power users are pulling ahead

Google unveils TurboQuant, a new AI memory compression algorithm — and yes, the internet is calling it ‘Pied Piper’

Kentucky woman rejects $26M offer to turn her farm into a data center

Someone has publicly leaked an exploit kit that can hack millions of iPhones

Cursor admits its new coding model was built on top of Moonshot AI’s Kimi

Delve accused of misleading customers with ‘fake compliance’

An exclusive tour of Amazon’s Trainium lab, the chip that’s won over Anthropic, OpenAI, even Apple