I think people underestimate how valuable these reports are, so I’m very glad that detailed investigation is done here. Every major grid operator around the world is going to study this and make improvements to make sure this doesn’t happen on their grid.
In a lot of ways it’s like investigations into airplane crashes.
As someone who lived through the blackout it was wild. I felt back into the pre-internet, pre-smartphone era. It was pretty cool actually. The rumor mill spread so fast that Within hours the official word on the street was that we were getting hacked by a foreign military and people were joking that we had nothing of interest to be conquered xD
Might have been less fun if it had been in the depths of winter. The fact that it was a balmy sunny day in springtime made it a pleasantly novel experience, I agree. Of course, the "sunny day" seems to have been correlated.
I didn’t even know about it until the next day - totally off grid, and starlink for internet access - and no mobile signal where we live to give it away either.
In Germany a few months prior saw CCC publishing a method for destabilizing energy grids using radio waves a cheap hardware: https://media.ccc.de/v/38c3-blinkencity-radio-controlling-st... and presented an attack vector to which most infrastructure in Europe is exposed.
About 4 hours before the grid collapse on the 28th of April 2025 was recorded the largest purchase of Monero in the past 3 years (to remember: monero is coin of choice for special operations), making it surge +40% in 24 hours. The initial Spanish reports mentioned conflicting power information from dozens of locations at the same time which is consistent with a sequential attack using the blinkencity method so the grid itself is forced to close down.
The fact that there is not a single root cause but several ones makes me instinctively think this is a good report, because it's not what the "bosses" (and even less politicians) like to hear.
Yes, a lot of modern engineering is good enough that single-cause failures are very rare indeed. That means that failures themselves are rare, but when they do happen, they're most likely to have multiple causes.
How to explain that to non-engineers is another problem.
Frequently, when you see these massive failures, the root cause is an alignment of small weaknesses that all come together on a specific day. See, for instance, the space shuttle O-ring incident, Three-Mile Island, Fukushima, etc. These are complex systems with lots of moving parts and lots of (sometimes independent) people managing them. In a sense, the complexity it the common root cause.
It's like the Swiss Cheese model where every system has "holes" or vulnerabilities, several layers, and a major incident only occurs when a hole aligns through all the layers.
"You’ve all experienced the Fundamental Failure-Mode Theorem: You’re investigating a problem and along the way you find some function that never worked. A cache has a bug that results in cache misses when there should be hits. A request for an object that should be there somehow always fails. And yet the system still worked in spite of these errors. Eventually you trace the problem to a recent change that exposed all of the other bugs. Those bugs were always there, but the system kept on working because there was enough redundancy that one component was able to compensate for the failure of another component. Sometimes this chain of errors and compensation continues for several cycles, until finally the last protective layer fails and the underlying errors are exposed."
I've had that multiple times. As well as the closely related 'that can't possibly have ever worked' and sure enough it never did. Forensics in old codebases with modern tools is always fun.
> See, for instance, the space shuttle O-ring incident
That wasn't really a result of an alignment of small weaknesses though. One of the reasons that whole thing was of particular interest was Feynman's withering appendix to the report where he pointed out that the management team wasn't listening to the engineering assessments of the safety of the venture and were making judgement calls like claiming that a component that had failed in testing was safe.
If a situation is being managed by people who can't assess technical risk, the failures aren't the result of many small weaknesses aligning. It wasn't an alignment of small failures as much as that a component that was well understood to be a likely point of failure had probably failed. Driven by poor management.
> Fukushima
This one too. Wasn't the reactor hit by a wave that was outside design tolerance? My memory was that they were hit by an earthquake that was outside design spec, then a tsunami that was outside design spec. That isn't a number of small weaknesses coming together. If you hit something with forces outside design spec then it might break. Not much of a mystery there. From a similar perspective if you design something for a 1:500 year storm then 1/500th of them might easily fail every year to storms. No small alignment of circumstances needed.
In some ways, yes, but yet it's what reality is. There was probably some last factor kicking in that triggered the cascade, but there were probably many non-happy-paths not properly covered by working backup/fallback strategies.
So a report could totally still tell "it's X fault", pointing the finger there. Government would blame the owner of X, some public statement about fixing X would be made and then the ones working in the field should internally push toi improve/fix their own (reduced) scope.
I don't know what will come of this report in the next months/years, I will keep an eye on it though, since I live in Spain :)
They need more battery storage for grid health, both colocated at solar PV generators (to buffer voltage and frequency anomalies) and spread throughout the grid. This replaces inertia and other grid services provided by spinning thermal generators. There was no market mechanism to encourage the deployment of this technology in concert with Spain’s rapid deployment of solar and wind.
There are non-battery buffers available too--I recently got rooftop residential solar installed, and learned that my area is covered by a grid profile requiring that the solar system stay online through something like 60 +/- 2Hz before shutting down completely, and ramping down production linearly beyond a 1Hz deviation or so. The point is to avoid cascading shutdowns by riding through over/undersupply situations, whereas an older standard for my area would have the all solar systems cut off the moment frequency exceeded 60.5Hz (which would indicate oversupply from power plant generators spinning faster via lower resistance).
In my system's case, switching to this grid profile was just a software toggle.
This is grid following, very effective for small scale generation. It does not work for large scale generation though when the grid is relying on that voltage and frequency from the utility scale renewable generation. When those large generators exceed their ride through tolerance, batteries step in to hold voltage and frequency up until the transient event ends or dispatchable generators called upon spin up. Thermal generators can take minutes to provide this support, batteries respond within 250-500ms.
Tesla’s Megapack system at the Hornsdale Power Reserve in Australia was the first example of this being proven out at scale in prod.
I was supposed to fly home from Santiago de Compostella when the blackout happened. Me and my girlfriend had checked out of our hotel and headed to the bus stop to take the bus to the airport. The blackout had already started but we didn't realise (in hindsight, I do remember the pedestrian crossing not working. But I didn't think much of it). Anyways our flight was cancelled and it was clear we needed somewhere to stay the night.
I immediately rebooked the same hotel, but when we got back there the receptionist had left so you had to check in over the phone instead. Except WhatsApp wasn't working. Then mobile data went down. And before long we were walking through the old town going hostel to hostel looking for a place to sleep, as everything got darker and darker (due to the lack of powered street lighting). The old town in almost pitch black was pretty scary!
We ended up breaking back into the hotel, borrowing a bunch of towels from a laundry cart in the hallway and sleeping in this lockable room we found in the basement.
Besides that somewhat stressful part, it was a really strange but fun experience to see the city without power: no traffic lights, darkened shops with lots of phone lights, cafés still operating just with only outdoor seating and limited menus, the occasional loud generator, and most of all the people seemingly having a great time in spite of it.
I would've loved to have stayed out all night exploring the city, but finding somewhere to sleep that night was a bit more pressing!
472 pages. That's going to be a nice bit of reading this weekend. It is very nice to see such a comprehensive report as well as the fact that it was made public immediately.
Can’t read all of this since it’s 424 pages but i want to point out that Australia is beating Europe on grid connected storage. Not on a per capita basis. It’s beating all of Europe combined outright https://www.visualcapitalist.com/top-20-countries-by-battery...
We did have many many problems previously. The state of South Australia went out for a couple of weeks at one point in similar cascading failures. This doesn’t happen anymore. In fact the price of electricity is falling and the grid is more stable now https://www.theguardian.com/australia-news/2026/mar/19/power...
This price drop is inline with the lowered usage of gas turbine peaker plants (isn’t that helpful right now? No need for blockaded gas for electricity).
A lot of people say it can’t be done. That you can’t have free power during the day (power is free on certain plans during daylight due to solar power inputs dropping wholesale prices to negative) and that you can’t build enough storage (still not there but the dent in gas turbine usage is clear).
It’s one of these cases where you’ve been lied to. Australia elected a government that listened to reports battery+solar is great for grid reliability and nuclear was always going to be more expensive.
You need grid connected storage where you have (unpredictable) renewables. That doesn't negate the benefits of Nuclear baseload power. In an ideal mix, you need both, and also Gas for emergencies. One is not better than the other, they have different roles in a balanced grid.
I think people underestimate how valuable these reports are, so I’m very glad that detailed investigation is done here. Every major grid operator around the world is going to study this and make improvements to make sure this doesn’t happen on their grid.
In a lot of ways it’s like investigations into airplane crashes.
Obligatory read: https://how.complexsystems.fail/
As someone who lived through the blackout it was wild. I felt back into the pre-internet, pre-smartphone era. It was pretty cool actually. The rumor mill spread so fast that Within hours the official word on the street was that we were getting hacked by a foreign military and people were joking that we had nothing of interest to be conquered xD
Might have been less fun if it had been in the depths of winter. The fact that it was a balmy sunny day in springtime made it a pleasantly novel experience, I agree. Of course, the "sunny day" seems to have been correlated.
We're talking about Spain. How bad could a winter really be?
and then people accuse social media of making people paranoid...
you are able to be paranoid on your own just fine
I didn’t even know about it until the next day - totally off grid, and starlink for internet access - and no mobile signal where we live to give it away either.
The hack thing spread wildly, indeed. Weird experience.
In Germany a few months prior saw CCC publishing a method for destabilizing energy grids using radio waves a cheap hardware: https://media.ccc.de/v/38c3-blinkencity-radio-controlling-st... and presented an attack vector to which most infrastructure in Europe is exposed.
About 4 hours before the grid collapse on the 28th of April 2025 was recorded the largest purchase of Monero in the past 3 years (to remember: monero is coin of choice for special operations), making it surge +40% in 24 hours. The initial Spanish reports mentioned conflicting power information from dozens of locations at the same time which is consistent with a sequential attack using the blinkencity method so the grid itself is forced to close down.
The fact that there is not a single root cause but several ones makes me instinctively think this is a good report, because it's not what the "bosses" (and even less politicians) like to hear.
Yes, a lot of modern engineering is good enough that single-cause failures are very rare indeed. That means that failures themselves are rare, but when they do happen, they're most likely to have multiple causes.
How to explain that to non-engineers is another problem.
Frequently, when you see these massive failures, the root cause is an alignment of small weaknesses that all come together on a specific day. See, for instance, the space shuttle O-ring incident, Three-Mile Island, Fukushima, etc. These are complex systems with lots of moving parts and lots of (sometimes independent) people managing them. In a sense, the complexity it the common root cause.
It's like the Swiss Cheese model where every system has "holes" or vulnerabilities, several layers, and a major incident only occurs when a hole aligns through all the layers.
https://en.wikipedia.org/wiki/Swiss_cheese_model
I use this model all the time. It's very helpful for explaining the multifactorial genesis of catastrophes to ordinary people.
Also perhaps worth a read:
https://devblogs.microsoft.com/oldnewthing/20080416-00/?p=22...
"You’ve all experienced the Fundamental Failure-Mode Theorem: You’re investigating a problem and along the way you find some function that never worked. A cache has a bug that results in cache misses when there should be hits. A request for an object that should be there somehow always fails. And yet the system still worked in spite of these errors. Eventually you trace the problem to a recent change that exposed all of the other bugs. Those bugs were always there, but the system kept on working because there was enough redundancy that one component was able to compensate for the failure of another component. Sometimes this chain of errors and compensation continues for several cycles, until finally the last protective layer fails and the underlying errors are exposed."
I've had that multiple times. As well as the closely related 'that can't possibly have ever worked' and sure enough it never did. Forensics in old codebases with modern tools is always fun.
> See, for instance, the space shuttle O-ring incident
That wasn't really a result of an alignment of small weaknesses though. One of the reasons that whole thing was of particular interest was Feynman's withering appendix to the report where he pointed out that the management team wasn't listening to the engineering assessments of the safety of the venture and were making judgement calls like claiming that a component that had failed in testing was safe.
If a situation is being managed by people who can't assess technical risk, the failures aren't the result of many small weaknesses aligning. It wasn't an alignment of small failures as much as that a component that was well understood to be a likely point of failure had probably failed. Driven by poor management.
> Fukushima
This one too. Wasn't the reactor hit by a wave that was outside design tolerance? My memory was that they were hit by an earthquake that was outside design spec, then a tsunami that was outside design spec. That isn't a number of small weaknesses coming together. If you hit something with forces outside design spec then it might break. Not much of a mystery there. From a similar perspective if you design something for a 1:500 year storm then 1/500th of them might easily fail every year to storms. No small alignment of circumstances needed.
It usually starts with a broken coffee machine.
Yep, sounds like "This was bound to happen at some point"
Which on some level is exactly "what the bosses and politicians want to hear"
When it's everybody's fault it's nobody's fault.
In some ways, yes, but yet it's what reality is. There was probably some last factor kicking in that triggered the cascade, but there were probably many non-happy-paths not properly covered by working backup/fallback strategies. So a report could totally still tell "it's X fault", pointing the finger there. Government would blame the owner of X, some public statement about fixing X would be made and then the ones working in the field should internally push toi improve/fix their own (reduced) scope.
I don't know what will come of this report in the next months/years, I will keep an eye on it though, since I live in Spain :)
Exactly.
There are ways to aggregate these into a single resilience score for policy makers with only moderate loss of detail but it's unpopular.
They need more battery storage for grid health, both colocated at solar PV generators (to buffer voltage and frequency anomalies) and spread throughout the grid. This replaces inertia and other grid services provided by spinning thermal generators. There was no market mechanism to encourage the deployment of this technology in concert with Spain’s rapid deployment of solar and wind.
There are non-battery buffers available too--I recently got rooftop residential solar installed, and learned that my area is covered by a grid profile requiring that the solar system stay online through something like 60 +/- 2Hz before shutting down completely, and ramping down production linearly beyond a 1Hz deviation or so. The point is to avoid cascading shutdowns by riding through over/undersupply situations, whereas an older standard for my area would have the all solar systems cut off the moment frequency exceeded 60.5Hz (which would indicate oversupply from power plant generators spinning faster via lower resistance).
In my system's case, switching to this grid profile was just a software toggle.
This is grid following, very effective for small scale generation. It does not work for large scale generation though when the grid is relying on that voltage and frequency from the utility scale renewable generation. When those large generators exceed their ride through tolerance, batteries step in to hold voltage and frequency up until the transient event ends or dispatchable generators called upon spin up. Thermal generators can take minutes to provide this support, batteries respond within 250-500ms.
Tesla’s Megapack system at the Hornsdale Power Reserve in Australia was the first example of this being proven out at scale in prod.
If someone wants a "quick and dirty" answers - there's presentation linked https://eepublicdownloads.blob.core.windows.net/public-cdn-c...
page 11 contains "Full root cause tree" - one image with all the high level info
I was supposed to fly home from Santiago de Compostella when the blackout happened. Me and my girlfriend had checked out of our hotel and headed to the bus stop to take the bus to the airport. The blackout had already started but we didn't realise (in hindsight, I do remember the pedestrian crossing not working. But I didn't think much of it). Anyways our flight was cancelled and it was clear we needed somewhere to stay the night.
I immediately rebooked the same hotel, but when we got back there the receptionist had left so you had to check in over the phone instead. Except WhatsApp wasn't working. Then mobile data went down. And before long we were walking through the old town going hostel to hostel looking for a place to sleep, as everything got darker and darker (due to the lack of powered street lighting). The old town in almost pitch black was pretty scary!
We ended up breaking back into the hotel, borrowing a bunch of towels from a laundry cart in the hallway and sleeping in this lockable room we found in the basement.
Besides that somewhat stressful part, it was a really strange but fun experience to see the city without power: no traffic lights, darkened shops with lots of phone lights, cafés still operating just with only outdoor seating and limited menus, the occasional loud generator, and most of all the people seemingly having a great time in spite of it.
I would've loved to have stayed out all night exploring the city, but finding somewhere to sleep that night was a bit more pressing!
472 pages. That's going to be a nice bit of reading this weekend. It is very nice to see such a comprehensive report as well as the fact that it was made public immediately.
Can’t read all of this since it’s 424 pages but i want to point out that Australia is beating Europe on grid connected storage. Not on a per capita basis. It’s beating all of Europe combined outright https://www.visualcapitalist.com/top-20-countries-by-battery...
We did have many many problems previously. The state of South Australia went out for a couple of weeks at one point in similar cascading failures. This doesn’t happen anymore. In fact the price of electricity is falling and the grid is more stable now https://www.theguardian.com/australia-news/2026/mar/19/power...
This price drop is inline with the lowered usage of gas turbine peaker plants (isn’t that helpful right now? No need for blockaded gas for electricity).
A lot of people say it can’t be done. That you can’t have free power during the day (power is free on certain plans during daylight due to solar power inputs dropping wholesale prices to negative) and that you can’t build enough storage (still not there but the dent in gas turbine usage is clear).
It’s one of these cases where you’ve been lied to. Australia elected a government that listened to reports battery+solar is great for grid reliability and nuclear was always going to be more expensive.
You need grid connected storage where you have (unpredictable) renewables. That doesn't negate the benefits of Nuclear baseload power. In an ideal mix, you need both, and also Gas for emergencies. One is not better than the other, they have different roles in a balanced grid.