My first in-prod corrupted hard drive problem

(blog.pavementlink.ch)

41 points | by r1chk1t 11 hours ago ago

28 comments

proactivesvcs 10 hours ago ago
I'm surprised to have read to the end and found that they're still not performing any hardware monitoring and alerting. SMART may not always show up pre-failure warnings but when it does they can usually be trusted.
[-]
- rkagerer 2 hours ago ago
  Hard Disk Sentinel is really good for this type of thing. The developer is awesome and some years ago after I asked for some new features added code to better support my RAID adapter.
- jeffbee 9 hours ago ago
  Wasn't it a conclusion of the Google hard drive reliability study that models based on SMART were not useful? I.e. drives with sector reallocations are much more likely to fail than those without, but their failure rate is still something like 15% per year, so what useful thing can you do with that signal?
  [-]
  - proactivesvcs 9 hours ago ago
    Well I don't see why you'd want to keep running a drive that is showing warning signs, it's just asking for trouble. But even if one doesn't replace them from this data, if you start seeing alerts and at the same time your database suffers from corruption, that also shows the use of SMART.
    [-]
    - jeffbee 9 hours ago ago
      Because taking drives out of service for SMART signals would cost a fortune and almost none of those drives were actually going to fail.
    - estimator7292 8 hours ago ago
      N=1, but I had a drive show catastrophic SMART failures once. I figured I'd take the opportunity to tinker with the exposed serial port on the drive's PCB and wiped the SMART values.
      Funny thing was, I didn't actually observe any data loss. I stressed the drive for several days, no errors. It went back in my daily driver for the next 5 years with no failure. It's been 15 years since that happened and the drive still hasn't failed.
      I don't trust SMART anymore.
Retr0id 10 hours ago ago
> So how were we able to recover the database and the data inside it? Most of the data was probably still intact, only a few sectors were unreadable. Once those were either restored (rewritten with a strong signal) or remapped by the drive’s firmware, the filesystem and the database engine could read the file end-to-end again. SQL Server pages also have checksums, so if any page came back wrong rather than unreadable, we’d have known. We got lucky: the corruption was at the magnetic-signal level, not at the “platter is scratched” level.
This doesn't quite seem to follow. As described, neither of the "recovery" methods actually restore lost data. So why weren't any of the SQL pages left in a bad state?
[-]
- benlivengood 10 hours ago ago
  As best as I can tell it was intermittent read failures on some sectors, not permanent failures.
  So if you keep rereading that section of the disk you eventually get all the data, save it somewhere, write a bunch of new patterns over it, then write the original data and verify it reads back correctly many times.
  I believe the article's analysis about RAID is wrong though; most controllers will start resilvering or just fail a drive once it experiences too many IO errors.
pshirshov 10 hours ago ago
So, you were not using a striped mirror ZFS for a prod database? What could go wrong, yep.
[-]
- r1chk1t 10 hours ago ago
  learned the hard way
  [-]
  - justinclift 8 hours ago ago
    Yet at the end it still has this:
    > I did some research, and a RAID wouldn’t have saved it either, RAID protects against drive failure, not against silent page corruption that gets faithfully replicated to every mirror.
    That being said, the article has some strong signal of AI writing in it. So it's possible the author isn't really learning well from the experience either. :(
    [-]
    - pshirshov 7 hours ago ago
      ZFS and ECC do protect against silent page corruption that gets faithfully replicated to every mirror.
      [-]
      - justinclift 5 hours ago ago
        Yeah, that was my point. The author seems to have gotten things wrong in some fundamental way.
barrkel 8 hours ago ago
HDD failures don't normally have a software root cause. Treat HDD failures as a certainty. It's just a matter of time.
jtchang 10 hours ago ago
Confused as to the actual root cause. Don't all hard drives provide SMART diagnostics these days? Was it really bad sectors?
[-]
- r1chk1t 10 hours ago ago
  Yes there was bad sectors in the SMART diagnostic
Felger 10 hours ago ago
Hi, I believe you are quite new to workstation/hardware admin. Lots of things to say here (not native english speaking so basic style, sorry for that) :
Disk errors logged in the system event log are from the I/O layer, low-level class driver (msahci.sys) / filter drivers. See Windows Storage Driver Architecture : https://learn.microsoft.com/en-us/windows-hardware/drivers/s...
A disk error of this type showing in the event log must immediately be treated as an actual disk issue. This is a low level issue below the actual filesystem and application/services. Seems here the .mdf/ldf of your SQL database used one or more bad sectors on the disk surface.
Your disk seems to be only one on the system, so the first thing to do is check SMART status, for example with Crystaldiskinfo (the most used and user-friendly free portable windows software).
It would very probably have shown a warning state for the internal disk, with probably one or more (judging the quantity of disk error entries in your log) for Attribute C5 "Current Pending sectors" and probably some in Att 05 - "Reallocated sectors count" and/or Att C4 - "Realloc event Count".
Second thing to do is trying to backup your data as fast as possible. In your case related to a Ms SQL database, trying to dump it / backup first was the good move. Sadly (DR pro experience here), weak surface / failing Head Stack Assembly of a traditionnal HDD from most vendors has more difficulties reading correctly a sector than writing it.
If the dump/backup fails, the second choice would have be to try to a sector-to-sector dump approach of the whole disk, with either a online (from OS) software capable of reading sectors from the boot disk (didn't try if HDD Raw Copy Tool 2.6 supports it), or an offline solution like Clonezilla, Acronis True Image, Aomei backupper, etc. But offline solution means offline computer and service...
I didn't exactly understood if you had an actual backup of the data or an image of the whole disk. Considering the critical usage of this station, you should have both running : daily data backup or more + up-to-date disk image ready. whatever the type of disk (HDD/SSD). And a spare, identical computer.
As for repair of HDDs "weak sectors" (meaning Current Pending Sectors), it is indeed possible, often with complete data recovery. If not, the sector will be left as is, or may be remapped if written to 0 (it will then shift from Current pending to realloc sector count).
Hard disk Sentinel Pro as such features (Disk repair, Quick Fix), it works quite well. The result vary greatly from one type of failure to another, as from one disk maker to another.
Note that if the SMART shows more than a little dozen of sectors, the head (amp/preamp) is probably failing, making weaker magnetically-wise sectors too difficult to read and/or write. In this case, the count of current and remapped increases every repair/check pass made by the tools. In this case, the drive is toast and must be replaced ASAP.
SSD are a complete different case for repair.
A older autonomous tool, Spinrite, was specialized for this usage (accurate recovery of data), but veeeeery slow.
RAID pertinence : fortunately, it is an expecteed case as most SATA disks are prone to HSA failure before not initialyzing at all. A RAID 1 mirror would have protected you from a mirrored defect accross the two disks.
The RAID controller (true hardware controller like LSI/Avago or Microsemi) or even fake raid like Intel RST / VROC maintain data integrity accross the array's disks. The defective disk will raise bad blocks (that will get marked in metadata of the Raid Volume), but the others disks are fine and the data can be read safely. If too many errors are reported on a disk (very few in fact on most controllers), it will be labelled as failed and taken down from the array.
[-]
- gruez 9 hours ago ago
  >Disk errors logged in the system event log are from the I/O layer, low-level class driver (msahci.sys) / filter drivers. See Windows Storage Driver Architecture : https://learn.microsoft.com/en-us/windows-hardware/drivers/s...
  What filters in the event log would you apply to find such errors?
- jwrallie 7 hours ago ago
  If taking it offline is not a concern, I would try a low level backup with ddrescue while booting from external media as soon as possible.
  Keep using the system from a disk showing read issues could trigger loss of more data, and one could always back up the SQL from the backup image later.
- r1chk1t 9 hours ago ago
  Thank you for all that, learned a lot !
pixel_popping 10 hours ago ago
I feel the pain OP.
Over the last decade, I've ran hundreds of servers if not thousands, and I entirely stopped using hard drives, now it's solely SSD/NVMe where the failure rate in practice is incredibly lower, I've had my fair share of middle-night runs because websites are offline or whatever to end-up in a hard drive diagnosis circus.
Imo, the peace of mind you get worth the cost, it also allows you to rethink development entirely, typical example would be that suddenly, copying all node_modules or rust deps is a great idea with 10Gbit/s bandwidth and fast drives (yes, I expect people to shit on me for saying this, please give me the counterarguments if you downvote me), many things change if you have a higher base performance assumption, storage is relatively cheap as well. I would never advise anyone that wants to run continuously in prod with low friction to get servers with HDD.
I get that for some use cases it's not possible, but for large majority of use cases, it's clearly not HDD that is the cost burden. $50 servers gets you TBs of SSD, of course don't go with VPS or "Cloud" if you intend to change your development based on new performance assumptions, it blows my mind the numbers of people paying thousand of dollars just to handle what, 100K visitors a day? That fits on a $100 server and a bunch of Kimsufi hosted across the world as a CDN.
People are overcomplicating infrastructure, big time (which leads to more problems, higher maintenance, security issues and so on).
[-]
- toast0 10 hours ago ago
  > Over the last decade, I've ran hundreds of servers if not thousands, and I entirely stopped using hard drives, now it's solely SSD/NVMe where the failure rate in practice is incredibly lower, I've had my fair share of middle-night runs because websites are offline or whatever to end-up in a hard drive diagnosis circus.
  My experience is that (most) spinners give off reliable pre-failure indicators (if you take the time to look/script looking), but SSDs fail by disappearing from the bus. The SSDs do fail much less often, but they still fail from time to time and recovery is harder.
  Either way, if your data is important to you/your customers, you really need a backup/recovery plan.
  I dunno about recent pricing, but not so long ago, it felt like spinners had a pretty high price floor and SSDs didn't... If you don't need a lot of space, you could find a small SSD that was still around the same $/GB as a medium sized SSD, but for spinners, there's a floor in dollars and space. So if you don't need a lot of space, you save money with an SSD and get better perf for free... If you need a lot of space and not a lot of perf, big spinners are more attainable than big SSDs.
  [-]
  - ryandrake 10 hours ago ago
    > My experience is that (most) spinners give off reliable pre-failure indicators (if you take the time to look/script looking), but SSDs fail by disappearing from the bus. The SSDs do fail much less often, but they still fail from time to time and recovery is harder.
    I'm not a pro, just a smalltime dork with a homelab. I use cheap WD HDDs on my NAS system connected to an LSI hardware RAID controller. I'll boast that I have a 100% record so far of preventing downtime and data loss by simply listening for the controller's audible alarm and swapping drives right away (I keep brand new spares). I also have offline backups, but have so far never needed them. Not sure how this would change if I moved to SSDs.
    [-]
    - Felger 9 hours ago ago
      Well, SAS disk tend to go in failed state immediately or very quickly, most of the time without going first through the warning state.
      SATA disk are indeed generally more predictable failure-wise. Most issues are related to a failing head stack assembly. Rarely platter demagnetization for some disks (Toshiba laptop).
      Other failure issues are usually related to a friggin' manufactured firmware issue from Dell, HP or Lenovo corp.
  - pixel_popping 10 hours ago ago
    Agree with the diagnostic part.
    > Either way, if your data is important to you/your customers, you really need a backup/recovery plan.
    You'd be surprised at how many devs/companies walk on eggshells all the time (praying that the fatal moment never arrive) because they aren't "brave" enough to do a proper backup system, which is often few minutes/hours of setup only.
- Retr0id 10 hours ago ago
  It is quite remarkable how quickly a modern SSD can scan over TBs of data, I'm less afraid of O(n) queries than I used to be.
louwrentius 8 hours ago ago
> This disk was probably dying. I did some research, and a RAID wouldn’t have saved it either, RAID protects against drive failure, not against silent page corruption that gets faithfully replicated to every mirror.
I dispute this was a 'silent' drive error as many systems reported read errors. Silent data corruption on hard drives is extremely rare, due to the tons of checksums used on all data. Maybe I'm wrong but I bet there are read errors on the drive in the appropriate system logs.
I feel that people confuse regular 'bad blocks' with 'silent data corruption' and there is a huge difference[0].
[0]: https://louwrentius.com/what-home-nas-builders-should-unders...
[-]
- phoronixrly 8 hours ago ago
  Agreed on the error not being silent. Also incorrect about RAID being unable to catch silent errors - it depends on the implementation - in Linux there's lvmraid that has the option to enforce integrity. There is also zfs which on top of everything else, has RAID functionality and integrity enforcement.