Backups, TimeMachine, HFS and pain

Disk doesn’t mount

A recent Saturday evening my external 4TB SSD for TimeMachine backups and some other data has failed to mount when connected. Strange, but it’d happened once before, and an OS restart had helped then. This time, it didn’t… While I was thinking that my SSD has suddenly started dying (it happens with hardware) and what to do about it, about 10 minutes have passed and a message popped up that said something like there is a problem with the drive, but you can still copy your data, and the volume was mounted read-only!

I tried to repair the volume in Disk Utility, but it would fail after several minutes. It wasn’t clear to me if it’s really a disk failure or a filesystem error.

I could find almost nothing useful online regarding such a situation. Answers to a few similar posts on Apple Developer Forums are useless and don’t give any new information. A bunch of articles just paraphrase the Apple’s official documentation and are useless. Compared to that, StackExchange questions are much better because they are much more technical and the answers try to explore possible alternatives, such as:

Is there a faster way to copy time machine files from one disk to another? is a good one because there are a few ideas, but ultimately no other solution than using Finder.
Migrate a Time Machine backup in terminal has some information about using cp, rsync and tar, but no answer.
Copying Time Machine backup, destination takes more than original size, are hardlinks being expanded? doesn’t have a solution.
Time Machine size explodes when copied to new drive doesn’t have a solution either.

And a few disk failure questions:

I had an older external HDD with a bunch of files from the current one. So I calculated the hashes of those files on both drives like this:

$ fd -0 --hidden -t f . /Volumes/source/data/{docs,podcasts,video} | xargs -0 -n 4000 -P8 md5sum >> ~/md5sums

and found that all of the hashes matched on the two drives. So far so good. (The calculation of 41k hashes of 1.1 TB of files in total on SSD was about a magnitude faster than the same on the HDD, approx. 20 minutes vs 5 hours).

I got a new 2 TB SSD because it’s the biggest I could find. How to copy the backups now?

The official way

https://support.apple.com/en-us/HT202380 is Apple’s official article on transferring TimeMachine backups to another disk. They tell you to format the target disk with the GUID partition table and a journaled HFS+ partition, then use Finder to “drag” the entire Backups.backupdb directory to the new disk, “then wait for the copy to complete”. Obviously it’s for the case when the target disk is at least as big as the source one. What if it isn’t?

I could calculate the backups size only based on the free space and the total amount of all other files. It turned out to be about 1.7 TB, so the entire backups history should fit on the new 2 TB disk. I started the copying with Finder as recommended. It took hours “to prepare” the copying, and after a few hours of “preparing” the target SSD was noticeably warm.

It copied a lot of files during the day. 24 hours later it was saying “Copying 0 objects to “target” 1,6 TB of 1,6 TB – Approximately 5 seconds left”; four hours later only the size changed: “1,8 TB of 1,8 TB”, still “5 seconds left”. When I checked the progress in the morning, I saw an error complaining that the disk was full. Well, maybe the backups were slightly over 2 TB in the end, but no, there was only about a year of backups (2016–2017), so it was not nearly close to the end at all.

I used ls -i on the same file in a few of the backups and the inodes were the same indeed! I don’t know why it failed to copy all the data then. It seems that Finder is not able to copy TimeMachine backups, at least on the latest OSX 10.14.6.

Other options

Ideally I’d save all or at least most of my TimeMachine backups, so they have to be copied to a new disk, but how?

I could think of these ideas:

Block-copying, as I described in my earlier post, is preferable, but I couldn’t do it because the target disk was smaller. And I couldn’t shrink the volume because the filesystem was read-only and corrupt.
Any other program (rsync, cp, etc.) won’t copy hardlinks to directories.
Copying a subset of the backups? I started copying the latest three backups, and soon Finder estimated the total size to be at least 1 TB — which isn’t true, there should be about 600 GB plus some minor changes, so it doesn’t deal with hardlinks in that case.
- When I just created a directory named Backups.backupdb in the root, I couldn’t copy anything to it — it seemed to be protected by Finder just based on the name. Thus I had to use another name.
- The permissions of the backup directories are insane, sudo rm -rf doesn’t work, even after chflags nouchg. Finder can delete those directories after asking for admin password, duh!
Write a program that copies hard-linked directories? That would require a lot of investigation and time.
Create one logical volume from a 3 TB HDD and 2 TB SSD so that I could block-copy the old disk, but then what? The filesystem would be in exactly the same state, with the same options.

Later I found an old blog post about a hack of disabling the journal on HFS+ by editing the bytes on disk directly and a script to copy TimeMachine backups in Linux, restoring the correct directory hierarchy. Turns out hard links on HFS+ is a very dirty hack.

Running `fsck` manually

Is it a dead end? Dropping to the terminal very often gives you much more options than the GUI.

diskutil (which is behind the Disk Utility) actually runs fsck_hfs when it verifies and repairs an HFS+ volume:

$ diskutil repairVolume /Volumes/source
Started file system repair on disk3 source
Verifying storage system
Performing fsck_cs -n -x --lv --uuid AAAABBBB-9B32-46F1-99A0-8CF4CF3DFFFF
Checking volume
…
The volume AAAABBBB-9B32-46F1-99A0-8CF4CF3DFFFF appears to be OK
Storage system check exit code is 0
Repairing file system
Volume was successfully unmounted
Performing fsck_hfs -fy -x /dev/rdisk3
Checking Journaled HFS Plus volume
Checking extents overflow file
Checking catalog file
Invalid record count
The volume source could not be verified completely
File system check exit code is 8
Restoring the original state found as mounted
Problem -69842 occurred while restoring the original mount state
Error: -69845: File system verify or repair failed
Underlying error: 8

So I checked the man page and launched it manually with a few other switches, to rebuild the catalog btree and print more debug information:

$ sudo fsck_hfs -fyd -D 0x0033 -Er /dev/rdisk3
journal_replay(/dev/disk3) returned 0
** /dev/rdisk3
        Using cacheBlockSize=32K cacheTotalBlock=65536 cacheSize=2097152K.
   Executing fsck_hfs (version hfs-407.200.4).
** Checking Journaled HFS Plus volume.
   The volume name is source
** Checking extents overflow file.
** Checking catalog file.
** Rebuilding catalog B-tree.
hfs_swap_BTNode: invalid forward link (0x8C2DC787)
hfs_swap_BTNode: invalid backward link (0x7757FD52)
hfs_swap_BTNode: invalid node kind (-115)
hfs_swap_BTNode: invalid node height (242)
hfs_swap_BTNode: invalid record count (0xDCA8)
   Invalid record count
(4, 8)
hfs_UNswap_BTNode: invalid node height (1)
** Rechecking volume.
** Checking Journaled HFS Plus volume.
   The volume name is source
** Checking extents overflow file.
** Checking catalog file.
** Checking multi-linked files.
** Checking catalog hierarchy.
** Checking extended attributes file.
** Checking multi-linked directories.
        privdir_valence=424854, calc_dirlinks=1277225, calc_dirinode=424854
** Checking volume bitmap.
** Checking volume information.
   Invalid volume file count
   (It should be 21842267 instead of 21565665)
   Invalid volume directory count
   (It should be 3189524 instead of 3162696)
   Invalid volume free block count
   (It should be 98293245 instead of 104033015)
        invalid VHB nextCatalogID
   Volume header needs minor repair
(2, 0)
   Verify Status: VIStat = 0x8000, ABTStat = 0x0000 EBTStat = 0x0000
                  CBTStat = 0x0000 CatStat = 0x00000000
** Repairing volume.
** Rechecking volume.
** Checking Journaled HFS Plus volume.
   The volume name is source
** Checking extents overflow file.
** Checking catalog file.
** Checking multi-linked files.
** Checking catalog hierarchy.
** Checking extended attributes file.
** Checking multi-linked directories.
        privdir_valence=424854, calc_dirlinks=1277225, calc_dirinode=424854
** Checking volume bitmap.
** Checking volume information.
** Trimming unused blocks.
        Trimming: startBlock=         1, blockCount=       737
        Trimming: startBlock=       934, blockCount=        91
        Trimming: startBlock=      7400, blockCount=         3
        <skipped about 325k lines here>
        Trimming: startBlock= 907509761, blockCount=  69074889
        Trimmed 98293245 allocation blocks.
** The volume source was repaired successfully.
        CheckHFS returned 0, fsmodified = 1

What a miracle, it did fix the filesystem! I ran fsck again and it reported no errors. Then I calculated the same hashes as I’d done initially and all of them matched again, so no data loss there. And finally I ran fsck with a switch to scan every occupied block to look for I/O read errors:

$ noti time caffeinate sudo fsck_hfs -fydES /dev/rdisk3
Password:
journal_replay(/dev/disk3) returned 0
** /dev/rdisk3
        Using cacheBlockSize=32K cacheTotalBlock=65536 cacheSize=2097152K.
Scanning entire disk for bad blocks
   Executing fsck_hfs (version hfs-407.200.4).
** Checking Journaled HFS Plus volume.
   The volume name is source
** Checking extents overflow file.
** Checking catalog file.
** Checking extended attributes file.
** Checking volume bitmap.
** Checking volume information.
** Trimming unused blocks.
** The volume source appears to be OK.
        CheckHFS returned 0, fsmodified = 0
    10342.56 real       343.32 user       555.55 sys

A number of command “combinators” here: noti to display a notification when the command finishes, time to report how long the command took to execute, caffeinate so that the Mac doesn’t go to sleep while the command is running.

Overall I’m glad that it was “just” a filesystem corruption and not an SSD failure, and that it recovered fine. I learned some gruesome details about HFS+ and confirmed that it’s often annoying to work with Apple’s proprietary data formats because they provide only one way to do something and that’s not enough — why can only the Finder copy hard-linked directories, and only in some cases? And can it at all?

KISS 🇺🇦

Stop the war!

Backups, TimeMachine, HFS and pain

Disk doesn’t mount

The official way

Other options

Running `fsck` manually

Comments

Disk doesn’t mount

The official way

Other options

Running fsck manually

Comments

Running `fsck` manually