Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random boot failues due to partition confusion because disko now relies on by-partlabel #551

Open
oddlama opened this issue Mar 2, 2024 · 15 comments
Labels
enhancement New feature or request

Comments

@oddlama
Copy link
Contributor

oddlama commented Mar 2, 2024

I just updated disko and switched to type = "gpt"; which a warning informed me to do. Now my device now randomly fails to boot. This is the case whenever I have another USB stick plugged into my system that coincidentally shares some partition labels (like disk-root-root) with the partition labels of my actual system.

The new gpt.nix implementation seems to rely solely on /dev/disk/by-partlabel instead of /dev/disk/by-id which was used previously. As a consequence of this change, my system now sometimes confuses a USB stick partition and the system confusion which share the same name, which seemingly dependent on udev device initialization order. So sometimes my system uses the wrong one and it doesn't boot.

Is there any way to switch back to /dev/disk/by-id ? I'd prefer to have stable and unique ids to refer to my partitions that don't cause random clashes. If there is absolutely no way to use by-id now because we don't know the partition order anymore, would it be possible to require manually setting a unique id on the parent gpt layout that is prepended to all labels? This mostly mitigate the issue, although one could still easily craft a USB stick to disrupt a system.

@Lassulus
Copy link
Collaborator

Lassulus commented Mar 3, 2024

hmm, you would need to override the device of each partition. you can set it to /dev/disk/by-id/$my_id-part$x. Another alternative would be to override the disk name. so something like:

disko.devices.disk.root.name = builtins.substring 0 10 (builtins.hashString "sha256" config.disko.devices.disk.root.device);

that should ensure a unique id according to the device? not sure if that's good enough for you. otherwise overriding each partition is exactly what you want, although a bit tedious if you have a lot of them

@oddlama
Copy link
Contributor Author

oddlama commented Mar 3, 2024

you can set it to /dev/disk/by-id/$my_id-part$x

That's what I did yesterday to work around the issue for the time being, but I was hoping there would be a better approach. The current partition numbers obviously only correspond to the order with legacy tables from before and if I were to resetup now, then the partitions would be in a different order because of alphabetical sorting and it would break again. Adding a partition would also break it if it isn't ordered last. Previously we could choose where to place the partitions so this would not happen.

I then tried to use the canonical device name via name = "nvme-Samsung_SSD_980_PRO_2TB_SXXENFXXWXXXXX" but noticed that this causes the partition labels to become very long because of the disk-<diskname>-<partname> format. And unfortunately those long labels seem to be unsupported somewhere (in the kernel ?).

disko.devices.disk.root.name = builtins.substring 0 10 (builtins.hashString "sha256" config.disko.devices.disk.root.device);

that should ensure a unique id according to the device?

This only solves part of the problem. I'd like to use the unique device identifier for the name anyway, but that results in huge partlabels. Hashing it here would make it shorter, but if you also have a longer name for the partition it will still exceed the limit. So that would need to be hashed too. It's a bit awkward having to set a hashed name for each device and partition from now on.

Please excuse my ignorance, but can you explain to me why the legacy tables are deprecated now? I searched the git history but failed to find any explanation or rationale for why they are deprecated.

From my (limited) point of view we previously had stable and predictable disk names, which were humanly readable at runtime, no issues with long names, no possibility for duplicates and thus confusion and we also had predictable partition order. Anyone who would intentionally want to create a name clash with a USB stick would have had to emulate a whole disk with one of my disk ids, now it just requires to trivially set a partition label to make my system boot something else.

What am I missing that makes the new approach preferable?

@Lassulus
Copy link
Collaborator

Lassulus commented Mar 4, 2024

you can set it to /dev/disk/by-id/$my_id-part$x

That's what I did yesterday to work around the issue for the time being, but I was hoping there would be a better approach. The current partition numbers obviously only correspond to the order with legacy tables from before and if I were to resetup now, then the partitions would be in a different order because of alphabetical sorting and it would break again. Adding a partition would also break it if it isn't ordered last. Previously we could choose where to place the partitions so this would not happen.

you can add a priority to each partition to ensure a stable partition order. just think of them as an index in that case.

I then tried to use the canonical device name via name = "nvme-Samsung_SSD_980_PRO_2TB_SXXENFXXWXXXXX" but noticed that this causes the partition labels to become very long because of the disk-<diskname>-<partname> format. And unfortunately those long labels seem to be unsupported somewhere (in the kernel ?).

This only solves part of the problem. I'd like to use the unique device identifier for the name anyway, but that results in huge partlabels. Hashing it here would make it shorter, but if you also have a longer name for the partition it will still exceed the limit. So that would need to be hashed too. It's a bit awkward having to set a hashed name for each device and partition from now on.

yeah I haven't figured out the limit yet, maybe its 64 characters? we could automatically hash them if they exceed the limit so you wouldn't have to override them on each device.

Please excuse my ignorance, but can you explain to me why the legacy tables are deprecated now? I searched the git history but failed to find any explanation or rationale for why they are deprecated.

From my (limited) point of view we previously had stable and predictable disk names, which were humanly readable at runtime, no issues with long names, no possibility for duplicates and thus confusion and we also had predictable partition order. Anyone who would intentionally want to create a name clash with a USB stick would have had to emulate a whole disk with one of my disk ids, now it just requires to trivially set a partition label to make my system boot something else.

What am I missing that makes the new approach preferable?

the old legacy tables used lists, which are hard to override or reference in nixos modules. also the device numbering was cumbersome and not portable (meaning, that if you changed your device from something like /dev/sda to /dev/disk/by-id/... the whole logic for device numbering changed) and this made testing the code harder. With the old tables the vm tests or image generation doesn't work because we can't really override all the device names in every partition. With the new attrset based tables we now have the possibility to set the device to an environment variable and have portable install scripts that way. the old tables also broke easily if people set it to devices like /dev/sda and /dev/sdb and they sometimes flipped.

Also the old device nubmering code https://github.com/nix-community/disko/blob/master/lib/default.nix#L76 was brittle and incomplete.

@Lassulus
Copy link
Collaborator

Lassulus commented Mar 4, 2024

Also another feature this enables is the non destructive updates I'm working on right now. Since you can add partitions later and they won't mess up the old partitions for mounting.

@oddlama
Copy link
Contributor Author

oddlama commented Mar 5, 2024

yeah I haven't figured out the limit yet, maybe its 64 characters? we could automatically hash them if they exceed the limit so you wouldn't have to override them on each device.

Yeah that'd definitely be a lot easier.

the old legacy tables used lists, which are hard to override or reference in nixos modules. also the device numbering was cumbersome and not portable (meaning, that if you changed your device from something like /dev/sda to /dev/disk/by-id/... the whole logic for device numbering changed) and this made testing the code harder. With the old tables the vm tests or image generation doesn't work because we can't really override all the device names in every partition. With the new attrset based tables we now have the possibility to set the device to an environment variable and have portable install scripts that way. the old tables also broke easily if people set it to devices like /dev/sda and /dev/sdb and they sometimes flipped.

Also the old device nubmering code https://github.com/nix-community/disko/blob/master/lib/default.nix#L76 was brittle and incomplete.

I see, thanks for the explanation. I'll test the partlabel hashing approach.

@jnsgruk
Copy link
Contributor

jnsgruk commented Mar 5, 2024

FWIW I ran into this too, the approach I've taken is to add a config.fileSystems block back into the machine configs, and just override the device attributes. It's a temporary workaround which I'll remove if/when the machines are reinstalled. You can see the commit here:

jnsgruk/nixos-config@186ba38

@SrEstegosaurio
Copy link

Well, I'm not getting random boot failures since in my case it's not random at all.
It fails a 100% of the times because it's looking for the wrong disk.

Errors before unlocking because it wants disk-vbd-cryptvol0 instead of cryptvol0 which is the name that appears when running ls /dev/disk/by-partlabel.

I have no clue on what to do.

@Lassulus
Copy link
Collaborator

Probably open a new issue with more insight into your config would help. Since disko is not run automatically to you probably have to either xhange the label in your config or change the partition labels with gdisk/parted or something like that

@cowboyai
Copy link

cowboyai commented May 2, 2024

it's not random. it won't load at all, ever, and we can't see into what it is telling initrd
image
drives are there...
no errors...
just won't boot...
image

@cowboyai
Copy link

cowboyai commented May 2, 2024

image

@cowboyai
Copy link

cowboyai commented May 2, 2024

complete flake configuration:
https://github.com/cowboyai/cim-start/tree/main/vhosts/x86

@cowboyai
Copy link

cowboyai commented May 3, 2024

RESOLVED...
in my case it appears to have been a missing kernel module... I think "vmd" or "sr_mod"...
with these two added, the boot sector is written correctly. That was the only change and the deploy worked.
Can this be discovered? boot changes required x module? or is it too system specific?

@nick-korsakov
Copy link

Hello @Lassulus!

My machine has several disks inside. The first one is for NixOS, the second is for testing NixOS (I use VMs to test it too), the third is for Windows and others are for RAID. I use disko-install to install NixOS on the first or second disk.

I wasn't aware of the new implementation relying on /dev/disk/by-partlabel instead of /dev/disk/by-id. disko-install surprisingly overwrote the Windows disk with important data. Fortunately, I was able to recover the data, but I got some new gray hairs. :)

Now I'm using a workaround with a device ID hash for the name. Thank you for the hint!

I think using /dev/disk/by-id for a device should work as expected without additional knowledge of internal implementation and hacks. Can I hope this issue will be resolved in the future?

@Lassulus
Copy link
Collaborator

Lassulus commented Aug 5, 2024

I'm confused in how that change could result in overriding your windows partition? does the windows partition share label names with the nixos ones? it seems like a different issue here that maybe merits it's own new issue :)

@iFreilicht
Copy link
Contributor

@nick-korsakov Could you run lsblk -o NAME,LABEL,PARTLABEL,MOUNTPOINT and share the results? The PARTLABEL for window partitions is usually something like "Basic data partition" or "Microsoft reserved partition", so I'm unsure how disko could've confused these with your new partitions.

In summary of this thread here, the workaround is definitely to use device = /dev/disk/by-id/...`.

Looking at the old code I am definitely in favor of the change to /dev/disk/by-partlabel, but we could consider a more dynamic approach if #789 gets off the ground, which would allow for individual confirmation per drive, making destructive errors less likely.

@iFreilicht iFreilicht added the enhancement New feature or request label Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants