Just passing by

  • 2 Posts
  • 6 Comments
Joined 1 年前
cake
Cake day: 2023年6月9日

help-circle
  • Hi,

    I’ve done some research on this myself and the answer is the USB controller. Specifically the way the USB controller “shares” bandwidth. It is not the way a sata controller or a pci lane deals with this. ZFS expects direct control of the disk to operate correctly and anything that gets in between the file system and the disk is a problem.

    Thanks for sharing. I agree with you 100% and I think everybody commenting here does. The whole point of the thread however was to understand if/how you can identify the location of the problem without guessing. The reality is I got to the conclusion that people… don’t. Like you said people know ZFS is fussy about how does he speaks with the disks and the minimum issue it has it throws a tantrum. So people just switch things until they work (or buy expensive motherboards with many ports). I don’t like the idea of not knowing “why”, so I will just add on my notes that for my specific usecase I cannot trust ZFS + OS (TrueNas scale) to use the USB disk for backups via ZFS send/recieve.

    If you want a stable system give zfs direct access to your disks and accept it will damage zfs operations over time if you do not.

    I would like to add that I am not trying to mirror my main disk with a usb one. I just wanted to copy the zfs snapshots on the usb drive once a day at midnight. ZFS is just (don’t throw stones at me for this, it is just my opinon) too brittle to use it this way too. I mean when I am trying to clean/recover the pool it just refuses (and there is no one writing on it).

    A better but still bad solution would be something like a USB to SATA enclosure. In this situation if you installed a couple disks in a mirror on the enclosure… They would be using a single USB port and the controller would at least keep the data on one lane instead of constantly switching.

    In my case there was no switching however. It was a single nvme drive in a single usb line in an enclusure. It was a separate stripe to just recieve data once a day.

    Regardless if you want to dive deeper you will need to do reading on USB controllers and bandwidth sharing.

    Not without good logs or debugging tools.

    I decided I cannot trust it so unfortunately I will take the usb enclosure with the nvme, format it with etx4 and use Kopia to backup the datasets there once a day. It is not what I wanted but it is the best I can get for now.

    About better solutions for the my play-NAS in general, I am constrained with the ports I have. I (again personal choice - I understand people disagree with this) don’t want to go SATA. Unfortunately, since I could not find any PCIe switch with ASM2812I (https://www.asmedia.com.tw/product/866yq74SPBqRdtgC/7c5YQ79xz8urEGr1) I am unable to get more from my m2 nvme pcie 3x4 (speed loss for me is not an issue, my main bottleneck is the network). It is interesting how you can find many more interesting attempt at it in the PIs ecosystem but not for mini PCs.


  • Thank you! A new path to check :) I didn’t find this in my search until now, so I added it on my documentation.


    Unfortunately it doesn’t tell me much, but I am really happy there is some more new info here. I can see some FAILED steps but it may be just connected to the fact it is a striped volume?

    1717612906   spa.c:6623:spa_import(): spa_import: importing tank-02
    1717612906   spa_misc.c:418:spa_load_note(): spa_load(tank-02, config trusted): LOADING
    1717612906   vdev.c:161:vdev_dbgmsg(): disk vdev '/dev/disk/by-partuuid/xxx-xxx-xxx-xxx-xxxx': best uberblock found for spa tank-02. txg 6462
    1717612906   spa_misc.c:418:spa_load_note(): spa_load(tank-02, config untrusted): using uberblock with txg=6462
    1717612906   spa.c:8925:spa_async_request(): spa=tank-02 async request task=4
    1717612906   spa_misc.c:404:spa_load_failed(): spa_load(tank-02, config trusted): FAILED: cannot open vdev tree after invalidating some vdevs
    1717612906   spa_misc.c:418:spa_load_note(): spa_load(tank-02, config trusted): UNLOADING
    1717612906   spa_misc.c:418:spa_load_note(): spa_load(tank-02, config trusted): spa_load_retry: rewind, max txg: 6461
    1717612906   spa_misc.c:418:spa_load_note(): spa_load(tank-02, config trusted): LOADING
    1717612907   spa_misc.c:418:spa_load_note(): spa_load(tank-02, config untrusted): vdev tree has 1 missing top-level vdevs.
    1717612907   spa_misc.c:418:spa_load_note(): spa_load(tank-02, config untrusted): current settings allow for maximum 0 missing top-level vdevs at this stage.
    1717612907   spa_misc.c:404:spa_load_failed(): spa_load(tank-02, config untrusted): FAILED: unable to open vdev tree [error=2]
    1717612907   spa_misc.c:418:spa_load_note(): spa_load(tank-02, config untrusted): UNLOADING
    

    It goes on and after a while:

    1717614235   spa_misc.c:2311:spa_import_progress_set_notes_impl(): 'tank-02' Finished importing
    1717614235   spa.c:8925:spa_async_request(): spa=tank-02 async request task=2048
    1717614235   spa_misc.c:418:spa_load_note(): spa_load(tank-02, config trusted): LOADED
    1717614235   metaslab.c:2445:metaslab_load_impl(): metaslab_load: txg 6464, spa tank-02, vdev_id 0, ms_id 95, smp_length 0, unflushed_allocs 0, unflushed_frees 0, freed 0, defer 0 + 0, unloaded time 1362018 ms, loading_time 0 ms, ms_max_size 8589934592, max size error 8589934592, old_weight 840000000000001, new_weight 840000000000001
    

    But I see no other issue otherwise. Any other new path/logs/ways I can query the system?


  • Thanks. I am ok with accepting the fact USB storage with ZFS is unreliable. I am ok with not using it in real case scenarios. My point stands however in understanding what broke so I know what to look for and, should I be crazy enough to try something similar again in some use-cases, know what to alert on. Call me curious. Everybody tells me it breaks, nobody tells me “look, it breaks here, and this is how you can see it”. I will try for another day or two and then will write it down on my notes as “unusable due to bad logging/debugging options”, not just because “it is USB” if that makes sense.