... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
linux.debian.kernel
Debian kernel discussions
2,884 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 1,795 of 2,884
Salvatore Bonaccorso to Yu Kuai
Bug#1121006: raid10 and component device
30 Nov 25 10:50:01
   XPost: linux.debian.bugs.dist   
   From: carnil@debian.org   
      
   Hi Yu,   
      
   [apologies for the maybe overlong list of recipients, reason see below]   
      
   On Sun, Nov 23, 2025 at 10:54:55AM +0800, Yu Kuai wrote:   
   > Hi,   
   >   
   > 在 2025/11/21 19:01, Filippo Giunchedi 写道:   
   > > Hello linux-raid,   
   > > I'm seeking assistance with the following bug: recent versions of mpt3sas   
   > > started announcing drive's optimal_io_size of 0xFFF000 and when said   
   drives are   
   > > part of a mdraid raid10 the array's optimal_io_size results in 0xFFF000.   
   > >   
   > > When an LVM PV is created on the array its metadata area by default is   
   aligned   
   > > with its optimal_io_size, resulting in an abnormally-large size of ~4GB.   
   During   
   > > GRUB's LVM detection an allocation is made based on the metadata area size   
   > > which results in an unbootable system. This problem shows up only for   
   > > newly-created PVs and thus systems with existing PVs are not affected in my   
   > > testing.   
   > >   
   > > I was able to reproduce the problem on qemu using scsi-hd devices as shown   
   > > below and on https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1121006.   
   The bug   
   > > is present both on Debian' stable kernel and Linux 6.18, though I haven't   
   yet   
   > > determined when the change was introduced in mpt3sas.   
   > >   
   > > I'm wondering where the problem is in this case and what could be done to   
   fix   
   > > it?   
   >   
   > You can take a look at the following thread.   
   >   
   > Re: [PATCH 1/2] block: ignore underlying non-stack devices io_opt - Yu Kuai   
      
      
   Thanks for pointing that out, I will leave context further down as   
   well intact.   
      
   mpt3sas folks, block-layer experts the above thread seems to have been   
   stalled recently, do you have any input on the way forward or if   
   indeed mpt3sas driver is behaving as expected here as well?   
      
   Filippo recently reported an issue while setting up a system at   
   wikimedia, https://bugs.debian.org/1121006 and   
   https://phabricator.wikimedia.org/T407586 for full context.   
      
   Regards,   
   Salvatore   
      
   >   
   > > thank you,   
   > > Filippo   
   > >   
   > > On Thu, Nov 20, 2025 at 02:43:24PM +0000, Filippo Giunchedi wrote:   
   > >> Hello Salvatore,   
   > >> Thank you for the quick reply.   
   > >>   
   > >> On Wed, Nov 19, 2025 at 05:59:48PM +0100, Salvatore Bonaccorso wrote:   
   > >> [...]   
   > >>>>          Capabilities: [348] Vendor Specific Information: ID=0001 Rev=1   
   Len=038    
   > >>>>          Capabilities: [380] Data Link Feature    
   > >>>>          Kernel driver in use: mpt3sas   
   > >>> This sounds like quite an intersting finding but probably hard to   
   > >>> reproduce without the hardware if it comes to be specific to the   
   > >>> controller type and driver.   
   > >> That's a great point re: reproducibility, and it got me curious on   
   something I   
   > >> hadn't thought of testing. Namely if there's another angle to this: does   
   any   
   > >> block device with the same block I/O hints exhibit the same problem? The   
   answer is   
   > >> actually "yes".   
   > >>   
   > >> I used qemu 'scsi-hd' device to set the same values to be able to test   
   locally.   
   > >> On an already-installed VM I added the following to present four new   
   devices:   
   > >>   
   > >> -device virtio-scsi-pci,id=scsi0   
   > >>   
   > >> -drive file=./workdir/disks/disk3.qcow2,format=qcow2,if=none,id=drive3   
   > >> -device scsi-hd,bus=scsi0.0,drive=drive3,physical_block_siz   
   =4096,logical_block_size=512,min_io_size=4096,opt_io_size=16773120   
   > >>   
   > >> -drive file=./workdir/disks/disk4.qcow2,format=qcow2,if=none,id=drive4   
   > >> -device scsi-hd,bus=scsi0.0,drive=drive4,physical_block_siz   
   =4096,logical_block_size=512,min_io_size=4096,opt_io_size=16773120   
   > >>   
   > >> -drive file=./workdir/disks/disk5.qcow2,format=qcow2,if=none,id=drive5   
   > >> -device scsi-hd,bus=scsi0.0,drive=drive5,physical_block_siz   
   =4096,logical_block_size=512,min_io_size=4096,opt_io_size=16773120   
   > >>   
   > >> -drive file=./workdir/disks/disk6.qcow2,format=qcow2,if=none,id=drive6   
   > >> -device scsi-hd,bus=scsi0.0,drive=drive6,physical_block_siz   
   =4096,logical_block_size=512,min_io_size=4096,opt_io_size=16773120   
   > >>   
   > >> I used 10G files with 'qemu-img create -f qcow2  10G' though size   
   doesn't   
   > >> affect anything in my testing.   
   > >>   
   > >> Then in the VM:   
   > >>   
   > >> # cat /sys/block/sd[cdef]/queue/optimal_io_size   
   > >> 16773120   
   > >> 16773120   
   > >> 16773120   
   > >> 16773120   
   > >> # mdadm --create /dev/md1 --level 10 --bitmap none --raid-devices 4   
   /dev/sdc /dev/sdd /dev/sde /dev/sdf   
   > >> mdadm: Defaulting to version 1.2 metadata   
   > >> mdadm: array /dev/md1 started.   
   > >> # cat /sys/block/md1/queue/optimal_io_size   
   > >> 4293918720   
   > >>   
   > >> I was able to reproduce the problem with src:linux 6.18~rc6-1~exp1 as   
   well as 6.12.57-1.   
   > >>   
   > >> Since it is easy to test this way I tried with a few different   
   opt_io_size values and   
   > >> was able to reproduce only with 16773120 (i.e. 0xFFF000).   
   > >>   
   > >>> I would like to ask: Do you have the possibility to make an OS   
   > >>> instalaltion such that you can freely experiment with various kernels   
   > >>> and then under them assemble the arrays? If so that would be great   
   > >>> that you could start bisecting the changes to find where find changes.   
   > >>>   
   > >>> I.e. install the OS independtly on the controller, find by bisecting   
   > >>> Debian versions manually the kernels between bookworm and trixie   
   > >>> (6.1.y -> 6.12.y to narrow down the upsream range).   
   > >> Yes I'm able to perform testing on this host, in fact I worked around the   
   > >> problem for now by disabling LVM's md alignment auto detection and thus   
   we have   
   > >> an installed system.   
   > >> For reference that's "devices { data_alignment_detection = 0 }" in lvm's   
   > >> config.   
   > >>   
   > >>> Then bisect the ustream changes to find the offending commits. Let me   
      
   [continued in next message]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]