Bug #224

Panic when accessing named pipe in ZFS snapshot

Added by Bryan Leaman about 1 year ago. Updated about 1 year ago.

Status:Closed Start:08/04/2010
Priority:High Due date:
Assigned to:- % Done:

0%

Category:- Spent time: -
Target version:-

Description

Using NCP3 RC3 under VMware.

I discovered this while trying to backup my root volume using tar, from a ZFS snapshot. This did not happen with OpenSolaris b134 as I was able to successfully tar root in that case.

I can reproduce this easily by simply using ls on the named pipe:

root@nexenta:~# zfs snapshot syspool/rootfs-nmu-002@mysnap
root@nexenta:~# cd /.zfs/snapshot/mysnap/etc/saf
root@nexenta:/.zfs/snapshot/mysnap/etc/saf# ls -al _sactab
-rw-r--r-- 1 root sys 52 Jun 29 09:19 _sactab
root@nexenta:/.zfs/snapshot/mysnap/etc/saf# ls -al _sysconfig
-rw-r--r-- 1 root sys 45 Jun 29 09:19 _sysconfig
root@nexenta:/.zfs/snapshot/mysnap/etc/saf# ls -al _sacpipe

panic[cpu0]/thread=ffffff00d199d740:
BAD TRAP: type=e (#pf Page fault) rp=ffffff000497ca70 addr=96 occurred in module "zfs" due to a NULL pointer dereference
ls:
#pf Page fault
Bad kernel fault at addr=0x96
pid=1049, pc=0xfffffffff79c523e, sp=0xffffff000497cb60, eflags=0x10286
cr0: 8005003b cr4: 6b8
cr2: 96
cr3: 5182000
cr8: c
rdi:                0 rsi:                0 rdx:              30d
rcx:              158  r8: ffffff00d360ea00  r9: ffffff00d1625d00
rax:                0 rbx:               10 rbp: ffffff000497cba0
r10: fffffffffb851078 r11:                0 r12: ffffff00d3b33d48
r13: ffffff00d40f8120 r14:                0 r15:              30d
fsb:                0 gsb: fffffffffbc2faa0  ds:               4b
es:               4b  fs:                0  gs:              1c3
trp:                e err:                0 rip: fffffffff79c523e
cs:               30 rfl:            10286 rsp: ffffff000497cb60
ss:               38
ffffff000497c950 unix:die+dd ()
ffffff000497ca60 unix:trap+177b ()
ffffff000497ca70 unix:cmntrap+e6 ()
ffffff000497cba0 zfs:zil_commit+26 ()
ffffff000497cc10 zfs:zfs_fsync+e2 ()
ffffff000497cc60 genunix:fop_fsync+5a ()
ffffff000497cd40 fifofs:fifo_fsync+fa ()
ffffff000497cd90 fifofs:fifo_inactive+81 ()
ffffff000497cde0 genunix:fop_inactive+af ()
ffffff000497ce00 genunix:vn_rele+5f ()
ffffff000497cea0 genunix:cstatat64_32+b1 ()
ffffff000497cec0 genunix:lstat64_32+31 ()
ffffff000497cf10 unix:brand_sys_sysenter+1e0 ()

I can provide the core files from /var/crash if necessary.

mdb.thinkmate1 - output from mdb with nfs crash (14 KB) Jeff Strunk, 10/04/2010 06:29 am

mdb.thinkmate1-delete-after-send - dump from clear instructions (6 KB) Jeff Strunk, 10/04/2010 10:40 am


History

Updated by Bryan Leaman about 1 year ago

This appears to be OpenSolaris bug 6957113 - "accessing a fifo special file in .zfs snapshot dir panics kernel". Was fixed in snv_142 but maybe not backported to NCP3?

http://bugs.opensolaris.org/bugdatabase/viewbug.do?bugid=6957113

Updated by Peter Radig about 1 year ago

Bryan Leaman wrote:

This appears to be OpenSolaris bug 6957113 - "accessing a fifo special file in .zfs snapshot dir panics kernel". Was fixed in snv_142 but maybe not backported to NCP3?

http://bugs.opensolaris.org/bugdatabase/viewbug.do?bugid=6957113

The fix for this bug is a one-liner. So that should be easy to integrate.

Updated by Bryan Leaman about 1 year ago

Peter Radig wrote:

The fix for this bug is a one-liner. So that should be easy to integrate.

Yes, that's correct. I rebuilt Nexenta from source after applying the one-line fix, and it eliminated the panic. It would be great if they can integrate this into NCP3 for the release since it's such a simple fix.

Updated by Garrett D'Amore about 1 year ago

  • Status changed from New to Closed

I've copied this to our bug reporting tool for NexentaStor, and will be tracking it there. Its bug id 2611 for hg.nexenta.com.

Closing this bug, as we will track it there. That said I think we can integrate this change relatively easily. The window of opportunity for 3.0 might have passed, but at least for 3.1 we can do it.

Updated by Jeff Strunk about 1 year ago

Since I don't have access to hg.nexenta.com I am putting this information here.

I applied the fix for this bug to test. It works for local access. I can run 'find .zfs -type p' as many times as I want when logged into my file server where this fix has been applied. I verified that an identical system without the fix crashes immediately if I run that command.

However, running that command on a linux nfs client twice crashes the fixed NCP file server.

The system in question is running NCP 3.0.1 with only the patch for bug 6957113 applied from upstream.

Updated by Jeff Strunk about 1 year ago

After working with bslbb on IRC, here are the exact steps to reproduce:

root@thinkmate1:~# zpool destroy thinkmate1
root@thinkmate1:~# zpool create thinkmate1 raidz2 c0t2d0 c0t3d0 c0t4d0 c0t5d0 c0t6d0 c0t7d0 log c0t0d0 cache c0t1d0
root@thinkmate1:~# mkfifo /thinkmate1/testfifo
root@thinkmate1:~# zfs snapshot thinkmate1@foo0
root@thinkmate1:~# zfs destroy -r thinkmate2
root@thinkmate1:~# zfs send thinkmate1@foo0 | zfs receive -F thinkmate2
root@thinkmate1:~# zfs snapshot thinkmate1@foo1
root@thinkmate1:~# zfs send -I foo0 thinkmate1@foo1 | zfs receive -F thinkmate2
root@thinkmate1:~# zfs destroy thinkmate1@foo0
root@thinkmate1:~# find /thinkmate1/.zfs/s
shares/   snapshot/ 
root@thinkmate1:~# find /thinkmate1/.zfs/ 
/thinkmate1/.zfs/
/thinkmate1/.zfs/snapshot
/thinkmate1/.zfs/snapshot/foo1
/thinkmate1/.zfs/shares
root@thinkmate1:~# ls /thinkmate1/
testfifo
root@thinkmate1:~# find /thinkmate1/.zfs/snapshot/foo1/
/thinkmate1/.zfs/snapshot/foo1/
/thinkmate1/.zfs/snapshot/foo1/testfifo
root@thinkmate1:~# find /thinkmate1/.zfs/snapshot/foo1/
/thinkmate1/.zfs/snapshot/foo1/
/thinkmate1/.zfs/snapshot/foo1/testfifo
root@thinkmate1:~# zfs snapshot thinkmate1@foo2
root@thinkmate1:~# zfs send -I foo1 thinkmate1@foo2 | zfs receive -F thinkmate2
root@thinkmate1:~# zfs destroy thinkmate1@foo1

At this point it crashes. I verified that it crashes without a cache or zil device in the pool as well.

At this point, I think the fix for 6957113 only covers up the deeper problem with FIFO special files.

Also available in: Atom PDF