mounted, be it local storage, fiber attached from a SAN, network based
like NFS, etc. These mounts may be exist for any number of reasons,
such as application data, user home directories, additional log space,
and so on. On occasion, such mounts can obfuscate an otherwise easily
identifiable problem. So here's our host details:
HOSTS: tux, beastie PROMPT: host [0] OSes: CentOS 5.4, FreeBSD 8.1 (respectively)Of note, for point of discussion Linux (tux) is used in the examples.
The steps for FreeBSD (beastie) follow at the bottom, though without
the main discussion. Additionally, the setup detailed below should be
usable on previous versions of both Linux and FreeBSD, and possibly
other OSes. (Notably, the following will not work on Solaris, though
it was attempted.)
Our situation is that we've been paged on an FS (/mnt) near capacity on
host 'tux'. Logging in, we check things out (step 1):
tux [0] /bin/df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 7.1G 1.6G 5.2G 24% / /dev/sda2 2.0G 71M 1.8G 4% /var tmpfs 506M 0 506M 0% /dev/shm /dev/sdc1 496M 410M 61M 88% /mnt /dev/sdd1 496M 32M 440M 7% /mnt/logs tux [0] cd /mnt tux [0] /bin/ls -aF ./ ../ bin/ data/ lib/ logs/ lost+found/ tux [0] /usr/bin/du -sh ./* 782K ./bin 61M ./data 2.1M ./lib 29M ./logs 12K ./lost+foundIn the 'df' output, we can see that '/mnt' is at 88% capacity, though the
subsequent 'du' shows nothing even close to consuming that much space.
Perhaps a file has been removed that's still held open (step 2):
tux [0] /sbin/fuser -cu /mnt /mnt: 2094c(root) 2484e(root) tux [0] /bin/ps -fp `/sbin/fuser -c /mnt 2>/dev/null` UID PID PPID C STIME TTY STAT TIME CMD root 2094 2092 0 22:26 pts/0 Ss+ 0:00 /bin/ksh root 2484 2226 37 23:05 pts/1 S+ 1:46 myapp -f /mnt/logs/app.log tux [0] /usr/bin/tty /dev/pts/0Given the above, we see that PID 2484 is currently running from '/mnt'.
Having a look below at /proc, it appears that PID 2484 has an open file
descriptor, FD 3. We get the inode number of the file '/mnt/logs/app.log'
that FD 3 is linked against in the subsequent 'ls' (step 3):
tux [0] /bin/ls -l /proc/2484/fd total 0 lrwx------ 1 root root 64 Nov 21 23:08 0 -> /dev/pts/1 lrwx------ 1 root root 64 Nov 21 23:08 1 -> /dev/pts/1 lrwx------ 1 root root 64 Nov 21 23:08 2 -> /dev/pts/1 lr-x------ 1 root root 64 Nov 21 23:08 3 -> /mnt/logs/app.log tux [0] /bin/ls -liL /proc/2484/fd/3 83971 -rw-r--r-- 1 root root 354418688 Nov 21 23:02 /proc/2484/fd/3 tux [0] /usr/bin/find /mnt -inum 83971 -exec /bin/ls -li {} \; /usr/bin/find: /a/lost+found: Permission denied tux [1]The file size referenced by FD 3 is about 338 MB (354418688 bytes),
which on this FS is quite a bit. Alright, smoking gun, the file no
longer exists though there is an open FD to it. We get approval to kill
'myapp' so that the FD is released, thus reclaiming our space (step 4):
tux [1] /bin/kill 2484 tux [0] /bin/ps -fp 2484 UID PID PPID C STIME TTY TIME CMD tux [1] /bin/df -h /mnt Filesystem Size Used Avail Use% Mounted on /dev/sdc1 496M 410M 61M 88% /mntNow we definitely have an issue. We killed the offending process
and verified it is dead, yet we haven't freed up any space. It seems
we may have overlooked something, or maybe have a corrupt or really
fragmented FS. We do have a couple of clues, however. In our first
'df' output, we saw (step 5):
<snip...> /dev/sdc1 496M 410M 61M 88% /mnt /dev/sdd1 496M 32M 440M 7% /mnt/logsFrom the 'df', we have 'sdc1' attached at '/mnt'; we also have 'sdd1'
subsequently attached at '/mnt/logs'. It is possible that our issue lies
under the mount of '/mnt/logs'. Our second clue, at least in Linux,
come from our 'find' of the inode. Our 'find' can't seem to locate
the inode on '/mnt' though there isn't any message in our 'ls' under
'/proc' indicating the file was previously deleted like we'd expect.
(If the file linked to by an FD has been deleted, Linux tends to tag
"(deleted)" at the end of the line in an 'ls' long listing as was seen
in the writeup for "Removing / Recovering an Open File (Linux)").
Fortunately, we can have a look at any possible file(s) that '/mnt/logs'
may be mounted over and hiding by remounting '/mnt' via NFS. Assuming NFS
isn't running, we need to set it up with an entry in '/etc/exports' and
starting the necessary services. After which, we can mount '/mnt'
via NFS elsewhere, such as under '/a' (step 6):
tux [0] /bin/cat /etc/exports /mnt localhost(ro) 127.0.0.1(ro) tux [0] /etc/init.d/portmap start Starting portmap: [ OK ] tux [0] /etc/init.d/nfs start Starting NFS services: [ OK ] Starting NFS quotas: [ OK ] Starting NFS daemon: [ OK ] Starting NFS mountd: [ OK ] Starting RPC idmapd: [ OK ] tux [0] [ ! -d /a ] && /bin/mkdir /a tux [0] /bin/mount -t nfs -o ro,nolock localhost:/mnt /a tux [0] /bin/df -h /a Filesystem Size Used Avail Use% Mounted on localhost:/mnt 496M 410M 61M 88% /aAfter mounting '/mnt' to '/a' we can now have a look at anything that
may reside beneath the '/mnt/logs' mount point that 'sdd1' is mounted
to (step 7):
tux [0] /bin/ls -aF /a/logs ./ ../ app.log app.log.0.gz app.log.1.gz app.log.2.gz app.log.3.gz tux [0] /usr/bin/du -sh /a/logs/* 340M /a/logs/app.log 1.1M /a/logs/app.log.0.gz 1.2M /a/logs/app.log.1.gz 1.3M /a/logs/app.log.2.gz 1.4M /a/logs/app.log.3.gz tux [0] /usr/bin/find /a -inum 83971 -exec /bin/ls -li {} \; /usr/bin/find: /a/lost+found: Permission denied 83971 -rw-r--r-- 1 root root 354418688 Nov 21 23:02 /a/logs/app.logWell, there's your problem. It appears that with 'sdd1' mounted
to '/mnt/logs', files were hidden below it including one at 340 MB.
It is possible that at some point the FS on 'sdd1' wasn't available
and 'myapp' instead was writing files to the directory structure on
'sdc1'. In addition, from the above we've also verified that 'app.log'
has the same inode number as we saw earlier from the details in '/proc'
regarding FD 3 of PID 2484. A quick review of the contents on 'sdd1' at
'/mnt/logs' for comparison further helps to verify our current analysis
(step 8):
tux [1] /bin/ls -aF /mnt/logs ./ app.log app.log.1.gz app.log.3.gz ../ app.log.0.gz app.log.2.gz lost+found/ tux [0] /usr/bin/du -sh /mnt/logs/* 25M /mnt/logs/app.log 1.1M /mnt/logs/app.log.0.gz 1.2M /mnt/logs/app.log.1.gz 1.3M /mnt/logs/app.log.2.gz 1.4M /mnt/logs/app.log.3.gz 12K /mnt/logs/lost+foundSince we've identified our issue, it's time for resolution. We need
to unmount '/a', kill our NFS services, unmount '/mnt/logs', and remove
the problem files (step 9):
tux [0] /bin/umount /a tux [0] /etc/init.d/nfs stop Shutting down NFS mountd: [ OK ] Shutting down NFS daemon: [ OK ] Shutting down NFS quotas: [ OK ] Shutting down NFS services: [ OK ] tux [0] /etc/init.d/portmap stop Stopping portmap: [ OK ] tux [0] /bin/umount /mnt/logs tux [0] pwd /mnt tux [0] /bin/ls -li logs/app.log 83971 -rw-r--r-- 1 root root 354418688 Nov 21 23:02 logs/app.log tux [0] /bin/rm logs/app.log* tux [0] /bin/ls -aF logs ./ ../In the above, just prior to file removal, we've confirmed one last
time that app.log is indeed our problem file. Since there should be no
files under the mount point, we've removed all of them. Nothing left
but to remount 'sdd1' to '/mnt/logs', verify our FS are online and that
we freed up some space, and restarting our app (step 10):
tux [0] /bin/mount /dev/sdd1 /mnt/logs tux [0] /bin/df -h /mnt /mnt/logs Filesystem Size Used Avail Use% Mounted on /dev/sdc1 496M 66M 405M 14% /mnt /dev/sdd1 496M 32M 440M 7% /mnt/logs tux [0] myapp -f /mnt/logs/app.log tux [0] /bin/ps -ef | /bin/grep my[a]pp root 11942 2226 18 23:47 pts/1 00:00:07 myapp -f /mnt/logs/app.logFreeBSD Steps:
Step 1: beastie [0] /bin/df -h Filesystem Size Used Avail Capacity Mounted on /dev/da0s1a 7.0G 1.2G 5.2G 19% / devfs 1.0K 1.0K 0B 100% /dev /dev/da0s1d 1.9G 2.5M 1.8G 0% /var procfs 4.0K 4.0K 0B 100% /proc /dev/da2s1a 496M 408M 48M 89% /mnt /dev/da3s1a 496M 29M 427M 6% /mnt/logs beastie [0] cd /mnt beastie [0] /bin/ls -aF ./ ../ .snap/ bin/ data/ lib/ logs/ beastie [0] /usr/bin/du -sh ./* 778K ./bin 60M ./data 2.0M ./lib 29M ./logs Step 2: beastie [0] /usr/bin/fstat -fn /mnt USER CMD PID FD DEV INUM MODE SZ|DV R/W root myapp 2677 text 0,91 32769 100555 17720 r root myapp 2677 3 0,91 16394 100600 356515840 r root ksh 1879 wd 0,91 2 40755 512 r beastie [0] /bin/ps ajwwxp 2677 1879 USER PID PPID PGID SID JOBC STAT TT TIME COMMAND root 1879 1876 1879 1879 0 Is+ 1 0:01.04 /bin/ksh root 2677 2668 2677 2668 1 I+ 2 1:14.85 myapp -f /mnt/logs/app.log beastie [0] tty /dev/pts/1 Step 3:Of note, FreeBSD doesn't give access to the open FDs via '/proc'
as Linux does. 'fstat', however, does allow us to output the device
numbers of the disk that a file resides on as with it's size. The 'ls'
output below is to verify the disk in question:
beastie [0] /bin/ls -ld /dev/da2s1a crw-r----- 1 root operator 0, 91 Nov 21 20:21 /dev/da2s1a beastie [0] /usr/bin/find /mnt -inum 16394 -exec /bin/ls -li {} \; beastie [0] Step 4: beastie [0] /bin/kill 2677 beastie [0] /bin/ps ajwwxp 2677 USER PID PPID PGID SID JOBC STAT TT TIME COMMAND beastie [1] /bin/df -h /mnt Filesystem Size Used Avail Capacity Mounted on /dev/da2s1a 496M 408M 48M 89% /mnt Step 5: <snip...> /dev/da2s1a 496M 408M 48M 89% /mnt /dev/da3s1a 496M 29M 427M 6% /mnt/logs Step 6: beastie [0] cat /etc/exports /mnt -ro localhost 127.0.0.1 beastie [0] /usr/sbin/rpcbind -L beastie [0] /usr/sbin/nfsd -u -t -n 4 beastie [0] /usr/sbin/mountd -r beastie [0] [ ! -d /a ] && /bin/mkdir /a beastie [0] /sbin/mount_nfs -o ro,udp,nolockd localhost:/mnt /a beastie [0] /bin/df -h /a Filesystem Size Used Avail Capacity Mounted on localhost:/mnt 496M 408M 48M 89% /a Step 7: beastie [0] /bin/ls -aF /a/logs ./ app.log app.log.1.gz app.log.3.gz ../ app.log.0.gz app.log.2.gz beastie [0] /usr/bin/du -sh /a/logs/* 340M /a/logs/app.log 1.0M /a/logs/app.log.0.gz 1.1M /a/logs/app.log.1.gz 1.2M /a/logs/app.log.2.gz 1.3M /a/logs/app.log.3.gz beastie [0] /usr/bin/find /a \( -inum 16394 -or -inum 32769 \) -exec /bin/ls -li {} \; 32769 -r-xr-xr-x 1 root wheel 17720 Nov 21 20:36 /a/bin/myapp 16394 -rw------- 1 root wheel 356515840 Nov 21 20:45 /a/logs/app.log Step 8: beastie [0] /bin/ls -aF /mnt/logs ./ app.log app.log.1.gz app.log.3.gz ../ app.log.0.gz app.log.2.gz beastie [0] /usr/bin/du -sh /mnt/logs/* 24M /mnt/logs/app.log 1.1M /mnt/logs/app.log.0.gz 1.2M /mnt/logs/app.log.1.gz 1.3M /mnt/logs/app.log.2.gz 1.4M /mnt/logs/app.log.3.gz Step 9: beastie [0] /sbin/umount /a beastie [0] /bin/pkill -9 mountd nfsd rpcbind beastie [0] /sbin/umount /mnt/logs beastie [0] pwd /mnt beastie [0] /bin/ls -li logs/app.log 16394 -rw------- 1 root wheel 356515840 Nov 21 20:45 logs/app.log beastie [0] /bin/rm logs/app.log* beastie [0] /bin/ls -aF logs ./ ../ Step 10: beastie [0] /sbin/mount /dev/da3s1a /mnt/logs beastie [0] /bin/df -h /mnt /mnt/logs Filesystem Size Used Avail Capacity Mounted on /dev/da2s1a 496M 63M 393M 14% /mnt /dev/da3s1a 496M 29M 427M 6% /mnt/logs beastie [0] myapp -f /mnt/logs/app.log beastie [0] /bin/ps awx | /usr/bin/grep my[a]pp 2810 2 I+ 0:06.19 myapp -f /mnt/logs/app.log