I have been asked this question in two consecutive interviews, but after some research and checking with various systems administrators I haven't received a good answer. I am wondering if somebody can help me out here.
A server is out of disk space. You notice a very large log file and determine it is safe to remove. You delete the file but the disk still shows that it is full. What would cause this and how would you remedy it? And how would you find which process is writing this huge log file?
This is a common interview question and a situation that comes up in a variety of production environments.
The file's directory entries have been deleted, but the logging process is still running. The space won't be reclaimed by the operating system until all file handles have been closed (e.g., the process has been killed) and all directory entries removed. To find the process writing to the file, you'll need to use the
lsof
command.The other part of the question can sometimes be "how do you clear a file that's being written to without killing the process?" Ideally, you'd "zero" or "truncate" the log file with something like
: > /var/log/logfile
instead of deleting the file.There's still another link to the file (either hard link or open file handle). Deleting a file only deletes the directory entry; the file data and inode hang around until the last reference to it has been removed.
It's somewhat common practice for a service to create a temporary file and immediately delete it while keeping the file open. This creates a file on disk, but guarantees that the file will be deleted if the process terminates abnormally, and also keeps other processes from accidentally stomping on the file. MySQL does this, for example, for all its on-disk temporary tables. Malware often uses similar tactics to hide its files.
Under Linux, you can conveniently access these deleted files as
/proc/<pid>/fd/<filenumber>
.I'm not a sysadmin, but from what I've gathered on Unix.SE, a Linux system won't actually delete a file (mark the space as free/reusable) after it is unlinked until all file descriptors pointing to them have been closed. So to answer the first part, the space isn't yet free because a process is still reading it. To answer the second, you can see which process is using the file with
lsof
.One alternative answer besides the obvious hard link/open file answer: that file is a (very) sparse file such as
/var/log/lastlog
on RHEL that wasn't actually taking up all that much space. Deleting it had very little impact, so you need to look at the next biggest file.If the process writing the file is root, it'll write into the superuser reserved file space. The file system has this space to keep a system operational in case a user task fills up the disk. This space (imho per default 5%) is invisible to many tools.
lsof can show you, which process has locked the file, ergo is writing to it.
Besides the file being open by a process, a 2nd case is when you have a file system that supports snapshots like
btrfs
orZFS
.For example you take a snapshot with that huge log file existent. If you delete the file now, you will delete only the delta. And the delta is deleted only when the file is not in use.
See also:
https://superuser.com/questions/863588/how-to-delete-a-file-in-all-snapshots-on-a-btrfs-system
ZFS: Removing files from snapshots?
A 3rd case is when you have a file system that supports block level de-duplication and most of the file is identical with another file. I do not expect this to happen for a log unless you have a container or VM that is sending the logs to a syslog container or VM which share the same FS so that the log contents are identical.