We're running a service that makes screenshots of URL supplied and posts it to our S3 bucket. Similar to manet, but our custom coded nodejs app. We don't store the screenshots on our local hard drive. We store them temporary for resize, then delete. The temp image folder is always empty.
The problem is: disk space is running lower and lower until server restart. For example, now df -h
shows:
ubuntu@ip-10-0-1-94:~$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 118G 74G 40G 65% /
none 4.0K 0 4.0K 0% /sys/fs/cgroup
udev 7.4G 8.0K 7.4G 1% /dev
tmpfs 1.5G 360K 1.5G 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 7.4G 0 7.4G 0% /run/shm
none 100M 0 100M 0% /run/user
However, du -sh /
shows:
root@ip-10-0-1-94:~# du -sh /
du: cannot access ‘/proc/14440’: No such file or directory
du: cannot access ‘/proc/14520/task/14520/fd/4’: No such file or directory
du: cannot access ‘/proc/14520/task/14520/fdinfo/4’: No such file or directory
du: cannot access ‘/proc/14520/fd/4’: No such file or directory
du: cannot access ‘/proc/14520/fdinfo/4’: No such file or directory
du: cannot access ‘/proc/14521’: No such file or directory
7.0G /
If I do du
for all folders in root filesystem, it will sum up to 7 Gb, not 74. If I restart the server, once it's back up again, there will be 7 Gb as it should be, but in 10-12 hours 70+ again and counting.
We're using mongodb
as our storage , so I'm assuming it can be it, however, I removed smallfiles
config option that I placed earlier. Still the same thing.
Attaching lsof
output here and ps aux
here
Here is mount
output:
ubuntu@ip-10-0-1-94:~$ mount
/dev/xvda1 on / type ext4 (rw,discard)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
none on /sys/fs/cgroup type tmpfs (rw)
none on /sys/fs/fuse/connections type fusectl (rw)
none on /sys/kernel/debug type debugfs (rw)
none on /sys/kernel/security type securityfs (rw)
udev on /dev type devtmpfs (rw,mode=0755)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755)
none on /run/lock type tmpfs (rw,noexec,nosuid,nodev,size=5242880)
none on /run/shm type tmpfs (rw,nosuid,nodev)
none on /run/user type tmpfs (rw,noexec,nosuid,nodev,size=104857600,mode=0755)
none on /sys/fs/pstore type pstore (rw)
systemd on /sys/fs/cgroup/systemd type cgroup (rw,noexec,nosuid,nodev,none,name=systemd)
Restarting any of running services, like mongodb
or supervisor
doesn't change anything. Here is an example:
root@ip-10-0-1-94:~# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 118G 74G 40G 65% /
none 4.0K 0 4.0K 0% /sys/fs/cgroup
udev 7.4G 8.0K 7.4G 1% /dev
tmpfs 1.5G 360K 1.5G 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 7.4G 0 7.4G 0% /run/shm
none 100M 0 100M 0% /run/user
root@ip-10-0-1-94:~# service mongod restart
mongod stop/waiting
mongod start/running, process 31590
root@ip-10-0-1-94:~# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 118G 74G 40G 65% /
none 4.0K 0 4.0K 0% /sys/fs/cgroup
udev 7.4G 8.0K 7.4G 1% /dev
tmpfs 1.5G 360K 1.5G 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 7.4G 0 7.4G 0% /run/shm
none 100M 0 100M 0% /run/user
or supervisor
controlling node
processes (workers and application) :
root@ip-10-0-1-94:~# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 118G 74G 40G 65% /
none 4.0K 0 4.0K 0% /sys/fs/cgroup
udev 7.4G 8.0K 7.4G 1% /dev
tmpfs 1.5G 360K 1.5G 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 7.4G 0 7.4G 0% /run/shm
none 100M 0 100M 0% /run/user
root@ip-10-0-1-94:~# service supervisor restart
Restarting supervisor: supervisord.
root@ip-10-0-1-94:~# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 118G 74G 40G 65% /
none 4.0K 0 4.0K 0% /sys/fs/cgroup
udev 7.4G 8.0K 7.4G 1% /dev
tmpfs 1.5G 360K 1.5G 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 7.4G 0 7.4G 0% /run/shm
none 100M 0 100M 0% /run/user
UPDATE: as it turns out this happens because of gearman
log with tons of
accept(Too many open files) -> libgearman-server/gearmand.cc:851
messages. Even though the file is deleted, it's still open by the gearman processes and therefore the space is not released. This is the proof:
root@ip-10-0-1-94:~# sudo lsof -s | awk '$5 == "REG"' | sort -n -r -k 7,7 | head -n 1
gearmand 4221 gearman 3w REG 202,1 31748949650 143608 /var/log/gearman-job-server/gearman.log.1 (deleted)
(thanks to @Andrew Henle)
Now the next question is: why does gearman
write that to the log. As stated here it's because of too many connections to gearman
in TIME_WAIT
state
However, they are not in TIME_WAIT
, they are in ESTABLISHED
. Here they are.
If I do strace -p 4221
, I'm seeing only this
write(22, "\3", 1) = 1
epoll_wait(6, {{EPOLLIN, {u32=9, u64=9}}}, 32, -1) = 1
clock_gettime(CLOCK_MONOTONIC, {169649, 568914324}) = 0
gettimeofday({1446109467, 793708}, NULL) = 0
accept4(9, {sa_family=AF_INET, sin_port=htons(33010), sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_NONBLOCK) = 874
write(17, "\3", 1) = 1
epoll_wait(6, {{EPOLLIN, {u32=9, u64=9}}}, 32, -1) = 1
clock_gettime(CLOCK_MONOTONIC, {169659, 749954206}) = 0
gettimeofday({1446109477, 974726}, NULL) = 0
accept4(9, {sa_family=AF_INET, sin_port=htons(33060), sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_NONBLOCK) = 875
write(32, "\3", 1) = 1
epoll_wait(6, {{EPOLLIN, {u32=9, u64=9}}}, 32, -1) = 1
clock_gettime(CLOCK_MONOTONIC, {169659, 754505349}) = 0
gettimeofday({1446109477, 979307}, NULL) = 0
accept4(9, {sa_family=AF_INET, sin_port=htons(33062), sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_NONBLOCK) = 876
write(27, "\3", 1) = 1
epoll_wait(6, {{EPOLLIN, {u32=9, u64=9}}}, 32, -1) = 1
clock_gettime(CLOCK_MONOTONIC, {169664, 300399805}) = 0
gettimeofday({1446109482, 525209}, NULL) = 0
accept4(9, {sa_family=AF_INET, sin_port=htons(33134), sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_NONBLOCK) = 877
write(22, "\3", 1) = 1
epoll_wait(6, {{EPOLLIN, {u32=9, u64=9}}}, 32, -1) = 1
clock_gettime(CLOCK_MONOTONIC, {169666, 161035104}) = 0
gettimeofday({1446109484, 385826}, NULL) = 0
accept4(9, {sa_family=AF_INET, sin_port=htons(33165), sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_NONBLOCK) = 878
write(17, "\3", 1) = 1
epoll_wait(6, {{EPOLLIN, {u32=9, u64=9}}}, 32, -1) = 1
clock_gettime(CLOCK_MONOTONIC, {169668, 308112847}) = 0
gettimeofday({1446109486, 532900}, NULL) = 0
accept4(9, {sa_family=AF_INET, sin_port=htons(33186), sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_NONBLOCK) = 879
write(32, "\3", 1) = 1
epoll_wait(6, {{EPOLLIN, {u32=9, u64=9}}}, 32, -1) = 1
clock_gettime(CLOCK_MONOTONIC, {169671, 251265264}) = 0
gettimeofday({1446109489, 476077}, NULL) = 0
accept4(9, {sa_family=AF_INET, sin_port=htons(33218), sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_NONBLOCK) = 880
write(27, "\3", 1) = 1
epoll_wait(6, {{EPOLLIN, {u32=9, u64=9}}}, 32, -1) = 1
clock_gettime(CLOCK_MONOTONIC, {169672, 320483648}) = 0
gettimeofday({1446109490, 545274}, NULL) = 0
accept4(9, {sa_family=AF_INET, sin_port=htons(33232), sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_NONBLOCK) = 881
write(22, "\3", 1) = 1
epoll_wait(6, {{EPOLLIN, {u32=9, u64=9}}}, 32, -1) = 1
clock_gettime(CLOCK_MONOTONIC, {169676, 186686282}) = 0
gettimeofday({1446109494, 411486}, NULL) = 0
accept4(9, {sa_family=AF_INET, sin_port=htons(33303), sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_NONBLOCK) = 882
write(17, "\3", 1) = 1
epoll_wait(6, {{EPOLLIN, {u32=9, u64=9}}}, 32, -1) = 1
clock_gettime(CLOCK_MONOTONIC, {169684, 699748557}) = 0
gettimeofday({1446109502, 924549}, NULL) = 0
accept4(9, {sa_family=AF_INET, sin_port=htons(33320), sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_NONBLOCK) = 883
write(32, "\3", 1) = 1
epoll_wait(6, {{EPOLLIN, {u32=9, u64=9}}}, 32, -1) = 1
clock_gettime(CLOCK_MONOTONIC, {169687, 906830251}) = 0
gettimeofday({1446109506, 131601}, NULL) = 0
accept4(9, {sa_family=AF_INET, sin_port=htons(33348), sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_NONBLOCK) = 884
write(27, "\3", 1) = 1
epoll_wait(6, {{EPOLLIN, {u32=9, u64=9}}}, 32, -1) = 1
clock_gettime(CLOCK_MONOTONIC, {169701, 112588731}) = 0
gettimeofday({1446109519, 337387}, NULL) = 0
accept4(9, {sa_family=AF_INET, sin_port=htons(33386), sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_NONBLOCK) = 885
write(22, "\3", 1) = 1
epoll_wait(6, {{EPOLLIN, {u32=9, u64=9}}}, 32, -1) = 1
clock_gettime(CLOCK_MONOTONIC, {169707, 686312787}) = 0
gettimeofday({1446109525, 911113}, NULL) = 0
accept4(9, {sa_family=AF_INET, sin_port=htons(33420), sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_NONBLOCK) = 886
write(17, "\3", 1) = 1
each portion of
epoll_wait(6, {{EPOLLIN, {u32=9, u64=9}}}, 32, -1) = 1
clock_gettime(CLOCK_MONOTONIC, {169707, 686312787}) = 0
gettimeofday({1446109525, 911113}, NULL) = 0
accept4(9, {sa_family=AF_INET, sin_port=htons(33420), sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_NONBLOCK) = 886
write(17, "\3", 1)
is added every 3-5 seconds. Nothing else for a few minutes.
Whatever process creates this file is your culprit:
Given that it's called
gearman.log.1
, I suspect that whatever is doing log rollover isn't doing it properly.When you see a gross mismatch between
df
anddu
like you're seeing, it's usually a deleted file that a process still has open.lsof | grep deleted
works well on Linux to find them.Simply searching for
deleted
in your postedlsof
output shows several other*.1
log files that seem to have the same improper rollover issue.Just another information about CentOS. In this case, when using "systemctl" to launch process. You have to modify the system file ==> /usr/lib/systemd/system/processName.service .Had this line in the file :
And just reload your system conf :