I want to know how many regular files have the extension .c
in a large complex directory structure, and also how many directories these files are spread across. The output I want is just those two numbers.
I've seen this question about how to get the number of files, but I need to know the number of directories the files are in too.
- My filenames (including directories) might have any characters; they may start with
.
or-
and have spaces or newlines. - I might have some symlinks whose names end with
.c
, and symlinks to directories. I don't want symlinks to be followed or counted, or I at least want to know if and when they are being counted. - The directory structure has many levels and the top level directory (the working directory) has at least one
.c
file in it.
I hastily wrote some commands in the (Bash) shell to count them myself, but I don't think the result is accurate...
shopt -s dotglob
shopt -s globstar
mkdir out
for d in **/; do
find "$d" -maxdepth 1 -type f -name "*.c" >> out/$(basename "$d")
done
ls -1Aq out | wc -l
cat out/* | wc -l
This outputs complaints about ambiguous redirects, misses files in the current directory, and trips up on special characters (for example, redirected find
output prints newlines in filenames) and writes a whole bunch of empty files (oops).
How can I reliably enumerate my .c
files and their containing directories?
In case it helps, here are some commands to create a test structure with bad names and symlinks:
mkdir -p cfiles/{1..3}/{a..b} && cd cfiles
mkdir space\ d
touch -- i.c -.c bad\ .c 'terrible
.c' not-c .hidden.c
for d in space\ d 1 2 2/{a..b} 3/b; do cp -t "$d" -- *.c; done
ln -s 2 dirlink
ln -s 3/b/i.c filelink.c
In the resulting structure, 7 directories contain .c
files, and 29 regular files end with .c
(if dotglob
is off when the commands are run) (if I've miscounted, please let me know). These are the numbers I want.
Please feel free not to use this particular test.
N.B.: Answers in any shell or other language will be tested & appreciated by me. If I have to install new packages, no problem. If you know a GUI solution, I encourage you to share (but I might not go so far as to install a whole DE to test it) :) I use Ubuntu MATE 17.10.
I haven't examined the output with symlinks but:
find
command prints the directory name of each.c
file it finds.sort | uniq -c
will gives us how many files are in each directory (thesort
might be unnecessary here, not sure)sed
, I replace the directory name with1
, thus eliminating all possible weird characters, with just the count and1
remainingtr
d
here is essentially the same asNR
. I could have omitted inserting1
in thesed
command, and just printedNR
here, but I think this is slightly clearer.Up until the
tr
, the data is NUL-delimited, safe against all valid filenames.With zsh and bash, you can use
printf %q
to get a quoted string, which would not have newlines in it. So, you might be able to do something like:However, even though
**
is not supposed to expand for symlinks to directories, I could not get the desired output on bash 4.4.18(1) (Ubuntu 16.04).But zsh worked fine, and the command can be simplified:
D
enables this glob to select dot files,.
selects regular files (so, not symlinks), and:h
prints only the directory path and not the filename (likefind
's%h
) (See sections on Filename Generation and Modifiers). So with the awk command we just need to count the number of unique directories appearing, and the number of lines is the file count.Python has
os.walk
, which makes tasks like this easy, intuitive, and automatically robust even in the face of weird filenames such as those that contain newline characters. This Python 3 script, which I had originally posted in chat, is intended to be run in the current directory (but it doesn't have to be located in the current directory, and you can change what path it passes toos.walk
):That prints the count of directories that directly contain at least one file whose name ends in
.c
, followed by a space, followed by the count of files whose names end in.c
. "Hidden" files--that is, files whose names start with.
--are included, and hidden directories are similarly traversed.os.walk
recursively traverses a directory hierarchy. It enumerates all the directories that are recursively accessible from the starting point you give it, yielding information about each of them as a tuple of three values,root, dirs, files
. For each directory it traverses to (including the first one whose name you give it):root
holds the pathname of that directory. Note that this is totally unrelated to the system's "root directory"/
(and also unrelated to/root
) though it would go to those if you start there. In this case,root
starts at the path.
--i.e., the current directory--and goes everywhere below it.dirs
holds a list of the pathnames of all the subdirectories of the directory whose name is currently held inroot
.files
holds a list of the pathnames of all the files that reside in the directory whose name is currently held inroot
but that are not themselves directories. Note that this includes other kinds of files than regular files, including symbolic links, but it sounds like you don't expect any such entries to end in.c
and are interested in seeing any that do.In this case, I only need to examine the third element of the tuple,
files
(which I callfs
in the script). Like thefind
command, Python'sos.walk
traverses into subdirectories for me; the only thing I have to inspect myself is the names of the files each of them contains. Unlike thefind
command, though,os.walk
automatically provides me a list of those filenames.That script does not follow symbolic links. You very probably don't want symlinks followed for such an operation, because they could form cycles, and because even if there are no cycles, the same files and directories may be traversed and counted multiple times if they are accessible through different symlinks.
If you ever did want
os.walk
to follow symlinks--which you usually wouldn't--then you can passfollowlinks=true
to it. That is, instead of writingos.walk('.')
you could writeos.walk('.', followlinks=true)
. I reiterate that you would rarely want that, especially for a task like this where you are recursively enumerating an entire directory structure, no matter how big it is, and counting all the files in it that meet some requirement.Find + Perl:
Explanation
The
find
command will find any regular files (so no symlinks or directories) and then print the name of directory they are in (%h
) followed by\0
.perl -0 -ne
: read the input line by line (-n
) and apply the script given by-e
to each line. The-0
sets the input line separator to\0
so we can read null-delimited input.$k{$_}++
:$_
is a special variable that takes the value of the current line. This is used as a key to the hash%k
, whose values are the number of times each input line (directory name) was seen.}{
: this is a shorthand way of writingEND{}
. Any commands after the}{
will be executed once, after all input has been processed.print scalar keys %k, " $.\n"
:keys %k
returns an array of the keys in the hash%k
.scalar keys %k
gives the number of elements in that array, the number of directories seen. This is printed along with the current value of$.
, a special variable that holds the current input line number. Since this is run at the end, the current input line number will be the number of the last line, so the number of lines seen so far.You could expand the perl command to this, for clarity:
Here's my suggestion:
This short script creates a tempfile, finds every file in and under the current directory ending in
.c
and writes the list to the tempfile.grep
is then used to count the files (following How can I get a count of files in a directory using the command line?) twice: The second time, directories that are listed multiple times are removed usingsort -u
after stripping filenames from each line usingsed
.This also works properly with newlines in filenames:
grep -c /
counts only lines with a slash and therefore considers only the first line of a multi-line filename in the list.Output
Small shellscript
I suggest a small bash shellscript with two main command lines (and a variable
filetype
to make it easy to switch in order to look for other file types).It does not look for or in symlinks, only regular files.
Verbose shellscript
This is a more verbose version that also considers symbolic links,
Test output
From short shellscript:
From verbose shellscript:
Simple Perl one liner:
Or simpler with
find
command:If you like golfing and have recent (like less than decade old) Perl:
Consider using the
locate
command which is much faster thanfind
command.Running on test data
Thanks to Muru for his answer to help me through stripping symbolic links out of the file count in Unix & Linux answer.
Thanks to Terdon for his answer of
$PWD
(not directed at me) in Unix & Linux answer.Original answer below referenced by comments
Short Form:
sudo updatedb
Update database used bylocate
command if.c
files were created today or if you've deleted.c
files today.locate -cr "$PWD.*\.c$"
locate all.c
files in the current directory and it's children ($PWD
). Instead of printing file names, and print count with-c
argument. Ther
specifies regex instead of default*pattern*
matching which can yield too many results.locate -r "$PWD.*\.c$" | sed 's%/[^/]*$%/%' | uniq -c | wc -l
. Locate all*.c
files in current directory and below. Remove file name withsed
leaving only directory name. Count number of files in each directory usinguniq -c
. Count number of directories withwc -l
.Start at current directory with one-liner
Notice how file count and directory count have changed. I believe all users have the
/usr/src
directory and can run above commands with different counts depending on number of installed kernels.Long Form:
The long form includes the time so you can see how much faster
locate
is overfind
. Even if you have to runsudo updatedb
it is many times faster than a singlefind /
.Note: This is all files on ALL drives and partitions. ie we can search for Windows commands too:
I have three Windows 10 NTFS partitions automatically mounted in
/etc/fstab
. Be aware locate knows everything!Interesting Count:
It takes 15 seconds to count 1,637,135 files in 286,705 directories. YMMV.
For a detailed breakdown on
locate
command's regex handling (appears not to be needed in this Q&A but used just in case) please read this: Use "locate" under some specific directory?Additional reading from recent articles: