For example, I want to download PCL 3d_rec_framework.
This is the git repository of PCL: https://github.com/PointCloudLibrary/pcl.git
How can I download this directory?
https://github.com/PointCloudLibrary/pcl/tree/master/apps
I tried this, but it didn't work:
sam@sam:~/code/pcl_standalone$ git clone https://github.com/PointCloudLibrary/pcl/tree/master/apps/3d_rec_framework
Cloning into '3d_rec_framework'...
error: The requested URL returned error: 403 while accessing https://github.com/PointCloudLibrary/pcl/tree/master/apps/3d_rec_framework/info/refs
fatal: HTTP request failed
sam@sam:~/code/pcl_standalone$
I don't want to download git of PCL and remove all other directories that I don't want.
How do I download just a single directory?
dobey's answer is no longer the case since git v1.7. You can now checkout certain folders from a repository. The full instructions are found here.
This tells git which directories you want to checkout. Then you can pull just those directories
git clone --filter
from git 2.19 now works on GitHub (tested 2021-01-14, git 2.30.0)This option was added together with an update to the remote protocol, and it truly prevents objects from being downloaded from the server.
E.g., to clone only objects required for
d1
of this minimal test repository: https://github.com/cirosantilli/test-git-partial-clone I can do:Here's a less minimal and more realistic version at https://github.com/cirosantilli/test-git-partial-clone-big-small
That repository contains:
All contents are pseudo-random and therefore incompressible.
Clone times on my 36.4 Mbps internet:
The
sparse-checkout
part is also needed unfortunately. You can also only download certain files with the much more understandable:but that method for some reason downloads files one by one very slowly, making it unusable unless you have very few files in the directory.
Analysis of the objects in the minimal repository
The clone command obtains only:
master
branchd1
,d2
,master
Then, the
git sparse-checkout set
command fetches only the missing blobs (files) from the server:d1/a
d1/b
Even better, later on GitHub will likely start supporting:
where
--filter=tree:0
from Git 2.20 will prevent the unnecessaryclone
fetch of all tree objects, and allow it to be deferred tocheckout
. But on my 2020-09-18 test that fails with:presumably because the
--filter=combine:
composite filter (added in Git 2.24, implied by multiple--filter
) is not yet implemented.I observed which objects were fetched with:
as mentioned at: https://stackoverflow.com/questions/7348698/git-how-to-list-all-objects-in-the-database/18793029#18793029 It does not give me a super clear indication of what each object is exactly, but it does say the type of each object (
commit
,tree
,blob
), and since there are so few objects in that minimal repo, I can unambiguously deduce what each object is.git rev-list --objects --all
did produce clearer output with paths for tree/blobs, but it unfortunately fetches some objects when I run it, which makes it hard to determine what was fetched when, let me know if anyone has a better command.TODO find GitHub announcement that saying when they started supporting it. https://github.blog/2020-01-17-bring-your-monorepo-down-to-size-with-sparse-checkout/ from 2020-01-17 already mentions
--filter blob:none
.git sparse-checkout
I think this command is meant to manage a settings file that says "I only care about these subtrees" so that future commands will only affect those subtrees. But it is a bit hard to be sure because the current documentation is a bit... sparse ;-)
It does not, by itself, prevent the fetching of blobs.
If this understanding is correct, then this would be a good complement to
git clone --filter
described above, as it would prevent unintentional fetching of more objects if you intend to do git operations in the partial cloned repo.When I tried on Git 2.25.1:
it didn't work because the
init
actually fetched all objects.However, in Git 2.28 it didn't fetch the objects as desired. But then if I do:
d1
is not fetched and checked out, even though this explicitly says it should: https://github.blog/2020-01-17-bring-your-monorepo-down-to-size-with-sparse-checkout/#sparse-checkout-and-partial-clones With disclaimer:So yeah, it's just too hard to be certain at the moment, thanks in part to the joys of GitHub being closed source. But let's keep an eye on it.
Command breakdown
The server should be configured with:
Command breakdown:
--filter=blob:none
skips all blobs, but still fetches all tree objects--filter=tree:0
skips the unneeded trees: https://www.spinics.net/lists/git/msg342006.html--depth 1
already implies--single-branch
, see also: https://stackoverflow.com/questions/1778088/how-to-clone-a-single-branch-in-gitfile://$(path)
is required to overcomegit clone
protocol shenanigans: https://stackoverflow.com/questions/47307578/how-to-shallow-clone-a-local-git-repository-with-a-relative-path--filter=combine:FILTER1+FILTER2
is the syntax to use multiple filters at once, trying to pass--filter
for some reason fails with: "multiple filter-specs cannot be combined". This was added in Git 2.24 at e987df5fe62b8b29be4cdcdeb3704681ada2b29e "list-objects-filter: implement composite filters"Edit: on Git 2.28, I experimentally see that
--filter=FILTER1 --filter FILTER2
also has the same effect, since GitHub does not implementcombine:
yet as of 2020-09-18 and complainsfatal: invalid filter-spec 'combine:blob:none+tree:0'
. TODO introduced in which version?The format of
--filter
is documented onman git-rev-list
.Docs on Git tree:
Test it out locally
The following script reproducibly generates the https://github.com/cirosantilli/test-git-partial-clone repository locally, does a local clone, and observes what was cloned:
GitHub upstream.
Output in Git v2.19.0:
Conclusions: all blobs from outside of
d1/
are missing. E.g.0975df9b39e23c15f63db194df7f45c76528bccb
, which isd2/b
is not there after checking outd1/a
.Note that
root/root
andmybranch/mybranch
are also missing, but--depth 1
hides that from the list of missing files. If you remove--depth 1
, then they show on the list of missing files.I have a dream
This feature could revolutionize Git.
Imagine having all the code base of your enterprise in a single repo without ugly third-party tools like
repo
.Imagine storing huge blobs directly in the repo without any ugly third party extensions.
Imagine if GitHub would allow per file / directory metadata like stars and permissions, so you can store all your personal stuff under a single repo.
Imagine if submodules were treated exactly like regular directories: just request a tree SHA, and a DNS-like mechanism resolves your request, first looking on your local
~/.git
, then first to closer servers (your enterprise's mirror / cache) and ending up on GitHub.First, do:
Then:
...And lastly:
It's that easy. Git will rewrite the repo so that only the desired sub-dir is included. This works even if the sub-dir is several layers deep. Just name the destination directory the name of the sub-dir. Then in the "git filter-branch" command put the relative path to the sub-dir. Oh, the
--depth 1
tellsgit
to only download the top of the head (essentially removing the history).You cannot. With git, you clone the entire repository, and the full history of the repository.
There are some workaround solutions to be able to get a single file out of a git archive, listed on a Stack Exchange answer for the same question, but you will still have to download the entire repository to get that single file or directory you want.
For GitHub repos, you can clone any sub-directories of any GitHub repository (at any reference) using https://github.com/HR/github-clone
If the url of the repository is this
and from there you want the folder images which you see through this url
Then do
This works inside Google Colab
Concise, modern (2020+) answer
Yes, it can be done with git 2.19+, several years old already.
Sparse clone:
Sparse checkout:
CHANGE
tree/master/
totrunk/
SIMPLE :
Like for your
To Install svn UBUNTU:https://linuxtechlab.com/simple-guide-to-install-svn-on-ubuntu/
Windows: https://tortoisesvn.net/downloads.html