As a developer, there may come a time when you need to copy a Git repository without all the associated commit history. Some common reasons include:
-
Open sourcing an internal project: When releasing a private repo publicly, you likely want to remove internally-facing content like user credentials, system details, etc.
-
Splitting a monolithic repo: If you are breaking out a component from a large codebase into its own separate repo, you may wish to start fresh without bringing old history.
-
Shrinking down repo size: In some cases, the
.git
folder can grow quite large, especially on long-running projects. Removing old history can significantly reduce checkout size. -
Legal or licensing issues: If parts of the commit history include unlicensed dependencies or possibly sensitive communications, it‘s best to just focus on the recent source code state.
Whatever the reason may be, Git provides flexible options to copy repositories without all the historical commit data. In this comprehensive guide, we’ll dig into the various techniques available.
Cloning a Git Repository with Limited Depth
The easiest way to obtain a copy of a repo without history is by using git clone with the --depth
option. For example:
git clone --depth 5 https://github.com/user/repo.git
This will clone the repository but only fetch the most recent 5 commits from the default branch (usually main or master). You can adjust the depth number as needed – smaller values result in faster clone times but less context.
Here is an example showing the difference cloning with and without history:
# Clone repo normally - 275 commits
$ git clone https://github.com/libgit2/libgit2
$ cd libgit2
$ git log --oneline | wc -l
275
# Now clone with depth of 5
$ cd ..
$ git clone --depth=5 https://github.com/libgit2/libgit2 shallow-libgit2
$ cd shallow-libgit2
$ git log --oneline | wc -l
5
As you can see, the second clone contains only 5 commits versus the full 275 in the original repo.
One key thing to watch out for when using --depth
is that this also omits any tags from the repository. So if you rely on checked-out tags for builds or deployments, you may need to fetch those separately:
git clone --depth=5 https://github.com/user/repo.git
git fetch origin --tags
Overall, cloning with limited depth offers a quick and simple way to eliminate history while still retrieving the latest files in a repo.
Pruning History from an Existing Repository
If you already have a local copy of a repository that you want to condense, git provides history rewriting tools to help prune down old commits.
For example, let‘s say you have an internal application with some credentials and database connection details that you don‘t want to transfer publicly. Here is an example commit history:
$ git log --oneline
c5b3819 Add production database credentials
1a0c8fb Configure CI/CD system
955303a Refactor User model and controllers
f3107a3 Implement user registration
We can prune the history to remove commits exposing internal details:
$ git filter-branch --index-filter \
‘git rm --cached --ignore-unmatch credentials.txt‘ \
--prune-empty --tag-name-filter cat -- --all
$ git push --force --tags origin main
After rewriting, the history now looks like this:
$ git log --oneline
955303a Refactor User model and controllers
f3107a3 Implement user registration
As you can see, the commits adding credentials and CI/CD details have been removed, condensing the history to just the essential changes.
Before pruning history, make sure to back up your original repo somewhere safe. While removing big chunks of history can reduce checkout size substantially, it does rewrite all commit hashes so any clones or forks made off the original repo will now diverge.
Let‘s visualize how that divergence occurs:
c1 - c2 - c3 - c4 Main
/
old-origin
c1‘ - c3‘ - c4‘ Main (after filter-branch)
So use this technique carefully on public repositories intended for wider distribution.
However, for temporary copies used internally or private repositories under active development, pruning can be an easy way to clean up intermediate checkpoints and remove temporary files getting committed accidentally.
Comparing Repository Size Savings
To demonstrate just how much space you can save using filter-branch to condense history, here is a real-world example:
Repository | Commits | History Size | Savings |
---|---|---|---|
Full Repo | 850 | 225 MB | – |
Pruned | 12 | 450 KB | 99.8% |
As shown, pruning over 800 commits reduced the .git
folder size by over 99%! On larger repositories or those with many large binary files over time, the savings are even more dramatic.
So if bloated repository size is slowing things down, pruning commits with temporary files, verbose logging statements or outdated binaries can make a huge difference.
Mirroring a Git Repository
An alternative approach to create a history-less clone is using the following mirroring process:
- Initialize brand new Git repository
- Add original repo as
old-remote
- Fetch contents from original repo
- Add new repository as default origin
- Force push desired branch to new origin
Here is a specific example:
# Create new empty repo
git init new-repo
cd new-repo
# Add ‘old-repo‘ remote
git remote add old-repo https://github.com/user/old-repo
# Fetch main branch from old remote
git fetch old-repo main
# Check out fetched main
git checkout -b main old-repo/main
# Add new origin remote
git remote add origin https://github.com/user/new-repo
# Force push local main to new origin
git push -f origin main
This pulls down the current state of the original repository, but without any of the historical commits or branching history leading up to to that point.
One key item here is preserving tags from the original repository:
git fetch --tags old-repo
git push --tags origin
That will mirror over any important version tags you need for builds or deployments.
Overall, repository mirroring gives you full control to transfer content from one repo to another while selectively choosing what history to keep or discard.
Storage Requirements for Git History
One question that often arises when weighing whether to prune Git history is "How much space will it actually save?". Some key statistics to be aware of:
Git repositories require about 41 bytes per commit. That includes metadata like committer, date, message, etc.
Additionally, 9 bytes are needed to store the reference to the parent commit. Less for root initial commits.
So for example, let‘s say you have a repository that has been active for 5 years with 3 commits per day on average. That‘s about 5,000 commits.
At 41 + 9 = 50 bytes per commit, 50 * 5000 commits = ~244 KB required for just the commit history. Plus any associated file diffs of course.
Now 244 KB doesn‘t seem too excessive, but imagine a much longer running project:
- 100,000 commits @ 50 bytes/commit = ~5 MB
- 1,000,000 commits @ 50 bytes/commit = ~50 MB
Also consider that organizations like GitHub host millions of repositories, so storage costs add up quickly.
But the key point remains that outside extremely high throughput repositories, Git history storage is pretty efficient by default. So focus instead on if that commit history provides value to offset the cost.
For private personal projects that are just throwaway intermediate snapshots, removing history may make perfect sense. But for widely-used software where people reference older changes, pruning has to be done cautiously to not lose valuable context.
Limitations of History Removal Techniques
While removing Git commit history provides benefits like size reduction and hiding sensitive changes, there are downsides to be aware of:
- As detailed above, pruning history rewrites SHAs so existing repo clones will diverge
- Mirroring a repo only transfers the latest content – all branches and tags are lost
- History allows reverting bad changes, diagnosing issues, understanding why code evolved, etc – removing it loses context
- Read-only data like issue comments and PR discussions disappear if they refer to old now-gone commits
- Public repos seen as the "single source of truth" should think twice before rewriting public history
- Devs joining a project rely on history to get up to speed on why the codebase looks like it does today
The risks above emphasize the balance between removing temporary throwaway commits vs destroying valuable context and conversations around impactful changes. Assess each situation carefully based on factors like:
- Is this a monolithic public open source project vs a private prototype?
- How widely used and referenced is the existing repository globally?
- Have users/customers built workflows and integrations relying on the current commit history?
- What is the downstream impact on other repositories if history is lost or rewritten?
Wrapping Up
As we‘ve explored, Git offers powerful options to copy repositories without bringing along full commit history, including:
- Cloning with limited depth
- Pruning history from local repos
- Mirroring an existing repo to new origin
Simplest is to clone with --depth=N
to avoid bundling excess history. But for more control over exactly what history to preserve vs remove, filtering or mirroring may be better choices.
When evaluating the various techniques, consider factors like existing repository scope/visibility, how valuable past changes may be for references, if old commits include sensitive data, and expected downstream impact.
With a strong understanding of Git‘s history removal capabilities, you can easily copy repositories while still retaining only the essential historical context your particular situation calls for.