Have you ever wondered how Git handles files and folders in the magical
Well, I did and hence I started to investigate further.
In this article I will present the results and explain the way how Git objects are interconnected.
Git1 is a source code management (SCM) system invented back in 2005 by Linus Torvalds, founder and inventor of Linux Kernel2. His intention was to implement a distributed SCM system which supports common development workflows more easily compared to other SCMs like SVN. This includes that commits can be made locally without having the need of a remote repository or server. In addition, compared to SVN which uses incremented revision numbers, Git has to rely on some other techniques since there may occur revision collisions when merging local and remote changes, otherwise. Hence Git introduces hashes using SHA13 for each and every resource which is tracked. The reason for using SHA1 is simple, as it is a fast but reliable algorithm to calculate hashes for a given input. Although it is not very secure—it should not be used for hashing passwords anymore due to collision attacks—it is sufficient for the purpose of tracking file contents as Git does.
How does Git handle contents?
Git relies on the filesystem of the underlying operating system. A filesystem basically has two different concepts: Files and folders. Although Git does not track folders but files, there are two corresponding abstractions in Git’s source code: Tree and Blob. A Blob contains the value of a file whereas a Tree is the pendant to a folder by listing all contained Blobs.
Let’s assume that we have a file called
test.txt which contains a simple, but well-known string:
I already wrote, that Git tracks files by its contents using a SHA1 hash.
When hashing the string above one will get
60fde9c2310b0d4cad4dab8d126b04387efba289 as value.
Let’s verify whether Git calculates the same hash value by creating a new Git repository, add the file and inspect the
$ git init . $ echo -n Hello, World! > test.txt $ git add .
.git/objects/ folder now contains one new folder having one file:
.git `-- objects +-- b4 | `-- 5ef6fec89518d314f546fd6c3025367b721684 +-- info `-- pack
The file and folder names are obviously hashed values of something.
In order to investigate the contents of a file, one can use the command
git cat-file -p b45e which prints out:
$ git cat-file -p b45e Hello, World!
Wait, what?! Why does the file has a hash value of
We did calculate some different hash value which is
60fde9c2310b0d4cad4dab8d126b04387efba289, didn’t we?
Well, yes, we did, but Git tracks a little bit more information than just the plain content of a file:
It also respects the file size as well as the type of Git object that is tracked.
In this case the file has a size of 13 byte and is of type Blob.
These information and the content of the file are taken, put into a new object and then hashed:
blob 13\0Hello, World!
The format for Blobs is specified as follows:
blob<blank><filesize in byte><null byte><file content>.
The null byte is used to separate the content of a file from the git specific header which contains meta data.
If we hash the resulting string above once more, SHA1 returns
Looks familiar, right? Yes, it is the very same hash as generated by Git!
We are done, aren’t we? But wait… What about folders? Until now, Git has tracked the file and its content, so we will have a look at folders now.
Since we are very happy having a mature
test.txt file now, let’s first commit that and than investigate further.
git commit -m "Initial commit."
Let’s check the
.git/objects/ folder once more.
We will see that a new file has been created by Git with a hash value of
In order to look inside that file, we can again use the
git cat-file command as shown below.
.git `-- objects +-- b4 | |-- 5ef6fec89518d314f546fd6c3025367b721684 | `-- 587014507e76d7dcf5b5299949fae0b12b06ab +-- 34 | `-- 1cf04522a24fcf326c5e46ff7ce4f66ff310dd +-- info `-- pack
$ git cat-file -p 341c 100644 blob b45ef6fec89518d314f546fd6c3025367b721684 test.txt
The content is part of a Git tree object which is the way Git tracks folders and contents. Some of you might know, that Git won’t track empty folders. This is because a tree object has to have at least one element inside, which is false when no files are contained in a folder.
There is another new file which hasn’t been investigated, yet, having a hash value
Once again we can use the magic git command to look inside the file.
$ git cat-file -p b458 tree 341cf04522a24fcf326c5e46ff7ce4f66ff310dd author Stephan Köninger <email@example.com> 1455660483 +0100 committer Stephan Köninger <firstname.lastname@example.org> 1455660483 +0100 Initial commit.
This object obviously contains the message we entered when committing the
Hence we can assume that this is the object which contains all information necessary for a commit.
The first line of the commit object points to the tree object we have seen in the previous chapter.
If one draws the different objects as a graph the result will look like as follows.
A more complex graph can be seen in the next figure which depicts two commits. The second commit points back to its previous commit using a parent-child association. As can be seen in the picture as well, the second commit introduces a new subfolder containing one file. This assumes, that a tree object can not just contain blobs but tree objects as well.