A brief overview of git internals
October 19, 2021
I got curious about the inner workings of git. I basically want to know how git keeps track of changes to lines.
When I type git diff file-name, how does it know which lines changed?
For this, I must investigate the contents of the .git directory.
The .git directory
Contains the following folders:
- hooks
- info
- logs
- objects
- refs
and it also contains the following files:
- COMMIT_EDITMSG
- HEAD
- config
- description
- index
The Objects Folder
git saves information about paths, filenames, and contents in objects.
git uses sha1sum to create the hashes. Then the first two characters of the hash become the directory inside the Objects folder. The remaining part of the hash becomes the filename inside that directory.
There are three types of objects in the Objects folder:
- Blob
- Tree
- Commit
Blobs
These objects store the contents of files.
Tree
These objects store file names and file types (file, symlink, executables, directories).
Commit
These objects store the commit's metadata (message, date, time, author).
Brief Analysis Walkthrough
Init
Create directory.
In it, make a file (call it test.md for example), and save the text "Test" in it.
Initialize the get repository:
git init
Check the contents of .git/objects/:
ls -la .git/objects/
Just info and pack but they are empty.
Stage
Stage test.md:
git add test.md
Check the contents of .git/objects/. It now contains a directoy called 34 which contains a file 5e6aef713208c8d50cdea23b85e6ad831f0449.
Check out it's contents by running this command:
git cat-file -p 345e6aef713208c8d50cdea23b85e6ad831f0449
You'll see "Test"
Commit
Commit the file:
git commit -m "Initial Commit"
Check the contents of .git/objects/.
There are two new directories. 16 and ad.
Directories so far
git cat-file -p 16***
// contains
//tree adab88f6680668d54be5227f8d7636521ba924e3
//author Michael Diez <mdiez@m10digital.com> 1634651715 -0400
//committer Michael Diez <mdiez@m10digital.com> 1634651715 -0400
//
//Initial Commit
git cat-file -p ad***
// contains
//100644 blob 345e6aef713208c8d50cdea23b85e6ad831f0449 test.md
Now we can categorize each directory as follows:
- Blob - directory 34
- Tree - directory ad
- Commit - directory 16
What happens if we copy a file?
The file will have the same contents, so its hash is the same, therefore I don't expect a new Blob object. Its file name will be different. So I expect a new Tree Object. After committing this new file, I expect a new Commit Object.
cp test.md test-copy.md
git add test-copy.md
ls -la .git/objects/
Sure enough. No new directories.
git commit "Copy test.md to test-copy.md"
OK, I see two new directories as expected: 97 and 1c.
git cat-file -p 1c***
// contains
// 100644 blob 345e6aef713208c8d50cdea23b85e6ad831f0449 test-copy.md
// 100644 blob 345e6aef713208c8d50cdea23b85e6ad831f0449 test.md
git cat-file -p 97***
//contains
//tree 1cf2880bf7e7d962fa8ac133804f671cc6c8ee45
//parent 161968326147e399896e0cebdc93d01fbc302852
//author Michael Diez <mdiez@m10digital.com> 1634652715 -0400
//committer Michael Diez <mdiez@m10digital.com> 1634652715 -0400
//
//Copy test.md to test-copy.md
Interesting Observations
In the tree file, both test-copy.md and test.md point to the same blob.
The new commit file shows a parent tree.
Conclusion for now
This analysis has given me more insight into how git works internally.
At this point I can kind of answer my question about diffs.
If I had to use the information contained in the objects directory to conduct a diff of a file I would do the following:
- Find the latest commit file and get the tree hash
- Use the tree hash to find the contents of the tree file
- Use the tree file to find the blob file of the file in question
- Use the contents of the blob file to diff with the current file
Pretty cool!
The key to git is the use of hashing to map long strings to a filename of finite size, then using the hashes to relate blobs, trees, and commits to each other.