A brief overview of git internals

October 19, 2021

I got curious about the inner workings of git. I basically want to know how git keeps track of changes to lines.

When I type git diff file-name, how does it know which lines changed?

For this, I must investigate the contents of the .git directory.

The .git directory

Contains the following folders:

hooks
info
logs
objects
refs

and it also contains the following files:

COMMIT_EDITMSG
HEAD
config
description
index

The Objects Folder

git saves information about paths, filenames, and contents in objects.

git uses sha1sum to create the hashes. Then the first two characters of the hash become the directory inside the Objects folder. The remaining part of the hash becomes the filename inside that directory.

There are three types of objects in the Objects folder:

Blob
Tree
Commit

Blobs

These objects store the contents of files.

Tree

These objects store file names and file types (file, symlink, executables, directories).

Commit

These objects store the commit's metadata (message, date, time, author).

Brief Analysis Walkthrough

Init

Create directory.

In it, make a file (call it test.md for example), and save the text "Test" in it.

Initialize the get repository:

git init

Check the contents of .git/objects/:

ls -la .git/objects/

Just info and pack but they are empty.

Stage

Stage test.md:

git add test.md

Check the contents of .git/objects/. It now contains a directoy called 34 which contains a file 5e6aef713208c8d50cdea23b85e6ad831f0449.

Check out it's contents by running this command:

git cat-file -p 345e6aef713208c8d50cdea23b85e6ad831f0449

You'll see "Test"

Commit

Commit the file:

git commit -m "Initial Commit"

Check the contents of .git/objects/.

There are two new directories. 16 and ad.

Directories so far

git cat-file -p 16***
// contains
//tree adab88f6680668d54be5227f8d7636521ba924e3
//author Michael Diez <mdiez@m10digital.com> 1634651715 -0400
//committer Michael Diez <mdiez@m10digital.com> 1634651715 -0400
//
//Initial Commit

git cat-file -p ad***
// contains
//100644 blob 345e6aef713208c8d50cdea23b85e6ad831f0449  test.md

Now we can categorize each directory as follows:

Blob - directory 34
Tree - directory ad
Commit - directory 16

What happens if we copy a file?

The file will have the same contents, so its hash is the same, therefore I don't expect a new Blob object. Its file name will be different. So I expect a new Tree Object. After committing this new file, I expect a new Commit Object.

cp test.md test-copy.md
git add test-copy.md
ls -la .git/objects/

Sure enough. No new directories.

git commit "Copy test.md to test-copy.md"

OK, I see two new directories as expected: 97 and 1c.

git cat-file -p 1c***
// contains
// 100644 blob 345e6aef713208c8d50cdea23b85e6ad831f0449 test-copy.md
// 100644 blob 345e6aef713208c8d50cdea23b85e6ad831f0449 test.md

git cat-file -p 97***
//contains
//tree 1cf2880bf7e7d962fa8ac133804f671cc6c8ee45
//parent 161968326147e399896e0cebdc93d01fbc302852
//author Michael Diez <mdiez@m10digital.com> 1634652715 -0400
//committer Michael Diez <mdiez@m10digital.com> 1634652715 -0400
//
//Copy test.md to test-copy.md

Interesting Observations

In the tree file, both test-copy.md and test.md point to the same blob.

The new commit file shows a parent tree.

Conclusion for now

This analysis has given me more insight into how git works internally.

At this point I can kind of answer my question about diffs.

If I had to use the information contained in the objects directory to conduct a diff of a file I would do the following:

Find the latest commit file and get the tree hash
Use the tree hash to find the contents of the tree file
Use the tree file to find the blob file of the file in question
Use the contents of the blob file to diff with the current file

Pretty cool!

The key to git is the use of hashing to map long strings to a filename of finite size, then using the hashes to relate blobs, trees, and commits to each other.

Mind blown