Git - Revision Control Perfected

In 2005, after just two weeks, Linus Torvalds completed the first version of Git, an open-source version control system. Unlike typical centralized systems, Git is based on a distributed model. It is extremely flexible and guarantees data integrity while being powerful, fast and efficient. With widespread and growing rates of adoption, and the increasing popularity of services like GitHub, many consider Git to be the best version control tool ever created.

Surprisingly, Linus had little interest in writing a version control tool before this endeavor. He created Git out of necessity and frustration. The Linux Kernel Project needed an open-source tool to manage its massively distributed development effectively, and no existing tools were up to the task.

Many aspects of Git's design are radical departures from the approach of tools like CVS and Subversion, and they even differ significantly from more modern tools like Mercurial. This is one of the reasons Git is intimidating to many prospective users. But, if you throw away your assumptions of how version control should work, you'll find that Git is actually simpler than most systems, but capable of more.

In this article, I cover some of the fundamentals of how Git works and stores data before moving on to discuss basic usage and work flow. I found that knowing what is going on behind the scenes makes it much easier to understand Git's many features and capabilities. Certain parts of Git that I previously had found complicated suddenly were easy and straightforward after spending a little time learning how it worked.

I find Git's design to be fascinating in and of itself. I peered behind the curtain, expecting to find a massively complex machine, and instead saw only a little hamster running in a wheel. Then I realized a complicated design not only wasn't needed, but also wouldn't add any value.

Git Object Repository

Git, at its core, is a simple indexed name/value database. It stores pieces of data (values) in "objects" with unique names. But, it does this somewhat differently from most systems. Git operates on the principle of "content-addressed storage", which means the names are derived from the values. An object's name is chosen automatically by its content's SHA1 checksum—a 40-character string like this:


1da177e4c3f41524e886b7f1b8a0c1fc7321cac2

SHA1 is cryptographically strong, which guarantees a different checksum for different data (the actual risk of two different pieces of data sharing the same SHA1 checksum is infinitesimally small). The same chunk of data always will have the same SHA1 checksum, which always will identify only that chunk of data. Because object names are SHA1 checksums, they identify the object's content while being truly globally unique—not just to one repository, but to all repositories everywhere, forever.

To put this into perspective, the example SHA1 listed above happens to be the ID of the first commit of the Linux kernel into a Git repository by Linus Torvalds in 2005 (2.6.12-rc2). This is a lot more useful than some arbitrary revision number with no real meaning. Nothing except that commit ever will have the same ID, and you can use those 40 characters to verify the data in every file throughout that version of Linux. Pretty cool, huh?

Git stores all the data for a repository in four types of objects: blobs, trees, commits and tags. They are all just objects with an SHA1 name and some content. The only difference between them is the type of information they contain.

Blobs and Trees

A blob stores the raw data content of a file. This is the simplest of the four object types.

A tree stores the contents of a directory. This is a flat list of file/directory names, each with a corresponding SHA1 representing its content. These SHA1s are the names of other objects in the repository. This referencing technique is used throughout Git to link all kinds of information together. For file entries, the referenced object is a blob. For directory entries, the referenced object is a tree that can contain more directory entries, in turn referencing more trees to define a complete and potentially unlimited hierarchy.

It's important to recognize that blobs and trees are not themselves files and directories; they are just the contents of files and directories. They don't know about anything outside their own content, including the existence of any references in other objects that point to them. References are one-way only.

example directory structure diagram

Figure 1. An example directory structure and how it might be stored in Git as tree and blob objects (I truncated the SHA1 names to six characters for readability).

In the example shown in Figure 1, I'm assuming that the files MyApp.pm and MyApp1.pm have the same contents, and so by definition, they must reference the same blob object. This behavior is implicit in Git because of its content-addressable design and works equally well for directories with the same content.

As you can see, directory structures are defined by chains of references stored in trees. A tree is able to represent all of the data in the files and directories under it even though it contains only one level of names and references. Because SHA1s of the referenced objects are within its content, a tree's SHA1 exactly identifies and verifies the data throughout the structure; a checksum resulting from a series of checksums verifies all the underlying data regardless of the number of levels.

Consider storing a change to the file README illustrated in Figure 1. When committed, this would create a new blob (with a new SHA1), which would require a new tree to represent "foo" (with a new SHA1), which would require a new tree for the top directory (with a new SHA1).

While creating three new objects to store one change might seem inefficient, keep in mind that aside from the critical path of tree objects from changed file to root, every other object in the hierarchy remains identical. If you have a gigantic hierarchy of 10,000 files and you change the text of one file ten directories deep, 11 new objects allow you to describe both the old and the new state of the tree.

Note:

One potential problem of the content-addressed design is that two large files with minor differences must be stored as different objects. However, Git optimizes these cases by using deltas to eliminate duplicate data between objects wherever possible. The size-reduced data is stored in a highly efficient manner in "pack files", which also are further compressed. This operates transparently underneath the object repository layer.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Interesting

Anonymous's picture

Good to read this informative article here on this website. It’s an interesting post.
Thank you! wintersport oostenrijk chalet / wintersport oostenrijk chalet

Ich selber habe einige

Anonymous's picture

Ich selber habe einige Webseiten und brauchte genau das, was ich hier lesen konnte. Könnte ja mal schauen, ob ich es richtig gemacht habe.

Liebe Grüße
http://www.flirtcenter24.de/

About Git

Mike Wang's picture

Worth keeping

Checking Out A Small Subset Of Files On A Small Device?

Anonymous's picture

The limitation that I immediately ran into when I considered migrating to git is to check out some (rather randomly selected) subset of files on a small/portable computing device.

Say I have a big repository of files and I only needed a very small subset of files while on the go -- to refer to and to be edited.

It was originally a small netbook computer where I could check out a few directories from a big repository and be able to edit files on the netbook computer while on the bus.

Netbook might have grown larger with regard to its disk storage, but now, I want to do the same on an Android phone.

git's sparse checkout feature still pulls the entire repository to the device. It only checkout a subset of files to give the appearance of sparse checkout, but it doesn't resolve the storage issue.

I don't think git submodules help, as, I think, one can't easily move selected files across repositories with all history intact (i.e., every now and then, add some additional directories to the list available to small devices by moving them to a submodule, when it becomes necessary), as one can easily do with CVS.

The only solution that I can think of is to remotely mount .git/objects/ directory and deal with its limitation.

Is there any creative brain power would find a solution lift this limitation?

Thanks.

Split-able git Tree?

Anonymous's picture

Given that:
Tree object = Blobs file names + permissions + Blobs collection.

Can splitting git repository be implemented by splitting some git's Tree object into 2 (sub-) Tree objects on a personal workstation, (perhaps new Commit objects to keep track of the split,) allowing a smaller tree be checked out to a small device.

Remote changes (done by others) can, then, be merged to the personal workstation (as staging), before merging to the splitted Tree branches for the small devices if necessary.

Changes on the small devices can be merged to the personal workstation (as staging), before being pulled by others?

Would that solve the disk space problem by limiting checkout to a small (sub-) Tree?

If this idea works, would some able developer turns it into an implementation?

Thanks.

On the guarantees of SHA1

Johan Commelin's picture

First of all I want to thank the author for this clear and concise article.

However, I want to point out some inaccuracy regarding the paragraph on SHA1. The author states that SHA1 guarantees that the data in the blobs is different, and that the chance that two pieces of data have the same SHA1 is infinitesimally small. I disagree on this point.

The 40-character string that SHA1 outputs gives us 16^40 = 2^160 ~~ 10^16 different checksums. Although this is big enough to assume the above descripted 'guarantee', the claim about the infinitesimal chance is just wrong.

Consider for example 2^160 + 1 pairwise distinct files (this is data, be it hypothetical). The chance that there will be two different pieces of data in this set having the same checksum is 1. And 1 is very very different from infinitesimal.

I agree that it is highly unlikely that two such files will occur in practice, let alone in one project. (For example, each person on earth would have to create about 100.000 distinct files, to come close to the 2^160 files.) Still I wanted to point this out about the cryptographic features of SHA1.

There is not enough matter in

Anonymous's picture

There is not enough matter in the universe to store 2^64 bits, much less 2^160 bits, even if you stored 1 bit per atom.

Your math is *way* off. 2^160

Anonymous's picture

Your math is *way* off. 2^160 ~~ 10^48.

Very nice introduction

Anonymous's picture

Congrats for a very clear and concise introduction for something as difficult to teach as git.

I love git, and had to give git training to Subversion users -- hard work! It really amounts to unlearning SVN and learn something completely new.

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState