Git - Revision Control Perfected


A commit is meant to record a set of changes introduced to a project. What it really does is associate a tree object—representing a complete snapshot of a directory structure at a moment in time—with contextual information about it, such as who made the change and when, a description, and its parent commit(s).

A commit doesn't actually store a list of changes (a "diff") directly, but it doesn't need to. What changed can be calculated on-demand by comparing the current commit's tree to that of its parent. Comparing two trees is a lightweight operation, so there is no need to store this information. Because there actually is nothing special about the parent commit other than chronology, one commit can be compared to any other just as easily regardless of how many commits are in between.

All commits should have a parent except the first one. Commits usually have a single parent, but they will have more if they are the result of a merge (I explain branching and merging later in this article). A commit from a merge still is just a snapshot in time like any other, but its history has more than one lineage.

By following the chain of parent references backward from the current commit, the entire history of a project can be reconstructed and browsed all the way back to the first commit.

A commit is expanded recursively into a project history in exactly the same manner as a tree is expanded into a directory structure. More important, just as the SHA1 of a tree is a fingerprint of all the data in all the trees and blobs below it, the SHA1 of a commit is a fingerprint of all the data in its tree, as well as all of the data in all the commits that preceded it.

This happens automatically because references are part of an object's overall content. The SHA1 of each object is computed, in part, from the SHA1s of any objects it references, which in turn were computed from the SHA1s they referenced and so on.


A tag is just a named reference to an object—usually a commit. Tags typically are used to associate a particular version number with a commit. The 40-character SHA1 names are many things, but human-friendly isn't one of them. Tags solve this problem by letting you give an object an additional name.

There are two types of tags: object tags and lightweight tags. Lightweight tags are not objects in the repository, but instead are simple refs like branches, except that they don't change. (I explain branches in more detail in the Branching and Merging section below.)

Setting Up Git

If you don't already have Git on your system, install it with your package manager. Because Git is primarily a simple command-line tool, installing it is quick and easy under any modern distro.

You'll want to set the name and e-mail address that will be recorded in new commits:

git config --global "John Doe"
git config --global

This just sets these parameters in the config file ~/.gitconfig. The config has a simple syntax and could be edited by hand just as easily.

User Interface

Git's interface consists of the "working copy" (the files you directly interact with when working on the project), a local repository stored in a hidden .git subdirectory at the root of the working copy, and commands to move data back and forth between them, or between remote repositories.

The advantages of this design are many, but right away you'll notice that there aren't pesky version control files scattered throughout the working copy, and that you can work off-line without any loss of features. In fact, Git doesn't have any concept of a central authority, so you always are "working off-line" unless you specifically ask Git to exchange commits with your peers.

The repository is made up of files that are manipulated by invoking the git command from within the working copy. There is no special server process or extra overhead, and you can have as many repositories on your system as you like.

You can turn any directory into a working copy/repository just by running this command from within it:

git init

Next, add all the files within the working copy to be tracked and commit them:

git add .
git commit -m "My first commit"

You can commit additional changes as frequently or infrequently as you like by calling git add followed by git commit after each modification you want to record.

If you're new to Git, you may be wondering why you need to call git add each time. It has to do with the process of "staging" a set of changes before committing them, and it's one of the most common sources of confusion. When you call git add on one or more files, they are added to the Index. The files in the Index—not the working copy—are what get committed when you call git commit.

Think of the Index as what will become the next commit. It simply provides an extra layer of granularity and control in the commit process. It allows you to commit some of the differences in your working copy, but not others, which is useful in many situations.

You don't have to take advantage of the Index if you don't want to, and you're not doing anything "wrong" if you don't. If you want to pretend it doesn't exist, just remember to call git add . from the root of the working copy (which will update the Index to match) each time and immediately before git commit. You also can use the -a option with git commit to add changes automatically; however, it will not add new files, only changes to existing files. Running git add. always will add everything.

The exact work flow and specific style of commands largely are left up to you as long as you follow the basic rules.

The git status command shows you all the differences between your working copy and the Index, and the Index and the most recent commit (the current HEAD):

git status

This lets you see pending changes easily at any given time, and it even reminds you of relevant commands like git add to stage pending changes into the Index, or git reset HEAD <file> to remove (unstage) changes that were added previously.



Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.


Anonymous's picture

Good to read this informative article here on this website. It’s an interesting post.
Thank you! wintersport oostenrijk chalet / wintersport oostenrijk chalet

Ich selber habe einige

Anonymous's picture

Ich selber habe einige Webseiten und brauchte genau das, was ich hier lesen konnte. Könnte ja mal schauen, ob ich es richtig gemacht habe.

Liebe Grüße

About Git

Mike Wang's picture

Worth keeping

Checking Out A Small Subset Of Files On A Small Device?

Anonymous's picture

The limitation that I immediately ran into when I considered migrating to git is to check out some (rather randomly selected) subset of files on a small/portable computing device.

Say I have a big repository of files and I only needed a very small subset of files while on the go -- to refer to and to be edited.

It was originally a small netbook computer where I could check out a few directories from a big repository and be able to edit files on the netbook computer while on the bus.

Netbook might have grown larger with regard to its disk storage, but now, I want to do the same on an Android phone.

git's sparse checkout feature still pulls the entire repository to the device. It only checkout a subset of files to give the appearance of sparse checkout, but it doesn't resolve the storage issue.

I don't think git submodules help, as, I think, one can't easily move selected files across repositories with all history intact (i.e., every now and then, add some additional directories to the list available to small devices by moving them to a submodule, when it becomes necessary), as one can easily do with CVS.

The only solution that I can think of is to remotely mount .git/objects/ directory and deal with its limitation.

Is there any creative brain power would find a solution lift this limitation?


Split-able git Tree?

Anonymous's picture

Given that:
Tree object = Blobs file names + permissions + Blobs collection.

Can splitting git repository be implemented by splitting some git's Tree object into 2 (sub-) Tree objects on a personal workstation, (perhaps new Commit objects to keep track of the split,) allowing a smaller tree be checked out to a small device.

Remote changes (done by others) can, then, be merged to the personal workstation (as staging), before merging to the splitted Tree branches for the small devices if necessary.

Changes on the small devices can be merged to the personal workstation (as staging), before being pulled by others?

Would that solve the disk space problem by limiting checkout to a small (sub-) Tree?

If this idea works, would some able developer turns it into an implementation?


On the guarantees of SHA1

Johan Commelin's picture

First of all I want to thank the author for this clear and concise article.

However, I want to point out some inaccuracy regarding the paragraph on SHA1. The author states that SHA1 guarantees that the data in the blobs is different, and that the chance that two pieces of data have the same SHA1 is infinitesimally small. I disagree on this point.

The 40-character string that SHA1 outputs gives us 16^40 = 2^160 ~~ 10^16 different checksums. Although this is big enough to assume the above descripted 'guarantee', the claim about the infinitesimal chance is just wrong.

Consider for example 2^160 + 1 pairwise distinct files (this is data, be it hypothetical). The chance that there will be two different pieces of data in this set having the same checksum is 1. And 1 is very very different from infinitesimal.

I agree that it is highly unlikely that two such files will occur in practice, let alone in one project. (For example, each person on earth would have to create about 100.000 distinct files, to come close to the 2^160 files.) Still I wanted to point this out about the cryptographic features of SHA1.

There is not enough matter in

Anonymous's picture

There is not enough matter in the universe to store 2^64 bits, much less 2^160 bits, even if you stored 1 bit per atom.

Your math is *way* off. 2^160

Anonymous's picture

Your math is *way* off. 2^160 ~~ 10^48.

Very nice introduction

Anonymous's picture

Congrats for a very clear and concise introduction for something as difficult to teach as git.

I love git, and had to give git training to Subversion users -- hard work! It really amounts to unlearning SVN and learn something completely new.

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState