Git - Revision Control Perfected
In 2005, after just two weeks, Linus Torvalds completed the first version of Git, an open-source version control system. Unlike typical centralized systems, Git is based on a distributed model. It is extremely flexible and guarantees data integrity while being powerful, fast and efficient. With widespread and growing rates of adoption, and the increasing popularity of services like GitHub, many consider Git to be the best version control tool ever created.
Surprisingly, Linus had little interest in writing a version control tool before this endeavor. He created Git out of necessity and frustration. The Linux Kernel Project needed an open-source tool to manage its massively distributed development effectively, and no existing tools were up to the task.
Many aspects of Git's design are radical departures from the approach of tools like CVS and Subversion, and they even differ significantly from more modern tools like Mercurial. This is one of the reasons Git is intimidating to many prospective users. But, if you throw away your assumptions of how version control should work, you'll find that Git is actually simpler than most systems, but capable of more.
In this article, I cover some of the fundamentals of how Git works and stores data before moving on to discuss basic usage and work flow. I found that knowing what is going on behind the scenes makes it much easier to understand Git's many features and capabilities. Certain parts of Git that I previously had found complicated suddenly were easy and straightforward after spending a little time learning how it worked.
I find Git's design to be fascinating in and of itself. I peered behind the curtain, expecting to find a massively complex machine, and instead saw only a little hamster running in a wheel. Then I realized a complicated design not only wasn't needed, but also wouldn't add any value.
Git Object Repository
Git, at its core, is a simple indexed name/value database. It stores pieces of data (values) in "objects" with unique names. But, it does this somewhat differently from most systems. Git operates on the principle of "content-addressed storage", which means the names are derived from the values. An object's name is chosen automatically by its content's SHA1 checksum—a 40-character string like this:
1da177e4c3f41524e886b7f1b8a0c1fc7321cac2
SHA1 is cryptographically strong, which guarantees a different checksum for different data (the actual risk of two different pieces of data sharing the same SHA1 checksum is infinitesimally small). The same chunk of data always will have the same SHA1 checksum, which always will identify only that chunk of data. Because object names are SHA1 checksums, they identify the object's content while being truly globally unique—not just to one repository, but to all repositories everywhere, forever.
To put this into perspective, the example SHA1 listed above happens to be the ID of the first commit of the Linux kernel into a Git repository by Linus Torvalds in 2005 (2.6.12-rc2). This is a lot more useful than some arbitrary revision number with no real meaning. Nothing except that commit ever will have the same ID, and you can use those 40 characters to verify the data in every file throughout that version of Linux. Pretty cool, huh?
Git stores all the data for a repository in four types of objects: blobs, trees, commits and tags. They are all just objects with an SHA1 name and some content. The only difference between them is the type of information they contain.
Blobs and Trees
A blob stores the raw data content of a file. This is the simplest of the four object types.
A tree stores the contents of a directory. This is a flat list of file/directory names, each with a corresponding SHA1 representing its content. These SHA1s are the names of other objects in the repository. This referencing technique is used throughout Git to link all kinds of information together. For file entries, the referenced object is a blob. For directory entries, the referenced object is a tree that can contain more directory entries, in turn referencing more trees to define a complete and potentially unlimited hierarchy.
It's important to recognize that blobs and trees are not themselves files and directories; they are just the contents of files and directories. They don't know about anything outside their own content, including the existence of any references in other objects that point to them. References are one-way only.
Figure 1. An example directory structure and how it might be stored in Git as tree and blob objects (I truncated the SHA1 names to six characters for readability).
In the example shown in Figure 1, I'm assuming that the files MyApp.pm and MyApp1.pm have the same contents, and so by definition, they must reference the same blob object. This behavior is implicit in Git because of its content-addressable design and works equally well for directories with the same content.
As you can see, directory structures are defined by chains of references stored in trees. A tree is able to represent all of the data in the files and directories under it even though it contains only one level of names and references. Because SHA1s of the referenced objects are within its content, a tree's SHA1 exactly identifies and verifies the data throughout the structure; a checksum resulting from a series of checksums verifies all the underlying data regardless of the number of levels.
Consider storing a change to the file README illustrated in Figure 1. When committed, this would create a new blob (with a new SHA1), which would require a new tree to represent "foo" (with a new SHA1), which would require a new tree for the top directory (with a new SHA1).
While creating three new objects to store one change might seem inefficient, keep in mind that aside from the critical path of tree objects from changed file to root, every other object in the hierarchy remains identical. If you have a gigantic hierarchy of 10,000 files and you change the text of one file ten directories deep, 11 new objects allow you to describe both the old and the new state of the tree.
Note:
One potential problem of the content-addressed design is that two large files with minor differences must be stored as different objects. However, Git optimizes these cases by using deltas to eliminate duplicate data between objects wherever possible. The size-reduced data is stored in a highly efficient manner in "pack files", which also are further compressed. This operates transparently underneath the object repository layer.
Today’s modular x86 servers are compute-centric, designed as a least common denominator to support a wide range of IT workloads. Those generic, virtualized IT workloads have much different resource optimization requirements than hyperscale and cloud applications. They have resulted in a “one size fits all” enterprise IT architecture that is not optimized for a specific set of IT workloads, and especially not emerging hyperscale workloads, such as web applications, big data, and object storage. In this report, you will learn how shifting the focus from traditional compute-centric IT architectures to an innovative disaggregated fabric-based architecture can optimize and scale your data center.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
| Trying to Tame the Tablet | May 08, 2013 |
| Dart: a New Web Programming Experience | May 07, 2013 |
- RSS Feeds
- New Products
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Drupal Is a Framework: Why Everyone Needs to Understand This
- Home, My Backup Data Center
- A Topic for Discussion - Open Source Feature-Richness?
- What's the tweeting protocol?
- Dart: a New Web Programming Experience
- Developer Poll
- Trying to Tame the Tablet
- Reply to comment | Linux Journal
1 hour 50 min ago - Reply to comment | Linux Journal
4 hours 23 min ago - Reply to comment | Linux Journal
5 hours 40 min ago - great post
6 hours 15 min ago - Google Docs
6 hours 37 min ago - Reply to comment | Linux Journal
11 hours 26 min ago - Reply to comment | Linux Journal
12 hours 13 min ago - Web Hosting IQ
13 hours 47 min ago - Thanks for taking the time to
15 hours 23 min ago - Linux is good
17 hours 21 min ago
Enter to Win an Adafruit Prototyping Pi Plate Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Prototyping Pi Plate Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- Next winner announced on 5-21-13!
Free Webinar: Linux Backup and Recovery
Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.
In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.



Comments
Interesting
Good to read this informative article here on this website. It’s an interesting post.
Thank you! wintersport oostenrijk chalet / wintersport oostenrijk chalet
Ich selber habe einige
Ich selber habe einige Webseiten und brauchte genau das, was ich hier lesen konnte. Könnte ja mal schauen, ob ich es richtig gemacht habe.
Liebe Grüße
http://www.flirtcenter24.de/
About Git
Worth keeping
Checking Out A Small Subset Of Files On A Small Device?
The limitation that I immediately ran into when I considered migrating to git is to check out some (rather randomly selected) subset of files on a small/portable computing device.
Say I have a big repository of files and I only needed a very small subset of files while on the go -- to refer to and to be edited.
It was originally a small netbook computer where I could check out a few directories from a big repository and be able to edit files on the netbook computer while on the bus.
Netbook might have grown larger with regard to its disk storage, but now, I want to do the same on an Android phone.
git's sparse checkout feature still pulls the entire repository to the device. It only checkout a subset of files to give the appearance of sparse checkout, but it doesn't resolve the storage issue.
I don't think git submodules help, as, I think, one can't easily move selected files across repositories with all history intact (i.e., every now and then, add some additional directories to the list available to small devices by moving them to a submodule, when it becomes necessary), as one can easily do with CVS.
The only solution that I can think of is to remotely mount .git/objects/ directory and deal with its limitation.
Is there any creative brain power would find a solution lift this limitation?
Thanks.
Split-able git Tree?
Given that:
Tree object = Blobs file names + permissions + Blobs collection.
Can splitting git repository be implemented by splitting some git's Tree object into 2 (sub-) Tree objects on a personal workstation, (perhaps new Commit objects to keep track of the split,) allowing a smaller tree be checked out to a small device.
Remote changes (done by others) can, then, be merged to the personal workstation (as staging), before merging to the splitted Tree branches for the small devices if necessary.
Changes on the small devices can be merged to the personal workstation (as staging), before being pulled by others?
Would that solve the disk space problem by limiting checkout to a small (sub-) Tree?
If this idea works, would some able developer turns it into an implementation?
Thanks.
On the guarantees of SHA1
First of all I want to thank the author for this clear and concise article.
However, I want to point out some inaccuracy regarding the paragraph on SHA1. The author states that SHA1 guarantees that the data in the blobs is different, and that the chance that two pieces of data have the same SHA1 is infinitesimally small. I disagree on this point.
The 40-character string that SHA1 outputs gives us 16^40 = 2^160 ~~ 10^16 different checksums. Although this is big enough to assume the above descripted 'guarantee', the claim about the infinitesimal chance is just wrong.
Consider for example 2^160 + 1 pairwise distinct files (this is data, be it hypothetical). The chance that there will be two different pieces of data in this set having the same checksum is 1. And 1 is very very different from infinitesimal.
I agree that it is highly unlikely that two such files will occur in practice, let alone in one project. (For example, each person on earth would have to create about 100.000 distinct files, to come close to the 2^160 files.) Still I wanted to point this out about the cryptographic features of SHA1.
There is not enough matter in
There is not enough matter in the universe to store 2^64 bits, much less 2^160 bits, even if you stored 1 bit per atom.
Your math is *way* off. 2^160
Your math is *way* off. 2^160 ~~ 10^48.
Very nice introduction
Congrats for a very clear and concise introduction for something as difficult to teach as git.
I love git, and had to give git training to Subversion users -- hard work! It really amounts to unlearning SVN and learn something completely new.