Version control
Let’s begin our exploration of Git, by gaining an understanding of what it is and what it can do for you. Git is software that keeps track of changes that you make to files and directories. And it’s especially good at keeping track of text changes. Let’s imagine that you have a document. You start with version 1 of that document. You make some changes to it, now you have version 2. And you make some more changes and you now have version 3. Git keeps track of those three different versions for you. It allows you to move back and forth between the versions. And to compare the different versions to see what changed.
Git is referred to as a version control system or VCS. Programmers wanted a way to be able to track the changes that they made to computer code over time, as they added features and as they fixed bugs. So they created version control. Because of this they’re also called source code management tools or SCM. The two terms are used pretty interchangeably.
IF you’ve never worked with a source code management tools before, you have likely worked with other types of version control. Many applications offer some form of version control as one of their features.
For example,
1) Microsoft Word allows you to track the changes that you make to a document. You turn on the track changes feature, send the document to someone else. They make changes to the document and then when they send it back to you, you’re able to review their edit as well as see the original version.
2) Adobe Photoshop has a feature called the history. You can bring up the history pallet, see the changes that you’ve made to an image and move backwards and forwards applying or removing the changes that you’ve made.
3) Wikipedia : If you’ve ever worked with a Wiki, like Wikipedia then you’ve used a form of version control. When someone contributes a change to a Wiki page the editors have the ability to undo that change. They can go back to a previous version if they need to. We call that process rolling back to a previous version.
And of course we’ve all done the most simple type of version control of all, undo. Ctrl+Z on Windows or Cmd-Z on a Mac. It’ll undo something that we typed or a change that we’ve made. It allows us to go backwards and even undo multiple changes. These are very primitive examples of version control. And there’s no substitute for a real version control system, like Git. But they do provide useful metaphors for you to have in your head. They’re examples of how to track and view different versions or changes over time. And how to move backwards and forwards in the history to undo or redo those changes. That’s what we’ll be doing with Git.
The history behind Git
Versions control systems that predate Git that I want us to look at. There have been others but these are some of the most popular and the most influential. And I think that they can help us to better understand Git.
The first of these is called SCCS ( Source Code Control System ). It was release in 1972 and was developed by AT&T and it was bundled free with the Unix operating system. Now Unix was also free and as a consequence, Unix spread quickly to places like universities and SCCS went along with it. Universities taught their students how to do version control using SCCS, so when they left the university to go work in jobs, the version control system they were familiar with and that they took with them was SCCS.What SCCS does, does it keeps the original document but then instead of saving the whole document a second time, it just saves a snapshot of what the changes were. So if you want v5 of a document, you just take v1 and apply four sets of changes to it to get to v5. That’s a more efficient way to store the changes over time.
So SCCS stayed dominant until the early 80s, when RCS was developed, Revision Control System. And it just made lots of improvements over SCCS. For one thing, it was cross-platform, whereas SCCS was Unix only. With the rise of the personal computer it was important to have a version control system that would also work on PCs. It was also more intuitive, had a cleaner syntax with fewer commands, and more features. Most importantly, it was faster and a lot of the speed increase came from the fact that it used a smarter storage strategy than SCCS. Remember SCCs stored the original file and them kept track of all the changes to that file that went after it. RCS flipped that around, so it kept the most recent file in its whole form and if you wanted to go backwards in time, you wanted previous versions, then you applied the change snapshots to go in reverse. If you think about, that’s a lot faster because most of the time what we want to work with is the current document. With SCCS if we wanted the current document and there were 20 sets of changes, you had to pull up the original and then wait while 20 sets of changes were applied. With RCS you can just bring up the current file and it’s already stored in its full state.
One of the problems with both SCCS and RCS was that they only allowed you to work with an individual file, one at a time. So you could track changes in a single file but not in sets of files or in a whole project. CVS or Concurrent Versions System allowed you do to that. Now the real innovation in CVS is not just the fact that you can work with multiple files. It’s the concurrent part. The fact that we can have a place where we can store our code, called the code repository, and you can put that on a remote server and more than one user can work on the same file at the same time. They can work concurrently. With previous versions, only one person could work with a file at a single time. So CVS adds a lot of features for users to be able to share their work and be able to update their file with changes that other people have made and placed in the remote repository. The idea of working with remote repositories was further improved upon with Apache Subversion or SVN for short. SVN was faster than CVS and allowed saving of non-text files, like images, where CVS couldn’t do that. Most importantly, the big innovation of SVN was that it was tracking, not just changes to single files or to groups of files, but actually watching what happened in a directory as a whole. Watching files and directories collectively and actually taking a snapshot of the directory, CVS would also update files one at a time as it went to either apply or read back changes. SVN would instead do that transactional commit and apply all of the changes that happened to the directory or to none of them at all. The snapshot was bigger that just the individual files, it was an entire directory or an entire set of changes that were happening to that directory at one time. It’s a subtle but important difference. Now CVN stayed the most popular version control system for a very long time.
In fact until Git came out. But there’s one other version control system that I want us to look at that comes in between and that’s BitKeeper SCM. It was a closed source, proprietary source code management tool. That means that a company owned it and sold it, the same way that Adobe sells Photoshop or Microsoft sells Word. One of the important features that BitKeeper had, and it wasn’t the first to have it, but that is distributed version control.
Before we get to that, let’s talk a little bit more about this idea of being closed source, where all the other ones that we’ve been looking at for a little while have been open source. The community version of BitKeeper was free and had a few less features and some usage restrictions. There was the paid version of BitKeeper but there was also a community version that they gave away for free. And that version was used for source code management of the Linux kernel from 2002 to 2005. It was controversial to use a proprietary SCM for the Linux kernel because the Linux kernel is an open source project. No one owns it, where the SCM is owned and controlled by a company. So many people objected saying, well what if they change the rules in the future? We’re going to be stuck using this company’s software. Well guess what? In April 2005, the community version stopped being free and all those predictions came true. So BitKeeper was never as popular as CVS or SVN but it’s important with the creation of Git because of it’s creation to Linux. Because in April 2005, when the community version stopped being free, that’s the same point at which Git was born.
Git was created by Linus Torvalds and you may recognize that name as the person who created Linux and still drives the development of the Linux kernel. When BitKeeper stopped being free, they needed an alternative for managing their source code. Linus looked around and he didn’t like the other VSCs that were out there, like CVS and SVN. He did like some of the concepts of BitKeeper but he thought he could do even better. So he wrote a new version control system from scratch and that was Git.
Git is distributed version control, like BitKeeper. It’s is open source and free, which is great for us because it means that people like you and me can download it for free, use it for free, and there’s no license fees or anything like that. It also means because it’s open source, the community can see the source code and contribute to it. They can submit bug fixes, add new features, all those benefits we get because it’s an open source project. It’s also compatible with most platforms, like Linux, macOS, and Windows. And it’s faster than most other source code management tools. A hundred times faster in some cases for some operations. It also has better safe guards built into it to guard against data corruption. Now these improvements all worked. Git became a big hit. As people discovered the power As people discovered the power of distributed version control, of distributed version control, as they got used to all of Git’s nice features, as they got used to all of Git’s nice features, Git experienced an explosion in popularity. Now there’s no official statistics on this Now there’s no official statistics on this but to give you an example,
GitHub launched in 2008 as a platform to host Git source code repositories.
In 2009, there were over 50,000 repositories with 100,000 users. with 100,000 users.
In 2011, just two years later, there were 2 million repositories with over a million users.
By 2018, GitHub was very popular, it was purchased by Microsoft, and
in 2019 there were it was purchased by Microsoft, and
in 2019 there were over 57 million repositories and over 28 million users. So, Git has definitely taken off.
About distributed version control
I want to explain what distributed version control means so we can understand why it’s such an important feature of Git. We talked about SCCS, RCS, CVS and SVN, four of the most popular version control systems of the past but all four of these use a central code repository model. That’s where one central place is used to store the master copy of your code. And when you’re working with the code, you check out a copy from that master repository. You work with it to make your changes, and then you submit those changes back to the central repository. Other users can also work with that repository, submitting their changes, and it’s up to us as users to keep up to date with whatever’s happening in that central code repository to make sure that we pull down and update any changes that other people have made.
Git doesn’t work that way. Git is distributed version control. Different users each maintain their own repositories instead of working from a central repository, and the changes are stored as sets or patches, and we’re focused on tracking changes, not the versions of the documents.
Now that’s a subtle difference. You may think well, CVS and SVN, those track changes too. They track the changes that it takes to get from version to version of each different file, or the different states of a directory. Git doesn’t work that way. Git really focuses on these change sets, and encapsulating a change set as a discrete unit, and then those change sets can be exchanged between repositories. We’re not trying to keep up to date with the latest version of something. Instead the question is do we have a change set applied or not? So you might say that you merge in change sets or you apply patches between the different repositories. So there’s no single master repository. There’s just many working copies, each with their own combination of change sets.
Let me give an illustration to make this point clear. Imagine that we have changes to a single document as sets A, B, C, D, E, and F.
None of these repositories is right, and none of them is wrong. No one of them is the master repository, and the others are somehow out of date or out of sync with it. They’re all just different repositories that happen to have different change sets in them. We could just as easily add change set G to repository 3, and then we could share it with repository four without ever having to go to any kind of central server at all,
Whereas with CVS and SVN, for example, you would need to submit those changes to a central server, and then people would need to pull down those changes to update their versions of the file.
Now by convention, we often do designate a repository as being the master repository, but that’s not built into Git. It’s not part of the Git architecture. It’s just a convention, that we say okay, this is going to be the master repository and everyone is going to submit their changes to this repository, and we’re all going to stay in sync from that one, but we don’t have to. We can actually have three or four different master repositories that have different versions in them, and we could all be contributing to those equally and just swapping changes between them. Now because it’s distributed, that has a couple of advantages. It means that there’s no need to communicate with a central server, and that makes things faster and it means that it’s not necessary to have network access to submit our changes. We can work on an airplane, for example. And there’s no single point of failure. With CVS and SVN, if something goes wrong with that central repository, that can be a real show stopper for everyone else who’s working off of that central repository. With Git we don’t have that problem. Everyone can keep working. They’ve each got their own repository that they’re working from, not just a copy that they’re trying to keep in sync with some central repository. It also encourages participation in forking projects, and this is really important for the open source community because developers can work independently. They can make changes, they can make bug fixes, feature improvements, and then they can submit those back to the project for either inclusion or rejection, and if you’re working on an open source project and you don’t like the way that it’s going, you can fork that project, create your own version and take it in a completely different direction. That becomes a really powerful and flexible feature that’s well suited to collaboration between teams, especially loose groups of distributed developers like you have in the open source world. Distributed version control is an important part of the Git architecture, and it’s important to learn about it, especially if you have previous experience with other version control systems like CVS or SVM. We’ll talk a lot more about how Git tracks and merges these sets of changes as we go forward. For now, just make sure that you understand that there is no central repository that we all work from. All repositories are considered equal by Git.