Getting start with Git

Introduction

Git is a version control system. It combines the best of automated backups such as Dropbox or Box with the purpose-driven revisions of Word’s Track Changes.

The big picture is that you have a collection of files (called a repository or repo), and once you’ve made a set of changes (whether it be adding new material, changing old material, or as trivial as fixing a typo), you save a copy with a note that explains the purpose of the change. Then you have all your previous versions saved so you can revert or examine then at any time, along with a quick summary of your rationale to remind yourself.

When working with multiple users, using Git allows you to coordinate work by enabling multiple users to edit the same file without a concern for overlapping work1.

If you are not working in a team, using Git still provides the benefits listed above.

Github and others

Git is a piece of open source software. Github and others (Bitbucket, Gitlab) are online hosts for your git project. Think email versus Gmail. Your email exists in the absence of Gmail, but using Gmail for your email provides numerous benefits. You can use Git without ever using an online host like Github; you would still get the benefits of saved backups and a log of your changes. However, using Github (or the others) allows you store this on a server so that you can access anywhere or share.

On Github, the default for repos is Public, that is, anyone can see them and anyone can ask you to accept their edits. (You of course can either give a set of users permission to make edits without you approving them, or you can disapprove a specific edit if you don’t agree.) Git and other version control started with open source software which thrive in public environments. If you have a need for private repos, you can pay for an account on Github, or sign up with a .edu address, or use Bitbucket or Gitlab which both support unlimited private repos.

Terminology

Before we dive too deeply, let’s define some terminology:

  • Repository: A collection of files which are kept together. Think a folder, but with special settings and features. We often differentiate between “remote repositories” which lives on Github and “local repositories” which are copies (or clones) of the remote repository that live on local machines. In most situations, you work with local repositories before contributing the changes to the remote repository.2
  • Clone: Creating a copy of the remote repository on your local machine. You do this only once (per local machine, and assuming you don’t erase the entire local repository and start over); think of it as an initialization step.
  • Pull: After you’ve cloned the remote repository, you pull from it to update your local repository to the most current version. If you have not made any changes to your local repository since your last pull (or the initial clone), this will simply update your code. If you have made changes to your local repository, pulling will initiate a merge:
    • Merge: When you try to combine your local repository with the changes on the remote repository and there are changes on both ends, you enter a merge. In the best case, there is no overlap between them (e.g. on the remote repository the changes were to file “one.txt” while the local repository had changes only in file “two.txt”), and the merge automatically accepts both sets of edits and merges them. If the edits overlap (e.g. they both edit the same file or maybe even the same line), you will have to manually decide which edits to keep. Try to avoid merge conflicts! We will discuss some best practices to avoid them.
  • Push: Take any changes you’ve made to your local repository, and push them up to the remote repository. This will not accept merges, so you must make sure there are no changes on the remote repository that do not exist locally. This is opposite of a pull.
  • Staging: When performing a push, not everything you’ve changed gets pushed. Instead, only changes you select and add to the staging get pushed (This can be some of the changes or all of them). This allows you to make temporary changes, or to push in several pieces so that each push is self-contained and expresses a single change.
  • Commit: There is a verb and a noun version of commit:
    • (noun) A commit is a specific “save” of the changes you’ve made, along with a short description of the changes.
    • (verb) Making a commit takes the changes which are in staging and creates a (noun) commit. (For new files, this can be known as “checking in”.)

It might help to visualize the different status each change can be in and how to move between the stages. This visualization is a slight tweak from https://www.edureka.co/blog/git-tutorial/

We did not define add, but that is how git moves a change into staging. We also do not discuss checkout, which erases any local changes and reverts your local repository to the same status as the remote repository.

The basic workflow

Here’s the basic idea behind the workflow. Some of these steps may not make sense yet, but I introduce it here so you have this in the back of your mind while reading further.

  1. Clone the remote repository locally
    … Some time passes
  2. Pull from the remote repository to make sure you have the latest changes.
  3. Make and test your changes locally.
  4. Add your changes to staging.
  5. Commit your changes from staging with a nice note explaining them.
  6. Pull from the remote repository again to make sure nothing else has changed.
    5a. If anything has changed, either enter a merge and adjust accordingly, or make sure that the new changes don’t affect what you’re working on. If it does, repeat steps 2-5. Once the changes on the remote repository don’t affect your local changes (or there are no changes), goto step 6.
  7. Push your changes to the remote repository.
    … Some time passes
  8. Repeat 1-6.

Working with a remote Git server

We’ll assume you have a repository on Github that you want to clone. (This can be either a new, blank repository [though it shouldn’t be entirely blank, typically you want to start a .gitignore file] or a work-in-progress repo.)

There are 4 major choices of how to interact with Github:

  1. Command line git. This is the original and most powerful version. If you have familiarity with a unix command line, this is the approach for you. However, it does have the steepest learning curve, especially if you don’t already know the command line.
  2. The Github client. This is pretty good software which will manage your local repositories.
  3. An alternative git client (e.g. GitKraken, SmartGit). Obviously the quality and usefulness of these will vary. I’m hesitant to suggest these to new users as they typically expose all of git up front which can be confusing. Most new users will only need the basics (what I’m teaching here). They are a good step-up from the basics.
  4. RStudio. I really like RStudio’s implementation as 1) We’re going to be working inside RStudio anyways, and 2) It only implements a subset of git, so its not overwhelming to new users. This is what we’ll be using.

Let’s use RStudio. There are two ways to clone a repo, using https or SSH. SSH is preferred, both because https requires logging in every time, and because all other interactions with the remote repository require SSH.

To connect via SSH, we’ll need to share our computers RSA security key with Github, so it knows who we are when we connect to the remote repository.

To do this, do the following:

You only need to do this once per computer.

Git in RStudio

When you open an RStudio project which is controlled by Git, or New Project -> Version Control to clone a repository, you get a new tab in the Environment/History section called Git (similar to Build when working with R packages).

Let’s briefly cover the functionality here:

  • Diff: Here you can see in Red what lines were removed and in green, what lines were added. For each “chunk”, you can choose to ignore it, to stage it (add it to Staging) or to discard it (revert to the original version).
  • Commit: Add a note explaining the reason behind the changes and commit them.
  • Push/Pull: As expected.
  • History: Show the commit logs. You can examine the changes made at each commit.

Also, if your local repository and the master repository become out of sync, you will see a message

Your branch is ahead of ‘origin/master’ by 1 commit.

Here, “branch” refers to the local repository and “‘origin/master’” is the remote repository. This is not an error, and will appear after you make a commit. Just don’t forget to push before you finish!

Handling and avoiding Merge conflicts

A merge conflict arises when there are changes on both the remote and local repositories. If you are the only person working on a repo, this happens most commonly if you make changes on one computer, push those changes to the remote repo, then start working on another computer without first pulling the changes down.

If you have a merge conflict, you’ll be presented with a diff that displays something like this:

<<<<<<< HEAD
.Rproj.*
=======
.Rproj.user
>>>>>>> b9a2932ab61c781986ffda197972885d6528ac2f

The top half (between the <<< and ===) contains what your local copy has, the bottom half (between the === and >>> contains what the remote repo has.

Your actual file will have this same syntax.

You need to enter each file and clean up the code; removing the <<<, === and >>> lines and making the code work correctly. Hopefully this is as easy as choosing which version to keep, however, sometimes you’ll need to do it much more carefully.

To avoid this (or at least minimize it), you should:

  1. Pull early and pull often. When you first start to work on changes, pull. If the changes are complicated and require several commits/a period of time, pull occasionally. Make sure your local repo is never behind the remote repo.
  2. If working with multiple authors, divide and conquer. Plan ahead and assign goals to avoid conflicts.

Conclusion

There’s a lot more about Git out there. It’s very hard to write up a detailed guideline on usage, a demonstration is often better. These notes are just a framework for what I plan on covering in class. Here’s a few other resources:


  1. Within reason. See merge conflict below.

  2. You can work directly on Github (or the others) in the remote repository, but it’s of limited use, especially if working on R code.

Josh Errickson