The Git Workflow

ENV 872 - EDA | Spring 2024 | Instructors: Luana Lima, John Fay |

Learning Git takes patience and persistence. At first, it will likely seem like a confusing way to just back up files to an cloud file server like Google Drive or Box. Rest assured, learning Git will pay off if you continue coding. For some more detailed background on what Git is and why we use it, see Section 1 on Happy With Git. Meanwhile, we’ll introduce typical Git workflows, which may be confusing now, but should become clear through repetition by the time this course is complete.

The first section outlines how we start the Git versioning process. Then we cover how we continue working with Git. And finally, we cover the special case of integrating an “upstream” repository into our workflow.

Getting started…

1. Create a repository on GitHub.com

A repository is a workspace that will hold all your coding files, and if your datasets are small, it can hold them as well. A repository, however, should not hold large datasets; in fact, any file over 100MB will cause a headache! (More on that later.) These repositories live on GitHub’s cloud and can be accessed by any machine via a web browser or via the Git software.

You can create a repository a few ways: from scratch or by forking an existing repository. We’ll do the latter in the steps below.

2. Clone your GitHub repository to your local machine

R/RStudio works on files stored on your local machine, so we’ll need to get the files in our cloud-based GitHub repository to our local machine. This process is called cloning a repository (because the two versions “share the same DNA”) and is what the Git software does. Git is actually integrated into RStudio, so we can send the appropriate Git commands to clone our forked repository to our local machine via RStudio.

3. Do your coding

With our repository now on our local machine, we can create, add, and edit files as we would in any coding session.

4. Track and commit changes to your local files

As you make changes to your local files, Git is passively watching you, taking note of any new files you’ve created/added to your workspace. You can tell Git which of these files you want to track and ultimately back up to GitHub. Of these files that have been added to Git’s tracking system, Git will tell you what changes have been made to these files since you’ve last committed any changes to Git’s tracker.

Note the two terms “add” and “commit”. When you “add” a file, you are telling Git to start tracking changes to a file. And when you want to save particular changes to an added file, you “commit” those changes to Git’s ledger. Each commit is tagged with a unique ID (generated by Git, called a “SHA”), and with a short message you create. This allows you to identify specific changes and undo them at any point.

5. Push local changes to your GitHub repository

So far all the changed files and Git’s tracking of those changes remain on your local machine. Before you stop your coding session (or perhaps before), you’ll want to push those changes to your GitHub repository. This makes a secure cloud back up of all your work. It also enables you (or collaborators) working on different machines to easily pull those changes to their machine.

Continuing work…

If you continue working on machine that already has a cloned copy of your remote repository, you don’t need to clone it again. Instead, you can start here:

1. Pull any changes from the remote repository

Before making any changes to your local files, you should make sure you have the latest files saved to the remote repository. This is important! Failure to do this may lead to something called a divergent branch, which will need to be resolved manually.

A divergent branch occurs when you start editing a local file, commit those changes, and the try to push those changes to your remote repository. Git gets confused if the file on the remote repository has changes in file which it doesn’t see in the local one you are try to replace it with.

Git will not assume which changes to make so it will make a hybrid file you have to manually edit! It’s a pain, and worth avoiding simply by getting into the habit of pulling before making any local changes.

2. Do your coding

3. Commit changes

4. Push your changes to the remote repository

Working with “Upstream Repositories”

When you fork a repository, as we do in this class, you’ll want to integrate changes in made in the repository from which you forked your repository. Prior to doing this, you’ll need to link the upstream repository to your local repository, done with the git remote add command. Then, at the start of any new coding session, you’ll want to first pull any changes from the upstream repository to the local repository, and commit those changes to your local, and then push those to your remote. Then you can pull