Git-101

Semanur Kapusızoğlu
Analytics Vidhya
Published in
7 min readNov 19, 2020

--

Learn more about Git and get started!.

Source: https://bit.ly/36PxMAC

“Git” is the best friend and the life savior of developers. As data science enthusiasts, should it be our best friend too? Definitely yes! In this post, I’ll explain what the hell is Git, why we should use it, and how to get started. In the following posts, we will learn how to get comfortable with managing our projects effectively with this technology

Recently I started a challenge where I committed myself to learn something new about Data Science every day (#66daysofdata challenge, you can check out this post for more information: https://bit.ly/3fdOKwd)

That’s how our paths have crossed with Git. It is a distributed version control system. So it basically keeps records of your work if you “commit” the changes you’ve made. With the help of Git, you can even go back to earlier versions of your work. Since it stores the files in a distributed manner, you won’t lose your data. It’s fast and accessible from all around the world. You can either work from your local repositories or remote repositories. Don’t worry I’ll pop some definitions in case you are not familiar with any of these.

Are Git and GitHub the same?

NO. GitHub is a code hosting platform for version control and collaboration [1]. Git comes into play when we say “version control”.

Version: Latest saved form of a document.

Version control: Management of changes to documents, programs, websites.

Let’s talk more about version control systems. There are 2 approaches: Centralized or Decentralized.

Centralized Version Control Systems

Everyone is connected to one online repository. They all commit changes and get updates from it when they are working on the project. If they lose internet connection, they become unable to do anything and in this case, changes get lost before committing.

Source: https://bit.ly/3fdVAld

Repository: A software repository, or “repo” for short, is a storage location for software packages. It can be local or remote. Local means the repository in your device and remote means the repositoy stored in GitHub.

Working copy: The version you are currently working on.

Commit: Saving changes to a repository.

Update: Getting the latest version of a repository.

Decentralized (or Distributed) Version Control Systems

Every user works separately from the repository they have on their devices. They commit changes and save the work to their “local repositories”. If they want to update the main repo, they basically “push” those changes.

Source: https://bit.ly/35IJ6yR

Push: Upload Git commits that you have executed into a remote repository, for example to your GitHub account.

Pull: Download changes from a remote repository, for example you are working on a project with your friend. He/she “pushed” the latest changes to the GitHub repository you are using for the project. You will need to pull those changes so that your local repository stays up-to-date.

Why Decentralized VCS is a better option when compared to Centralized?

  • Decentralized VCS (version control system for short) is faster.
  • It does not require an internet connection.
  • Changes can be accepted/rejected.
  • No need to contact the main server (remote repository) all the time.

Why do data enthusiasts should use Git?

Now that we know more about version control systems and which category Git falls into, we can discuss the reasons why we should use it.

Advantages of Git:

  • More organized teamwork:
    In big teams and projects, sometimes many people work on the same thing and this can lead to confusion & problems. With Git, we can create “branches”, work on the branch, and “merge” after we think it’s ready. This makes it safer to work together.
  • Easier to follow changes throughout time:
    What changed in this version of the project?
    Who committed the changes?
  • Saves space and ensures backup.
  • Flexible, free, fast, allows us to go back to older versions, easy to use, completely distributed, and can support projects of big scale.

Branch*: There are different types of branches and related operations we will discuss these later. But to simply understand the concept, we can use the below photo.

Let’s go back to our previous example. Master branch is our main project that we do with our friend. We keep working on the project and add changes through time. Then we suddenly realized we forgot to do a step in previous sections or we wanted to add some other feature, we basically go back to that version and create a seperate branch to work on those changes. After we are satisfied with the work, we can combine the changed with master.

Merge*: Combining a branch with another branch or the master branch

*As it has been said, we will deal with branching and merging operations with details in the further chapters, do not worry if these concepts seem complicated for now.

As data enthusiasts, we should familiarize ourselves with Git because many companies run big projects that are run by teams. And we are likely to take part in those teams. Everyone has specific tasks assigned to them and we should be able to keep track of those changes. Comfortably work on our copy and push the changes when we feel like it’s ready.

How to get started?

Now, we will install and initialize Git. Github’s webpage suggests we follow these installation steps:

  1. Navigate to the latest Git for Windows installer and download the latest version [2].
  2. Once the installer has started, follow the instructions as provided in the Git Setup wizard screen until the installation is complete. (Do not forget to select “Git Bash” and “Git GUI”.
  3. Open the windows command prompt (or Git Bash if you selected not to use the standard Git Windows Command Prompt during the Git installation).
  4. Type git version to verify Git was installed successfully.

Git Bash: Terminal-command line like platform allowing us to manage changes in our project by using a series of different commands (which will be discussed in the next post).

Git GUI: Graphical User Interface for Git.

$ git --version
Right-click on the desktop, you should see “Git Bash” there. Click on it and execute the code written above.

Initializing Git for your project

If we want to use Git, installing will not be enough. We should also initialize it and do some configurations. We will need to do configuration settings just for once, Git will store it and use it when it’s necessary. To initialize Git in our folder, we will use the following command:

$ git init
The first command (make directory) is used to create a directory called “project”. The second one (change directory) is used to move from Desktop to another directory, which is “project” in our case.

Yeah, it’s that easy! Now that we initialized, let’s go ahead and do the configurations.

$ git config --global user.name "your_GitHub_username"
$ git config --global user.email "your_GitHub_email"
This will connect your project file to your GitHub account. But do not worry, as long as you do not “push” the changes, they will not appear in the GitHub repository. Your work will remain local.

Notice the change, now in the upper bar (master) appeared. This means we have successfully initialized Git.

By using the following command, we can make sure Git saved our configuration settings.

$ git config --list

By typing the following command, we can see our “status” which will be explained in the following post.

Status: The difference between the last saved version and the copy you are currently working on.

$ git status
As we haven’t added anything or committed any changes, it shows nothing. A typical data science project will include a dataset, codes, etc. We will see how we can add files so that Git tracks them in the next post.

Now we will stop here and let these sink in. For the next post, we will learn:

  • how to open projects and add files,
  • stage the changes,
  • committing,
  • going back to older versions,
  • branching & merging,
  • pushing & pulling changes to our repository,
  • cloning some other repository and
  • some common conventions/good practices.

It might take time to get used to Git and work comfortably. But once you do, it will surely make your life a lot easier. Here are some resources to go through before doing the further application. Stay tuned for the next post, I’ll see you there!

Some useful resources to check out before we dive deeper:

--

--

Semanur Kapusızoğlu
Analytics Vidhya

Hi! I‘m an industrial engineer passionate about data science and machine learning. I’m here because best way to learn something is to teach it.Hope you enjoy!