What Happens on GitLab When You do git push?
Ever curious about how Git and GitLab work behind the scenes? Now’s the time to pick up your beloved IDE and embark on a journey to explore the inner workings of these tools with me!
The Basics
Before we set off on our journey, let’s stock up on some background knowledge in less than five minutes. Ready, set, go!
Inside a Git Repository
Projects using Git have a .git
folder at their root (hidden), which holds all the information saved by Git. Here are the parts we’ll be focusing on this time:
.git
βββ HEAD # Current branch (ref) the workspace is on
βββ objects # git objects, from which git can reconstruct the entire commit history and files at any point in time
βΒ Β βββ 20 # Sparse objects, organized into folders based on the first byte of their hash to avoid too many files in a single directory
βΒ Β βΒ Β βββ 7151a78fb5e2d99f1185db7ebbd7d883ebde6c
βΒ Β βββ 43 # Another set of sparse objects
βΒ Β βΒ Β βββ 49b682aeaf8dc281c7a7c8d8460f443835c0c2
βΒ Β βββ pack # Compressed objects
βββ refs # Branches, with file contents being the hash of a commit
βββ heads
βΒ Β βββ feat
βΒ Β βΒ Β βββ hello-world # A feature branch
βΒ Β βββ main # Main branch
βββ remotes
βΒ Β βββ origin
βΒ Β βββ HEAD # Local record of the remote branch
βββ tags # Tags, with file contents being the hash of a commit
On the server side, Git only stores information within the .git
folder (referred to as a bare repository
), git clone
pulls this information from the remote and reconstructs the repository at the HEAD
state on your local machine, while git push
sends your local refs along with their related commit, tree, and file objects to the remote.
Git compresses objects for network transmission, resulting in a packfile
.
Git Transfer Protocol
Let’s go through what happens during a git push
, step by step:
- The user runs
git push
on the client side. - The client’s Git
git-send-pack
service, with the repository identifier, calls the server’sgit-receive-pack
service. - The server returns the current commit hash for each ref in the repository, each as a 40-character hex-coded text, looking something like this:
001f# service=git-receive-pack
000000c229859bcc73cdab4db2b70ed681077a5885f80134 refs/heads/main\x00report-status report-status-v2 delete-refs side-band-64k quiet atomic ofs-delta push-options object-format=sha1 agent=git/2.37.1.gl1
0000
Here, we see that the server’s main
branch is at 229859bcc73cdab4db2b70ed681077a5885f80134
(ignoring the protocol content).
- The client identifies commits it has that the server does not, informing the server about the refs to be updated:
009f0000000000000000000000000000000000000000 8fa91ae7af0341e6524d1bc2ea067c99dff65f1c refs/heads/feat/hello-world
In this example, we’re pushing a new branch feat/hello-world
, currently pointing to 8fa91ae7af0341e6524d1bc2ea067c99dff65f1c
. Since it’s a new branch, its previous point is represented as 0000000000000000000000000000000000000000
.
- The client packs the relevant commits along with their tree and file objects into a compressed packfile and sends it to the server. A packfile is binary:
report-status side-band-64k agent=git/2.20.10000PACK\x00\x00\x00\x02\x00\x00\x00\x03\x98\x0cx\x9c\x8d\x8bI
\xc30\x0c\x00\xef~\x85\xee\x85"[^$(\xa5_\x91m\x85\xe6\xe0\xa4\x04\xe7\xff]^\xd0\xcb0\x87\x99y\x98A\x11\xa5\xd8\xab,\xbdSA]Z\x15\xcb(\x94|4\xdf\x88\x02&\x94\xa0\xec^z\xd86!\x08'\xa9\xad\x15j]\xeb\xe7\x0c\xb5\xa0\xf5\xcc\x1eK\xd1\xc4\x9c\x16FO\xd1\xe99\x9f\xfb\x01\x9bn\xe3\x8c\x01n\xeb\xe3\xa7\xd7aw\xf09\x07\xf4\\\x88\xe1\x82\x8c\xe8\xda>\xc6:\xa7\xfd\xdb\xbb\xf3\xd5u\x1a|\xe1\xde\xac\xe29o\xa9\x04x\x9c340031Q\x08rut\xf1u\xd5\xcbMap\xf6\xdc\xd6\xb4n}\xef\xa1\xc6\xe3\xcbO\xdcp\xe3w\xb10=p\xc8\x10\xa2(%\xb1$U\xaf\xa4\xa2\x84\xa1T\xe5\x8eO\xe9\xcf\xd3\x0c\\R\x7f\xcf\xed\xdb\xb9]n\xd1\xea3\xa2\x00\xd3\x86\x1db\xbb\x02x\x9c\x01+\x00\xd4\xff2022\xe5\xb9\xb4 09\xe6\x9c\x88 01\xe6\x97\xa5 \xe6\x98\x9f\xe6\x9c\x9f\xe5\x9b\x9b 15:52:13 CST
\xa4d\x11\xa1\xe8\x86\xdeQ\x90\xb1\xe0Z\xfd\x7f\x91\x90\xc3\xd6\x17\xe8\x02&K\xd0
- The server unpacks the packfile, updates the ref, and returns the result of the process:
003a\x01000eunpack ok
0023ok refs/heads/feat/hello-world
The Git transfer protocol can be carried over by SSH or HTTP(S).
Pretty straightforward, right?
Components of GitLab
GitLab is a popular code hosting service that supports collaborative development, issue tracking, CI/CD, and more.
GitLab’s services aren’t monolithic. Let’s use version 15 as an example; components related to git push
include:
- GitLab: Developed in Ruby, it consists of two parts: the web/API services (hereafter referred to as Rails) and the job queue/background jobs (referred to as Sidekiq).
- Gitaly: Developed in Go, it’s the Git storage backend for GitLab, responsible for storing and accessing Git repositories, exposing various Git operations as GRPC calls. Initially, Rails operated Git repositories directly through Git command lines on NFS, but this became inefficient with scale, leading to the development of Gitaly.
- Workhorse: Developed in Go, acting as a reverse proxy for Rails, handling “slow” HTTP requests like Git push/pull, file uploads/downloads. It offloads processing from Rails to handle resource-intensive operations more efficiently.
- GitLab Shell: Developed in Go, it manages Git SSH connections, serving as a bridge between the user’s Git client and Gitaly.
- GitLab Runner: Also developed in Go, responsible for running CI/CD jobs.
- Data is stored in Postgres, with Redis used for caching. Rails and Sidekiq directly interact with the database and cache, while other components communicate through APIs implemented in Rails.
The Journey of git push
Now that you’re equipped with the basics, let’s dive into the adventure!
Do You Prefer SSH?
If your remote URL looks like [email protected]:user/repo.git
, you’re communicating with GitLab over SSH. When executing a git push
, essentially, your Git client’s upload-pack
service is executing a command like:
|
|
There are interesting questions to ponder here:
- Everyone’s username is
git
. How does the server distinguish who is who? - SSH? Can I run arbitrary commands on the server?
These questions are addressed by GitLab Shell’s gitlab-sshd, a customized SSH daemon speaking the same protocol as sshd. During the SSH handshake, the client provides its public key, and gitlab-sshd verifies it against the public keys registered in GitLab through an internal API call, thus authenticating the user.
Moreover, gitlab-sshd restricts the commands clients can execute, using the attempted command to decide which method to run. Any non-matching commands are rejected.
Sadly, it seems we can’t run bash
or rm -rf /
on GitLab’s servers via SSH. β(οΏ£Π οΏ£)β
In the past, GitLab indeed used sshd to handle Git requests. To solve the aforementioned problems, their authorized_keys
file looked something like this:
# Managed by gitlab-rails
command="/bin/gitlab-shell key-1",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty ssh-
rsa AAAAB3NzaC1yc2EAAAABJQAAAIEAiPWx6WM4lhHNedGfBpPJNPpZ7yKu+dnn1SJejgt1016k6YjzGGphH2TUxwKzxcKDKKezwkpfnxPkSMkuEspGRt/aZZ9wa++Oi7
Qkr8prgHc4soW6NUlfDzpvZK2H5E7eQaSeP3SAwGmQKUFHCddNaP0L+hM7zhFNzjFvpaMgJw0=
command="/bin/gitlab-shell key-2",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty ssh-
rsa AAAAB3NzaC1yc2EAAAABJQAAAIEAiPWx6WM4lhHNedGfBpPJNPpZ7yKu+dnn1SJejgt1026k6YjzGGphH2TUxwKzxcKDKKezwkpfnxPkSMkuEspGRt/aZZ9wa++Oi7
Qkr8prgHc4soW6NUlfDzpvZK2H5E7eQaSeP3SAwGmQKUFHCddNaP0L+hM7zhFNzjFvpaMgJw0=
Yes, all users’ public keys were in this file, potentially reaching sizes of hundreds of MBs!
The command
parameter overrides any command the SSH client attempts to run, instead launching gitlab-shell with the public key ID as the parameter. gitlab-shell then executes the intended operation based on the SSH_ORIGINAL_COMMAND
environment variable.
With the transition to gitlab-sshd, which relies on a Rails API backed by Postgres indexing, the performance disparity between users based on their registration order is no longer an issue.
After authenticating the user, gitlab-sshd checks the user’s write access to the target repository and retrieves information about the repository’s Gitaly instance, user ID, and repository details.
Finally, gitlab-sshd calls the corresponding Gitaly instance’s SSHReceivePack
method, acting as a relay and translator between the Git client (SSH) and Gitaly (GRPC).
From a broader perspective, a git push
over SSH follows this flow:
- The user executes
git push
. - The Git client establishes an SSH connection to gitlab-shell.
- gitlab-shell queries the Rails internal API for public key verification(
GET /api/v4/internal/authorized_keys
) and write permission(POST /api/v4/internal/allowed
). - The API responds with the Gitaly address, authentication token, repository object, and hook callback information.
- gitlab-shell relays this information to Gitaly’s
SSHReceivePack
method. - Gitaly executes
git-receive-pack
in the appropriate working directory, setting environment variables(GITALY_HOOKS_PAYLOAD
) for Git hooks. - The server-side Git attempts to update refs and execute Git hooks.
- Completion.
The details about Gitaly and Git hooks will be covered next.
Do You Prefer HTTP(S)?
The remote URL for HTTP(S) looks like https://gitlab.example.com/user/repo.git
. Unlike SSH, HTTP requests are stateless and always follow a request-response pattern. When you perform a git push
, the Git client interacts with two endpoints in sequence:
GET https://gitlab.example.com/user/repo.git/info/refs?service=git-receive-pack
: The server returns the current commit hashes of all branches in the repository body.POST https://gitlab.example.com/user/repo.git/git-receive-pack
: The client submits the branches to be updated along with their old commit hashes and new commit hashes in the body, including the required packfile. The server returns the result of the operation in the body, along with the familiar prompt to “create a merge request”:
003a\x01000eunpack ok
0023ok refs/heads/feat/hello-world
00000085\x02
To create a merge request for feat/hello-world, visit:
https://gitlab.example.com/user/repo/-/merge_requests/new?merge_request%5Bs0029\x02ource_branch%5D=feat%2Fhello-world
0000
Both requests are intercepted by Workhorse, which does two things each time:
- Forwards the request as-is to Rails, which returns the authentication result, user ID, and information about the repository’s corresponding Gitaly instance (interesting, right? The
info/refs
andgit-receive-pack
endpoints of Rails are actually used for authentication, which I guess has some historical reasons behind it). - Based on the information returned by Rails, Workhorse establishes a connection with Gitaly, acting as a relay between the client and Gitaly.
In summary, a git push
via HTTP(S) goes like this:
- The user executes
git push
. - The Git client calls
GET https://gitlab.example.com/user/repo.git/info/refs?service=git-receive-pack
, including the corresponding authorization header. - Workhorse intercepts the request, forwards it to Rails as-is, and obtains the authentication result, user ID, and information about the repository’s corresponding Gitaly instance.
- Based on Rails’ response, Workhorse calls Gitaly’s GRPC service
InfoRefsReceivePack
, acting as a relay between the client and Gitaly. - Gitaly runs
git-receive-pack
in the appropriate working directory and returns ref information. - The Git client calls
POST https://gitlab.example.com/user/repo.git/git-receive-pack
. - Workhorse intercepts the request, forwards it to Rails as-is, and obtains the authentication result, user ID, and information about the repository’s corresponding Gitaly instance.
- Based on Rails’ response, Workhorse calls Gitaly’s GRPC service
PostReceivePack
, acting as a relay between the client and Gitaly. - Gitaly runs
git-receive-pack
in the appropriate working directory, setting environment variables likeGITALY_HOOKS_PAYLOAD
, which includesGL_ID
,GL_REPOSITORY
, etc. - The server-side Git attempts to update refs and execute Git hooks.
- Completion.
Gitaly and Git Hooks
After traversing through the connection layer and authorization checks, we’re now closer to GitLab’s heartβGitaly.
Gitaly, named with a bit of humor blending “Git” with the desolately populated Russian village “Aly”, aims for minimal disk IO operations, much like Aly’s minimal population. This playful naming underscores the developers’ goal of efficiency and minimalism.
Gitaly is responsible for managing Git repositories in GitLab, executing Git binary operations through fork/exec, and using cgroups to ensure these processes don’t consume excessive CPU and memory resources. Repositories are stored locally(e.g. /var/opt/gitlab/git-data/repositories/@hashed/b1/7e/b17ef6d19c7a5b1ee83b907c595526dcb1eb06db8227d650d5dda0a9f4ce8cd9.git
), and Gitaly handles the intricate operations of managing these repositories.
When you perform a git push
, Gitaly’s SSHReceivePack
(for SSH) and PostReceivePack
(for HTTPS) methods come into play. These methods ultimately rely on Git’s git-receive-pack
, which means the core updates to refs and objects are managed by the Git binary itself. Git’s git-receive-pack
offers hooks that allow Gitaly to integrate into this process, interacting with Rails for certain operations.
graph LR
Gitaly[1,4. Gitaly<br>] --exec/fork--> git[2. Git<br>git-receive-pack]
git --execute hooks--> hooks[3. gitaly-hooks<br>Binary acting as hook execution target]
hooks --unix socket + GRPC + parameters provided by Git and Gitaly--> Gitaly
Gitaly --API queries for permissions and ledger updates--> Workhorse[5. Workhorse]
Workhorse --proxies request as-is to--> Rails[6. Rails]
When Gitaly starts git-receive-pack
, it passes a Base64-encoded JSON via the GITALY_HOOKS_PAYLOAD
environment variable, containing repository information, the Gitaly Unix Socket address and connection token, user information, and which hooks to execute (for git push
, it’s always the following ones). It also sets Git’s core.hooksPath
parameter to a temporary directory prepared by Gitaly at runtime, where all hook files are symlinked to gitaly-hooks.
After being started by git-receive-pack
, gitaly-hooks reads GITALY_HOOKS_PAYLOAD
from the environment variable, reconnects to Gitaly through Unix Socket and GRPC, informing Gitaly of the currently executed hook and the parameters provided by Git to the hook.
pre-receive hook
This hook is triggered once Git receives a git push
request, before any updates are made. It receives change information via stdin in the format of <old commit ref hash> <new commit ref hash> <ref name>
, one per line.
Upon receiving this information, Gitaly makes two calls to Rails:
POST /api/v4/internal/allowed
: This API endpoint was already invoked during the connection layer’s authorization step. This time, it includes change information, allowing Rails to make more granular decisions like enforcing branch protection rules.POST /api/v4/internal/pre_receive
: This notifies Rails that the repository is about to be updated, incrementing a reference counter for the repository to prevent disruptive changes during the push process.
If the POST /api/v4/internal/allowed
returns an error, Gitaly will relay this error back to gitaly-hooks, which will write the error message to standard error and exit with a non-zero exit code. The error message will be collected by git-receive-pack
and written to its standard error. The non-zero exit code from gitaly-hooks will cause git-receive-pack
to stop processing the current git push
and exit with a non-zero exit code as well, returning control to Gitaly. Gitaly then collects the standard error output from git-receive-pack
and responds to Workhorse/Gitlab-Shell with a GRPC response.
Observant you might wonder how unprocessed objects, which have already been uploaded to the server when hooks are running, are handled if the process is halted at this point?
Actually, objects associated with an incomplete git push
are first written to a quarantine environment, stored in a subfolder under objects
, such as incoming-8G4u9v
. This way, if the hooks determine there’s a problem with the push, the related resources can be easily cleaned up.
update hook
This hook is invoked just before Git updates a ref. It doesn’t currently interact with Rails but is essential for maintaining repository integrity during updates.
GitLab also supports custom Git hooks, which can be triggered at this stage, allowing administrators to integrate additional checks or operations during the push process.
post-receive hook
After all refs have been successfully updated, the post-receive
hook is triggered, sending the same change information as the pre-receive
hook to Gitaly.
Gitaly then informs Rails through the POST /api/v4/internal/post_receive
endpoint. Rails performs several actions in response:
- Suggests creating a Merge Request for the updated branch.
- Decrements the repository’s reference counter.
- Refreshes repository cache.
- Triggers CI pipelines.
- Sends notification emails if applicable.
Some of these actions are asynchronous, managed by Sidekiq to ensure efficient processing without blocking the post-receive hook’s execution.
Conclusion
You’ve now journeyed through the entire process of git push
from the client-side to the server-side, uncovering the intricate details of GitLab’s handling of Git operations.
Here’s our end-of-journey treasure!
graph LR
push-ssh(git push via SSH)
push-http(git push via HTTP)
push-ssh --SSH--> gitlab-sshd
push-http --HTTP--> workhorse
gitlab-sshd --Via Workhorse Proxy:<br>Public Key Verification/Write Permission Check--> rails[Rails]
workhorse --GRPC--> gitaly[Gitaly]
workhorse --Write Permission Check--> rails
gitlab-sshd --GRPC--> gitaly
gitaly --fork/exec--> git
git --Read and Write--> disk[Physical Disk]
git --Hooks--> gitaly-hooks
gitaly-hooks --Unix socket + GRPC--> gitaly
gitaly --Write Permission Verification<br>pre-receive Hook<br>post-receive Hook--> rails
rails --Job Delegation--> sidekiq[SideKiq]
rails --Cache Refresh--> redis[Redis]
sidekiq --Triggers CI--> runner[Gitlab Runner]