What Happens on GitLab When You do git push?

      β˜• 12 min read
🏷️
  • #git
  • #GitLab
  • Ever curious about how Git and GitLab work behind the scenes? Now’s the time to pick up your beloved IDE and embark on a journey to explore the inner workings of these tools with me!

    The Basics

    Before we set off on our journey, let’s stock up on some background knowledge in less than five minutes. Ready, set, go!

    Inside a Git Repository

    Projects using Git have a .git folder at their root (hidden), which holds all the information saved by Git. Here are the parts we’ll be focusing on this time:

    .git
    β”œβ”€β”€ HEAD # Current branch (ref) the workspace is on
    β”œβ”€β”€ objects # git objects, from which git can reconstruct the entire commit history and files at any point in time
    β”‚Β Β  β”œβ”€β”€ 20 # Sparse objects, organized into folders based on the first byte of their hash to avoid too many files in a single directory
    β”‚Β Β  β”‚Β Β  └── 7151a78fb5e2d99f1185db7ebbd7d883ebde6c
    β”‚Β Β  β”œβ”€β”€ 43 # Another set of sparse objects
    β”‚Β Β  β”‚Β Β  └── 49b682aeaf8dc281c7a7c8d8460f443835c0c2
    β”‚Β Β  └── pack # Compressed objects
    └── refs # Branches, with file contents being the hash of a commit
        β”œβ”€β”€ heads
        β”‚Β Β  β”œβ”€β”€ feat
        β”‚Β Β  β”‚Β Β  └── hello-world # A feature branch
        β”‚Β Β  └── main # Main branch
        β”œβ”€β”€ remotes
        β”‚Β Β  └── origin
        β”‚Β Β      └── HEAD # Local record of the remote branch
        └── tags # Tags, with file contents being the hash of a commit
    
    Git data model
    The red parts are provided by refs, the rest by objects. Commit objects (yellow) point to tree objects (blue) that represent file structures, which in turn point to individual file objects (grey) Image/Pro Git on git-scm.com

    On the server side, Git only stores information within the .git folder (referred to as a bare repository), git clone pulls this information from the remote and reconstructs the repository at the HEAD state on your local machine, while git push sends your local refs along with their related commit, tree, and file objects to the remote.

    Git compresses objects for network transmission, resulting in a packfile.

    Git Transfer Protocol

    Let’s go through what happens during a git push, step by step:

    1. The user runs git push on the client side.
    2. The client’s Git git-send-pack service, with the repository identifier, calls the server’s git-receive-pack service.
    3. The server returns the current commit hash for each ref in the repository, each as a 40-character hex-coded text, looking something like this:
    001f# service=git-receive-pack
    000000c229859bcc73cdab4db2b70ed681077a5885f80134 refs/heads/main\x00report-status report-status-v2 delete-refs side-band-64k quiet atomic ofs-delta push-options object-format=sha1 agent=git/2.37.1.gl1
    0000
    

    Here, we see that the server’s main branch is at 229859bcc73cdab4db2b70ed681077a5885f80134 (ignoring the protocol content).

    1. The client identifies commits it has that the server does not, informing the server about the refs to be updated:
    009f0000000000000000000000000000000000000000 8fa91ae7af0341e6524d1bc2ea067c99dff65f1c refs/heads/feat/hello-world
    

    In this example, we’re pushing a new branch feat/hello-world, currently pointing to 8fa91ae7af0341e6524d1bc2ea067c99dff65f1c. Since it’s a new branch, its previous point is represented as 0000000000000000000000000000000000000000.

    1. The client packs the relevant commits along with their tree and file objects into a compressed packfile and sends it to the server. A packfile is binary:
    report-status side-band-64k agent=git/2.20.10000PACK\x00\x00\x00\x02\x00\x00\x00\x03\x98\x0cx\x9c\x8d\x8bI
    \xc30\x0c\x00\xef~\x85\xee\x85"[^$(\xa5_\x91m\x85\xe6\xe0\xa4\x04\xe7\xff]^\xd0\xcb0\x87\x99y\x98A\x11\xa5\xd8\xab,\xbdSA]Z\x15\xcb(\x94|4\xdf\x88\x02&\x94\xa0\xec^z\xd86!\x08'\xa9\xad\x15j]\xeb\xe7\x0c\xb5\xa0\xf5\xcc\x1eK\xd1\xc4\x9c\x16FO\xd1\xe99\x9f\xfb\x01\x9bn\xe3\x8c\x01n\xeb\xe3\xa7\xd7aw\xf09\x07\xf4\\\x88\xe1\x82\x8c\xe8\xda>\xc6:\xa7\xfd\xdb\xbb\xf3\xd5u\x1a|\xe1\xde\xac\xe29o\xa9\x04x\x9c340031Q\x08rut\xf1u\xd5\xcbMap\xf6\xdc\xd6\xb4n}\xef\xa1\xc6\xe3\xcbO\xdcp\xe3w\xb10=p\xc8\x10\xa2(%\xb1$U\xaf\xa4\xa2\x84\xa1T\xe5\x8eO\xe9\xcf\xd3\x0c\\R\x7f\xcf\xed\xdb\xb9]n\xd1\xea3\xa2\x00\xd3\x86\x1db\xbb\x02x\x9c\x01+\x00\xd4\xff2022\xe5\xb9\xb4 09\xe6\x9c\x88 01\xe6\x97\xa5 \xe6\x98\x9f\xe6\x9c\x9f\xe5\x9b\x9b 15:52:13 CST
    \xa4d\x11\xa1\xe8\x86\xdeQ\x90\xb1\xe0Z\xfd\x7f\x91\x90\xc3\xd6\x17\xe8\x02&K\xd0
    
    1. The server unpacks the packfile, updates the ref, and returns the result of the process:
    003a\x01000eunpack ok
    0023ok refs/heads/feat/hello-world
    

    The Git transfer protocol can be carried over by SSH or HTTP(S).

    Pretty straightforward, right?

    Components of GitLab

    GitLab is a popular code hosting service that supports collaborative development, issue tracking, CI/CD, and more.

    GitLab’s services aren’t monolithic. Let’s use version 15 as an example; components related to git push include:

    • GitLab: Developed in Ruby, it consists of two parts: the web/API services (hereafter referred to as Rails) and the job queue/background jobs (referred to as Sidekiq).
    • Gitaly: Developed in Go, it’s the Git storage backend for GitLab, responsible for storing and accessing Git repositories, exposing various Git operations as GRPC calls. Initially, Rails operated Git repositories directly through Git command lines on NFS, but this became inefficient with scale, leading to the development of Gitaly.
    • Workhorse: Developed in Go, acting as a reverse proxy for Rails, handling “slow” HTTP requests like Git push/pull, file uploads/downloads. It offloads processing from Rails to handle resource-intensive operations more efficiently.
    • GitLab Shell: Developed in Go, it manages Git SSH connections, serving as a bridge between the user’s Git client and Gitaly.
    • GitLab Runner: Also developed in Go, responsible for running CI/CD jobs.
    • Data is stored in Postgres, with Redis used for caching. Rails and Sidekiq directly interact with the database and cache, while other components communicate through APIs implemented in Rails.
    Simplified GitLab components relationship
    A simplified view of GitLab components Diagram/GitLab architecture overview on docs.gitlab.com

    The Journey of git push

    Now that you’re equipped with the basics, let’s dive into the adventure!

    Do You Prefer SSH?

    If your remote URL looks like [email protected]:user/repo.git, you’re communicating with GitLab over SSH. When executing a git push, essentially, your Git client’s upload-pack service is executing a command like:

    1
    
    ssh -x [email protected] "git-receive-pack 'user/repo.git'"
    

    There are interesting questions to ponder here:

    • Everyone’s username is git. How does the server distinguish who is who?
    • SSH? Can I run arbitrary commands on the server?

    These questions are addressed by GitLab Shell’s gitlab-sshd, a customized SSH daemon speaking the same protocol as sshd. During the SSH handshake, the client provides its public key, and gitlab-sshd verifies it against the public keys registered in GitLab through an internal API call, thus authenticating the user.

    Moreover, gitlab-sshd restricts the commands clients can execute, using the attempted command to decide which method to run. Any non-matching commands are rejected.

    Sadly, it seems we can’t run bash or rm -rf / on GitLab’s servers via SSH. β”‘(οΏ£Π” οΏ£)┍

    In the past, GitLab indeed used sshd to handle Git requests. To solve the aforementioned problems, their authorized_keys file looked something like this:

    # Managed by gitlab-rails
    command="/bin/gitlab-shell key-1",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty ssh-
    rsa AAAAB3NzaC1yc2EAAAABJQAAAIEAiPWx6WM4lhHNedGfBpPJNPpZ7yKu+dnn1SJejgt1016k6YjzGGphH2TUxwKzxcKDKKezwkpfnxPkSMkuEspGRt/aZZ9wa++Oi7
    Qkr8prgHc4soW6NUlfDzpvZK2H5E7eQaSeP3SAwGmQKUFHCddNaP0L+hM7zhFNzjFvpaMgJw0=
    command="/bin/gitlab-shell key-2",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty ssh-
    rsa AAAAB3NzaC1yc2EAAAABJQAAAIEAiPWx6WM4lhHNedGfBpPJNPpZ7yKu+dnn1SJejgt1026k6YjzGGphH2TUxwKzxcKDKKezwkpfnxPkSMkuEspGRt/aZZ9wa++Oi7
    Qkr8prgHc4soW6NUlfDzpvZK2H5E7eQaSeP3SAwGmQKUFHCddNaP0L+hM7zhFNzjFvpaMgJw0=
    

    Yes, all users’ public keys were in this file, potentially reaching sizes of hundreds of MBs!

    The command parameter overrides any command the SSH client attempts to run, instead launching gitlab-shell with the public key ID as the parameter. gitlab-shell then executes the intended operation based on the SSH_ORIGINAL_COMMAND environment variable.

    With the transition to gitlab-sshd, which relies on a Rails API backed by Postgres indexing, the performance disparity between users based on their registration order is no longer an issue.

    Two experienced developers are having fun when waiting for git push to be finished Image/xkcd-excuse.com
    Two experienced developers are having fun when waiting for git push to be finished Image/xkcd-excuse.com

    After authenticating the user, gitlab-sshd checks the user’s write access to the target repository and retrieves information about the repository’s Gitaly instance, user ID, and repository details.

    Finally, gitlab-sshd calls the corresponding Gitaly instance’s SSHReceivePack method, acting as a relay and translator between the Git client (SSH) and Gitaly (GRPC).

    From a broader perspective, a git push over SSH follows this flow:

    1. The user executes git push.
    2. The Git client establishes an SSH connection to gitlab-shell.
    3. gitlab-shell queries the Rails internal API for public key verification(GET /api/v4/internal/authorized_keys) and write permission(POST /api/v4/internal/allowed).
    4. The API responds with the Gitaly address, authentication token, repository object, and hook callback information.
    5. gitlab-shell relays this information to Gitaly’s SSHReceivePack method.
    6. Gitaly executes git-receive-pack in the appropriate working directory, setting environment variables(GITALY_HOOKS_PAYLOAD) for Git hooks.
    7. The server-side Git attempts to update refs and execute Git hooks.
    8. Completion.

    The details about Gitaly and Git hooks will be covered next.

    Do You Prefer HTTP(S)?

    The remote URL for HTTP(S) looks like https://gitlab.example.com/user/repo.git. Unlike SSH, HTTP requests are stateless and always follow a request-response pattern. When you perform a git push, the Git client interacts with two endpoints in sequence:

    • GET https://gitlab.example.com/user/repo.git/info/refs?service=git-receive-pack: The server returns the current commit hashes of all branches in the repository body.
    • POST https://gitlab.example.com/user/repo.git/git-receive-pack: The client submits the branches to be updated along with their old commit hashes and new commit hashes in the body, including the required packfile. The server returns the result of the operation in the body, along with the familiar prompt to “create a merge request”:
    003a\x01000eunpack ok
    0023ok refs/heads/feat/hello-world
    00000085\x02
    To create a merge request for feat/hello-world, visit:
      https://gitlab.example.com/user/repo/-/merge_requests/new?merge_request%5Bs0029\x02ource_branch%5D=feat%2Fhello-world
    0000
    

    Both requests are intercepted by Workhorse, which does two things each time:

    1. Forwards the request as-is to Rails, which returns the authentication result, user ID, and information about the repository’s corresponding Gitaly instance (interesting, right? The info/refs and git-receive-pack endpoints of Rails are actually used for authentication, which I guess has some historical reasons behind it).
    2. Based on the information returned by Rails, Workhorse establishes a connection with Gitaly, acting as a relay between the client and Gitaly.

    In summary, a git push via HTTP(S) goes like this:

    1. The user executes git push.
    2. The Git client calls GET https://gitlab.example.com/user/repo.git/info/refs?service=git-receive-pack, including the corresponding authorization header.
    3. Workhorse intercepts the request, forwards it to Rails as-is, and obtains the authentication result, user ID, and information about the repository’s corresponding Gitaly instance.
    4. Based on Rails’ response, Workhorse calls Gitaly’s GRPC service InfoRefsReceivePack, acting as a relay between the client and Gitaly.
    5. Gitaly runs git-receive-pack in the appropriate working directory and returns ref information.
    6. The Git client calls POST https://gitlab.example.com/user/repo.git/git-receive-pack.
    7. Workhorse intercepts the request, forwards it to Rails as-is, and obtains the authentication result, user ID, and information about the repository’s corresponding Gitaly instance.
    8. Based on Rails’ response, Workhorse calls Gitaly’s GRPC service PostReceivePack, acting as a relay between the client and Gitaly.
    9. Gitaly runs git-receive-pack in the appropriate working directory, setting environment variables like GITALY_HOOKS_PAYLOAD, which includes GL_ID, GL_REPOSITORY, etc.
    10. The server-side Git attempts to update refs and execute Git hooks.
    11. Completion.

    Gitaly and Git Hooks

    After traversing through the connection layer and authorization checks, we’re now closer to GitLab’s heartβ€”Gitaly.

    Gitaly, named with a bit of humor blending “Git” with the desolately populated Russian village “Aly”, aims for minimal disk IO operations, much like Aly’s minimal population. This playful naming underscores the developers’ goal of efficiency and minimalism.

    Gitaly is responsible for managing Git repositories in GitLab, executing Git binary operations through fork/exec, and using cgroups to ensure these processes don’t consume excessive CPU and memory resources. Repositories are stored locally(e.g. /var/opt/gitlab/git-data/repositories/@hashed/b1/7e/b17ef6d19c7a5b1ee83b907c595526dcb1eb06db8227d650d5dda0a9f4ce8cd9.git), and Gitaly handles the intricate operations of managing these repositories.

    When you perform a git push, Gitaly’s SSHReceivePack (for SSH) and PostReceivePack (for HTTPS) methods come into play. These methods ultimately rely on Git’s git-receive-pack, which means the core updates to refs and objects are managed by the Git binary itself. Git’s git-receive-pack offers hooks that allow Gitaly to integrate into this process, interacting with Rails for certain operations.

    graph LR
        Gitaly[1,4. Gitaly<br>] --exec/fork--> git[2. Git<br>git-receive-pack]
        git --execute hooks--> hooks[3. gitaly-hooks<br>Binary acting as hook execution target]
        hooks --unix socket + GRPC + parameters provided by Git and Gitaly--> Gitaly
        Gitaly --API queries for permissions and ledger updates--> Workhorse[5. Workhorse]
        Workhorse --proxies request as-is to--> Rails[6. Rails]
    

    When Gitaly starts git-receive-pack, it passes a Base64-encoded JSON via the GITALY_HOOKS_PAYLOAD environment variable, containing repository information, the Gitaly Unix Socket address and connection token, user information, and which hooks to execute (for git push, it’s always the following ones). It also sets Git’s core.hooksPath parameter to a temporary directory prepared by Gitaly at runtime, where all hook files are symlinked to gitaly-hooks.

    After being started by git-receive-pack, gitaly-hooks reads GITALY_HOOKS_PAYLOAD from the environment variable, reconnects to Gitaly through Unix Socket and GRPC, informing Gitaly of the currently executed hook and the parameters provided by Git to the hook.

    pre-receive hook

    This hook is triggered once Git receives a git push request, before any updates are made. It receives change information via stdin in the format of <old commit ref hash> <new commit ref hash> <ref name>, one per line.

    Upon receiving this information, Gitaly makes two calls to Rails:

    • POST /api/v4/internal/allowed: This API endpoint was already invoked during the connection layer’s authorization step. This time, it includes change information, allowing Rails to make more granular decisions like enforcing branch protection rules.
    • POST /api/v4/internal/pre_receive: This notifies Rails that the repository is about to be updated, incrementing a reference counter for the repository to prevent disruptive changes during the push process.

    If the POST /api/v4/internal/allowed returns an error, Gitaly will relay this error back to gitaly-hooks, which will write the error message to standard error and exit with a non-zero exit code. The error message will be collected by git-receive-pack and written to its standard error. The non-zero exit code from gitaly-hooks will cause git-receive-pack to stop processing the current git push and exit with a non-zero exit code as well, returning control to Gitaly. Gitaly then collects the standard error output from git-receive-pack and responds to Workhorse/Gitlab-Shell with a GRPC response.

    Observant you might wonder how unprocessed objects, which have already been uploaded to the server when hooks are running, are handled if the process is halted at this point?

    Actually, objects associated with an incomplete git push are first written to a quarantine environment, stored in a subfolder under objects, such as incoming-8G4u9v. This way, if the hooks determine there’s a problem with the push, the related resources can be easily cleaned up.

    update hook

    This hook is invoked just before Git updates a ref. It doesn’t currently interact with Rails but is essential for maintaining repository integrity during updates.

    GitLab also supports custom Git hooks, which can be triggered at this stage, allowing administrators to integrate additional checks or operations during the push process.

    post-receive hook

    After all refs have been successfully updated, the post-receive hook is triggered, sending the same change information as the pre-receive hook to Gitaly.

    Gitaly then informs Rails through the POST /api/v4/internal/post_receive endpoint. Rails performs several actions in response:

    • Suggests creating a Merge Request for the updated branch.
    • Decrements the repository’s reference counter.
    • Refreshes repository cache.
    • Triggers CI pipelines.
    • Sends notification emails if applicable.

    Some of these actions are asynchronous, managed by Sidekiq to ensure efficient processing without blocking the post-receive hook’s execution.

    Conclusion

    You’ve now journeyed through the entire process of git push from the client-side to the server-side, uncovering the intricate details of GitLab’s handling of Git operations.

    Here’s our end-of-journey treasure!

    graph LR
        push-ssh(git push via SSH)
        push-http(git push via HTTP)
        push-ssh --SSH--> gitlab-sshd
        push-http --HTTP--> workhorse
        gitlab-sshd --Via Workhorse Proxy:<br>Public Key Verification/Write Permission Check--> rails[Rails]
        workhorse --GRPC--> gitaly[Gitaly]
        workhorse --Write Permission Check--> rails
        gitlab-sshd --GRPC--> gitaly
        gitaly --fork/exec--> git
        git --Read and Write--> disk[Physical Disk]
        git --Hooks--> gitaly-hooks
        gitaly-hooks --Unix socket + GRPC--> gitaly
        gitaly --Write Permission Verification<br>pre-receive Hook<br>post-receive Hook--> rails
        rails --Job Delegation--> sidekiq[SideKiq]
        rails --Cache Refresh--> redis[Redis]
        sidekiq --Triggers CI--> runner[Gitlab Runner]
    

    References

    Share on

    nanmu42
    WRITTEN BY
    nanmu42
    To build beautiful things beautifully.

    What's on this Page