Skip to content

Fix issues when cloning repositories with large blobs (>4GB)#6069

Open
LordKiRon wants to merge 6 commits intogit-for-windows:mainfrom
LordKiRon:minimalFix
Open

Fix issues when cloning repositories with large blobs (>4GB)#6069
LordKiRon wants to merge 6 commits intogit-for-windows:mainfrom
LordKiRon:minimalFix

Conversation

@LordKiRon
Copy link

@LordKiRon LordKiRon commented Jan 27, 2026

This PR fixes a couple of problems when cloning large repositories (in my case, a 67GB database containing 4GB+ versioned files). This fixes the clone/fetch, but working with the repository still fails, only much later :)

@LordKiRon LordKiRon marked this pull request as ready for review January 27, 2026 14:04
@LordKiRon LordKiRon force-pushed the minimalFix branch 2 times, most recently from bf5c415 to a9cb9de Compare January 27, 2026 16:18
@dscho
Copy link
Member

dscho commented Jan 27, 2026

@LordKiRon I would like to upstream this fix, but upstream Git requires a real name in the Signed-off-by line. Would you mind providing that?

@LordKiRon
Copy link
Author

LordKiRon commented Jan 27, 2026

@LordKiRon I would like to upstream this fix, but upstream Git requires a real name in the Signed-off-by line. Would you mind providing that?

Sorry , but this creates ties between my real life and public activity I would like to avoid. Not that I am doing something illigal or unpropriete in either area :) and I guess after some digging on the net you might even connect my nickname to my real name, but its a different than creating real connection.
I preffer to keep "internet life" and personal life separate.

@dscho
Copy link
Member

dscho commented Jan 27, 2026

@LordKiRon I would like to upstream this fix, but upstream Git requires a real name in the Signed-off-by line. Would you mind providing that?

Sorry , but this creates ties between my real life and public activity I would like to avoid. Not that I am doing something illigal or unpropriete in either area :) and I guess after some digging on the net you might even connect my nickname to my real name, but its a different than creating real connection. I preffer to keep "internet life" and personal life separate.

Understood. I'll take ownership, publicly, then, documenting that you are the actual author but prefer to stay pseudonymous.

dscho added 4 commits February 6, 2026 14:42
When unpacking objects from a packfile, the object size is decoded
from a variable-length encoding. On platforms where unsigned long is
32-bit (such as Windows, even in 64-bit builds), the shift operation
overflows when decoding sizes larger than 4GB. The result is a
truncated size value, causing the unpacked object to be corrupted or
rejected.

Fix this by changing the size variable to size_t, which is 64-bit on
64-bit platforms, and ensuring the shift arithmetic occurs in 64-bit
space.

This was originally authored by LordKiRon <https://github.com/LordKiRon>,
who preferred not to reveal their real name and therefore agreed that I
take over authorship.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
On Windows, zlib's `uLong` type is 32-bit even on 64-bit systems. When
processing data streams larger than 4GB, the `total_in` and `total_out`
fields in zlib's `z_stream` structure wrap around, which caused the
sanity checks in `zlib_post_call()` to trigger `BUG()` assertions.

The git_zstream wrapper now tracks its own 64-bit totals rather than
copying them from zlib. The sanity checks compare only the low bits,
using `maximum_unsigned_value_of_type(uLong)` to mask appropriately for
the platform's `uLong` size.

This is based on work by LordKiRon in git-for-windows#6076.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The odb_read_stream structure uses unsigned long for the size field,
which is 32-bit on Windows even in 64-bit builds. When streaming
objects larger than 4GB, the size would be truncated to zero or an
incorrect value, resulting in empty files being written to disk.

Change the size field in odb_read_stream to size_t and introduce
unpack_object_header_sz() to return sizes via size_t pointer. Since
object_info.sizep remains unsigned long for API compatibility, use
temporary variables where the types differ, with comments noting the
truncation limitation for code paths that still use unsigned long.

This was originally authored by LordKiRon <https://github.com/LordKiRon>,
who preferred not to reveal their real name and therefore agreed that I
take over authorship.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The delta header decoding functions return unsigned long, which
truncates on Windows for objects larger than 4GB. Introduce size_t
variants get_delta_hdr_size_sz() and get_size_from_delta_sz() that
preserve the full 64-bit size, and use them in packed_object_info()
where the size is needed for streaming decisions.

This was originally authored by LordKiRon <https://github.com/LordKiRon>,
who preferred not to reveal their real name and therefore agreed that I
take over authorship.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
dscho added 2 commits February 6, 2026 18:24
To test Git's behavior with very large pack files, we need a way to
generate such files quickly.

A naive approach using only readily-available Git commands would take
over 10 hours for a 4GB pack file, which is prohibitive.

Side-stepping Git's machinery and actual zlib compression by writing
uncompressed content with the appropriate zlib header makes things
much faster. The fastest method using this approach generates many
small, unreachable blob objects and takes about 1.5 minutes for 4GB.
However, this cannot be used because we need to test git clone, which
requires a reachable commit history.

Generating many reachable commits with small, uncompressed blobs takes
about 4 minutes for 4GB. But this approach 1) does not reproduce the
issues we want to fix (which require individual objects larger than
4GB) and 2) is comparatively slow because of the many SHA-1
calculations.

The approach taken here generates a single large blob (filled with NUL
bytes), along with the trees and commits needed to make it reachable.
This takes about 2.5 minutes for 4.5GB, which is the fastest option
that produces a valid, clonable repository with an object large enough
to trigger the bugs we want to test.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The shift overflow bug in index-pack and unpack-objects caused incorrect
object size calculation when the encoded size required more than 32 bits
of shift. This would result in corrupted or failed unpacking of objects
larger than 4GB.

Add a test that creates a pack file containing a 4GB+ blob using the
new 'test-tool synthesize pack --reachable-large' command, then clones
the repository to verify the fix works correctly.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
@dscho dscho changed the title Implemented minimal fix for shift > 32 issue of 4GB+ data Fix issues when cloning repositories with large blobs (>4GB) Feb 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants