Open
Conversation
mryab
commented
Feb 25, 2023
|
|
||
| CLIENT_BRANCH = "main" | ||
| BLOCK_BRANCH_PREFIX = "block_" | ||
| BLOCK_BRANCH_PREFIX = "int8_block" |
Member
Author
There was a problem hiding this comment.
We'll roll that back before merging
mryab
commented
Feb 25, 2023
Comment on lines
+51
to
+57
| if load_in_8bit: | ||
| block = replace_8bit_linear(block) | ||
| block = block.to(device) |
Member
Author
There was a problem hiding this comment.
I moved replace_8bit_linear here because it's not possible to correctly load the quantized Linear8bitLt checkpoint into the model before it's converted and quantized
mryab
commented
Feb 25, 2023
src/petals/utils/convert_block.py
Outdated
Comment on lines
80
to
81
| from petals.utils.linear8bitlt_patch import CustomLinear8bitLt | ||
|
|
||
| for n, module in model.named_children(): | ||
| if len(list(module.children())) > 0: | ||
| replace_8bit_linear(module, threshold) | ||
|
|
||
| if isinstance(module, torch.nn.Linear) and n not in ["lm_head", "score"]: | ||
| assert module.weight.device.type == "cpu", f"expected linear layers on CPU, got {module.weight.device}" | ||
| model._modules[n] = CustomLinear8bitLt( | ||
| model._modules[n] = bnb.nn.Linear8bitLt( |
Member
Author
There was a problem hiding this comment.
Not strictly necessary, but it'd be good to get rid of all bitsandbytes-related code that got into upstream before merging this
justheuristic
approved these changes
Feb 26, 2023
4 tasks
56a3bee to
a610f4d
Compare
borzunov
reviewed
Jun 6, 2023
| use_auth_token: Optional[str] = None, | ||
| cache_dir: Optional[str] = None, | ||
| max_disk_space: Optional[int] = None, | ||
| load_in_8bit=False, |
Collaborator
There was a problem hiding this comment.
Suggested change
| load_in_8bit=False, | |
| load_in_8bit: bool = False, |
Collaborator
|
We discussed that we may revive this feature for loading NF4-pre-quantized weights for Llama 2 and Stable Beluga 2. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR relies on bitsandbytes-foundation/bitsandbytes#159 and makes it possible to call
convert_modelwith the int8 data type and later on download the 8-bit checkpoint instead of 16-bit if serving the model withload_in_8bit=True. This can save up to 2x bandwidth on starting a server, as shown by this comparison of model sizes for bloom-560m:The command that was used for conversion is
python -m petals.cli.convert_model --model bigscience/bloom-560m --output_path ./converted_model_int8 --torch_dtype int8 --resize_token_embeddings 50000 --block_branch_prefix int8_block. To test that the checkpoint loads correctly, you need to install bitsandbytes from the branch in the PR above and runpython -m petals.cli.run_server bigscience/test-bloomd --new_swarm --skip_reachability_check --throughput 100 --device cuda(pay attention that I had to changeBLOCK_BRANCH_PREFIXin this branch for the sake of testing).