Skip to content

feat: add ray_serve as llm provider#178

Open
ChenZiHong-Gavin wants to merge 1 commit intomainfrom
feat/add-ray-serve
Open

feat: add ray_serve as llm provider#178
ChenZiHong-Gavin wants to merge 1 commit intomainfrom
feat/add-ray-serve

Conversation

@ChenZiHong-Gavin
Copy link
Collaborator

No description provided.

@github-actions github-actions bot added the core label Feb 6, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ChenZiHong-Gavin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates Ray Serve as a new backend option for managing and serving Large Language Models. This enhancement provides a flexible and scalable infrastructure for deploying various LLM implementations, allowing for distributed inference and improved resource utilization.

Highlights

  • New LLM Provider: Added 'ray_serve' as a selectable backend option for initializing Large Language Models.
  • Ray Serve Client Integration: Introduced RayServeClient to enable interaction with Ray Serve deployments, allowing connection to existing services or dynamic deployment of new ones.
  • LLM Deployment Wrapper for Ray Serve: Created LLMDeployment, a Ray Serve deployment class that wraps existing LLM backends (vLLM, HuggingFace, SGLang) and exposes their generation capabilities via a web API.
Changelog
  • graphgen/common/init_llm.py
    • Added a new conditional branch to instantiate RayServeClient when the specified backend is 'ray_serve'.
  • graphgen/models/llm/api/ray_serve_client.py
    • New file defining the RayServeClient class.
    • Provides functionality to obtain a handle to an existing Ray Serve application or deployment.
    • Includes logic to deploy a new Ray Serve application if a serve_backend is provided and no existing handle is found.
    • Implements asynchronous methods (generate_answer, generate_topk_per_token, generate_inputs_prob) to interact with the deployed LLM service.
  • graphgen/models/llm/local/ray_serve_deployment.py
    • New file defining the LLMDeployment class, decorated as a Ray Serve deployment.
    • Initializes an internal LLM instance based on the provided backend (e.g., 'vllm', 'huggingface', 'sglang').
    • Exposes the LLM's generation methods (generate_answer, generate_topk_per_token, generate_inputs_prob) for remote calls.
    • Includes an __call__ method to handle incoming HTTP requests, parsing method and arguments to invoke the appropriate LLM function.
    • Provides an app_builder function to facilitate command-line deployment using serve run.
Activity
  • No human activity (comments, reviews, etc.) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

if app_name:
try:
self.handle = serve.get_app_handle(app_name)
except Exception:
elif deployment_name:
try:
self.handle = serve.get_deployment(deployment_name).get_handle()
except Exception:
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully integrates Ray Serve as a new LLM provider, enhancing the system's ability to leverage distributed LLM deployments. However, it introduces several security concerns, primarily related to the exposure of an unauthenticated HTTP endpoint. The LLMDeployment class lacks authentication, exposes verbose error messages, and passes untrusted input to LLM wrappers that are vulnerable to prompt injection. Additionally, there are areas for improvement related to general error handling, code organization, and potential blocking operations in an asynchronous context.

) -> List[Token]:
return await self.llm_instance.generate_inputs_prob(text, history, **extra)

async def __call__(self, request: Request) -> Dict:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The __call__ method, which serves as the HTTP entry point for the Ray Serve deployment, lacks any authentication or authorization checks. This allows any user with network access to the Ray Serve port to execute LLM queries, potentially leading to unauthorized resource consumption and abuse of the service.

kwargs = data.get("kwargs", {})

if method == "generate_answer":
result = await self.generate_answer(text, history, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The __call__ method passes untrusted user input (text and history) directly to the LLM's generate_answer method. Several LLM backends in this repository (e.g., HuggingFaceWrapper, SGLangWrapper) use manual string concatenation to build prompts, making them highly susceptible to prompt injection attacks. An attacker could provide crafted input to manipulate the LLM's behavior or spoof conversation history.

Comment on lines +34 to +35
except Exception:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Catching a generic Exception can hide unexpected errors and make debugging difficult. It's better to catch more specific exceptions that you anticipate, or at least log the full traceback if a generic Exception must be caught.

Comment on lines +38 to +39
self.handle = serve.get_deployment(deployment_name).get_handle()
except Exception:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the previous comment, catching a generic Exception here can obscure the root cause of issues. Consider catching more specific exceptions related to serve.get_deployment or get_handle failures.

Comment on lines +58 to +59
deployment = LLMDeployment.bind(backend=serve_backend, config=kwargs)
serve.run(deployment, name=app_name, route_prefix=f"/{app_name}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

serve.run() is a blocking call by default. If RayServeClient is intended to be used within an asynchronous application, this call will block the event loop, potentially causing performance issues or deadlocks. Consider if the Ray Serve application should be deployed out-of-band (e.g., as a separate script) or if serve.start() and serve.run_app() should be used in a non-blocking manner if the client is responsible for its lifecycle within an async context.


return {"result": result}
except Exception as e:
return {"error": str(e)}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The current implementation exposes sensitive internal information by returning raw exception details, which could be exploited by attackers to gain deeper insights into the system's architecture and vulnerabilities. It's crucial to prevent the leakage of stack traces, file paths, or configuration details. Instead, catch specific exceptions and return generalized error messages, logging full details internally without exposing them to the client.

        except Exception as e:
            # Log the error internally for debugging
            # import logging
            # logging.exception("Error in LLMDeployment")
            return {"error": "An internal error occurred while processing your request."}

Comment on lines +46 to +48
import uuid

app_name = f"llm_app_{serve_backend}_{uuid.uuid4().hex[:8]}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Imports should generally be placed at the top of the file (PEP 8 guideline). Moving import uuid to the top of the file improves readability and ensures consistency, even if Python allows conditional imports.

import uuid

                    app_name = f"llm_app_{serve_backend}_{uuid.uuid4().hex[:8]}"

Comment on lines +55 to +57
# Filter kwargs to avoid passing unrelated args if necessary,
# but LLMDeployment config accepts everything for now.
# Note: We need to pass kwargs as the config dict.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment indicates a potential issue with kwargs being passed directly. It's safer and clearer to explicitly construct a config dictionary with only the parameters relevant to LLMDeployment to avoid passing unintended arguments. This also makes the code more robust to future changes in LLMDeployment's constructor.

                # Explicitly construct config for LLMDeployment
                deployment_config = {"model": kwargs.get("model"), "tokenizer": kwargs.get("tokenizer")} # Add other relevant config parameters
                deployment = LLMDeployment.bind(backend=serve_backend, config=deployment_config)

Comment on lines +17 to +20
tokenizer_model = os.environ.get("TOKENIZER_MODEL", "cl100k_base")
if "tokenizer" not in config:
tokenizer = Tokenizer(model_name=tokenizer_model)
config["tokenizer"] = tokenizer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The tokenizer initialization logic is duplicated here and in LLMServiceActor. To improve maintainability and avoid redundancy, consider extracting this logic into a shared utility function or a common base class if applicable. This ensures consistent tokenizer handling across different LLM wrappers.

Usage: serve run graphgen.models.llm.local.ray_serve_deployment:app_builder backend=vllm model=...
"""
# args comes from the command line key=value pairs
backend = args.pop("backend", "vllm")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The app_builder function defaults backend to "vllm" if not provided. This default might not align with the user's expectation or the LLMDeployment's intended behavior if other backends are more common or desired as a default. Consider making the backend explicit or ensuring the default is well-documented and consistent with the overall system design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant