Conversation
Summary of ChangesHello @ChenZiHong-Gavin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates Ray Serve as a new backend option for managing and serving Large Language Models. This enhancement provides a flexible and scalable infrastructure for deploying various LLM implementations, allowing for distributed inference and improved resource utilization. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request successfully integrates Ray Serve as a new LLM provider, enhancing the system's ability to leverage distributed LLM deployments. However, it introduces several security concerns, primarily related to the exposure of an unauthenticated HTTP endpoint. The LLMDeployment class lacks authentication, exposes verbose error messages, and passes untrusted input to LLM wrappers that are vulnerable to prompt injection. Additionally, there are areas for improvement related to general error handling, code organization, and potential blocking operations in an asynchronous context.
| ) -> List[Token]: | ||
| return await self.llm_instance.generate_inputs_prob(text, history, **extra) | ||
|
|
||
| async def __call__(self, request: Request) -> Dict: |
There was a problem hiding this comment.
The __call__ method, which serves as the HTTP entry point for the Ray Serve deployment, lacks any authentication or authorization checks. This allows any user with network access to the Ray Serve port to execute LLM queries, potentially leading to unauthorized resource consumption and abuse of the service.
| kwargs = data.get("kwargs", {}) | ||
|
|
||
| if method == "generate_answer": | ||
| result = await self.generate_answer(text, history, **kwargs) |
There was a problem hiding this comment.
The __call__ method passes untrusted user input (text and history) directly to the LLM's generate_answer method. Several LLM backends in this repository (e.g., HuggingFaceWrapper, SGLangWrapper) use manual string concatenation to build prompts, making them highly susceptible to prompt injection attacks. An attacker could provide crafted input to manipulate the LLM's behavior or spoof conversation history.
| except Exception: | ||
| pass |
| self.handle = serve.get_deployment(deployment_name).get_handle() | ||
| except Exception: |
| deployment = LLMDeployment.bind(backend=serve_backend, config=kwargs) | ||
| serve.run(deployment, name=app_name, route_prefix=f"/{app_name}") |
There was a problem hiding this comment.
serve.run() is a blocking call by default. If RayServeClient is intended to be used within an asynchronous application, this call will block the event loop, potentially causing performance issues or deadlocks. Consider if the Ray Serve application should be deployed out-of-band (e.g., as a separate script) or if serve.start() and serve.run_app() should be used in a non-blocking manner if the client is responsible for its lifecycle within an async context.
|
|
||
| return {"result": result} | ||
| except Exception as e: | ||
| return {"error": str(e)} |
There was a problem hiding this comment.
The current implementation exposes sensitive internal information by returning raw exception details, which could be exploited by attackers to gain deeper insights into the system's architecture and vulnerabilities. It's crucial to prevent the leakage of stack traces, file paths, or configuration details. Instead, catch specific exceptions and return generalized error messages, logging full details internally without exposing them to the client.
except Exception as e:
# Log the error internally for debugging
# import logging
# logging.exception("Error in LLMDeployment")
return {"error": "An internal error occurred while processing your request."}| import uuid | ||
|
|
||
| app_name = f"llm_app_{serve_backend}_{uuid.uuid4().hex[:8]}" |
There was a problem hiding this comment.
| # Filter kwargs to avoid passing unrelated args if necessary, | ||
| # but LLMDeployment config accepts everything for now. | ||
| # Note: We need to pass kwargs as the config dict. |
There was a problem hiding this comment.
The comment indicates a potential issue with kwargs being passed directly. It's safer and clearer to explicitly construct a config dictionary with only the parameters relevant to LLMDeployment to avoid passing unintended arguments. This also makes the code more robust to future changes in LLMDeployment's constructor.
# Explicitly construct config for LLMDeployment
deployment_config = {"model": kwargs.get("model"), "tokenizer": kwargs.get("tokenizer")} # Add other relevant config parameters
deployment = LLMDeployment.bind(backend=serve_backend, config=deployment_config)| tokenizer_model = os.environ.get("TOKENIZER_MODEL", "cl100k_base") | ||
| if "tokenizer" not in config: | ||
| tokenizer = Tokenizer(model_name=tokenizer_model) | ||
| config["tokenizer"] = tokenizer |
There was a problem hiding this comment.
The tokenizer initialization logic is duplicated here and in LLMServiceActor. To improve maintainability and avoid redundancy, consider extracting this logic into a shared utility function or a common base class if applicable. This ensures consistent tokenizer handling across different LLM wrappers.
| Usage: serve run graphgen.models.llm.local.ray_serve_deployment:app_builder backend=vllm model=... | ||
| """ | ||
| # args comes from the command line key=value pairs | ||
| backend = args.pop("backend", "vllm") |
There was a problem hiding this comment.
The app_builder function defaults backend to "vllm" if not provided. This default might not align with the user's expectation or the LLMDeployment's intended behavior if other backends are more common or desired as a default. Consider making the backend explicit or ensuring the default is well-documented and consistent with the overall system design.
No description provided.