Skip to content

feat: support synthesizing masked fill_in_blank QA pairs#173

Open
superfarther wants to merge 2 commits intoInternScience:mainfrom
superfarther:yzh/masked_fill_in_blank
Open

feat: support synthesizing masked fill_in_blank QA pairs#173
superfarther wants to merge 2 commits intoInternScience:mainfrom
superfarther:yzh/masked_fill_in_blank

Conversation

@superfarther
Copy link

This PR support synthesizing masked fill_in_blank QA pairs

@github-actions github-actions bot added documentation Improvements or additions to documentation core examples labels Feb 4, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @superfarther, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the system's question generation capabilities by introducing a novel method for creating masked fill-in-blank QA pairs. It integrates a new generator that intelligently rephrases graph data and masks key entities, alongside a dedicated partitioner for extracting graph triples. The inclusion of a comprehensive example pipeline ensures immediate usability and demonstration of this new feature.

Highlights

  • New QA Generation Method: Introduced support for synthesizing masked fill-in-blank Question-Answering (QA) pairs, enhancing the variety of generative tasks.
  • MaskedFillInBlankGenerator: Added a new MaskedFillInBlankGenerator class that rephrases graph nodes and edges into coherent text, then randomly masks a node's name to create fill-in-blank questions.
  • TriplePartitioner: Implemented a TriplePartitioner to extract distinct (node, edge, node) triples from a graph, which serves as input for the new generator.
  • Example Pipeline: Provided a complete example, including a README, a shell script, and a YAML configuration, to demonstrate how to generate masked fill-in-blank QAs.
  • Output Format Support: Extended the base generator to support a new QA_pairs output data format for generated questions and answers.
Changelog
  • examples/generate/generate_masked_fill_in_blank_qa/README.md
    • Added a new README file for the masked fill-in-blank QA generation example.
  • examples/generate/generate_masked_fill_in_blank_qa/generate_masked_fill_in_blank.sh
    • Added a new shell script to execute the masked fill-in-blank QA generation pipeline.
  • examples/generate/generate_masked_fill_in_blank_qa/masked_fill_in_blank_config.yaml
    • Added a new YAML configuration file to define the pipeline for generating masked fill-in-blank QA pairs.
  • graphgen/bases/base_generator.py
    • Modified format_generation_results to support the new QA_pairs output format, returning question and answer as a dictionary.
  • graphgen/models/init.py
    • Imported and registered MaskedFillInBlankGenerator and TriplePartitioner for dynamic loading within the graphgen.models module.
  • graphgen/models/generator/init.py
    • Imported MaskedFillInBlankGenerator to make it available within the generator module.
  • graphgen/models/generator/masked_fill_in_blank_generator.py
    • Added a new MaskedFillInBlankGenerator class that rephrases input graph nodes and edges into coherent text and then masks a randomly selected node to form a fill-in-blank question.
  • graphgen/models/partitioner/init.py
    • Imported TriplePartitioner to make it available within the partitioner module.
  • graphgen/models/partitioner/triple_partitioner.py
    • Added a new TriplePartitioner class that partitions a graph into distinct (node, edge, node) triples, ignoring isolated points and yielding triples in BFS order.
  • graphgen/operators/generate/generate_service.py
    • Integrated the MaskedFillInBlankGenerator into the GenerateService to handle the masked_fill_in_blank generation method.
  • graphgen/operators/partition/partition_service.py
    • Integrated the TriplePartitioner into the PartitionService to handle the triple partitioning method.
    • Made method_params optional in the PartitionService constructor to accommodate partitioners that may not require them.
Activity
  • The pull request introduces a new feature to generate masked fill-in-blank QA pairs.
  • New example files, including a README, a shell script, and a configuration file, have been added to demonstrate the new functionality.
  • Core logic for the MaskedFillInBlankGenerator and TriplePartitioner has been implemented.
  • Existing service classes (GenerateService and PartitionService) have been updated to integrate the new generator and partitioner.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for synthesizing masked fill-in-the-blank QA pairs. It adds a new TriplePartitioner to break down the graph into triples and a MaskedFillInBlankGenerator for generating the QA pairs. The overall approach is sound and the changes are well-structured. I've identified one critical issue that could lead to a runtime error, along with several high-severity issues concerning violations of base class contracts and the use of global state, which should be addressed. I've also included some medium-severity suggestions to enhance code quality and maintainability.

mask_pattern = re.compile(re.escape(mask_node_name), re.IGNORECASE)
masked_context = mask_pattern.sub("___", context)
# For accuracy, extract the actual replaced text from the context as the ground truth (keeping the original case)
gth = re.search(mask_pattern, context).group(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The call to re.search(mask_pattern, context) can return None if the mask_node_name is not found in the context. This would cause a crash with an AttributeError when .group(0) is called. You should add a check to handle this case gracefully, for example by logging a warning and returning an empty list.

Suggested change
gth = re.search(mask_pattern, context).group(0)
match = re.search(mask_pattern, context)
if not match:
logger.warning(
"Could not find mask_node_name '%s' in the rephrased context. Context: %s",
mask_node_name,
context,
)
return []
gth = match.group(0)

from graphgen.templates import AGGREGATED_GENERATION_PROMPT
from graphgen.utils import detect_main_language, logger

random.seed(42)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Setting a global random seed with random.seed(42) is generally discouraged as it affects the entire application's random number generation, which can lead to unexpected behavior in other parts of the code. For reproducibility, it's better to create a local random.Random instance within your class, for example in the __init__ method, and use that for random operations like random.choice on line 103.

Comment on lines +77 to +79
@staticmethod
def parse_response(response: str) -> dict:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The parse_response method is defined as an abstract method in the BaseGenerator class but is implemented with pass here. Additionally, the return type hint dict is incompatible with the base class's list[dict]. Since this method is not used in the overridden generate method, it should either be implemented correctly or raise NotImplementedError to adhere to the abstract base class contract.

Suggested change
@staticmethod
def parse_response(response: str) -> dict:
pass
@staticmethod
def parse_response(response: str) -> list[dict]:
raise NotImplementedError("This method is not used in MaskedFillInBlankGenerator as it overrides the `generate` method.")

from graphgen.bases import BaseGraphStorage, BasePartitioner
from graphgen.bases.datatypes import Community

random.seed(42)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Setting a global random seed with random.seed(42) is generally discouraged as it affects the entire application's random number generation. This can lead to unexpected behavior in other parts of the code. For reproducibility, it's better to create a local random.Random instance within your class, for example in the __init__ method, and use that for random operations like random.shuffle.

Comment on lines 17 to 57
def partition(
self,
g: BaseGraphStorage,
**kwargs: Any,
) -> Iterable[Community]:
nodes = [n[0] for n in g.get_all_nodes()]
random.shuffle(nodes)

visited_nodes: Set[str] = set()
used_edges: Set[frozenset[str]] = set()

for seed in nodes:
if seed in visited_nodes:
continue

# start BFS in a connected component
queue = deque([seed])
visited_nodes.add(seed)

while queue:
u = queue.popleft()

for v in g.get_neighbors(u):
edge_key = frozenset((u, v))

# if this edge has not been used, a new triple has been found
if edge_key not in used_edges:
used_edges.add(edge_key)

# use the edge name to ensure the uniqueness of the ID
u_sorted, v_sorted = sorted((u, v))
yield Community(
id=f"{u_sorted}-{v_sorted}",
nodes=[u_sorted, v_sorted],
edges=[(u_sorted, v_sorted)]
)

# continue to BFS
if v not in visited_nodes:
visited_nodes.add(v)
queue.append(v) No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The partition method in the base class BasePartitioner is type-hinted to return a List[Community]. This implementation returns an Iterable[Community] because it uses yield, which violates the Liskov Substitution Principle and the base class contract. To fix this, you should change this function to build and return a list of communities instead of using yield.

dependencies:
- build_kg
params:
method: triple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a trailing whitespace character on this line. While YAML is often tolerant of this, it's best practice to remove it to avoid potential parsing issues and maintain a clean configuration file.

      method: triple


if output_data_format == "QA_pairs":
return {
"question": question,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a trailing whitespace after question,. Please remove it to maintain code style consistency.

Suggested change
"question": question,
"question": question,

Comment on lines +42 to +54
# TODO: configure add_context
# if add_context:
# original_ids = [
# node["source_id"].split("<SEP>")[0] for node in _process_nodes
# ] + [edge[2]["source_id"].split("<SEP>")[0] for edge in _process_edges]
# original_ids = list(set(original_ids))
# original_text = await text_chunks_storage.get_by_ids(original_ids)
# original_text = "\n".join(
# [
# f"{index + 1}. {text['content']}"
# for index, text in enumerate(original_text)
# ]
# )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This large block of commented-out code seems to be related to a future feature (add_context). It's better to remove commented-out code from the codebase to improve readability. If this logic is needed for future reference, it should be tracked in an issue or a separate branch.

if not context:
return []

nodes, edge = batch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The variable edge is used to unpack the second item from batch, but based on the assertion on line 100 (len(edge) == 1), it appears to be a list of edges. For clarity and to avoid confusion, consider renaming it to edges here and on line 100.

Suggested change
nodes, edge = batch
nodes, edges = batch

# continue to BFS
if v not in visited_nodes:
visited_nodes.add(v)
queue.append(v) No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The file is missing a newline character at the end. It's a common convention to end files with a newline to prevent issues with file concatenation and some version control tools.

Suggested change
queue.append(v)
queue.append(v)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core documentation Improvements or additions to documentation examples

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant