The following code (evaluate method in core_logic.py) seems suspicious:
answer_matches = re.findall(DatasetSpecificProcessing.ANSWER_DELIMITER_PATTERN, generated_text)
# answer_matches = [self.chat_completion(FINAL_ANSWER_EXTRACTION_PROMPT.format(text=generated_text), "You are an AI assistant. Please follow the users requests.")]
answer_matches = [generated_text]
#
answers_len, dataset_len = len(answer_matches), len(dataset_subset)
Since generated_text is a string, len(answer_matches) will always be 1. I don't think this is intended. Also, answer_matches seems to be redundantly defined.