Skip to content

[Bug]: total_score, head_data is not always populated for LinkPreviewConfig #1749

@bkennedy-improving

Description

@bkennedy-improving

crawl4ai version

0.7.8

Expected Behavior

Per Documentation:

  1. Total Score - Smart Combination

Intelligently combines intrinsic and contextual scores with fallbacks:

  • When both scores available: (intrinsic * 0.3) + (contextual * 0.7)
  • When only intrinsic: uses intrinsic score
  • When only contextual: uses contextual score
  • When neither: not calculated

head_data is returned.

Current Behavior

Below is an example output:

As you can see there is no intelligent fallback even though an intrinsic score is available.

Related, but it seems like I'm not getting head_data back either. That could be the pdf files I'm looking at.

{
  "https://www.pa.gov/en/grants/search/grant-details/dced/9": [
    {
      "url": "https://dced.pa.gov/programs/ben-franklin-technology-development-authority-venture-investment-program",
      "text": "Program Page",
      "intrinsic_score": 5.0,
      "contextual_score": null,
      "total_score": null,
      "head_data": null
    },
    {
      "url": "https://grants.pa.gov/",
      "text": "Go to Application  (opens in a new tab)",
      "intrinsic_score": 4.0,
      "contextual_score": null,
      "total_score": null,
      "head_data": null
    },
    {
      "url": "https://dced.pa.gov/download/bftda-venture-investment-program-guidelines?wpdmdl=87903",
      "text": "Program Guidelines",
      "intrinsic_score": 4.0,
      "contextual_score": null,
      "total_score": null,
      "head_data": null
    },
    {
      "url": "https://www.pa.gov/privacy-policy",
      "text": "Privacy Policy(opens in a new tab)",
      "intrinsic_score": 3.5,
      "contextual_score": null,
      "total_score": null,
      "head_data": null
    },
    {
      "url": "https://www.pa.gov/en/agencies/dced.html",
      "text": "Visit the DCED Website  (opens in a new tab)",
      "intrinsic_score": 2.7857142857142856,
      "contextual_score": null,
      "total_score": null,
      "head_data": null
    }
  ]
}

Additionally, from an 'apples to apples' comparison, total_score shouldn't just be the intrinsic value if contextual isn't available, but rather the weighted intrinsic value. It seemed like from the documentation total_score would simply represent the raw intrinsic score (if 5, then return 5, not 5 x .3 = 1.5)

Scenario Apples-to-Apples Approach Resulting Formula / Score
Both available Weighted average $(intrinsic \times 0.3) + (contextual \times 0.7)$
Only intrinsic Uses weighted intrinsic score $intrinsic \times 0.3$
Only contextual Uses weighted contextual score $contextual \times 0.7$

Is this reproducible?

Yes

Inputs Causing the Bug

See example above.

Steps to Reproduce

See Code Snippet

Code snippets

The relevant code looks something like this:


        md_generator = DefaultMarkdownGenerator()
        config = CrawlerRunConfig(
            url_matcher=str(t.url),
            markdown_generator=md_generator,
            excluded_tags=['nav', 'footer', 'header'],
            extraction_strategy=llm_strategy,
            cache_mode=CacheMode.BYPASS,
            stream=True,
            score_links=True,
            exclude_all_images=True,
            link_preview_config=LinkPreviewConfig(
                include_internal=True,
                include_external=True,
                max_links=20,
                concurrency=5,
                timeout=10,
                query='my query here'
                score_threshold=0.2,
            ),
        )


Happy to provide more details privately.

OS

Linux

Python version

3.12.3

Browser

Chrome

Browser version

144.0.7559.59

Error logs & Screenshots (if applicable)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    🐞 BugSomething isn't working📌 Root causedidentified the root cause of bug

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions