Skip to content

request JSON-LD from Link rel=alternate#129

Open
alpha-beta-soup wants to merge 8 commits intodigitalbazaar:masterfrom
alpha-beta-soup:iss-128
Open

request JSON-LD from Link rel=alternate#129
alpha-beta-soup wants to merge 8 commits intodigitalbazaar:masterfrom
alpha-beta-soup:iss-128

Conversation

@alpha-beta-soup
Copy link

@alpha-beta-soup alpha-beta-soup commented Jun 24, 2020

Ref #128

With the following test cases defined:

context.jsonld

{
  "@context": {
    "@vocab":   "https://w3c.github.io/json-ld-api/tests/vocab#",
    "dcterms":       "http://purl.org/dc/terms/",
    "jld":      "https://w3c.github.io/json-ld-api/tests/vocab#",
    "mf":       "http://www.w3.org/2001/sw/DataAccess/tests/test-manifest#",
    "rdfs":     "http://www.w3.org/2000/01/rdf-schema#",
    "xsd":      "http://www.w3.org/2001/XMLSchema#",

    "context":         { "@type": "@id" },
    "expect":          { "@id": "mf:result", "@type": "@id" },
    "expectErrorCode": { "@id": "mf:result" },
    "frame":           { "@type": "@id" },
    "input":           { "@id": "mf:action", "@type": "@id" },
    "option":          { "@type": "@id"},
    "sequence":        { "@id": "mf:entries", "@type": "@id", "@container": "@list" },
    "redirectTo":      { "@type": "@id"},

    "name":                 "mf:name",
    "purpose":              "rdfs:comment",
    "description":          "rdfs:comment",
    "base":                 { "@type": "@id" },
    "compactArrays":        { "@type": "xsd:boolean" },
    "compactToRelative":    { "@type": "xsd:boolean" },
    "contentType":          { "@type": "xsd:string" },
    "expandContext":        { "@type": "@id" },
    "extractAllScripts":    { "@type": "xsd:boolean" },
    "httpLink":             { "@type": "xsd:string", "@container": "@set" },
    "httpStatus":           { "@type": "xsd:integer" },
    "normative":            { "@type": "xsd:boolean" },
    "processingMode":       { "@type": "xsd:string" },
    "processorFeature":     { "@type": "xsd:string" },
    "produceGeneralizedRdf":{ "@type": "xsd:boolean" },
    "specVersion":          { "@type": "xsd:string" },
    "useNativeTypes":       { "@type": "xsd:boolean" }
  }
}

manifest.jsonld

{
  "@context": ["context.jsonld", {"@base": "manifest"}],
  "@id": "",
  "@type": "mf:Manifest",
  "name": "JSON-LD Test Suite",
  "description": "This manifest loads some tests for resolving https://github.com/digitalbazaar/pyld/issues/128",
  "sequence": [{
  	"@id": "#t1",
  	"@type": ["jld:PositiveEvaluationTest", "jld:ExpandTest"],
  	"name": "Test for JSON-LD via Link header",
  	"purpose": "Tests for correct retrieval of remote JSON-LD when it is present as a Link HTTP header",
  	"input": "/full/path/to/pyld/tests/sample.jsonld",
  	"expect": "/full/path/to/pyld/tests/output.jsonld"
  }, {
    "@id": "#t2",
    "@type": ["jld:PositiveEvaluationTest", "jld:ExpandTest"],
    "name": "Test for JSON-LD via direct JSON-LD URL",
    "purpose": "Tests for correct retrieval of remote JSON-LD when it is given as a direct link within a context",
    "input": "/full/path/to/pyld/tests/sample2.jsonld",
    "expect": "/full/path/to/pyld/tests/output.jsonld"
  }]
}

sample.jsonld

{
	"@context": "https://schema.org",
	"@type":"Dataset",
	"@id":"http://localhost:5000/collections/obs",
	"url":"http://localhost:5000/collections/obs"
}

sample2.jsonld

{
	"@context": "https://schema.org/docs/jsonldcontext.jsonld",
	"@type":"Dataset",
	"@id":"http://localhost:5000/collections/obs",
	"url":"http://localhost:5000/collections/obs"
}

output.jsonld

[
  {
    "@id": "http://localhost:5000/collections/obs",
    "@type": [
      "http://schema.org/Dataset"
    ],
    "http://schema.org/url": [
      {
        "@id": "http://localhost:5000/collections/obs"
      }
    ]
  }
]

The tests both fail before the changes. The tests both passs after the changes. Since all the test cases of this repository are remote, I am not sure whether or where to contribute these test cases.

However, there is one regression, which I do not currently know how to resolve. This test does not have an error before the changes, and does have an error after the changes. The report is:

======================================================================
ERROR: Remote document: https://w3c.github.io/json-ld-api/tests/remote-doc-manifest#t0002: Document loader loads a JSON document.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "tests/runtests.py", line 372, in runTest
    raise e
  File "tests/runtests.py", line 319, in runTest
    result = getattr(jsonld, fn)(*params)
  File "tests/../lib/pyld/jsonld.py", line 163, in expand
    return JsonLdProcessor().expand(input_, options)
  File "tests/../lib/pyld/jsonld.py", line 820, in expand
    remote_doc = load_document(input_, options)
  File "tests/../lib/pyld/jsonld.py", line 6591, in load_document
    code='loading document failed')
pyld.jsonld.JsonLdError: ('No remote document found at the given URL.',)
Type: jsonld.NullRemoteDocument
Code: loading document failed

All other tests are unaffected. However, as mentioned in #128, running the tests as described with no changes elicits 5 failures, 2 errors, and 34 skipped tests.

@datadavev datadavev mentioned this pull request Aug 3, 2020
datadavev added a commit to datadavev/pyld that referenced this pull request Aug 3, 2020
@JulienPalard
Copy link

This PR does not fixes the following test:

from pyld import jsonld

jsonld.expand(
    {
        "@context": "http://schema.org/",
        "@type": "Person",
        "name": "Jane Doe",
        "jobTitle": "Professor",
        "telephone": "(425) 123-4567",
        "url": "http://www.janedoe.com",
    }
)

because https://schema.org gives Content-Type: text/html, which is in headers["Accept"]:

(Pdb) p headers["Accept"]
'application/ld+json;profile=http://www.w3.org/ns/json-ld#context, application/ld+json, application/json;q=0.5, text/html;q=0.8, application/xhtml+xml;q=0.8'

it's here due to pyld/jsonld.py adding it:

6573 	    # FIXME: only if html5lib loaded?
6574 	    headers['Accept'] = headers['Accept'] + ', text/html;q=0.8, application/xhtml+xml;q=0.8'

According to json-ld:

A processor seeing a non-JSON result will note the presence of the link header and load that document instead.

So even if we accept HTML, if it's HTML and there's a json-ld alternative, let's use it.

To do this, I'll move doc['document'] = response.json() right before returning, and just drop the if content_type not in headers['Accept']: you added, it works for me.

In other words (and with a correct indentation) I mean:

--- a/lib/pyld/documentloader/requests.py
+++ b/lib/pyld/documentloader/requests.py
@@ -69,7 +69,6 @@ def requests_document_loader(secure=False, **kwargs):
                 'contentType': content_type,
                 'contextUrl': None,
                 'documentUrl': response.url,
-                'document': response.json() if content_type in headers['Accept'] else None
             }
             link_header = response.headers.get('link')
             if link_header:
@@ -77,15 +76,15 @@ def requests_document_loader(secure=False, **kwargs):
                     LINK_HEADER_REL)
                 # only 1 related link header permitted
                 if linked_context and content_type != 'application/ld+json':
-                  if isinstance(linked_context, list):
-                      raise JsonLdError(
-                          'URL could not be dereferenced, '
-                          'it has more than one '
-                          'associated HTTP Link Header.',
-                          'jsonld.LoadDocumentError',
-                          {'url': url},
-                          code='multiple context link headers')
-                  doc['contextUrl'] = linked_context['target']
+                    if isinstance(linked_context, list):
+                        raise JsonLdError(
+                            'URL could not be dereferenced, '
+                            'it has more than one '
+                            'associated HTTP Link Header.',
+                            'jsonld.LoadDocumentError',
+                            {'url': url},
+                            code='multiple context link headers')
+                    doc['contextUrl'] = linked_context['target']
                 linked_alternate = parse_link_header(link_header).get('alternate')
                 # if not JSON-LD, alternate may point there
                 if (linked_alternate and
@@ -93,9 +92,8 @@ def requests_document_loader(secure=False, **kwargs):
                         not re.match(r'^application\/(\w*\+)?json$', content_type)):
                     doc['contentType'] = 'application/ld+json'
                     doc['documentUrl'] = prepend_base(url, linked_alternate['target'])
-                    if content_type not in headers['Accept']:
-                        # Original was not JSON/JSON-LD; fetch linked JSON-LD
-                        return loader(doc['documentUrl'], options=options)
+                    return loader(doc['documentUrl'], options=options)
+            doc['document'] = response.json()
             return doc
         except JsonLdError as e:
             raise e

@alpha-beta-soup
Copy link
Author

Hmm, that's a bit odd. Yes, Pyld adds text/html (in the line you pointed out, i.e. headers['Accept'] = headers['Accept'] + ', text/html;q=0.8, application/xhtml+xml;q=0.8') but that's a default header value and it shouldn't affect the server response as long as application/json+ld is included in the Accept header value with a higher precedence. That check

if content_type not in headers['Accept']:
    # Original was not JSON/JSON-LD; fetch linked JSON-LD
    return loader(doc['documentUrl'], options=options)

is important, it checks whether the server responded with application/json+ld, and if not, attempts to fetch the linked resource. There's no value doing this if the response is already JSON-LD.

@JulienPalard
Copy link

it shouldn't affect the server response as long as application/json+ld is included in the Accept header value with a higher precedence.

Totally agree. But looks like https://schema.org/ does not have a application/json-ld variant at all, so it always reply with text/html, independently of the Accept header. But this text/html response links to the ld+json, see:

$ curl -I -H "Accept: application/ld+json" https://schema.org
HTTP/2 200 
[...]
link: </docs/jsonldcontext.jsonld>; rel="alternate"; type="application/ld+json"
[...]
content-type: text/html
[...]

Which looks legit to me, even though I still didn't read and understood rfc8288 entierly.

That check

if content_type not in headers['Accept']:
    # Original was not JSON/JSON-LD; fetch linked JSON-LD
    return loader(doc['documentUrl'], options=options)

is important, it checks whether the server responded with application/json+ld, and if not, attempts to fetch the linked resource. There's no value doing this if the response is already JSON-LD.

IIRC It looks already covered by the:

not re.match(r'^application/(\w*+)?json$', content_type)):

check a few line before.

@alpha-beta-soup
Copy link
Author

Ah I see. Sorry it's been a while since I've looked at this. I think your version of the PR makes sense, and as long as both our tests pass, it gets my vote.

@JulienPalard
Copy link

and as long as both our tests pass, it gets my vote.

I'd like to add both tests to the test suite, but I don't understand how to do so. If someone can explain before merging I'd gladly do so.

@mielvds
Copy link
Collaborator

mielvds commented Feb 5, 2026

@anatoly-scherbakov first a github noob question, but how ddid you link iss-128 to this PR?
Second: I re-enabled the https://schema.org/ context in a test in the iss-128 branch because of this fix. However, it seems like I'm stil getting the pyld.jsonld.JsonLdError: ('Could not retrieve a JSON-LD document from the URL.',)

@anatoly-scherbakov
Copy link
Collaborator

@mielvds thanks for review!

  1. The PR was already present, I just was able to push to it. I think I have that ability since I am a contributor to digitalbazaar/pyld. But the exact mechanics is due to an LLM agent helping me, it's going to be difficult to resurrect the necessary commands 😅

  2. I am adding tests to verify this. It seems that .json() was called too early.

@alpha-beta-soup
Copy link
Author

Thanks for resurrecting this, let me know if you need anything from me.

@anatoly-scherbakov
Copy link
Collaborator

@alpha-beta-soup thank you for your contribution! I'd welcome it if you would look through the additional changes I've done to see if they are what you had intended.

Copy link
Author

@alpha-beta-soup alpha-beta-soup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly my intention, has my approval subject to tests passsing without regressions

alpha-beta-soup and others added 6 commits February 10, 2026 09:06
Co-authored-by: Anatoly Scherbakov <altaisoft@gmail.com>
Test that both requests and aiohttp document loaders can expand
a JSON-LD document whose context is https://schema.org, which
serves text/html with a Link rel="alternate" header pointing to
the actual JSON-LD context.

Ref digitalbazaar#128

Co-authored-by: Cursor <cursoragent@cursor.com>
Defer response.json() until after Link header processing so that
non-JSON responses (e.g. text/html from schema.org) don't crash
the loader. When a Link rel="alternate" of type application/ld+json
is found and the response isn't already JSON, follow it unconditionally.

Also fixes 2-space indentation to 4-space.

Fixes digitalbazaar#128

Co-authored-by: Cursor <cursoragent@cursor.com>
Same fix as the requests loader: defer response.json() until after
Link header processing, and actually fetch the alternate URL when
found (previously it only updated doc['documentUrl'] without making
a second request).

Also fixes 2-space indentation to 4-space.

Fixes digitalbazaar#128

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Collaborator

@mielvds mielvds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! The tests that were failing before now pass, so the schema.org context URL seems to work now. I've reverted them (and figured out how to push to this branch ;)).

I don't like the test_schema_org.py file though (probably an AI artefact). Those test cases should be in something close to

  • a test_aiohttp.py and test_requests.py,
  • or simpler: test_loader.py with classes TestRequests and TestAiohttp

The tests themselves can use schema.org, but should be named after the functionality they are testing, so something like test_remote_context_without_jsonld_mimetype

@mielvds
Copy link
Collaborator

mielvds commented Feb 11, 2026

@anatoly-scherbakov I just merged #170 introducing tests/test_document_loader.py. I'd rather have you merge these tests in there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants