Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Percent encoding spec and : and / #39

Open
jdillon opened this issue Apr 6, 2018 · 13 comments
Open

Percent encoding spec and : and / #39

jdillon opened this issue Apr 6, 2018 · 13 comments
Labels
PURL core specification Format and syntax that define PURL (excludes PURL type definitions)
Milestone

Comments

@jdillon
Copy link

jdillon commented Apr 6, 2018

The notes in the specification about percent encoding of ":" are a bit confusing:

the '#', '?', '@' and ':' characters must NOT be encoded when used as separators. They may need to be encoded elsewhere
the ':' type separator does not need to and must NOT be encoded. It is unambiguous unencoded everywhere

It seems like these 2 contradict either other. The former indicates that ":" may need to be encoded elsewhere. The later indicates that ":" is "unambiguous unencoded everywhere".

Similarly the qualifier component documentation says:

A value must be must be a percent-encoded string

... and does not mention anything about "/". But the test-suite-data.json references canonical_purl representations like repository_url=repo.spring.io/release.

And the percent-encoding docs state:

the '/' used as namespace/name and subpath segments separator does not need to and must NOT be percent-encoded. It is unambiguous unencoded everywhere

My interpretation of this boils down to... "/" is never encoded, but the language in the parts of the specification are unclear. The name, namespace and subpath parts are clear wrt to "/", which leaves the qualifier and version bits as vague as to if "/" is supposed to be percent-encoded or not.

@stevespringett
Copy link
Member

stevespringett commented Aug 16, 2018

With reference to:

pkg:docker/customer/dockerimage@sha256:244fd47e07d1004f0aed9c?repository_url=gcr.io

It is my opinion that the testsuite in the .NET, Python, and Java implementations as well as in the spec repo are incorrect and violate the spec. Whereas the testsuite in the Rust implementation adheres to the spec (as written).

Spec states:

  • The version is prefixed by a '@' separator when not empty
  • A version must be a percent-encoded string
  • the '#', '?', '@' and ':' characters must NOT be encoded when used as separators. They may need to be encoded elsewhere

Therefore, the canonicalized purl should be changed to:

pkg:docker/customer/dockerimage@sha256%3A244fd47e07d1004f0aed9c?repository_url=gcr.io

We need clarification on this.

@stevespringett
Copy link
Member

stevespringett commented Aug 16, 2018

With reference to:

pkg:Maven/org.apache.xmlgraphics/batik-anim@1.9.1?repositorY_url=repo.spring.io/release&classifier=sources

It is my opinion that the testsuite in the .NET, Python, and Java implementations as well as in the spec repo are incorrect and violate the spec. Whereas the testsuite in the Rust implementation adheres to the spec (as written).

Spec states:

  • the '=' qualifiers key/value separator must NOT be encoded
  • the '/' used as type/namespace/name and subpath segments separator does not need to and must NOT be percent-encoded. It is unambiguous unencoded everywhere
  • A value must be must be a percent-encoded string

Therefore, the canonicalized purl should be changed to:

pkg:maven/org.apache.xmlgraphics/batik-anim@1.9.1?classifier=sources&repository_url=repo.spring.io%2Frelease

We need clarification on this.

@stevespringett
Copy link
Member

Based on my understanding of the spec, there are a total of three test cases that violate the spec as its currently written. The Rust implementation has went ahead and changed its test cases, which I'm in agreement with. But we need clarification as I've run into a situation with the Java implementation which requires additional encoding, but the additional encoding breaks the above test cases. However, adopting the testcase changes made by the Rust impl makes the Java impl happy as well.

stevespringett added a commit to package-url/packageurl-java that referenced this issue Aug 17, 2018
…ing decoding to version, qualifier values, and subpath when parsing a purl string. Added missing encoding to name, version, qualifier value when canonicalizing. These corrections break the testsuites. Corrected testsuites (now varies from spec testsuites). package-url/purl-spec#39 . Added additional validation as part of  package-url/purl-spec#46
@stevespringett
Copy link
Member

@pombredanne this ticket should be resolved prior to 1.0 #50

@gernot-h
Copy link
Contributor

gernot-h commented Apr 3, 2019

While I also stumbled hard about this and also would welcome a clearer wording, I see no contradiction here and would consider current testcases correct.

First statement summarizes where the #', '?', '@' and ':' characters MUST NOT be encoded, but it makes no definite statement about where they MUST be encoded. This is further clarified in the following statement which clearly says that it is "unambiguous unencoded everywhere".

To keep purls readable for humans, I think encoding shall be avoided where possible. Think about Debian versions - something like pkg:deb/debian/file@1%3A5.30-1%2Bdeb9u2?arch=amd64 looks quite unreadable to me.

@pombredanne
Copy link
Member

@stevespringett
re

Therefore, the canonicalized purl should be changed to:
pkg:maven/org.apache.xmlgraphics/batik-anim@1.9.1?classifier=sources&repository_url=repo.spring.io%2Frelease
We need clarification on this.

let's update the tests to match the spec and ...

@gernot-h very good point. In fact on common packages such as scoped npms that have a namespace that has an @ sign prefix the encoding was added because it was encoded in the public npm registry. But this is no longer the case and this is not needed as there are never any ambiguities there.
And your debian example is also telling.
So let's devise the bare absolute minimum encoding that would be needed such that Package URLs stay non ambiguous and are the most readable in the majority of the cases.

@gernot-h
Copy link
Contributor

gernot-h commented Apr 9, 2019

@pombredanne Great, anything I can do to help sorting this out? Prepare some pull requests? (I will be on vacation for the next 1.5 weeks, but be interested to help afterwards)

Regarding the confusion about the two statements regarding ":", what about rephrasing the first statement to:

From:

the '#', '?', '@' and ':' characters must NOT be encoded when used as separators. They may need to be encoded elsewhere

To:

the '#', '?', '@' and ':' characters must NOT be encoded when used as separators. Some of them need to be encoded elsewhere as specified in the rules below.

@stevespringett stevespringett added the PURL core specification Format and syntax that define PURL (excludes PURL type definitions) label Apr 1, 2020
@stevespringett stevespringett added this to the 1.0 milestone Apr 1, 2020
@matt-phylum
Copy link

The WHATWG URL spec has a great section on percent encoding that defines a set of character classes and their exact encoding rules. It would be great if the PURL spec defined its percent encoding rules in relation to those. For example, "the PURL path percent-encode set is the WHATWG URL path percent-encode set and U+0040 (@)".

@michaelbprice
Copy link

My interpretation of the rules when putting together a PR (#245) for a vcpkg purl type, was that any value in a key=value qualifier should be percent-encoded, and the tests I've added there reflect that decision. It tests that after parsing, the resulting values are not percent-encoded. IMO, readability of purl strings themselves are secondary to unambiguous and fast parsing by tools (which then might have a presentation layer that helps users understand the data).

@matt-phylum
Copy link

This comment is about qualifiers only. The PURL spec is fairly clear (IMO) that colons should not be encoded in version numbers, but could be more explicit.

This ticket was originally about qualifiers and I've found a potentially serious problem with the way qualifiers are encoded and decoded.

Neither PURL nor the WHATWG or RFC URL specs say that : or / need to be encoded when used in a query string. Any character is allowed to be encoded, so technically it's allowable, even if not canonical, and the qualifier should parse correctly either way.

However, PURL qualifiers look suspiciously like x-www-form-urlencoded parameters, and people may be tempted to use the x-www-form-urlencoded rules for encoding and decoding qualifiers. The WHATWG x-www-form-url-encoded spec does not require them to be encoded when parsing, but says that : and / do need to be encoded when serializing (as well as many other characters that the PURL spec never mentions like ().

If the set of encoded characters were the only difference, it would be fine if some implementations were using x-www-form-urlencoded and some were not, but there is one critical difference with x-www-form-urlencoded: + is replaced with during parsing, so it must be encoded as %2B when serializing. The current PURL spec never says anything about special handling of +. If some PURL implementations implement qualifiers using x-www-form-urlencoded and some implement qualifiers exactly as written in the current PURL spec, the implementations will interoperate just fine most of the time, but for certain cases the meaning of the qualifier changes when passing from one implementation to the other.

  • Serializing git+https:// using a PURL qualifier serializer and then parsing using an x-www-form-urlencoded parser changes the value to git https://
  • Serializing my project-1.0.0.tar.gz using a x-www-form-urlencoded serializer that replaces spaces with + (not all of them do) and then parsing using a PURL qualifier parser changes the value to my+project-1.0.0.tar.gz.

Besides needing to be more explicit about what characters need to be encoded in what parts of the PURL, to avoid PURLs that have different meanings to different implementations, the spec needs to clearly specify whether the qualifiers are x-www-form-urlencoded or if they are a simpler format that is like x-www-form-urlencoded but not.

@jdalton
Copy link

jdalton commented Aug 20, 2024

FYI package-url/packageurl-js has taken the following approach in their v2.0.0 release.
(From comment package-url/packageurl-js#43 (comment)):

This is handled in package-url/packageurl-js#73 by using URLSearchParams to encode and then turning + into %20 for better portability. I sided with the Rust implementation.

Also leveraging standard URLSearchParams. Deferring to standard encoders like URLSearchParams and encodeURIComponent for base encoding and then applying tweaks allows for less chances of mistakes (I trust standard implementations over myself).

@matt-phylum
Copy link

It's inconsistent in packageurl-js 2.0.0. Plus signs parse as plus signs, except for in qualifier values where they parse as spaces.

@jdalton
Copy link

jdalton commented Aug 20, 2024

@matt-phylum

Plus signs parse as plus signs, except for in qualifier values where they parse as spaces.

URLSearchParams (docs) are qualifier specific. I was referring to how qualifiers are encoded. I'm saying that in v2+ we encode the slashes in the qualifier as Rust does. We also follow Rust in that we do NOT encode spaces as +. The JS implementation is aligned with Rust on encoding of slashes as %2F and spaces as %20 in encoded qualifiers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PURL core specification Format and syntax that define PURL (excludes PURL type definitions)
7 participants