Kythe URI Specification
This document defines the schema for Kythe uniform resource identifiers ("Kythe URI").
The primary purpose of a Kythe URI is to provide a textual encoding of a Kythe VName, which is a unique identifier for a node in the semantic graph generated by Kythe-compatible tools. A Kythe URI may also be extended to encode simple queries about a particular VName in a transportable format.
Scheme Label
The scheme label for Kythe URIs will be "kythe:
".
Character Set
A Kythe URI is a string of UCS (Unicode) characters. For storage and transmission, a Kythe URI will be encoded as UTF-8 with no byte-order mark, using Normalization Form NFKC.
Except as restricted by the syntax, all UCS characters are valid in a Kythe URI.
Reserved characters (e.g., "/", "?") and whitespace must be percent-escaped
per Section 2.1 of RFC 3986, e.g., " " becomes "%20
".
Syntax
The following grammar defines the syntax of a Kythe URI. Some productions have provisional values and will change as the Kythe schema evolves.
kythe-uri = "kythe:" [corpus] attrs ["#" signature]
corpus = "//" label 0*{"/" path-segment}
label = ireg-name -- RFC 3987
attrs = ["?" lang-attr] ["?" path-attr] ["?" root-attr]
lang-attr = "lang=" language
path-attr = "path=" path-segment 0*{"/" path-segment}
root-attr = "root=" root-segment 0*{"/" root-segment}
language = 1*ipchar -- RFC 3987
signature = 1*ipchar -- RFC 3987
root-segment = 1*ipchar -- RFC 3987
path-segment = 1*{unreserved | pct-encoded | "/"} -- RFC 3987
Note that the order of the attributes (the attrs
production) is fixed, to
ensure that a Kythe URI has a canonical string encoding.
For queries, path-segment is resolved as specified in RFC 3986 Section 5.2.4 (Remove Dot Segments).
See also Vector-Name (VName)
Examples (subject to change):
-
Empty (no fields):
kythe:
-
Signature only:
kythe:#loc-a90320dafd60
-
Ad-hoc corpus (signature, corpus, path, language):
kythe://corpusname?lang=c%2B%2B?path=file/base/file.h#class-Foo
-
Bitbucket (corpus, path):
kythe://bitbucket.org/creachadair/stringset?path=README.md
-
Maven (corpus, path, language):
kythe://maven.org/central/org/apache/thrift?lang=java?path=libthrift/0.9.1
-
Language, path, signature:
kythe:?lang=go?path=mapreduce/go/contrib/plan.go#MR
-
Corpus, path, language:
kythe://code.google.com/p/go.tools?lang=go?path=cmd/godoc/doc.go
-
Alternate root:
kythe://chromium.org/chrome?path=openssl/crypto/bf/bf_pi.h?root=third_party/openssl/1650
Rationale
The grammar for kythe-uri
is compatible with the generic URI syntax defined
in RFC 3986, to the extent that a fairly naive parser should be able to handle
parsing a Kythe URI into its high-level components: The "hostname" and "path"
components of the generic URI will represent the corpus
, the "query"
component will capture the attrs
, and the "fragment" component will capture
the signature
.
The meaning of the strings generated by the corpus
production is not defined
in this specification; the intent is to allow a corpus to behave like a
hostname, so that a server providing Kythe data can use the corpus string to
locate the data for that corpus. For services that support many independent
corpora (e.g., github.com, bitbucket.org, code.google.com), the corpus field
will probably include information about the project directly (e.g.,
"code.google.com/p/go.text"). In cases where there is only a single corpus
with a body of different branches or subdivisions, some of that context may
be stored in the root
attribute instead.
The decision about which representation to choose is mainly controlled by
whether the "project" label is likely to vary. A github.com repo will not
frequently change name, so it makes sense to include the repo name as part of
the corpus, and reserve the root
field for branches. The encoding of the URI
is agnostic to the decision.