Proposal: Per-language "key compilations" for language intrinsics


Aside from pure lambda calculus, most programming languages come equipped with "intrinsics", which for discussion I'll define as the collection of entities (types, functions, variables, properties, etc.) that are built-in or intrinsic to the language. Examples include Go's built-in append, function, core Python types like str and unicode that are (essentially) hardwired into __builtin__, built-in types like int, float, and char in C and C-derived languages,[1] and so forth. More broadly, this terminology may be slightly abused to include code that is technically library code (in that it may be separately importable), but which is usually not seen in source format because it is precompiled in a well-known location. (Many of the JVM built-in types like java.lang.String are of this character, and likewise the Go standard library as consumed by the go build command).

For indexing, intrinsics present a tricky challenge: You'd like to index them, since they are an important part of the language. But the nature of built-in objects is that they do not present as "compilations" in the usual sense: You don't necessarily get source for them as part of the build.

In some cases you could argue that cross-references for language intrinsics aren't necessary that useful: It isn't necessarily that helpful for a code browser to show you all the places where the built-in bool type is used, for example.[2] But even if you choose not to emit cross-references for intrinsics, tools need to be able to find links to things like canonical documentation (e.g., to have uses of str link to This becomes especially important for "hosted" language such as Python, JavaScript, or Lua, where there are often additional intrinsics provided by the host environment (e.g., the browser, for ECMAScript applications) that aren't necessarily visible to the programmer given only the language specification.

We have traditionally handled this by having the language indexer generate data for intrinsics opportunistically: The first time the indexer sees a usage of some intrinsic object, it generates the relevant graph artifacts for it and then sets a flag to remember that it did so, so that it won't repeat the effort (or the data). This works reasonably well for small corpora, but has some pernicious side-effects:

  1. The indexer is no longer hermetic, since the cache carries state across multiple compilations.
  2. (For that reason) Indexer output can't be safely cached, since outputs may vary depending on order of analysis.
  3. On a large corpus, there may be a lot of duplicate data attributable to intrinsics, which increases storage and processing costs (sometimes quite substantially).

Because this issue affects virtually all languages in a similar way, I'd like to propose a more hermetic and reliable solution.

Proposal: For any language X, define a key compilation for X as a compilation unit with no source files or required inputs, whose VName has the following structure:

   "language":  "X",
   "corpus":    "kythe/intrinsics",
   "path":      "",
   "root":      ""

In other words the compilation has the language label for X, the special corpus label kythe/intrinsics, and all other VName fields empty except (optionally) the signature, which may contain a language-version label (e.g., java8, python3, etc.). The intended use of such a compilation is that:

  1. When indexing, generate exactly one such key compilation unit for each language to be indexed.
  2. Add the key compilation to the work list for the indexers of that language.
  3. Teach the indexers to recognize the key compilation, and in response to generate whatever data are useful for the intrinsics of the language (once).

This approach has the benefit that you need only generate the data for intrinsics once per corpus, and moreover gives you the ability to index intrinsics entirely separately and cache their results. Moreover, given that intrinsics change only rarely, you can sensibly afford to cache these data for a long time—it may only be necessary to update them when the indexer binary changes, or when a new release of the language is encountered.

Implementing this proposal fully will require that each language indexer be taught to recognize key compilations for its primary language label, and that any indexing installation be instrumented to inject a key compilation for each language sometime during the process. Otherwise, however, this should not require any schema changes (unless otherwise needed to model some specific language's intrinsics).


[1] Actually int is a somewhat subtle case. A C compiler may define (e.g.) int via a typedef in a header file, mapped to some "real" (concrete) type with a different name; so the concrete type may vary in size and signedness depending on platform, flags, etc.
[2] It might be fun curiosity, but probably not that productive. And to answer such a query quickly may impose a substantial storage footprint.

fromberger created this task.Via WebTue, Apr 18, 11:49 PM
fromberger claimed this task.
fromberger added a project: Indexing.
Herald added a subscriber: Core Team. · View Herald TranscriptVia HeraldTue, Apr 18, 11:49 PM
fromberger added a comment.Via WebTue, Apr 18, 11:56 PM

If we want this to be able to work for "library intrinsics" we may need to relax the constraints somewhat, to allow (say) path/root arguments and maybe other metadata in the compilation. I was intentionally minimalistic in the above; I don't think it's crucial that a solution follow exactly this format.

craigbarber added a subscriber: craigbarber.Via WebWed, Apr 19, 9:45 AM

In this proposal, would said key compilation unit contain source code definitions for the intrinsics which the indexer consumes normally, or would the indexer just have logic to recognize the key compilation and emit some hard coded nodes as it were representing each intrinsic?

fromberger added a comment.Via WebWed, Apr 19, 9:53 AM

In this proposal, would said key compilation unit contain source code definitions for the intrinsics which the indexer consumes normally, or would the indexer just have logic to recognize the key compilation and emit some hard coded nodes as it were representing each intrinsic?

Probably not. If there are source definitions in the language itself, you don't necessarily need this mechanism. That said, the boundary is fuzzy: In some cases it may be difficult to index the source, or the source might just be stubs around a native-code implementation, or so on. At some point it becomes a judgement call about whether to instrument the build for the language tools (so you can index these directly) or to teach the indexer to treat them as "special".

I'd say if there are sources available, it's better to index directly, but you could make a case on either side.

Add Comment