Aside from pure lambda calculus, most programming languages come equipped with "intrinsics", which for discussion I'll define as the collection of entities (types, functions, variables, properties, etc.) that are built-in or intrinsic to the language. Examples include Go's built-in append, function, core Python types like str and unicode that are (essentially) hardwired into __builtin__, built-in types like int, float, and char in C and C-derived languages,[1] and so forth. More broadly, this terminology may be slightly abused to include code that is technically library code (in that it may be separately importable), but which is usually not seen in source format because it is precompiled in a well-known location. (Many of the JVM built-in types like java.lang.String are of this character, and likewise the Go standard library as consumed by the go build command).
For indexing, intrinsics present a tricky challenge: You'd like to index them, since they are an important part of the language. But the nature of built-in objects is that they do not present as "compilations" in the usual sense: You don't necessarily get source for them as part of the build.
In some cases you could argue that cross-references for language intrinsics aren't necessary that useful: It isn't necessarily that helpful for a code browser to show you all the places where the built-in bool type is used, for example.[2] But even if you choose not to emit cross-references for intrinsics, tools need to be able to find links to things like canonical documentation (e.g., to have uses of str link to http://docs.python.org/3/library/stdtypes.html#str). This becomes especially important for "hosted" language such as Python, JavaScript, or Lua, where there are often additional intrinsics provided by the host environment (e.g., the browser, for ECMAScript applications) that aren't necessarily visible to the programmer given only the language specification.
We have traditionally handled this by having the language indexer generate data for intrinsics opportunistically: The first time the indexer sees a usage of some intrinsic object, it generates the relevant graph artifacts for it and then sets a flag to remember that it did so, so that it won't repeat the effort (or the data). This works reasonably well for small corpora, but has some pernicious side-effects:
- The indexer is no longer hermetic, since the cache carries state across multiple compilations.
- (For that reason) Indexer output can't be safely cached, since outputs may vary depending on order of analysis.
- On a large corpus, there may be a lot of duplicate data attributable to intrinsics, which increases storage and processing costs (sometimes quite substantially).
Because this issue affects virtually all languages in a similar way, I'd like to propose a more hermetic and reliable solution.
Proposal: For any language X, define a key compilation for X as a compilation unit with no source files or required inputs, whose VName has the following structure:
{ "language": "X", "corpus": "kythe/intrinsics", "path": "", "root": "" }
In other words the compilation has the language label for X, the special corpus label kythe/intrinsics, and all other VName fields empty except (optionally) the signature, which may contain a language-version label (e.g., java8, python3, etc.). The intended use of such a compilation is that:
- When indexing, generate exactly one such key compilation unit for each language to be indexed.
- Add the key compilation to the work list for the indexers of that language.
- Teach the indexers to recognize the key compilation, and in response to generate whatever data are useful for the intrinsics of the language (once).
This approach has the benefit that you need only generate the data for intrinsics once per corpus, and moreover gives you the ability to index intrinsics entirely separately and cache their results. Moreover, given that intrinsics change only rarely, you can sensibly afford to cache these data for a long time—it may only be necessary to update them when the indexer binary changes, or when a new release of the language is encountered.
Implementing this proposal fully will require that each language indexer be taught to recognize key compilations for its primary language label, and that any indexing installation be instrumented to inject a key compilation for each language sometime during the process. Otherwise, however, this should not require any schema changes (unless otherwise needed to model some specific language's intrinsics).
Notes:
[1] Actually int is a somewhat subtle case. A C compiler may define (e.g.) int via a typedef in a header file, mapped to some "real" (concrete) type with a different name; so the concrete type may vary in size and signedness depending on platform, flags, etc.
[2] It might be fun curiosity, but probably not that productive. And to answer such a query quickly may impose a substantial storage footprint.