Kythe - Kythe Schema Overview

Terminology Overview
Edge and Node Examples

This document is intended as a high-level overview of the Kythe graph schema, and explains some of the core features that make up the overall graph structure. It does not replace the Kythe graph schema doc, which should still be used as a canonical reference. Instead, this is intended to help orient the reader with what various pieces of the schema mean, and how to interpret it.

Terminology Overview

This document relies on the following basic terminology:

An anchor denotes a region of a file. For example, in kcd.go, the Reader type/interface has many anchors:

type Reader interface {
  Revisions(_ context.Context, filter *RevisionsFilter, f func(Revision) error) error
  ^^^^^^^^^
}

For example, there will be an anchor (indicated above by ^ marks) covering the identifier Revisions from kcd.go. The span of an anchor is the region of the file containing the entity of interest.

A semantic node represents an abstract entity that may or may not be associated directly with a file. Semantic nodes obtain location by association with anchors.

From the same example above, we can think of Revisions as the definition site of a semantic node (function) belonging to the type Reader. It has an anchor in kcd.go that defines where it is in the file. There is also another anchor in memdb.go where a Revisions function is implemented for the (concrete) DB type in that file.

A fact is a named bytestring value associated with a node. The Kythe schema, defines the names and expected formats for various facts; for example the doc/uri fact is commonly used to attach a reference to some externally-located documentation to a node in the graph.

An edge is a directed, labelled relationship between two nodes. In the above example, we would paint a define/binding edge from the anchor at Revisions to the function semantic node. This indicates that the anchor is a "definition" for the function (specifically, one that binds it to an identifier).

Note	In the diagrams below, not all the possible anchors, semantic nodes, facts, or edges are shown. We will continue to refer back to the same `kcd.go` file where possible.

Edge and Node Examples

Now we will explain some of the basic edges and nodes that you’ll see in a Kythe schema.

Jump-to-Definition

Jump-to-definition allows navigating from a usage of some entity (e.g., a variable or function) to its definition. To support this, the indexer must, at minimum, emit an anchor for a definition site, a semantic node to represent the entity that is defined, and an anchor for the usage site. In this example, we see one definition and two references for a variable named matchRevision (from kcd.go):

func (rf *RevisionsFilter) Compile() (func(Revision) bool, error) {
  if rf == nil || (rf.Revision == "" && rf.Corpus == "" && rf.Until.IsZero() && rf.Since.IsZero()) {
    return func(Revision) bool { return true }, nil
  }
  var matchRevision, matchCorpus func(...string) bool // definition
  var err error
  if matchRevision, err = singleMatcher(rf.Revision); err != nil {  // reference 1
    return nil, err
  }
  if matchCorpus, err = singleMatcher(regexp.QuoteMeta(rf.Corpus)); err != nil {
    return nil, err
  }
  return func(rev Revision) bool {
    return matchRevision(rev.Revision) &&  // reference 2
      matchCorpus(rev.Corpus) &&
      (rf.Until.IsZero() || !rf.Until.Before(rev.Timestamp.In(time.UTC))) &&
      (rf.Since.IsZero() || !rev.Timestamp.In(time.UTC).Before(rf.Since))
  }, nil
}

The first mention of matchRevision (commented as "definition") records an anchor with a defines/binding edge to the semantic node for that variable. The references (commented "reference 1" and "reference 2") have anchors with ref edges to the same semantic node.

There may in general be more than one definition site for a given semantic node, and in some cases there may be no definition at all. When a unique definition does exist we call it the target definition of the node.

An example with no definition sites is an implicit constructor in Java. If you define a class and do not write a constructor, the compiler writes one for you, and it participates in cross-references, but it does not have a location in the source text of your program.

Multiple definition sites occur for many C++ objects, including forward declarations of classes or function prototypes and their completions. Anchors for (re)declarations of a type also use defines/binding edges.

Callgraph

The callgraph consists of a set of triples, each of which associates a call site, a caller, and a callee. The call site is the location in the source where the call occurs (typically a call-expression of some kind); the caller is the function that contains the call site, and the callee is the function that was invoked.

In the Kythe schema, this structure is represented by three nodes and two edges: An anchor at the call site, and semantic nodes for the caller and callee respectively. The call site anchor has a childof edge pointing to the caller, and a ref/call edge pointing to the callee:

kcd.go:

func (rf *RevisionsFilter) Compile() (func(Revision) bool, error) {

memdb.go:

type DB struct {
...
}
...

// Revisions implements a method of kcd.Reader.
func (db DB) Revisions(_ context.Context, want *kcd.RevisionsFilter, f func(kcd.Revision) error) error {
  revisionMatches, err := want.Compile()

Note	This graph does not show all the nodes and edges arising from these constructs. In particular, the complete graph will include additional nodes for the overrides (implementations) of the interface function in `memdb.go`. We’ve omitted that branch of the graph for the sake of brevity.

Here we have two more parts of kcd.go, surrounding usage of the Compile function. The function is defined in kcd.go, and then referenced with a function call in memdb.go. The anchor of code for Compile in kcd.go has a childof edge pointing to the function semantic node for Revisions.

Class/Interface Hierarchy & Overrides

Inheritance relationships are captured by the extends and satisfies edges. Override relationships are expressed by the overrides edge. These relationships are illustrated by this example from Go (which uses the satisfies edge to express Go’s implicit interface satisfaction). In this example, the childof edges capture the containment relationships between the methods and their types:

kcd.go:

type Reader interface {
  Revisions(_ context.Context, filter *RevisionsFilter, f func(Revision) error) error
}

memdb.go:

type DB struct {
...
}
...
// Revisions implements a method of kcd.Reader.
func (db DB) Revisions(_ context.Context, want *kcd.RevisionsFilter, f func(kcd.Revision) error) error {

Here we see the top level Reader interface again from kcd.go. The interface function Revisions from that same file has a childof edge pointing to the interface’s node. The implementation of Reader that lives in memdb.go has a satisfies edge back to the parent interface, and its own Revisions function has an overrides edge pointing to the top level Revisions function from kcd.go. For C++ and Java, interfaces, abstract classes, and so on, have similar relationships in the Kythe schema.

Note: For languages with explicitly-declared inheritance and interface implementation (such as Java), Kythe uses the extends edge label instead of satisfies.

Functions and Parameters

Functions are represented by semantic nodes that have additional edges to indicate their relationship to the parameters of the function. These edges are named param.0, param.1, etc., where the ordinal reflects the order of declaration.

As illustrated in this example, the parameters in turn may have their own declaration sites, following the conventions of the jump-to-definition example:

kcd.go:

type Reader interface {
  Revisions(ctx context.Context, filter *RevisionsFilter, f func(Revision) error) error

Back to kcd.go Revisions method, we see how its function parameters work in the Kythe schema. We see the same old anchor and semantic node definition for Revisions as before, but this time we also see the anchors and semantic nodes for the three parameters to the function (_, filter, and f) with the expected defines/binding edges. Finally, each of the function param nodes is pointed to from the Revisions node with a param.N edge, indicating that they are function parameters for the given indices.

Documentation

Documentation may be expressed either as an explicit doc node with a text fact containing a descriptive text, or as a plain anchor node spanning the literal text of the documentation (say, a comment). The latter may include parameter edges pointing to bracket-delimited substrings that should be considered references to other objects in the graph.

Text Fact Style

FileDataCache.java:

/**
 * {@link FileDataProvider} that looks up file data from a given {@link Iterable}
 * of {@link FileData}.
 */

Here we look at FileDataCache.java to get the two different types of documents alluded to above. The left half of the graph describes the javadoc string for the top level FileDataCache class, with a text fact pointing to a node with the actual comment of the javadoc. It also has param.0 and param.1 edges pointing to the named bits of code in the doc string, the Iterable interface and the FileData class.

Anchor-Doc Style

kcd.go:

// Reader represents read-only access to an underlying storage layer used to
// implement a compilation database.
type Reader interface {

The second way of handling documentation is via an anchor for the whole comment text, with a documents edge pointing to the semantic node of the thing that it is documenting. That is shown here with the text comment for kcd.go Reader interface.