Computer Science Dept.
Vassar College
Poughkeepsie, NY 12604-0462
Tel: (914) 437-5992
Fax: (914) 437-7498
weltyc@cs.vassar.edu
Program Understanding efforts by individual maintainers are dominated by a process known as discovery, which is characterized by low-level searches through the source code and documentation to obtain information important to the maintenance task. Discovery is complicated by the delocalization of information in the source code, and can consume from 40-60% of a maintainers time. This paper presents an ontology for representing code-level knowledge based on abstract syntax trees, that was developed in the context of studying maintenance problems in a small software company. The ontology enables the utilization of automated reasoning to counter delocalization, and thus to speed up discovery.
Subject Areas: Program Understanding.
While this point has been often repeated, the lions share of software engineering research, particularly AI&SE research, continues to be devoted to upstream parts of the software life-cycle. The research community has traditionally turned away from solving real industry problems because they involve too much that is not relevant, or interesting, from a research perspective. In turn, industry has tended to turn away from research because its application shows little present or near-future value.
A new research paradigm known as industry as laboratory [Potts, 1993] is beginning to catch on amongst some software engineering researchers, and was the subject of a recent workshop [Welty and Selfridge, 1995]. One of the keys to research fulfillment under this paradigm is understanding, and making clear, that it is not the same as technology transfer [Henninger, 1997]. This is a subtle and often misunderstood issue that has prevented some researchers from using industry as a laboratory. There is a lot of interesting research that can be based on the real and current problems of the software industry, without having to go through the slings and arrows of creating and supporting a product. Research in program understanding is one example of this, and although the early work pre-dates the industry as laboratory term itself, it clearly fits the definition.
This paper reports on a research effort that was motivated by the landmark work on Software Information Systems (SISs) [Devanbu, et al., 1990] [Selfridge, 1990], and our experiences trying to apply this technology to the problems being experienced by a small software company. While we found that the most critical need of the maintainers we studied was domain knowledge, and a clear picture of how the software fit into the domain, that aspect of this research is not discussed here. Instead we focus on the domain-independent framework that was developed, and how it helped maintainers understand the software. The ways in which this framework fits into a solution to the more important domain knowledge problems has been documented somewhat ([Welty, 1995a] and [Welty, 1995b]), and will be the subject of future reports.
The most significant of the domain-independent research that came out of this effort has been the marriage of some of the powerful automated reasoning capabilities of description logics [Brachman, et al., 1991] with abstract syntax trees in support of program understanding, yielding a tool for visualizing object-oriented programs. This tool highlights pertinent information, such as identifying side-effects, and facilitates browsing the code in a way that dramatically speeds up the understanding process.

GROUP?"While intrinsically a very simple process, discovery takes so much time because of the nature of the source code. Programming languages, while adequate for representing control-flow, tend to delocalize other kinds of information [Soloway, et al., 1986]. That is, when examining a piece of code, all that a maintainer can see is what is in the current window of the text editor. The likelihood of a maintainer finding the answer to a question decreases with the distance the answer is from the current visual context [Lampert, et al., 1988]. In other words, maintainers tend to base their understanding on what they can see, what they already know, and what they can easily ascertain.
The best known and most closely studied example of this is the delocalized plan [Soloway and Letovsky, 1986], in which some abstract goal is achieved by lines of code that are spread out through the source program (probably across different source files). The problem is more than just delocalized plans, however, it's delocalized information in general. In the C++ method shown in Figure 1, the delocalization of the source code relationships is manifested in numerous ways. Take, for example, the simple question, "What is the class GROUP?" Class definitions are sometimes at the top of a source code file, but more often are located in separate header files. To get the precise answer to this question, the maintainer would have to find the definition. The more effort this search requires, the less likely it is that the maintainer will follow through. If the maintainer does give up before finding the definition, chances are she will proceed trying to understand this method based on whatever information she already knows or can guess, and this incomplete understanding can lead to mistakes [Lampert, et al., 1988].
Furthermore, recent studies have shown that object-oriented languages actually increase understanding problems by delocalizing much more information than their imperative predecessors [Huitt and Wilde, 1992]. Inheritance, in particular, spreads method and slot (instance variable) declarations up the class hierarchy, making it harder to find answers to questions about class composition, among other things.

Its powerful domain model combined with the efficient inference and expressive query capabilities of the underlying description logic CLASSIC [Brachman, et al., 1991], made LaSSIE an effective tool for assistance during maintenance. We began our research with the goal of studying how the SIS technology might be used in a different environment than that which spawned LaSSIE. Where LaSSIE was the result of studying a huge maintenance group in one of the world's largest corporations, we chose a maintenance and support group in a small software company.
Aside from being small, this company was of interest because they had already made a commitment to use Smalltalk, and to at least try and impose more strict methodologies on their maintenance team. The company was, however, very strapped for resources and their most experienced developers were rarely available. This was creating a snowballing maintenance problem which in turn affected all aspects of their business.
We began our study in the usual ways, interviewing management and staff in groups and one-on-one and observing the maintainers at work.
We used the contextual inquiry interviewing technique [Holzblatt and Jones, 1993] for studying the maintainers at work. There were some flaws in this from an empirical perspective, since the maintainers admitted they worked much harder while we were present, and frequently followed through on tasks that normally they would have waited to ask someone about. Nevertheless, we quickly confirmed that these maintainers spent close to 60% of their time engaged in discovery.[1]
We were also able to confirm that the delocalization of information in the source code was the biggest factor contributing to the time spent in discovery. For example, the most common activity was "grepping" the source code for all uses of a variable in order to see the places it was changed. The second most common activity was browsing the class hierarchy looking for a method or slot definition to better understand the use of a variable in the code.

a=a+1; is shown in Figure 3. This tree represents only the syntactic structure of the code. We felt that the power of the SIS approach lay in the use of semantic information. For example, the LaSSIE ontology was able to represent the fact that a variable definition was in a specific file. This relationship between a variable and a file requires more semantics than an AST can provide.

a=a+1; is shown in Figure 5, and it is clear that our AST has now become a semantic network. Note that in our ontology we have chosen to treat basic arithmetic operators as methods, not as syntactic elements of the language.
In addition to instances of a few concepts in the taxonomy, Figure 5 shows some of the relationships defined in the ontology, such as that between an assignment operation and the variable it changes (changes), an assignment operation and the new value to be assigned (new-value), a message and the object it is invoked on (send-to), a message and a parameter (has-parameter), and a message and the method it invokes (calls-method). Figure 5 uses the convention that individuals[3] are given names which include the concept they are individuals of followed by a random number to make each name unique.
The ontology, including all the relationships and constraints, is described fully elsewhere [Welty, 1995a], and though there is clearly more here than in an AST, the true augmentation has yet to be introduced.

start, next, while-true, and when-false) and data-type definition (has-data-type) relations to those shown in Figure 5.
With the augmented AST, the distance between pieces of information is no longer measured in terms of lines of code, or different files, but in terms of the number of links that must be traversed. For example, in Figure 5 to get from the object assignment-05 to method-04, we must traverse the new-value link to get to message-16, and then the calls-method link. These two objects, then, are separated by two links, and we will say they are two hops apart.
Our interface, which is discussed in [Welty, 1996], supports this notion by presenting hypertext links for every relationship an object is involved in. This approach clearly changes the meaning of delocalization, and it also reinforces the point that normal source code delocalizes everything but control flow. In our representation, far more information is localized.
In addition, any relationships that would prove useful in discovery (such as answers to the questions listed in Section 3.2 on page 5), could be added as links. For example, parameter-02, as an instance of data-type-08, has a slot represented by slot-13. The individuals parameter-02 and slot-13 are two hops apart in Figure 6, but adding a has-slot relationship between the two would cut that down to one, thus localizing slot information on class instances.
Such added semantic links would certainly speed up understanding if present, but the data shown in Figure 6 is only that which can be extracted automatically from the code by parsing it. These other links are not explicitly there, and requiring a maintainer to put them in would make the discovery process more time consuming than not having this representation at all.
We believed early on that we would get a lot of use out of simple path-tracing grammars, once the code was represented as a semantic net. We were quite surprised to find that even simpler forms of inference yielded a big payoff as well.
The simplest inference the system provides is through relation inverses. Every code-level relation has an inverse, and this provides a tremendous amount of cross-reference information. Consider the simple code shown in Figure 5; the inverse of the changes relationship is changed-by, and this now gives us the ability to localize within one hop all places a variable is changed. It would be impossible to over-emphasize the importance of this simple inference, as it localized information regarding the most common question the maintainers reported asking during discovery (Question 1.), and it was considered by that group to be the most significant achievement of this work.
The data-type-of relation, which is the inverse of the has-data-type relation, was used to localize to within one hop answers to Question 7. (Is this data-type used?) The answer to the question is "yes" if the data-type-of relation exists on the data-type, and "no" if it does not.
has-slot and has-method relations local to each instance of a data-type.

has-slot relation on parameter-02 from Figure 6 is illustrated in Figure 7. In this case the inference specification in Classic is:
Where the right side of the specification is the path, and the left side is the relation to be derived. This notation would correspond to the following Prolog assertion:has-slot <- (has-data-type has-slot)
We chose to call this class of inferences "path-tracing" because its implementation required just that: tracing paths through the semantic network of individuals. When inverses are brought into the picture, we found that all path-tracing inferences were expressions of transitivity [Fox, Wright, and Adam, 1985]. In the example above,has-slot(?x,?z) :- has-data-type(?x,?y), has-slot(?y,?z).
slot-of (the inverse of has-slot) is transitive over data-type-of (the inverse of has-data-type). Expressing the rule that way, of course, would lead to the standard kinds of infinite loops that occur when a relation is the first step in its own derivation.There were actually many, questions for which path tracing proved a very succinct and effective way of localizing information, and with the exception of side-effects, we used path-tracing inference to localize answers to all the remaining top questions. In most cases these path-tracing inferences required relation inverses and took place in multiple steps.
The key to side effect detection was in classifying the different kinds of side-effects that occur in software: I/O related side-effects, and changes to global variables. Any method that calls an I/O method or that changes a global variable is therefore a method with a side-effect. In addition, a method that does not call an I/O method nor changes a global variable can have an indirect side-effect if it calls a method with a side-effect. Finally, a side-effect (direct or indirect) is conditional if it occurs in a conditional branch of the control flow.
The facility for detecting side-effects is described completely in [Welty, 1995a]. Although it only requires a few lines of specification in Classic, the full capability involves eight levels of inference and its explanation (in English) is quite involved. We will instead focus here on the central part of this reasoning, upon which all the different kinds of side-effect related information is based: the identification of an assignment-side-effect.
Assignment side-effects are direct side-effects involving a change to a global variable. Referring back to the code-level ontology shown in Figure 4, the only way a variable can be changed is in an assignment statement, which is represented by an individual of the concept assignment. Not every assignment is a side-effect, only those which change a global variable. Description logics are designed to make it easy to specify the sufficient conditions for distinguishing a concept from its sub-concepts. In fact, to say that "a side-effect is a kind of assignment in which the variable being changed is a global" requires only
In other words, any assignment whoseassignment-side-effect:: (and assignment (all changes global-variable))
changes relation points to an individual of global-variable is automatically classified as a side-effect. The two assignments shown in Figure 6 are not side effects, since they change local variables.
The full side-effect facility produces similar inferences for propagating the existence of side-effects up to the methods they are contained in, out to methods that call them, and so forth. The facility is fully automated, and the result is the localization of information about side effects. Each method invocation (individuals of the concept message) is identified as a call to a method with a direct side-effect, or a call to a method with an indirect side-effect. This, again, brings the answers to a common discovery question to within one hop.
Significant steps remain to completely transfer the techniques reported here to industry. In particular, a system for translating back and forth between the Smalltalk source code and the code-level representation of that source code. Such a translation would be easy in principle: there is a one to one correspondence between the syntactic elements of the language and the ontology. This did not seem to be a particularly interesting research problem, so we did not pursue it. The company we worked with is looking into developing this.
There were so many interesting aspects of this project, and it was difficult to decide which to place in this paper. The dynamic between the research group and the employees of the company was quite complex, and many things reported here have been over-simplified due to space constraints. It may seem strange, for example, that a company so strapped for resources would let us in to study what they were doing without expecting us to apply our research directly to their software system. This was a problem, in fact, but we solved it by bartering some of our own experience with solving some of their smaller, more mundane, problems.
Another barrier we encountered was that our ontology required us to be there to explain. We did design a web-based interface, but it still presented information in a (hypertextual) semantic network form, which the maintainers were not entirely comfortable with.
The results were, however, very well received. The management and the maintenance group remain quite impressed with the capabilities these techniques demonstrated, and were amazed that with each discovery question the maintenance team came up with, we were easily able to specify a mechanism to localize the answers. They are looking forward to eventually being able to employ a full-blown system for performing maintenance. It was interesting to us that some of the simpler things to do (such as tracking variable usage through relation inverses) were the most impressive to them.
Our work with this company continues, and we are currently focusing on the much more serious problems of representing domain knowledge and integrating it with the representations of the source code.
[Brachman, et al., 1991] Brachman, R., McGuinness, D., Patel-Schneider, P., Borgida, A. and Resnick, L. Living with CLASSIC: When and How to Use a KL-ONE-Like Language. Principles of Semantic Networks. Morgan Kaufman. Pp. 401-456. May, 1991.
[Devanbu, et al., 1990] Devanbu, P., Selfridge, P., Brachman, R., and Ballard, R. The LaSSIE Software Information System. Communications of the ACM. September, 1990.
[Devanbu, Selfridge, and Brachman, 1990] Devanbu, P., Selfridge, P., and Brachman, R. LaSSIE - A Classification-Based Software Information System. Proceedings of the 12th International Conference on Software Engineering. 1990.
[Fox, Wright, and Adam, 1985] Fox, M., Wright, J. and Adam, D. Experiences with SRL: An analysis of a frame-based knowledge representation. Expert Database Systems. Benjamin/Cummings. 1985.
[Henninger, 1997] Henninger, S. Case-Based Knowledge Management Tools for Software Development. Journal of Automated Software Engineering. 4(3). Kluwer Academic Press. July, 1997.
[Holzblatt and Jones, 1993] Holzblatt, K. and Jones, S. Contextual Inquiry: A Participatory Technique for System Design. Lawrence Erlbaum. 1993.
[Huitt and Wilde, 1992] Huitt, R., and Wilde, N. Maintenance Support for Object-Oriented Programs. IEEE Transactions on Software Engineering. 18(12), December, 1992.
[Lampert, et al., 1988] Lampert, R., Littman, D., Pinto, J., Soloway, E. and Letovsky, S. Designing Documentation to Compensate for Delocalized Plans. Communications of the ACM. 31(11). Nov, 1988.
[Lewis, 1996] Lewis, T. The big software chill. IEEE Computer. 29(3), Pp. 12-14. March, 1996.
[Ning, et al., 1994] Ning, J., Engberts, A., and Kozaczynski, W. Automated Support for Legacy Code Understanding. Communications of the ACM. 37(5), Pp. 50-58. May, 1994.
[Nishimoto, Chen, and Ramamoorthy, 1990] Nishimoto, M., Chen, Y. and Ramamoorthy, C. The C Information Abstraction System. IEEE Transactions on Software Engineering. March, 1990.
[Potts, 1993] Potts, C. Software Engineering Research Revisited. IEEE Software. 10(5): 19-28. 1993.
[Quilici, 1994] Quilici, A. A Memory-Based Approach to Recognizing Programming Plans. Communications of the ACM. 37(5), Pp. 84-93. May, 1994.
[Quilici, et al., 1996] Quilici, A., Yang, Q., and Woods, S. Applying Plan Recognition Algorithms to Program Understanding. Proceedings of KBSE-96. IEEE Computer Society Press. September, 1996.
[Quilici and Woods, 1997] Quilici, A. and Woods, S. Toward a Constraint-Satisfaction Framework for Evaluating Program-Understanding Algorithms. Journal of Automated Software Engineering. 4(3). Kluwer Academic Press. July, 1997.
[Rieman, 1993] Rieman, J. The diary study: A workplace-oriented research tool to guide laboratory efforts. Proceedings of INTERCHI'93. Pp. 321-326. ACM Press. 1993.
[Ross and Schoman, 1977] Structured Analysis: A language for communicating ideas. IEEE Transactions on Software Engineering. SE-3(1):16-34. 1977.
[Selfridge, 1990] Selfridge, P. Integrating Code Knowledge with a Software Information System., in Hoebel, L. ed., Proceedings of the 1990 Knowledge-Based Software Engineering Conference. Pp. 183-195. Rome Laboratory Technical Report. Sept., 1990.
[Selfridge, 1991] Selfridge, P. Knowledge Representation Support for a Software Information System. Proceedings of the Seventh Conference on Artificial Intelligence Applications. Pp. 134-140. IEEE Computer Society Press. March, 1991.
[Soloway and Ehrlich, 1984] Soloway, E. and Ehrlich, K. An Empirical Investigation of Tacit Plan Knowledge in Programming. Human Factors in Computer Systems. Ablex Publishers. 1984.
[Soloway, et al., 1986] Soloway, E., Letovsky, S., Pinto, J. and Littman, D. Mental Models and Software Maintenance. Proceedings of the Conference on Empirical Studies of Programmers. Ablex Publishers. Pp. 80-98. 1986.
[Soloway and Letovsky, 1986] Soloway, E. and Letovsky, S. Delocalized Plans and Program Comprehension. IEEE Software. 3(3). May, 1986.
[Shrobe, 1995] Shrobe, H. DARPA Evolutionary Design of Complex Software - Additional Program Information. Available at http://www.ito.darpa.mil/ResearchAreas/EDCS_Detail.html.
[Welty, 1995a] Welty, C. An Integrated Representation for Software Development and Discovery. PhD Thesis. Rensselaer Polytechnic Institute. Troy, NY. 1995. Available at http://www.cs.vassar.edu/faculty/welty/papers/phd/.
[Welty, 1995b] Welty, C. Towards an Epistemology for Software Representations, in Setliff, D. ed, Proceedings of the Tenth Knowledge-Based Software Engineering Conference. IEEE Computer Society Press. November, 1995.
[Welty, 1996] Welty, C. An HTML Interface for Classic. Proceedings of the 1996 International Workshop on Description Logics. AAAI Press, November, 1996.
[Welty and Selfridge, 1995] Welty, C. and Selfridge, P., eds. The IJCAI-95 Third Workshop on AI and Software Engineering. AAAI Press. August, 1995.
[Welty and Selfridge, 1997] Welty, C., and Selfridge, P. Breaking the Toy Mold. In The International Journal of Automated Software Engineering. 4(3). Kluwer. January, 1997.
variable-10 in Figure 5 is an individual of the Classic concept variable, and it represents the variable "a", which in the software is an instance of some class. This rather odd (but necessary) "double instance" representation, and the trouble it causes, is discussed further in [Welty, 1995b].