Some Ideas about UML-Diffing

A collection of my summarization about the topic uml-diffing (especially class diagrams)

Topics

First I'll present you about which kind of UML-Diagrams I'm talking here.. especially their data-representation.. then I will give some generic Ideas about diffing and especially uml-diffing
After that introduction I present two papers and their algorithm to achieve the best diffs
  • Uml-Diagrams and their representation
  • generic thoughts about umldiff
  • paper: UMLDiff by Zhenchang Xing, Eleni Stroulia
  • paper: Difference Detection in UML class Diagrams by Martin Girschick

Uml-Diagrams and their representation

With UML-Diagrams I mean mostly Class-Diagrams specified with the UML-Metamodel which can be created by tools directly from code or by tools via point and click.. Their data-representation is always a XML-structure.
XML makes sense: The entitities of a UML-Diagram are hierarchically ordered.. like Packages > Classes, Interfaces > methods,attributes
They are exactly as a tree - one element has 1-n subelements and a subelements has only one parent

This structure is also standardized under the name XMI - but I won't talk about it.. since the difference algorithms will only look at the basic-elements of that diagram and then the internal tree-representation is sufficient.. cause all uml-models have to take the metamodel in some way into account there can't be that big differences

What's special about diffing UML-Diagrams?

Diffing in general means the comparison between two entities and resulting in a formula which describe things to change in one entity to receive the other one..
The most popular tool for this is "diff" which is used inside many versioning systems like git and svn

UML-Diagrams are in the first place text-documents - in the second xml documents and in the third xml-documents with a special syntax and semantic (the UML-Metamodel)
So in theory it would be possible to diff UML-diagramms with the "diff"-tool but in practice there is no big gain since diff only recognizes changes on lexical level but not on a higher (e.g. structured) level.
for example you can rearrange elements inside xml since they are unordered.. diff would match those rearranged tags as changes.. but they are not..

Next it would be possible to use a generic xml-diff algorithm.. in general those algorithms will compare trees and look which transformation is the shortest to gain the other tree.. This is already a good idea and will probably match quite well.. but there are some flaws:
  • most xml-differ behave don't support similarities and can't find renaming or movings between levels
  • xml-differ don't take the uml-metamodel into account - which states that a package only has classes as direct child - and a class hast only innerclasses,attributes and operations

So the following papers mostly will concentrate on how similarities can be found using the metamodel - the similarity detection is in each paper the key algorithm.

To shorten the things a bit.. all algorithms I know about use the bread-first search.. so they go down each level of the hierarchy and search for matching items.. Things which differ are how they are matched and what will be done with unmatched ones

UMLDiff by Zhenchang Xing, Eleni Stroulia

TODO link to paper

The idea behind that paper is a tool "JDEvAn" which will parse existing java code - creating a uml-diagram and then diffing the diagram against an analysis of the same project but older version. This lets the developer see the evolution of the code which helps in making decisions for the future or look in the past why the actual codes got this state.
This paper mostly described the diffing algorithm.
The datamodel is described as a graph (tree) with a virtual root node

Breadth first the algorithm will go down the hierarchy and identify corresponding entitites with:
a) name similarity heuristic - safe because it is rare that an entity will be removed and a new one readded with the same name
b) similar relations to other entitites (for example classes with same attributes and methods)

The first matches will be used as "landmarks" to populate all others until all are matched (or only the newly added will remain)

name similarity heuristics

There exists a widely accepted algorithm:

Longest Common Subbsequence (LCS)

It will compare two strings and searches for the longest matching substring.. for example: Motorbike and Mountainbike will have "bike" as longest common subsequence
The metric will just be:
	
getLCSMetric(string s1, string s2)
{
	return length(LCS(s1, s2)) * 2 / (length(s1) + length(s2));
}
	
This is just done to keep the LCS relative to the length of both strings

This method has a negative it only reflects lexical similarity but not changes in wordorder which leads to:

Most Common Pairs

The algorithm was invented in the paper and the name is by me.. The idea is to split the strings in blocks of the length 2 and then look how much blocks are similar
	
getMostCommonPairs(string s1, string s2)
{
	s1 = upper(s1); // since reordering words will invalidate case sensitiveness
	s2 = upper(s2);
	s1 = pairs(s1); // will split abcde to ['ab', 'bc', 'cd', 'de']
	s2 = pairs(s2);
	union = len(s1) + len(s2);
	s1.retainAll(s2); // will only keep those blocks which both have in common
	return len(s1)*2 / union;
}
	
The return at the end is again just done to keep that length in relation to the length of both strings
The author only thought about word reordering - but in my opinion it will also help for typos in short words (in longer words lcs will again be good)
Constructed example to let LCS look bad:
mobey money grey
LCS:
LCS(mobey, money) = 2 ("mo" or "ey") -> score: 2*2/(5+5) = 0.40
LCS(mobey, grey)  = 2 ("ey")    -> score: 2*2/(5+4) = 0.44 -> better

MCP:
MCP(mobey, money) = 2 ("mo", "ey") -> score: 2*2/(4+4) = 0.5 -> better
MCP(mobey, grey)  = 1 ("ey")    -> score: 1*2/(4+3) = 0.14

Structure Similarity

Overall Similarity Metric

Global Similarity Algorithm

3 queues:
  • next - matching entities on another level which will be searched next time
  • match - already matched entries
  • results - processed
- the queues will always consist of a tuple (model1, model2)
	
m1,m2 = getRootsOfBothModels();
next.add([m1, m2]); // add both roots as tuple
while (next != null)
{
		match = next;
		foreach(match as x)
		{
			for (all children of x[0] with type ?? as child)
			{
				if (child matches inside x[1].childrens())
				{
					if (hierarchyLevel == currentLevel)
						match.add(child); // when we add something here they also would be parsed inside THIS foreach (so a bit depth first						when matching often)
					else
						next.add(child);
				}
			}
			compare direct children of same type

		}

		while (match != null)
		{
			match.pop(); result.add();
			if (!children_notMatched())
				compare direct children
			identifyRenamedChildren() -> similarityMetric
		}
		identifyMoveCandidates();
}
now add unmatched as new or removed
	

Similarity Metric - only inside a class - else it would require too much cpu (conside the goal of that project - real sourcecode has undreds of attributes.. but maybe this restriction should be enhanced to packagelevel?)

productive use

A project with 800 classes needed 2.5h to extract and 25minutes to compare.. they manually checked the correctness of their algorithm and cam to 95%

oppinion

I think the exactness mostly comes through the underlying code since the function-parameters can be taken into account for example
The String-Similarity function could be usable also the mainloop looks very well since it first tries to find exact matches and only after that searches for similarities

Difference Detection in UML Class Diagrams by Martin Girschick

This projects main goal is realy umldiffing of class diagrams - he will generate graphical output what has changed and what not
He describes very analytical how that difference algorithm should be build and what things can be used.. also this is the only paper where i've seen so much distinction in the differentiation results: unmatched, matched, added, deleted, moved-new, moved-old, cloned, modivied

destilled Ideas after analytical process:
  • type of class-diagram Element doesn't change (a class will never get an attribute)
  • order of classes, attributes and methods are irrelevant
  • parameter order relevant
  • well defined hierarchy (package>class>attr)
  • classes, attributes, methods, parameters have names
  • name is unique inside a container
  • unique identifier from modelling tools -> will restrict to same source
  • stereotypes cannot be changed -> removed and added only
  • supplier of generalization can only be changed to another superclass (otherwise has to be deleted and with new supplier added again)
  • associations only moved inside inheritance hierarchy

algorithm

breadth first - classes moved to other packages, operations to other classes - this minimizes wrong matches cause it first tries to match the operations after the class were matched

The algorithm will walk down through the variant-tree and compare to the base-tree.
findmatch() - intelligent function which will give a score between -16 (no match) over 0 (unknown) to +16 (definitly matches) .. based on location (even predecesor and successor)
name (identical or similar)
stereotype
type

package > class > generalization > attribute > association > operation
	
findMatch(varElement, baseDiagram)
{
	baseContainer = getMatch(varElement.Container, baseDiagram);
	foreach (baseContainer)
		maxElement = highest match there;
	if (value(maxElement) < DEFINED_LIMIT)
		// maxElement = suche in anderen containtern auf der gleichen hierarchie ebene
	if (value(maxElement) < DEFINED_LIMIT)
		return null;
	return maxElement;
}
	
then he defined a transformation function which will model the diff-information to the elements:
	
transform(Element a, Element b)
{
	a = 0 => add;
	a was matched somewhere else => clone
	a != b => modified
	a.container!=b.container => moved
}
	
signature of a function is very importent - when more parameter it even gets more important than the name


TODO: algorithmen nochmal genauer durchgucken - hab festgestellt dass ich meine selbstformulierten worte vom algorithmus nicht mehr verstehen :)