Publication
The goal, the procedure and the results of the contest have been described in details in the following publication. On this webpage we present the excerpt from this paper that highlights and dicusses the most important results.
Goncalo Antunes, Marzieh Bakhshandeh, Jose Borbinha, Joao Cardoso, Sharam Dadashnia, Chiara Di Francescomarino, Mauro Dragoni, Peter Fettke, Avigdor Gal, Chiara Ghidini, Philip Hake, Abderrahmane Khiat, Christopher Klinkmüller, Elena Kuss, Henrik Leopold, Peter Loos, Christian Meilicke, Tim Niesen, Catia Pesquita, Timo Péus, Andreas Schoknecht, Eitam Sheetrit, Andreas Sonntag, Heiner Stuckenschmidt, Tom Thaler, Ingo Weber, Matthias Weidlich: The Process Model Matching Contest 2015. In: 6th International Workshop on Enterprise Modelling and Information Systems Architectures (EMISA 2015), September 3-4, 2015, Innsbruck, Austria.
Please cite this paper if you refer to results reported in the contest, if you use a data set from the contest or some of the other materials made available here. An authors version of this paper is available here.
Results
Tables 1 to 4 give an overview of the results for the datasets. For getting a
better understanding of the result details, we report the average (∅) and the
standard deviation (SD) for each metric. The highest value for each metric is
marked using bold font. In our evaluation we distinguish between micro and
macro average. Macro average is defined as the average of precision, recall
and f-measure scores over all testcases. On the contrary, micro average
is computed by summing up TP, TN, FP, and FN scores applying the
precision, recall and f-measure formula once on the resulting values. Micro
average scores take different sizes of testcases into account, e.g., bad recall
on a small testcase has only limited impact on the micro average recall
scores. Some agreements are required to compute macro average scores for two
special cases. It might happen that a matcher generates an empty set of
correspondences. If this is the case, we set the precision score for computing
the macro average to 1.0, due to the consideration that an empty set of
correspondences contains no incorrect correspondences. Moreover, some of the
testcases of the AM data set have empty gold standards. In this case we set the
recall score for computing the macro average to 1.0, because all correct matches
have been detected.
Table 1: Results of University Admission Matching
|
|
|
Precision |
|
Recall |
|
F-Measure |
Approach |
|
∅-mic |
∅-mac |
SD | |
∅-mic |
∅-mac |
SD | |
∅-mic |
∅-mac |
SD |
|
RMM/NHCM |
|
.686 |
.597 |
.248 |
|
.651 |
.61 |
.277 |
|
.668 |
.566 |
.224 |
RMM/NLM |
|
.768 |
.673 |
.261 |
|
.543 |
.466 |
.279 |
|
.636 |
.509 |
.236 |
MSSS |
|
.807 |
.855 |
.232 |
|
.487 |
.343 |
.353 |
|
.608 |
.378 |
.343 |
OPBOT |
|
.598 |
.636 |
.335 |
|
.603 |
.623 |
.312 |
|
.601 |
.603 |
.3 |
KMSSS |
|
.513 |
.386 |
.32 |
|
.578 |
.402 |
.357 |
|
.544 |
.374 |
.305 |
RMM/SMSL |
|
.511 |
.445 |
.239 |
|
.578 |
.578 |
.336 |
|
.543 |
.477 |
.253 |
TripleS |
|
.487 |
.685 |
.329 |
|
.483 |
.297 |
.361 |
|
.485 |
.249 |
.278 |
BPLangMatch |
|
.365 |
.291 |
.229 |
|
.435 |
.314 |
.265 |
|
.397 |
.295 |
.236 |
KnoMa-Proc |
|
.337 |
.223 |
.282 |
|
.474 |
.292 |
.329 |
|
.394 |
.243 |
.285 |
AML-PM |
|
.269 |
.25 |
.205 |
|
.672 |
.626 |
.319 |
|
.385 |
.341 |
.236 |
RMM/VM2 |
|
.214 |
.186 |
.227 |
|
.466 |
.332 |
.283 |
|
.293 |
.227 |
.246 |
pPalm-DS |
|
.162 |
.125 |
.157 |
|
.578 |
.381 |
.38 |
|
.253 |
.18 |
.209 |
|
The results for the UA data set (Table 1) illustrate large differences in the
quality of the generated correspondences. Note that we ordered the matchers
in Table 1 and in the other results tables by micro average f-measure.
The best results in terms of f-measure (micro-average) are obtained by
the RMM/NHCM approach (0.668) followed by RMM/NLM (0.636)
and MSSS (0.608). At the same time five matching systems generate
results with an f-measure of less than 0.4. When we compare these results
against the results achieved in the 2013 edition of the contest, we have
to focus on macro-average scores, which have been computed also in
the 2013 edition. This year, there are several matchers with a macro
average of >0.5, while the best approach achieved 0.41 in 2013. This
improvement indicates that the techniques for process matching have
progressed over the last two years. Anyhow, we also have to take into
account that the gold standard has been improved and the format of the
models has been changed to BPMN. Thus, results are only partially
comparable.
Comparing micro and macro f-measure averages in 2015, there are, at times,
significant differences. In most cases, macro scores are significantly lower. This is
caused by the existence of several small testcases (small in numbers of
correspondences) in the collection that seem to be hard to deal with for
some matchers. These testcases have a strong negative impact on macro
averages and a moderated impact on micro average. This is also one
of the reasons why we prefer to discuss the results in terms of micro
average.
It is interesting to see that the good results are not only based on a strict
setting that aims for high precision scores, but that matchers like RMM/NHCM
and OPBOT manage to achieve good f-measure scores based on well-balanced
precision/recall scores. Above, we have described the gold standard of this data
set as rather strict in terms of 1:n correspondences. This might indicate that
the matching task should not be too complex. However, some of the
approaches failed to generate good results. Note that this is caused by
a low precision, while at the same time recall values have not or only
slightly been affected positively. A detailed matcher specific analysis,
that goes beyond the scope of this paper, has to reveal the underlying
reason.
Table 2: Results of University Admission Matching with Subsumption
|
|
|
Precision |
|
Recall |
|
F-Measure |
Approach |
|
∅-mic |
∅-mac |
SD | |
∅-mic |
∅-mac |
SD | |
∅-mic |
∅-mac |
SD |
|
RMM/NHCM |
|
.855 |
.82 |
.194 |
|
.308 |
.326 |
.282 |
|
.452 |
.424 |
.253 |
OPBOT |
|
.744 |
.776 |
.249 |
|
.285 |
.3 |
.254 |
|
.412 |
.389 |
.239 |
RMM/SMSL |
|
.645 |
.713 |
.263 |
|
.277 |
.283 |
.217 |
|
.387 |
.36 |
.205 |
KMSSS |
|
.64 |
.667 |
.252 |
|
.273 |
.289 |
.299 |
|
.383 |
.336 |
.235 |
AML-PM |
|
.385 |
.403 |
.2 |
|
.365 |
.378 |
.273 |
|
.375 |
.363 |
.22 |
KnoMa-Proc |
|
.528 |
.517 |
.296 |
|
.282 |
.281 |
.278 |
|
.367 |
.319 |
.25 |
BPLangMatch |
|
.545 |
.495 |
.21 |
|
.247 |
.256 |
.228 |
|
.34 |
.316 |
.209 |
RMM/NLM |
|
.787 |
.68 |
.267 |
|
.211 |
.229 |
.308 |
|
.333 |
.286 |
.299 |
MSSS |
|
.829 |
.862 |
.233 |
|
.19 |
.212 |
.312 |
|
.309 |
.255 |
.318 |
TripleS |
|
.543 |
.716 |
.307 |
|
.205 |
.224 |
.336 |
|
.297 |
.217 |
.284 |
RMM/VM2 |
|
.327 |
.317 |
.209 |
|
.27 |
.278 |
.248 |
|
.296 |
.284 |
.226 |
pPalm-DS |
|
.233 |
.273 |
.163 |
|
.316 |
.328 |
.302 |
|
.268 |
.25 |
.184 |
|
The results for the UA data set where we used the extended gold standard
including subsumption correspondences are shown in Table 2. Due to the experimental
status of this gold standard the results shown are thus less
conclusive. However, we decided finally to include these results because subsumption
correspondences will often occur when two process models differ in
terms of granularity. A comparison against the strict version of the gold standard
(Table 1) reveals that there are some slight changes in the f-measure
based ordering of the matchers. OPBOT climbs up from rank #4 to rank
#2, AML-PM climbs from up from rank #10 to rank #5, while other
matchers are only slightly affected. This shows that some of the implemented
methods can be used to detect subsumption correspondences, while other
techniques are in particular designed to focus on direct 1:1 correspondences
only.
Table 3: Results of Birth Certificate Matching
|
|
|
Precision |
|
Recall |
|
F-Measure |
Approach |
|
∅-mic |
∅-mac |
SD | |
∅-mic |
∅-mac |
SD | |
∅-mic |
∅-mac |
SD |
|
OPBOT |
|
.713 |
.679 |
.184 |
|
.468 |
.474 |
.239 |
|
.565 |
.54 |
.216 |
pPalm-DS |
|
.502 |
.499 |
.172 |
|
.422 |
.429 |
.245 |
|
.459 |
.426 |
.187 |
RMM/NHCM |
|
.727 |
.715 |
.197 |
|
.333 |
.325 |
.189 |
|
.456 |
.416 |
.175 |
RMM/VM |
|
.474 |
.44 |
.2 |
|
.4 |
.397 |
.241 |
|
.433 |
.404 |
.21 |
BPLangMatch |
|
.645 |
.558 |
.205 |
|
.309 |
.297 |
.22 |
|
.418 |
.369 |
.221 |
AML-PM |
|
.423 |
.402 |
.168 |
|
.365 |
.366 |
.186 |
|
.392 |
.367 |
.164 |
KMSSS |
|
.8 |
.768 |
.238 |
|
.254 |
.237 |
.238 |
|
.385 |
.313 |
.254 |
RMM/SMSL |
|
.508 |
.499 |
.151 |
|
.309 |
.305 |
.233 |
|
.384 |
.342 |
.178 |
TripleS |
|
.613 |
.553 |
.26 |
|
.28 |
.265 |
.264 |
|
.384 |
.306 |
.237 |
MSSS |
|
.922 |
.972 |
.057 |
|
.202 |
.177 |
.223 |
|
.332 |
.244 |
.261 |
RMM/NLM |
|
.859 |
.948 |
.096 |
|
.189 |
.164 |
.211 |
|
.309 |
.225 |
.244 |
KnoMa-Proc |
|
.234 |
.217 |
.188 |
|
.297 |
.278 |
.234 |
|
.262 |
.237 |
.205 |
|
The BR data set has not been modified compared to its 2013 version. Thus,
we can directly compare the 2015 results against the 2013 results. Again, we
have to focus on the macro average scores. In 2013, the top results were
achieved by RefMod-Mine/NSCM with an macro average f-measure of 0.45.
In 2015 the best performing matcher on this data set is the OPBOT
approach with macro average f-measure of 0.54, which is a significant
improvement compared to 2013. The systems on the follow-up positions,
which are pPalm-DS (0.426), RMM/NHCM (0.416), and RMM/VM2
(0.402), could not outperform the 2013 results. However, the average
approach (≈0.35) in 2015 is clearly better than the average approach in
2013 (≈0.29), which can be understood as an indicator for an overall
improvement.
While it is possible for the UA data set to generate high f-measures with
a balanced approach in terms of precision and recall, the BR data set
does not share this characteristics. All matchers, with the exception of
KnoMa-Proc, favor precision over recall. Moreover, a high number of non-trivial
correspondences cannot be found by the participants of the contest. We
conducted an additional analysis where we computed the union of all matcher
generated alignments. For this alignment we measured a recall of 0.631. This
means that there is a large fraction of non-trivial correspondences in the
BR data set that cannot be found by any of the matchers. Note that
we measured the analogous score also for the other data sets, with the
outcome of 0.871 for the UA dataset (0.494 for the extended UA data set)
and 0.68 for the AM data set. These numbers illustrate that the BR
data set is a challenging data set, which requires specific methods to
overcome low recall scores. This can also be the reason why some of the
systems that perform not so well on the UA data set are among the top-5
systems for the BR data set. These systems are OPBOT, pPalm-DS, and
RMM/VM2.
Table 4: Results of Asset Management Matching
|
|
|
Precision |
|
Recall |
|
F-Measure |
Approach |
|
∅-mic |
∅-mac |
SD | |
∅-mic |
∅-mac |
SD | |
∅-mic |
∅-mac |
SD |
|
AML-PM |
|
.786 |
.664 |
.408 |
|
.595 |
.635 |
.407 |
|
.677 |
.48 |
.422 |
RMM/NHCM |
|
.957 |
.887 |
.314 |
|
.505 |
.521 |
.422 |
|
.661 |
.485 |
.426 |
RMM/NLM |
|
.991 |
.998 |
.012 |
|
.486 |
.492 |
.436 |
|
.653 |
.531 |
.438 |
BPLangMatch |
|
.758 |
.567 |
.436 |
|
.563 |
.612 |
.389 |
|
.646 |
.475 |
.402 |
OPBOT |
|
.662 |
.695 |
.379 |
|
.617 |
.634 |
.409 |
|
.639 |
.514 |
.403 |
MSSS |
|
.897 |
.979 |
.079 |
|
.473 |
.486 |
.432 |
|
.619 |
.519 |
.429 |
RMM/VM2 |
|
.676 |
.621 |
.376 |
|
.545 |
.6 |
.386 |
|
.603 |
.454 |
.384 |
KMSSS |
|
.643 |
.834 |
.282 |
|
.527 |
.532 |
.417 |
|
.579 |
.482 |
.382 |
TripleS |
|
.614 |
.814 |
.261 |
|
.545 |
.546 |
.434 |
|
.578 |
.481 |
.389 |
pPalm-DS |
|
.394 |
.724 |
.348 |
|
.595 |
.615 |
.431 |
|
.474 |
.451 |
.376 |
KnoMa-Proc |
|
.271 |
.421 |
.383 |
|
.514 |
.556 |
.42 |
|
.355 |
.268 |
.279 |
RMM/SMSL |
|
.722 |
.84 |
.307 |
|
.234 |
.37 |
.366 |
|
.354 |
.333 |
.327 |
|
The results for the AM data set are presented in Table 4. The top
performing matchers in terms of macro f-measure are AML-PM (0.677),
RMM/NHCM (0.661), and RMM/NLM (0.653). While these systems
are close in terms of f-measure, they have a different characteristics in
terms of precision and recall. The two RMM-based systems have a high
precision in common. Especially RMM/NLM has a precision of 0.991,
which means that less than 1 out of 100 correspondences are incorrect.
AML-PM, the top performing system, has only a precision of .786 and a
(relatively high) recall of .595. It is notable that these results have been
achieved by the use a standard ontology matching systems instead of
using a specific approach for process model matching. For the details
we refer the reader to the respective system description in the previous
section. The best results in terms of recall have been achieved by the
OPBOT matcher (0.617). Looking at the recall scores in general, it can be
concluded that it is hard to top a of 0.6 without a significant loss in
precision.
The results of our evaluation show that there is a high variance in
terms of the identified correspondences across the different data sets.
However, there are also some systems that perform well over all three
data sets (we exclude the UAS data set in this consideration due to its
experimental character). These systems are RMM/NHCM and OPBOT.
RMM/NHCM is ranked #1, #3 and #2, OPBOT is ranked #4, #1, and
#5 in terms of macro-average. None of the other approaches is among
the top-five with respect to all three data sets. This illustrates again
how hard it is to propose a mechanism that works well for the different
modeling styles and labeling conventions that can be found in our test data
collection.