Publication
The goal, the procedure and the results of the contest have been described in details in the following publication. On this webpage we present the excerpt from this paper that highlights and dicusses the most important results.
 
Goncalo Antunes, Marzieh Bakhshandeh, Jose Borbinha, Joao Cardoso, Sharam Dadashnia, Chiara Di Francescomarino, Mauro Dragoni, Peter Fettke, Avigdor Gal, Chiara Ghidini, Philip Hake, Abderrahmane Khiat, Christopher Klinkmüller, Elena Kuss, Henrik Leopold, Peter Loos, Christian Meilicke, Tim Niesen, Catia Pesquita, Timo Péus, Andreas Schoknecht, Eitam Sheetrit, Andreas Sonntag, Heiner Stuckenschmidt, Tom Thaler, Ingo Weber, Matthias Weidlich: The Process Model Matching Contest 2015. In: 6th International Workshop on Enterprise Modelling and Information Systems Architectures (EMISA 2015), September 3-4, 2015, Innsbruck, Austria.
 
Please cite this paper if you refer to results reported in the contest, if you use a data set from the contest or some of the other materials made available here. An authors version of this paper is available here.
Results
Tables 1 to 4 give an overview of the results for the datasets. For getting a
   better understanding of the result details, we report the average (∅) and the
   standard deviation (SD) for each metric. The highest value for each metric is
   marked using bold font. In our evaluation we distinguish between micro and
   macro average. Macro average is defined as the average of precision, recall
   and f-measure scores over all testcases. On the contrary, micro average
   is computed by summing up TP, TN, FP, and FN scores applying the
   precision, recall and f-measure formula once on the resulting values. Micro
   average scores take different sizes of testcases into account, e.g., bad recall
   on a small testcase has only limited impact on the micro average recall
   scores. Some agreements are required to compute macro average scores for two
   special cases. It might happen that a matcher generates an empty set of
   correspondences. If this is the case, we set the precision score for computing
   the macro average to 1.0, due to the consideration that an empty set of
   correspondences contains no incorrect correspondences. Moreover, some of the
   testcases of the AM data set have empty gold standards. In this case we set the
   recall score for computing the macro average to 1.0, because all correct matches
   have been detected.
  Table 1: Results of University Admission Matching
  
    | 
 | 
  
    |  |  | Precision |  | Recall |  | F-Measure | 
  
    | Approach |  | ∅-mic | ∅-mac | SD |  | ∅-mic | ∅-mac | SD |  | ∅-mic | ∅-mac | SD | 
  
    | 
 | 
  
    | RMM/NHCM |  | .686 | .597 | .248 |  | .651 | .61 | .277 |  | .668 | .566 | .224 | 
  
    | RMM/NLM |  | .768 | .673 | .261 |  | .543 | .466 | .279 |  | .636 | .509 | .236 | 
  
    | MSSS |  | .807 | .855 | .232 |  | .487 | .343 | .353 |  | .608 | .378 | .343 | 
  
    | OPBOT |  | .598 | .636 | .335 |  | .603 | .623 | .312 |  | .601 | .603 | .3 | 
  
    | KMSSS |  | .513 | .386 | .32 |  | .578 | .402 | .357 |  | .544 | .374 | .305 | 
  
    | RMM/SMSL |  | .511 | .445 | .239 |  | .578 | .578 | .336 |  | .543 | .477 | .253 | 
  
    | TripleS |  | .487 | .685 | .329 |  | .483 | .297 | .361 |  | .485 | .249 | .278 | 
  
    | BPLangMatch |  | .365 | .291 | .229 |  | .435 | .314 | .265 |  | .397 | .295 | .236 | 
  
    | KnoMa-Proc |  | .337 | .223 | .282 |  | .474 | .292 | .329 |  | .394 | .243 | .285 | 
  
    | AML-PM |  | .269 | .25 | .205 |  | .672 | .626 | .319 |  | .385 | .341 | .236 | 
  
    | RMM/VM2 |  | .214 | .186 | .227 |  | .466 | .332 | .283 |  | .293 | .227 | .246 | 
  
    | pPalm-DS |  | .162 | .125 | .157 |  | .578 | .381 | .38 |  | .253 | .18 | .209 | 
  
    | 
 | 
The results for the UA data set (Table 1) illustrate large differences in the
   quality of the generated correspondences. Note that we ordered the matchers
   in Table 1 and in the other results tables by micro average f-measure.
   The best results in terms of f-measure (micro-average) are obtained by
   the RMM/NHCM approach (0.668) followed by RMM/NLM (0.636)
   and MSSS (0.608). At the same time five matching systems generate
   results with an f-measure of less than 0.4. When we compare these results
   against the results achieved in the 2013 edition of the contest, we have
   to focus on macro-average scores, which have been computed also in
   the 2013 edition. This year, there are several matchers with a macro
   average of >0.5, while the best approach achieved 0.41 in 2013. This
   improvement indicates that the techniques for process matching have
   progressed over the last two years. Anyhow, we also have to take into
   account that the gold standard has been improved and the format of the
   models has been changed to BPMN. Thus, results are only partially
   comparable.
Comparing micro and macro f-measure averages in 2015, there are, at times,
   significant differences. In most cases, macro scores are significantly lower. This is
   caused by the existence of several small testcases (small in numbers of
   correspondences) in the collection that seem to be hard to deal with for
   some matchers. These testcases have a strong negative impact on macro
   averages and a moderated impact on micro average. This is also one
   of the reasons why we prefer to discuss the results in terms of micro
   average.
It is interesting to see that the good results are not only based on a strict
   setting that aims for high precision scores, but that matchers like RMM/NHCM
   and OPBOT manage to achieve good f-measure scores based on well-balanced
   precision/recall scores. Above, we have described the gold standard of this data
   set as rather strict in terms of 1:n correspondences. This might indicate that
   the matching task should not be too complex. However, some of the
   approaches failed to generate good results. Note that this is caused by
   a low precision, while at the same time recall values have not or only
   slightly been affected positively. A detailed matcher specific analysis,
   that goes beyond the scope of this paper, has to reveal the underlying
   reason.
  Table 2: Results of University Admission Matching with Subsumption
  
    | 
 | 
  
    |  |  | Precision |  | Recall |  | F-Measure | 
  
    | Approach |  | ∅-mic | ∅-mac | SD |  | ∅-mic | ∅-mac | SD |  | ∅-mic | ∅-mac | SD | 
  
    | 
 | 
    RMM/NHCM |  | .855 | .82 | .194 |  | .308 | .326 | .282 |  | .452 | .424 | .253 | 
    | OPBOT |  | .744 | .776 | .249 |  | .285 | .3 | .254 |  | .412 | .389 | .239 | 
  
    | RMM/SMSL |  | .645 | .713 | .263 |  | .277 | .283 | .217 |  | .387 | .36 | .205 | 
  
    | KMSSS |  | .64 | .667 | .252 |  | .273 | .289 | .299 |  | .383 | .336 | .235 | 
  
    | AML-PM |  | .385 | .403 | .2 |  | .365 | .378 | .273 |  | .375 | .363 | .22 | 
  
    | KnoMa-Proc |  | .528 | .517 | .296 |  | .282 | .281 | .278 |  | .367 | .319 | .25 | 
  
    | BPLangMatch |  | .545 | .495 | .21 |  | .247 | .256 | .228 |  | .34 | .316 | .209 | 
  
    | RMM/NLM |  | .787 | .68 | .267 |  | .211 | .229 | .308 |  | .333 | .286 | .299 | 
  
    | MSSS |  | .829 | .862 | .233 |  | .19 | .212 | .312 |  | .309 | .255 | .318 | 
  
    | TripleS |  | .543 | .716 | .307 |  | .205 | .224 | .336 |  | .297 | .217 | .284 | 
  
    | RMM/VM2 |  | .327 | .317 | .209 |  | .27 | .278 | .248 |  | .296 | .284 | .226 | 
  
    | pPalm-DS |  | .233 | .273 | .163 |  | .316 | .328 | .302 |  | .268 | .25 | .184 | 
  
    | 
 | 
The results for the UA data set where we used the extended gold standard
   including subsumption correspondences are shown in Table 2. Due to the experimental
   status of this gold standard the results shown are thus less
   conclusive. However, we decided finally to include these results because subsumption
   correspondences will often occur when two process models differ in
   terms of granularity. A comparison against the strict version of the gold standard
   (Table 1) reveals that there are some slight changes in the f-measure
   based ordering of the matchers. OPBOT climbs up from rank #4 to rank
   #2, AML-PM climbs from up from rank #10 to rank #5, while other
   matchers are only slightly affected. This shows that some of the implemented
   methods can be used to detect subsumption correspondences, while other
   techniques are in particular designed to focus on direct 1:1 correspondences
   only.
  Table 3: Results of Birth Certificate Matching
  
    | 
 | 
  
    |  |  | Precision |  | Recall |  | F-Measure | 
  
    | Approach |  | ∅-mic | ∅-mac | SD |  | ∅-mic | ∅-mac | SD |  | ∅-mic | ∅-mac | SD | 
  
    | 
 | 
    OPBOT |  | .713 | .679 | .184 |  | .468 | .474 | .239 |  | .565 | .54 | .216 | 
    | pPalm-DS |  | .502 | .499 | .172 |  | .422 | .429 | .245 |  | .459 | .426 | .187 | 
   
    | RMM/NHCM |  | .727 | .715 | .197 |  | .333 | .325 | .189 |  | .456 | .416 | .175 | 
   
    | RMM/VM |  | .474 | .44 | .2 |  | .4 | .397 | .241 |  | .433 | .404 | .21 | 
   
    | BPLangMatch |  | .645 | .558 | .205 |  | .309 | .297 | .22 |  | .418 | .369 | .221 | 
   
    | AML-PM |  | .423 | .402 | .168 |  | .365 | .366 | .186 |  | .392 | .367 | .164 | 
   
    | KMSSS |  | .8 | .768 | .238 |  | .254 | .237 | .238 |  | .385 | .313 | .254 | 
   
    | RMM/SMSL |  | .508 | .499 | .151 |  | .309 | .305 | .233 |  | .384 | .342 | .178 | 
   
    | TripleS |  | .613 | .553 | .26 |  | .28 | .265 | .264 |  | .384 | .306 | .237 | 
   
    | MSSS |  | .922 | .972 | .057 |  | .202 | .177 | .223 |  | .332 | .244 | .261 | 
   
    | RMM/NLM |  | .859 | .948 | .096 |  | .189 | .164 | .211 |  | .309 | .225 | .244 | 
   
    | KnoMa-Proc |  | .234 | .217 | .188 |  | .297 | .278 | .234 |  | .262 | .237 | .205 | 
  
    | 
 | 
The BR data set has not been modified compared to its 2013 version. Thus,
   we can directly compare the 2015 results against the 2013 results. Again, we
   have to focus on the macro average scores. In 2013, the top results were
   achieved by RefMod-Mine/NSCM with an macro average f-measure of 0.45.
   In 2015 the best performing matcher on this data set is the OPBOT
   approach with macro average f-measure of 0.54, which is a significant
   improvement compared to 2013. The systems on the follow-up positions,
   which are pPalm-DS (0.426), RMM/NHCM (0.416), and RMM/VM2
   (0.402), could not outperform the 2013 results. However, the average
   approach (≈0.35) in 2015 is clearly better than the average approach in
   2013 (≈0.29), which can be understood as an indicator for an overall
   improvement.
While it is possible for the UA data set to generate high f-measures with
   a balanced approach in terms of precision and recall, the BR data set
   does not share this characteristics. All matchers, with the exception of
   KnoMa-Proc, favor precision over recall. Moreover, a high number of non-trivial
   correspondences cannot be found by the participants of the contest. We
   conducted an additional analysis where we computed the union of all matcher
   generated alignments. For this alignment we measured a recall of 0.631. This
   means that there is a large fraction of non-trivial correspondences in the
   BR data set that cannot be found by any of the matchers. Note that
   we measured the analogous score also for the other data sets, with the
   outcome of 0.871 for the UA dataset (0.494 for the extended UA data set)
   and 0.68 for the AM data set. These numbers illustrate that the BR
   data set is a challenging data set, which requires specific methods to
   overcome low recall scores. This can also be the reason why some of the
   systems that perform not so well on the UA data set are among the top-5
   systems for the BR data set. These systems are OPBOT, pPalm-DS, and
   RMM/VM2.
  Table 4: Results of Asset Management Matching
  
    | 
 | 
  
    |  |  | Precision |  | Recall |  | F-Measure | 
  
    | Approach |  | ∅-mic | ∅-mac | SD |  | ∅-mic | ∅-mac | SD |  | ∅-mic | ∅-mac | SD | 
  
    | 
 | 
  
    | AML-PM |  | .786 | .664 | .408 |  | .595 | .635 | .407 |  | .677 | .48 | .422 | 
  
    | RMM/NHCM |  | .957 | .887 | .314 |  | .505 | .521 | .422 |  | .661 | .485 | .426 | 
  
    | RMM/NLM |  | .991 | .998 | .012 |  | .486 | .492 | .436 |  | .653 | .531 | .438 | 
  
    | BPLangMatch |  | .758 | .567 | .436 |  | .563 | .612 | .389 |  | .646 | .475 | .402 | 
  
    | OPBOT |  | .662 | .695 | .379 |  | .617 | .634 | .409 |  | .639 | .514 | .403 | 
  
    | MSSS |  | .897 | .979 | .079 |  | .473 | .486 | .432 |  | .619 | .519 | .429 | 
  
    | RMM/VM2 |  | .676 | .621 | .376 |  | .545 | .6 | .386 |  | .603 | .454 | .384 | 
  
    | KMSSS |  | .643 | .834 | .282 |  | .527 | .532 | .417 |  | .579 | .482 | .382 | 
  
    | TripleS |  | .614 | .814 | .261 |  | .545 | .546 | .434 |  | .578 | .481 | .389 | 
  
    | pPalm-DS |  | .394 | .724 | .348 |  | .595 | .615 | .431 |  | .474 | .451 | .376 | 
  
    | KnoMa-Proc |  | .271 | .421 | .383 |  | .514 | .556 | .42 |  | .355 | .268 | .279 | 
  
    | RMM/SMSL |  | .722 | .84 | .307 |  | .234 | .37 | .366 |  | .354 | .333 | .327 | 
  
    | 
 | 
The results for the AM data set are presented in Table 4. The top
   performing matchers in terms of macro f-measure are AML-PM (0.677),
   RMM/NHCM (0.661), and RMM/NLM (0.653). While these systems
   are close in terms of f-measure, they have a different characteristics in
   terms of precision and recall. The two RMM-based systems have a high
   precision in common. Especially RMM/NLM has a precision of 0.991,
   which means that less than 1 out of 100 correspondences are incorrect.
   AML-PM, the top performing system, has only a precision of .786 and a
   (relatively high) recall of .595. It is notable that these results have been
   achieved by the use a standard ontology matching systems instead of
   using a specific approach for process model matching. For the details
   we refer the reader to the respective system description in the previous
   section. The best results in terms of recall have been achieved by the
   OPBOT matcher (0.617). Looking at the recall scores in general, it can be
   concluded that it is hard to top a of 0.6 without a significant loss in
   precision.
The results of our evaluation show that there is a high variance in
   terms of the identified correspondences across the different data sets.
   However, there are also some systems that perform well over all three
   data sets (we exclude the UAS data set in this consideration due to its
   experimental character). These systems are RMM/NHCM and OPBOT.
   RMM/NHCM is ranked #1, #3 and #2, OPBOT is ranked #4, #1, and
   #5 in terms of macro-average. None of the other approaches is among
   the top-five with respect to all three data sets. This illustrates again
   how hard it is to propose a mechanism that works well for the different
   modeling styles and labeling conventions that can be found in our test data
   collection.