GI logo

Process Model Matching Contest
@EMISA 2015

The second Process Modell Matching Contest

EMISA 2015

About the Matching Contest


The increasing interest in and the growing number of approaches for process matching are the main motivation for a systematic and continuous evaluation of the proposed methods. The Process Model Matching Contest is an initiative to evaluate the strength and weaknesses of these methods. The first contest has been conducted in 2013. Now we have extended the evaluation by improving the gold standards and adding new test cases in 2015.

The goals of the contest are:

  • assessing strengths and weaknesses of matching systems on real-world matching problems
  • increasing the communication among developers and the BPM community
  • developing, improving, and applying evaluation techniques specific for the field of process matching
The means to achieve these goals are the organization of a yearly evaluation event and the publication of the tests and results. In 2015 the main results will be published as an article in the EMISA proceedings.

The idea to set up the Process Model Matching Contest as an annual campaign is motivated by the constant success of the Ontology Alignment Evaluation Initiative (OAEI), which is an annual evaluation campaign concerned with the matching of static models (ontologies).


The matching problems to be solved within the contest consist of the following three datasets (all datasets contain English text only).

  1. University Admission: This set consists of 36 model pairs that were derived from 9 models representing the application procedure for Master students of nine German universities. The process models are available in BPMN format. Compared to the 2013 version of the dataset, we have fixed several issues with the models that have to be matched, changed the format of the models, and have strongly improved the quality of the gold standard.
    Download dataset and gold standard.
    Hot Fix: Minor faults in the reference alignment have been fixed at 26th of June 2015, if you are using an older version we recommend to donwload the dataset with the fixes included.
  2. Birth Registration: This set consists of 36 model pairs that were derived from 9 models representing the birth registration processes of Germany, Russia, South Africa, and the Netherlands. The models are available as petri-nets (PNML format). This version of the dataset has also been used in the 2013 contest.
    Download dataset and gold standard.
  3. Asset Management: This set consist of 36 model pairs that were derived from 72 models from the SAP Reference Model Collection. The selected process models cover different aspects from the area of finance and accounting. The models are available as EPCs (in EPML-format). The dataset is new to the evaluation contest. The evaluation of this dataset is done blind, i.e., the participants do not know the gold standard of the dataset in advance.
    Download dataset.
    UPDATE (23.07.2015) The gold standard contains mappings between functions (which correspond to activities) only. Events are not mapped!
    UPDATE (27.07.2015) The output generated for this dataset is expected to use a certain URI scheme illustrated in the reference alignment for testcase1. We make this example available as an example of an alignment using the URI scheme required by our evaluation scripts
    UPDATE (14.08.2015) On request of several participants, we publish now the complete gold standard of the third dataset in the format of the Alignment API.

We plan to extend these datasets in the next years. If you know or own an interesting dataset (especially one with gold standard) you are always welcome to contact the organizers or, even better, to join in the organization of the contest. The third dataset was developed by Christopher Klinkmüller based on the SAP Reference Model. We thank Christopher for making that dataset available to the contest.


The results of the matching contest will be discussed at EMISA 2015 co-located with the 13th International Conference on Business Process Management (BPM 2015). Both events will be hosted by the University of Innsbruck and BPM Research Cluster, and will take place in Innsbruck.

Participation as a system developer

Course of Action

First of all, we ask all participants to inform us about their intention to participate via an informal email. In this email, the name of the matching system (not longer than 12 letters) and the research lab / university affiliation should be mentioned explicitly. This helps us to gain a rough overview. As a participant you should download the datasets (see availability above) and check whether your system can (a) read the input data correctly and (b) generate the required output data, once the datasets have been published. Aside from specifics of the system that are related to the format of the input data, the matcher should be applied to each matching task (i.e., each different test case of each dataset) with the same setting. Once the results have been generated, they should be submitted to the contact address via email together with a short description of the matching system that can also highlight some results or experiences that have been gained in working with the datasets.

The organizers of the contest will use the alignment files provided by the participants to evaluate the matching technique in terms of precision and recall. These results will be summarized on the contest website and presented at the workshop. Furthermore, there will be a joint paper in the workshop proceedings that includes the descriptions of the matching techniques and discusses the obtained results. The requirement for inclusion is that the matching technique has indeed been developed by the participants and that the code will be provided to the organizers on request. Note that the code will not be published and treated confidentially. It will only be used to confirm the functionality stated in the paper.

Results Format

As a results format, i.e., the format to describe the generated alignments (also referred to as mappings or matchings), we use the Alignment API. The Alignment API defines a format to describe alignments and provides a Java implementation for generating, manipulating, reading and storing them. While it was originally developed for ontology alignments, it also allows us to share process model alignments in a convenient way. Please note in particular that the Alignment API does not allow special characters or blank spaces. Also ensure to use activity IDs that are encoded as URIs instead of their labels in the mapping. Examples of alignment files are available as gold standard files within the datasets specified above.

You can download an Eclipse project that illustrates how to use the Alignment API via Java. The project contains several examples in the respective package. In particular, you can see how to create an alignment and how to store it (, how to evaluate a single alignment against a reference alignment (, and how to conduct a complete evaluation for one of the datasets with generating detailed and aggregated results ( We highly encourage to run these examples and to look at the code and the comments. Wihin 15 minutes you should be able to use the Alignment API for your purpose.

Minor issue with the code (23.07.2015):One of the participants reported about a mistake in the file Indeed, in the second of the two nested loops you have to remove the - 1 in oder to iterate over all testcases.

The Eclipse Project contains also the core of the Alignment API jar-files that are required to use the Alignment API as it is done in the examples. The path to the jars is already specified in the build path. Using the libraries in the Eclipse project does not force you to download and use the complete Alignment API.

However, on the long term you might like to use more functionality of the Alignment API, we refer the reader to the tutorials available here.

Input Format

As described above, the Process Model Matching Contest 2015 requires the participants to match three different process modeling notations: Petri Nets, Event-driven Process Chains, and BPMN models. Hence, the participants also have to process three different input formats. To keep the effort for parsing and analyzing the models as low as possible, we recommend to make use of the jBPT code library, which is a comprehensive collection of techniques for process model analysis. The java code and and a tutorials for its use can be found here.

In case you have problems with reading the input files, do not hesitate to contact us.

Result Submission

The results should be submitted in a zip-file that has the following folder structure

	... images, other included resources
	... overall 36 files
	... overall 36 files

This zip-file should not contain the input dataset nor any other resources different from the generated alignments. An exception is the paper folder, which contains the short description (only 1 to 1.5 pages) of the matching system / matching techniques that have been applied to generate the results. Please use the same paper format defined in the EMISA submission section. Note that the folder has to contain the latex sources, because we need to create one paper out of all submissions.

Publication of the Results

The overall results of the evaluation will be published at the workshop. Prior to that date, we will not publish any results. We are looking forward to an exciting results presentation and fruitful discussion at the workshop. Meanwhile we have also published the most important results on this webpage.

Main Contact

The first registration as well as the final results should be submitted to Henrik Leopold (h.leopold [AT] If you have any questions related to the datasets or if you have any problems in generating the required output, please also write an email to Henrik.

Results of the Contest


The goal, the procedure and the results of the contest have been described in details in the following publication. On this webpage we present the excerpt from this paper that highlights and dicusses the most important results.

Goncalo Antunes, Marzieh Bakhshandeh, Jose Borbinha, Joao Cardoso, Sharam Dadashnia, Chiara Di Francescomarino, Mauro Dragoni, Peter Fettke, Avigdor Gal, Chiara Ghidini, Philip Hake, Abderrahmane Khiat, Christopher Klinkmüller, Elena Kuss, Henrik Leopold, Peter Loos, Christian Meilicke, Tim Niesen, Catia Pesquita, Timo Péus, Andreas Schoknecht, Eitam Sheetrit, Andreas Sonntag, Heiner Stuckenschmidt, Tom Thaler, Ingo Weber, Matthias Weidlich: The Process Model Matching Contest 2015. In: 6th International Workshop on Enterprise Modelling and Information Systems Architectures (EMISA 2015), September 3-4, 2015, Innsbruck, Austria.

Please cite this paper if you refer to results reported in the contest, if you use a data set from the contest or some of the other materials made available here. An authors version of this paper is available here.

Participating Systems

We are happy that finally 12 different systems particpated at the context and submitted results for all three datasets.

  1. AML-PM: Marzieh Bakhshandeh, Joao Cardoso, Goncalo Antunes, Catia Pesquita, Jose Borbinha
  2. BPLangMatch: Eitam Sheetrit, Matthias Weidlich, Avigdor Gal
  3. KnoMa-Proc: Mauro Dragoni, Chiara Di Francescomarino, Chiara Ghidini
  4. Know-Match-SSS (KMSSS): Abderrahmane Khiat
  5. Match-SSS (MSSS): Abderrahmane Khiat
  6. RefMod-Mine/VM2 (RMM/VM2): Sharam Dadashnia, Tim Niesen, Philip Hake, Andreas Sonntag, Tom Thaler, Peter Fettke, Peter Loos
  7. RefMod-Mine/NHCM (RMM/NHCM): Tom Thaler, Philip Hake, Sharam Dadashnia, Tim Niesen, Andreas Sonntag, Peter Fettke, Peter Loos
  8. RefMod-Mine/NLM (RMM/NLM): Philip Hake, Tom Thaler, Sharam Dadashnia, Tim Niesen, Andreas Sonntag, Peter Fettke, Peter Loos
  9. RefMod-Mine/SMSL (RMM/SMSL): Andreas Sonntag, Philip Hake, Sharam Dadashnia, Tim Niesen, Tom Thaler, Peter Fettke, Peter Loos
  10. OPBOT: Christopher Klinkmüller, Ingo Weber
  11. pPalm-DS: Timo Peus
  12. TripleS: Andreas Schoknecht
The submitted results of all systems are available via this download link.


For assessing the submitted process model matching techniques, we compare the computed correspondences against a manually created gold standard. Using the gold standard, we classify each computed activity match as either true-positive (TP), true-negative (TN), false-positive (FP) or false-negative (FN). Based on this classification, we calculate the precision (TP/(TP+FP)), the recall (TP/(TP+FN)), and the f-measure, which is the harmonic mean of precision and recall (2*precision*recall/(precision+recall)).


Tables 1 to 4 give an overview of the results for the datasets. For getting a better understanding of the result details, we report the average (∅) and the standard deviation (SD) for each metric. The highest value for each metric is marked using bold font. In our evaluation we distinguish between micro and macro average. Macro average is defined as the average of precision, recall and f-measure scores over all testcases. On the contrary, micro average is computed by summing up TP, TN, FP, and FN scores applying the precision, recall and f-measure formula once on the resulting values. Micro average scores take different sizes of testcases into account, e.g., bad recall on a small testcase has only limited impact on the micro average recall scores. Some agreements are required to compute macro average scores for two special cases. It might happen that a matcher generates an empty set of correspondences. If this is the case, we set the precision score for computing the macro average to 1.0, due to the consideration that an empty set of correspondences contains no incorrect correspondences. Moreover, some of the testcases of the AM data set have empty gold standards. In this case we set the recall score for computing the macro average to 1.0, because all correct matches have been detected.

Table 1: Results of University Admission Matching

Precision Recall F-Measure
Approach ∅-mic ∅-mac SD ∅-mic ∅-mac SD ∅-mic ∅-mac SD

RMM/NHCM .686 .597 .248 .651 .61 .277 .668 .566 .224
RMM/NLM .768 .673 .261 .543 .466 .279 .636 .509 .236
MSSS .807 .855 .232 .487 .343 .353 .608 .378 .343
OPBOT .598 .636 .335 .603 .623 .312 .601 .603 .3
KMSSS .513 .386 .32 .578 .402 .357 .544 .374 .305
RMM/SMSL .511 .445 .239 .578 .578 .336 .543 .477 .253
TripleS .487 .685 .329 .483 .297 .361 .485 .249 .278
BPLangMatch .365 .291 .229 .435 .314 .265 .397 .295 .236
KnoMa-Proc .337 .223 .282 .474 .292 .329 .394 .243 .285
AML-PM .269 .25 .205 .672 .626 .319 .385 .341 .236
RMM/VM2 .214 .186 .227 .466 .332 .283 .293 .227 .246
pPalm-DS .162 .125 .157 .578 .381 .38 .253 .18 .209

The results for the UA data set (Table 1) illustrate large differences in the quality of the generated correspondences. Note that we ordered the matchers in Table 1 and in the other results tables by micro average f-measure. The best results in terms of f-measure (micro-average) are obtained by the RMM/NHCM approach (0.668) followed by RMM/NLM (0.636) and MSSS (0.608). At the same time five matching systems generate results with an f-measure of less than 0.4. When we compare these results against the results achieved in the 2013 edition of the contest, we have to focus on macro-average scores, which have been computed also in the 2013 edition. This year, there are several matchers with a macro average of >0.5, while the best approach achieved 0.41 in 2013. This improvement indicates that the techniques for process matching have progressed over the last two years. Anyhow, we also have to take into account that the gold standard has been improved and the format of the models has been changed to BPMN. Thus, results are only partially comparable.

Comparing micro and macro f-measure averages in 2015, there are, at times, significant differences. In most cases, macro scores are significantly lower. This is caused by the existence of several small testcases (small in numbers of correspondences) in the collection that seem to be hard to deal with for some matchers. These testcases have a strong negative impact on macro averages and a moderated impact on micro average. This is also one of the reasons why we prefer to discuss the results in terms of micro average.

It is interesting to see that the good results are not only based on a strict setting that aims for high precision scores, but that matchers like RMM/NHCM and OPBOT manage to achieve good f-measure scores based on well-balanced precision/recall scores. Above, we have described the gold standard of this data set as rather strict in terms of 1:n correspondences. This might indicate that the matching task should not be too complex. However, some of the approaches failed to generate good results. Note that this is caused by a low precision, while at the same time recall values have not or only slightly been affected positively. A detailed matcher specific analysis, that goes beyond the scope of this paper, has to reveal the underlying reason.

Table 2: Results of University Admission Matching with Subsumption

Precision Recall F-Measure
Approach ∅-mic ∅-mac SD ∅-mic ∅-mac SD ∅-mic ∅-mac SD

RMM/NHCM .855 .82 .194 .308 .326 .282 .452 .424 .253
OPBOT .744 .776 .249 .285 .3 .254 .412 .389 .239
RMM/SMSL .645 .713 .263 .277 .283 .217 .387 .36 .205
KMSSS .64 .667 .252 .273 .289 .299 .383 .336 .235
AML-PM .385 .403 .2 .365 .378 .273 .375 .363 .22
KnoMa-Proc .528 .517 .296 .282 .281 .278 .367 .319 .25
BPLangMatch .545 .495 .21 .247 .256 .228 .34 .316 .209
RMM/NLM .787 .68 .267 .211 .229 .308 .333 .286 .299
MSSS .829 .862 .233 .19 .212 .312 .309 .255 .318
TripleS .543 .716 .307 .205 .224 .336 .297 .217 .284
RMM/VM2 .327 .317 .209 .27 .278 .248 .296 .284 .226
pPalm-DS .233 .273 .163 .316 .328 .302 .268 .25 .184

The results for the UA data set where we used the extended gold standard including subsumption correspondences are shown in Table 2. Due to the experimental status of this gold standard the results shown are thus less conclusive. However, we decided finally to include these results because subsumption correspondences will often occur when two process models differ in terms of granularity. A comparison against the strict version of the gold standard (Table 1) reveals that there are some slight changes in the f-measure based ordering of the matchers. OPBOT climbs up from rank #4 to rank #2, AML-PM climbs from up from rank #10 to rank #5, while other matchers are only slightly affected. This shows that some of the implemented methods can be used to detect subsumption correspondences, while other techniques are in particular designed to focus on direct 1:1 correspondences only.

Table 3: Results of Birth Certificate Matching

Precision Recall F-Measure
Approach ∅-mic ∅-mac SD ∅-mic ∅-mac SD ∅-mic ∅-mac SD

OPBOT .713 .679 .184 .468 .474 .239 .565 .54 .216
pPalm-DS .502 .499 .172 .422 .429 .245 .459 .426 .187
RMM/NHCM .727 .715 .197 .333 .325 .189 .456 .416 .175
RMM/VM .474 .44 .2 .4 .397 .241 .433 .404 .21
BPLangMatch .645 .558 .205 .309 .297 .22 .418 .369 .221
AML-PM .423 .402 .168 .365 .366 .186 .392 .367 .164
KMSSS .8 .768 .238 .254 .237 .238 .385 .313 .254
RMM/SMSL .508 .499 .151 .309 .305 .233 .384 .342 .178
TripleS .613 .553 .26 .28 .265 .264 .384 .306 .237
MSSS .922 .972 .057 .202 .177 .223 .332 .244 .261
RMM/NLM .859 .948 .096 .189 .164 .211 .309 .225 .244
KnoMa-Proc .234 .217 .188 .297 .278 .234 .262 .237 .205

The BR data set has not been modified compared to its 2013 version. Thus, we can directly compare the 2015 results against the 2013 results. Again, we have to focus on the macro average scores. In 2013, the top results were achieved by RefMod-Mine/NSCM with an macro average f-measure of 0.45. In 2015 the best performing matcher on this data set is the OPBOT approach with macro average f-measure of 0.54, which is a significant improvement compared to 2013. The systems on the follow-up positions, which are pPalm-DS (0.426), RMM/NHCM (0.416), and RMM/VM2 (0.402), could not outperform the 2013 results. However, the average approach (≈0.35) in 2015 is clearly better than the average approach in 2013 (≈0.29), which can be understood as an indicator for an overall improvement.

While it is possible for the UA data set to generate high f-measures with a balanced approach in terms of precision and recall, the BR data set does not share this characteristics. All matchers, with the exception of KnoMa-Proc, favor precision over recall. Moreover, a high number of non-trivial correspondences cannot be found by the participants of the contest. We conducted an additional analysis where we computed the union of all matcher generated alignments. For this alignment we measured a recall of 0.631. This means that there is a large fraction of non-trivial correspondences in the BR data set that cannot be found by any of the matchers. Note that we measured the analogous score also for the other data sets, with the outcome of 0.871 for the UA dataset (0.494 for the extended UA data set) and 0.68 for the AM data set. These numbers illustrate that the BR data set is a challenging data set, which requires specific methods to overcome low recall scores. This can also be the reason why some of the systems that perform not so well on the UA data set are among the top-5 systems for the BR data set. These systems are OPBOT, pPalm-DS, and RMM/VM2.

Table 4: Results of Asset Management Matching

Precision Recall F-Measure
Approach ∅-mic ∅-mac SD ∅-mic ∅-mac SD ∅-mic ∅-mac SD

AML-PM .786 .664 .408 .595 .635 .407 .677 .48 .422
RMM/NHCM .957 .887 .314 .505 .521 .422 .661 .485 .426
RMM/NLM .991 .998 .012 .486 .492 .436 .653 .531 .438
BPLangMatch .758 .567 .436 .563 .612 .389 .646 .475 .402
OPBOT .662 .695 .379 .617 .634 .409 .639 .514 .403
MSSS .897 .979 .079 .473 .486 .432 .619 .519 .429
RMM/VM2 .676 .621 .376 .545 .6 .386 .603 .454 .384
KMSSS .643 .834 .282 .527 .532 .417 .579 .482 .382
TripleS .614 .814 .261 .545 .546 .434 .578 .481 .389
pPalm-DS .394 .724 .348 .595 .615 .431 .474 .451 .376
KnoMa-Proc .271 .421 .383 .514 .556 .42 .355 .268 .279
RMM/SMSL .722 .84 .307 .234 .37 .366 .354 .333 .327

The results for the AM data set are presented in Table 4. The top performing matchers in terms of macro f-measure are AML-PM (0.677), RMM/NHCM (0.661), and RMM/NLM (0.653). While these systems are close in terms of f-measure, they have a different characteristics in terms of precision and recall. The two RMM-based systems have a high precision in common. Especially RMM/NLM has a precision of 0.991, which means that less than 1 out of 100 correspondences are incorrect. AML-PM, the top performing system, has only a precision of .786 and a (relatively high) recall of .595. It is notable that these results have been achieved by the use a standard ontology matching systems instead of using a specific approach for process model matching. For the details we refer the reader to the respective system description in the previous section. The best results in terms of recall have been achieved by the OPBOT matcher (0.617). Looking at the recall scores in general, it can be concluded that it is hard to top a of 0.6 without a significant loss in precision.

The results of our evaluation show that there is a high variance in terms of the identified correspondences across the different data sets. However, there are also some systems that perform well over all three data sets (we exclude the UAS data set in this consideration due to its experimental character). These systems are RMM/NHCM and OPBOT. RMM/NHCM is ranked #1, #3 and #2, OPBOT is ranked #4, #1, and #5 in terms of macro-average. None of the other approaches is among the top-five with respect to all three data sets. This illustrates again how hard it is to propose a mechanism that works well for the different modeling styles and labeling conventions that can be found in our test data collection.

Important dates

Publication of datasets

Submission deadline (match results and matcher description)



The matching contest in 2015 is organized by

We would also like to acknowledge the support of Christopher Klinkmüller for making available the SAP Reference Model.

Any concrete questions related to participation / submission of results should be addressed directly to Henrik Leopold.