Title: An Unsupervised Bayesian Probabilistic Model based Semantic Matchmaking Approach for Web Service Discovery

Web service discovery is a fundamental task in the service-oriented architecture (SOA) which searches suitable web services with particular interest. This paper presents an intelligent and “user-friendly” service discovery approach which enables requester search for services by entering various type of query content, including words, phrases, or even sentences. Specifically, an unsupervised Bayesian probabilistic model, bi-Directional Hybrid Priors Topic Model (bi-SWTM) is proposed to achieve semantic matching for possible query contents (words, phrases, or sentences) and the sentences in web service description by mapping words and sentences in the same semantic space. bi-SWTM captures the textual semantics of the words and sentences in a probabilistic simplex, which provides a flexible operation to build the semantic links from the query to service description. Meanwhile, the textual semantics generated by bi-SWTM is highly interpretable that help to understand the user requirements. The proposed model is examined on ProgrammableWeb. Experimental results demonstrate that bi-SWTM outperforms state-of-the-art methods for semantic service discovery on service classification and retrieval. The visualizations of the nearest-neighbored queries and descriptions show the insights of our model on capturing the latent semantics of Web services.
The main contributions of this paper are as follows:

  • We propose a novel bi-Directional Hybrid Priors Topic Model (bi-SWTM) to understand the words and sentences in web services, which is capable of discovering the latent semantics of the complex queries and the descriptions of the services.
  • The semantics learned from the service descriptions are highly interpretable, and the service queries and service descriptions can be embedded into the same semantic space, which provides an effective way to understand the service better in matching the queries and descriptions.
  • Comprehensive experiments on ProgrammableWeb demonstrate that the semantics revealed by our model are of high quality, it significantly improves the performance of service classification and retrieval, compared with state-of-the-art comparisons.
  • Data and Source Code

    We provide the code, the scripts as well the data here. Our code is based on C++ and python 3.5, so the g++ and python are both needed. Click here to download the code of the proposed bi-SWTM.

    The preprocessed data is from ProgrammableWeb, which consists of 22,000 web services, and each web service is described with a short unstructured text. We use the released repository and remove the web services with less than 5 words in the description. A subset is obtained with 12,912 web services. The package contains the original data, the preprocessed data, the category, the dictionary and the stop words.

    Experimental Results

    The experiments include Services Classification, Services Discovery, Nearest Services, and Language Modeling.

    The scripts can be found here. You can download all the packages of codes and data to reproduce each part of the experiments.

    Meanwhile, some intermediate results of the bi-SWTM and the comparisons are provided here, such as the classification results, retrieval results, and the vectors learned by the models.

    Additional Experiments

    This part of the experiments is followed by the nice comments of the reviewers.

    These experiments are focused on the tasks of services classifications on phrases level and paragraph-level.

    The scripts and data can be found here. You can download all the packages of codes and data to reproduce the experiments following the steps in readme.txt.

    Word2Vec and LDA are included as the comparing models.