Developing the utilized intelligent systems, learning effective text representations, especially extracting the sentence features, is increasingly important. Numerous previous studies concentrated on the task of sentence representation learning based on deep learning approaches. While the present approaches are mostly proposed with the single task or replied on the labeled corpus when learning the embedding of the sentences. In this paper, we assess the factors in learning sentence representation and propose an efficient unsupervised learning framework with multi-task learning (USR-MTL), in which various text learning tasks are merged into the unitized framework. With the syntactic and semantic features of sentences, three different factors are reflected some extent in the task of the sentence representation learning that is the wording, or the ordering of the neighbored sentences of a target sentence in other words. Hence, we integrate the word-order learning task, word prediction task, and the sentence-order learning task into the proposed framework to attain meaningful sentence embeddings. Here, the process of sentence embedding learning is reformulated as a multi-task learning framework of the sentence-level task and the two word-level tasks. Moreover, the proposed framework is motivated by an unsupervised learning algorithm utilizing the unlabeled corpus. Based on the experimental results, our approach achieves the state-of-the-art performances on the downstream natural language processing tasks compared to the popular unsupervised representation learning techniques. The experiments on representation visualization and task analysis demonstrate the effectiveness of the tasks in the proposed framework in creating reasonable sentence representations proving the capacity of the proposed unsupervised multi-task framework for the sentence representation learning.
The main contributions of this paper are as follows:
An unsupervised sentence representation learning framework with multi-task learning is proposed, in which the proposed USR-MTL is a novel unsupervised multi-task learning technique with the unlabeled corpus for the sentence representation learning. Experiments on the downstream tasks demonstrate that the proposed technique outperforms state-of-the-art unsupervised approaches with nine downstream tasks.
Three different types of unsupervised text learning tasks are integrated into the proposed framework with an AutoEncoder to learn the high-quality sentence embedding, such as sentence-order learning task, word prediction task, and word-order learning task as the first work to merge the unsupervised text learning tasks both from sentence level and word level.
We design an efficient algorithm for the suggested framework jointly training an encoder and three different types of language learning tasks. The experiments on text classification, semantic analysis, paraphrase detection, and image-sentence ranking represent the effectiveness of the USR-MTL. The experiments on task analysis and sentence representation visualization prove the capacity of the three unsupervised text learning tasks utilized in the USR-MTL.
We provide the code, as well the data here.
Our code is based on python 3.5 and PyTorch, so the python and PyTorch are both needed. Click
here to download the code of the proposed USR-MTL.
One can follow the readme after downloading the code to implement and run our algorithm. We provide the readme in English and Chinese versions.
The data is from here, which consists of 7000 novels and 74M sentences, and you can visit smashwords.com to collect your own version of BookCorpus.
The experiments include Text Classification, Paraphrase Detection, Semantic Relatedness, Semantic Textual Similarity, and Image-Sentence Ranking, as well as the Task Analysis and Visualization of the sentence representations.