There are two different levels of the projects, that you can choose one of them to submit.
Design an simple QA system(or Dialog System)
You can use FAQ from SMS Spam Collection Data Set, which contains 100M examples. The reference paper is "The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems". Github:rkadlec/ubuntu-ranking-dataset-creator
Design an translation system of Chinese-English
You can use the data from
here (https://conferences.unite.un.org/UNCorpus/zh#introduction)
Design an auto Summary Extractor with baidu wiki
Design an information retrieval system with baidu wiki
Text Classfier for news
You can use the data from bytedance(https://github.com/aceimnorstuvwxz/toutiao-text-classfication-dataset)
*Any competitions released by alibaba, bytedance, baidu, tencent, huawei et al.
Read a paper publised in the last three years on NLP from the top conference, such as AAAI, IJCAI, ACL, EMNLP et al. You need to implement and show your coding when you report it.
Note that, most likely you will receive a lower score if you choose Level 2 other Level 1.