Cyberspace of Shujun LI


General AI


Natural Language Processing and Computational Linguistics

General Tools: NLTK (Natural Language Toolkit) Stanford CoreNLP (Java) TextBlob spaCy (textacy: NLP, before and after spaCy) PyTorch-NLP (GitHub) NLP.js Natural Apache OpenNLP CogCompNLP Hugging Face (datasets; Write With Transformer) Talk to Transformer (InferKit online demo) quanteda: Quantitative Analysis of Textual Data in R (GitHub) Linguistic Inquiry and Word Count (LIWC) gensim – Topic Modelling in Python BERTopic Transformers Transformer-XL bert-as-service BERTweet: A pre-trained language model for English Tweets (EMNLP 2020) COVID-Twitter-BERT (CT-BERT) RNNTagger TreeTagger Python Word Segmentation Word Ninja SymSpell (Python port: symspellpy) Language Style Transfer (NIPS 2017) GeoTxt (Transactions in GIS 2019) Edinburgh Geoparser GeoPy XAI for Natural Language Processing (AACL-IJCNLP 2020)
Pretrained Models: 预训练模型仓库 emotion icon OpenAI Google's BERT Microsoft DeepSpeed (GitHub) 悟道 (Wudao) (WuDaoCorpora; GitHub, GLM, CLM; BMInf) PaddleNLP

Chinese NLP Resources: 百度ERNIE Pretrained language model and its related optimization techniques developed by Huawei Noah's Ark Lab (鹏程.盘古α / PanGu-α) 中文BERT-wwm系列模型 awesome-chinese-nlp (Guan Wang) “结巴”中文分词 THUAIPoet (九歌) research group (九歌V2.0; BERT-CCPoem, MixPoet @ AAAI 2020, Stylistic Poetry @ EMNLP 2018, WMPoetry @ IJCAI 2018; 中国古典诗歌匹配数据集 / CCPM = Chinese Classical Poetry Matching Dataset, Other datasets) 少女诗人小冰 tensorflow_poems / LiBai AI Composer / 中文古诗自动作诗机器人 中文语料小数据

Datasets: Nicolas Iderhoff's nlp-datasets WordNet Wikimedia Downloads Wiktionary (Frequency lists) WordNet Amazon MASSIVE dataset WebNLG Challenge Wiktextract (data @ British National Corpus Use of corpora in translation studies @ Centre for Translation Studies, University of Leeds OpenLexicon Lexique (WorldLex: Blog, Twitter and Newspapers Word Frequencies for 66 languages) Datasets of Automatic Keyphrase Extraction @ LIAAD, INESCTEC KPTimes Corpus @ INLG 2019 dewiki-wordrank OAGSX Title Generation Dataset OAGKX Keyword Generation Dataset GeoNames

Privacy-related resources: PrivaSeer (PrivaSeer Corpus @ ACL 2021, PrivBERT @ ACL 2021)

Federated Learning

General Resources: Google AI's Federated Learning online comic Awesome-Federated-Learning The Federated Learning Portal
Open-source Tools: TensorFlow Federated (TFF) (GitHub) NVIDIA Clara FedML: A Research Library and Benchmark for Federated Machine Learning FedML-AI (GitHub) emotion icon FedAI WeBank AI's Federated AI Ecosystem (Federated Learning Research at Webank AI)

Commercial Solutions: Owkin Rhino Health

Valid XHTML 1.0 Transitional


Germany (CET)