Cyberspace of Shujun LI


Shujun's Publications

e-Data and Data Analytics Services: Zenodo Figshare Dryad DataCite Wolfram Data Repository The GDELT Project: A Global Database of Society Global Open Data Index (Open Knowledge) Sunlight Foundation JRC (Joint Research Centre) Data Catalogue UK Data Service Police Foundation’s Public Safety Open Data Portal mldata (machine learning data set repository) MLcomp datasets UCI Machine Learning Repository Kaggle KNIME Google Dataset Search Google Public Data Explorer Common Crawl People Data Labs (Free Company Dataset, Largest US Employers by Metro Dataset, Free Job Title Dataset, Free Engineering Skills Dataset) Awesome Public Datasets emotion icon 数据堂 emotion icon DataSift DataGenetics Informatica Corporation Splunk Inc. Tableau Software Social Media Analysis Toolkit (SMAT) (GitLab) Newspaper3k: Article scraping & curation Trint Scrintal
Natural Language Processing and Computational Linguistics: NLTK (Natural Language Toolkit) Stanford CoreNLP (Java) TextBlob spaCy (textacy: NLP, before and after spaCy) PyTorch-NLP (GitHub) NLP.js Apache OpenNLP CogCompNLP Hugging Face (datasets; Write With Transformer) Talk to Transformer (InferKit online demo) quanteda: Quantitative Analysis of Textual Data in R (GitHub) Linguistic Inquiry and Word Count (LIWC) gensim – Topic Modelling in Python BERTopic Transformers Transformer-XL bert-as-service BERTweet: A pre-trained language model for English Tweets (EMNLP 2020) COVID-Twitter-BERT (CT-BERT) RNNTagger TreeTagger Python Word Segmentation Word Ninja SymSpell (Python port: symspellpy) Language Style Transfer (NIPS 2017) emotion icon 预训练模型仓库 悟道 (Wudao) (WuDaoCorpora; GitHub, GLM, CLM; BMInf) PaddleNLP 百度ERNIE Pretrained language model and its related optimization techniques developed by Huawei Noah's Ark Lab (鹏程.盘古α / PanGu-α) 中文BERT-wwm系列模型 awesome-chinese-nlp (Guan Wang) “结巴”中文分词 THUAIPoet (九歌) research group (九歌V2.0; BERT-CCPoem, MixPoet @ AAAI 2020, Stylistic Poetry @ EMNLP 2018, WMPoetry @ IJCAI 2018; 中国古典诗歌匹配数据集 / CCPM = Chinese Classical Poetry Matching Dataset, Other datasets) 少女诗人小冰 tensorflow_poems / LiBai AI Composer / 中文古诗自动作诗机器人 emotion icon Nicolas Iderhoff's nlp-datasets WordNet Wikimedia Downloads Wiktionary (Frequency lists) WordNet WebNLG Challenge Wiktextract (data @ British National Corpus Use of corpora in translation studies @ Centre for Translation Studies, University of Leeds OpenLexicon Lexique (WorldLex: Blog, Twitter and Newspapers Word Frequencies for 66 languages) Datasets of Automatic Keyphrase Extraction @ LIAAD, INESCTEC KPTimes Corpus @ INLG 2019 dewiki-wordrank OAGSX Title Generation Dataset OAGKX Keyword Generation Dataset 中文语料小数据 emotion icon PrivaSeer (PrivaSeer Corpus @ ACL 2021, PrivBERT @ ACL 2021)
Personal Data Management Platforms: MyData Global Solid HAT (Hub-of-All-Things) (HATDeX - HAT Data Exchange Ltd, HAT Community Foundation (HCF), HAT Accelerator, Documentation for Developers) DataBox Project Aircloak openPDS/SafeAnswers: Personal Data with Privacy

False Information

Organizations, Tools and Data: W3C Credible Web Community Group (Github, Credible Web CG Area-2 (Corroboration-Based Strategies)) WikiCred Truth Decay @ RAND (Fighting Disinformation Online: A Database of Web Tools) NewsGuard (Firefox extension - NewsGuard, Google Chrome extension - NewsGuard, Firefox extension - HealthGuard, Android app - NewsGuard; COVID-19 Misinformation Resources, Coronavirus Misinformation Tracking Center) misinformation datasets @ FakeNewsTracker Google Fact Check (Google Fact Check Tools API, Google Fact Check Explorer, Google Fact Check Markup Tool) Fact-check Feed @ SMAT: The Social Media Analysis Toolkit Verifi! News Landscape (NELA) Toolkit FakerFact Media Manipulation Casebook Fact-checking organizations with FACTS-NFT 台灣事實查核中心 (Taiwan FactCheck Center) Lead Stories emotion icon EU DisinfoLab Full Fact First Draft Media Bias/Fact Check (MBFC) Science Media Centre MisinfoCon Credibility Coalition Fake News Challenge (FNC) (Stance Detection dataset for FNC-1) Poynter Institute (International Fact-Checking Network - IFCN, IFCN Code of Principles; #CoronaVirusFacts Alliance, CoronaVirusFacts/DatosCoronaVirus Alliance Database) Fairness & Accuracy In Reporting (FAIR) Content Authenticity Initiative (CAI) Project Origin: Protecting Trusted Media Coalition for Content Provenance and Authenticity (C2PA) BBC Disinformation Watch BBC Reality Check FactCheck @ Channel 4 News Fact check @ The Ferret The Reporters' Lab (Fact-Checking, The Duke Tech & Check Cooperative, ClaimReview) Truth or Fiction Check Your Fact FactsCan AFP Fact Check Africa Check Bad News game Go Viral! emotion icon Snopes PolitiFact Global Disinformation Index (GDI) Gossip Cop Fact Checker @ The Washington Post Hoax-Slayer COVID-19 Misinformation Newsletter @ Programme on Democracy and Technology (DemTech), Oxford University emotion icon Factmata AuCoDe Arkose Labs (Fake Reviews, Fake Users) Lie Detectors Fakespot Masterpiece Generator Tweetgen fake-resume-generator

Multimedia False Information: Awesome Deepfakes fake-face-detection: some collected paper and personal notes relevant to Fake Face Detetection DeepFake-o-meter: An open platform integrating state-of-the-art DeepFake detection methods This Person Does Not Exist (datasets, face fenerator, free generated photos) DeepFaceDrawing: Deep Generation of Face Images from Sketches (SIGGRAPH 2020) (DeepFaceDrawing-Jittor @ GitHub) DeepNude-an-Image-to-Image-technology pix2pix: Image-to-Image Translation with Conditional Adversarial Nets (CVPR 2017) (CycleGAN and pix2pix in PyTorch @ GitHub, original code @ GitHub, Christopher Hesse's interactive demo) Deepware Scanner (GitHub) Adversarial Deepfakes (WACV 2021) DefakeHop (ICME 2021) DeepFaceLab Faceswap (GitHub) MyFakeApp (based on Faceswap) ZAO App Reface App WOMBO Botika Virtual Humans FaceForensics Benchmark Partnership on AI's AI and Media Integrity Steering Committee (Deepfake Detection Challenge = DFDC) WildDeepfake: A Challenging Real-World Dataset for Deepfake Detection Celeb-DF (v2): A New Dataset for DeepFake Forensics (CVPR 2020) (GitHub) KoDF: A Large-scale Korean DeepFake Detection Dataset (CVPR 2021) CoMoFoD - Image Database for Copy-Move Forgery Detection Copy-Move Forgery Database with Similar but Genuine Objects (COVERAGE) Truthmark GANDCTAnalysis
MKLab-ITI's image-verification-corpus Assembler (Google's project) A corpus of debunked and verified user-generated videos (Online Information Review 2019) Fake-EmoReact 2021 Challenge @ SocialNLP 2021 EmotionGIF 2020 Challenge @ SocialNLP 2020

Other Research Related: WeVerify GATE (General Architecture for Text Engineering) PHEME project PAN (a series of scientific events and shared tasks on digital text forensics and stylometry) emotion icon CLEF2020 CheckThat! Lab (Enabling Automatic Identification and Verification of Claims in Social Media) CLEF2019 CheckThat! Lab CLEF2018 CheckThat! Lab FEVER Datasets (scientific claims) ClaimBuster: Automated Live Fact-checking (ClaimPortal, ICWSM 2020 dataset) Claim Detection in Social Media via Fusion of Transformer and Syntactic Features (CLEF CheckThat! 2020) ClaimsKG claim-rank (RANLP 2017) Claim Extraction for Scientific Publications SciFact (GitHub) Too Many Claims to Fact-Check: Prioritizing Political Claims Based on Check-Worthiness (MAISoN'2020 @ CIKM'2020) entity-fishing - Entity Recognition and Disambiguation Full Fact's Fast & Furious Fact Check Challenge (2016) emotion icon (Iffy Index of Unreliable Sources, Wayback Workshop) OSoMe (Observatory on Social Media) @ Network Science Institute (IUNI), Center for Complex Networks and Systems Research (CNetS), Indiana University (Tools and Datasets: Hoaxy®, Fakey, Botometer, BotSlayer, CoVaxxy, EchoDemo) emotion icon Graph-based Fraud Detection Papers and Resources VoterFraud2020 (@ GitHub, @ Fighshare) FakeNewsNet Maciej Szpakowski's Fake News Corpus Fakeddit (GitHub) Dichotomies of Disinformation Avax (anti-vaccine) tweets dataset (2021) The COVID-19 Infodemic: Can the Crowd Judge Recent Misinformation Objectively? (SIGIR 2020 + ECIR 2020 + CIKM 2020) ReCOVery: A Multimodal Repository for COVID-19 News Credibility Research (CIKM 2020) FakeCovid: Fact Checked data for COVID-19 (ICWSM 2020 workshop) Dataset for COVID-19 Misinformation on Twitter (2020) CHECKED: Chinese COVID-19 Fake News Dataset (2020) Factuality and Bias Prediction of News Media (ACL 2020 + EMNLP 2018) FakeHealth repository (ICWSM 2020) FiveThirtyEight's dataset of 3 million Russian troll tweets Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board (ICWSM 2020) Learning from Fact-checkers (SIGIR 2019) The Rise of Guardians (SIGIR 2018) LIAR-PLUS fake news databse (FEVER 2018) LIAR fake news databse (ACL 2017) CREDBANK-data (ICWSM 2015) emotion icon 中文谣言数据 (中国科学: 信息科学 2015)

Information Visualization

Tools: Transparency Vis


Tools: GetOldTweets-java (GetOldTweets3)
Data: COVID-19 @ Aminer (COVID-19 Open Datasets, dashboard)

Valid XHTML 1.0 Transitional


Germany (CET)