Maosong Sun
Comment
Stakeholder Type

Maosong Sun

Maosong Sun

Professor Department of Computer Science and TechnologyTsinghua University
2025

Anticipation Committee member

My research interests are computational linguistics, statistical and corpus-based natural language processing (NLP), Chinese language computing (computational morphology, bilingual terminology extraction), information retrieval (Chinese text categorization, graphical model-based keyword extraction), collective intelligence (tag generation, Web trend analysis) and social computing (query log analysis, community discovery). I have participated as project leader or principal researcher in over 20 projects funded by National Natural Science Foundation of China, National Social Science Foundation of China, National 863 High-Tech Program, National 973 Basic Research Program, as well as in projects funded by a number of international IT companies. I have published, together with my students, about 130 papers in academic journals and international conferences in the above fields. The total number of citations of these papers in Google Scholar is roughly 1,400. I have served as program committee member in numerous national and international conferences, and as conference chair or program committee chair for many times.

One of my research focus is Chinese word segmentation, the most fundamental issue in Chinese information processing. I have proposed some key concepts in word segmentation (such as maximal overlapping segmentation ambiguities, true and pseudo segmentation ambiguities, local and global statistics), and have developed an integrated Chinese word segmentation and part-of-speech tagging system, which is able to explore all sorts of knowledge, bigrams of words, parts-of-speech and characters, statistical and structural information of named entities, and local statistics of character strings. I have also tried to extend my experience in Chinese word segmentation to other languages in which word segmentation are need, resulting in an international standard "ISO/FDIS 24614-1: Language Resource Management -- Word Segmentation of Written Texts -- Part 1: Basic Concepts and General Principles". I am the only project leader for this ISO standard.

Recently, I have presented an original viewpoint in NLP: NLP based on huge-scale naturally annotated corpora. The basic idea is with Web-scale corpora, natural annotation may help machine better perform some NLP tasks. There are two types of natural annotation: explicit (such as punctuations, anchor text, query log, Wikipedia, blog tags) and implicit (such as language usage patterns). I further put forward a fundamental problem: if we could integrate all information drawn from naturally annotated corpora of different perspectives, are we able to achieve some degree of deep understanding of languages A preliminary work by my students and me published in Computational Linguistics 2009 showed the usefulness of punctuations in Chinese word segmentation, suggesting that this idea deserves further study.

Related Radar Content