Represent-ngram (Index of Posts):

This categoy of tools learns to represent ngrams (a window of n continous tokens)..
This includes:

GaKCo-SVM- a Fast GApped k-mer string Kernel using COunting

Tool GaKCo-SVM: a Fast GApped k-mer string Kernel using COunting

Paper: @Arxiv | @ECML17


Talk PDF



String Kernel (SK) techniques, especially those using gapped k-mers as features (gk), have obtained great success in classifying sequences like DNA, protein, and text. However, the state-of-the-art gk-SK runs extremely slow when we increase the dictionary size (Σ) or allow more mismatches (M). This is because current gk-SK uses a trie-based algorithm to calculate co-occurrence of mismatched substrings resulting in a time cost proportional to O(ΣM). We propose a \textbf{fast} algorithm for calculating \underline{Ga}pped k-mer \underline{K}ernel using \underline{Co}unting (GaKCo). GaKCo uses associative arrays to calculate the co-occurrence of substrings using cumulative counting. This algorithm is fast, scalable to larger Σ and M, and naturally parallelizable. We provide a rigorous asymptotic analysis that compares GaKCo with the state-of-the-art gk-SK. Theoretically, the time cost of GaKCo is independent of the ΣM term that slows down the trie-based approach. Experimentally, we observe that GaKCo achieves the same accuracy as the state-of-the-art and outperforms its speed by factors of 2, 100, and 4, on classifying sequences of DNA (5 datasets), protein (12 datasets), and character-based English text (2 datasets), respectively.



	location = {Cham},
	title = {GaKCo: A Fast Gapped k-mer String Kernel Using Counting},
	isbn = {978-3-319-71249-9},
	pages = {356--373},
	booktitle = {Machine Learning and Knowledge Discovery in Databases},
	publisher = {Springer International Publishing},
	author = {Singh, Ritambhara and Sekhon, Arshdeep and Kowsari, Kamran and Lanchantin, Jack and Wang, Beilun and Qi, Yanjun},
	editor = {Ceci, Michelangelo and Hollmén, Jaakko and Todorovski, Ljupčo and Vens, Celine and Džeroski, Sašo},
	date = {2017}

Support or Contact

Having trouble with our tools? Please contact Rita and we’ll help you sort it out.

Deep Learning for Character-based Information Extraction on Chinese and Protein Sequence

Title: Deep Learning for Character-based Information Extraction on Chinese and Protein Sequence

  • authors: Yanjun Qi, Sujatha Das, Ronan Collobert, Jason Weston

Paper ECIR

Supplementary Here

Talk: Slide


In this paper we introduce a deep neural network architecture to perform information extraction on character-based sequences, e.g. named-entity recognition on Chinese text or secondary-structure detection on protein sequences. With a task-independent architecture, the deep network relies only on simple character-based features, which obviates the need for task-specific feature engineering. The proposed discriminative framework includes three important strategies, (1) a deep learning module mapping characters to vector representations is included to capture the semantic relationship between characters; (2) abundant online sequences (unlabeled) are utilized to improve the vector representation through semi-supervised learning; and (3) the constraints of spatial dependency among output labels are modeled explicitly in the deep architecture. The experiments on four benchmark datasets have demonstrated that, the proposed architecture consistently leads to the state-of-the-art performance.


  title={Deep learning for character-based information extraction},
  author={Qi, Yanjun and Das, Sujatha G and Collobert, Ronan and Weston, Jason},
  booktitle={European Conference on Information Retrieval},

Support or Contact

Having trouble with our tools? Please contact Yanjun Qi and we’ll help you sort it out.

Document classification with weighted supervised n-gram embedding


  • Methods and systems for document classification include embedding n-grams from an input text in a latent space, embedding the input text in the latent space based on the embedded n-grams and weighting said n-grams according to spatial evidence of the respective n-grams in the input text, classifying the document along one or more axes, and adjusting weights used to weight the n-grams based on the output of the classifying step.

  • authors: Qi, Yanjun and Bai, Bing

Paper1: Sentiment classification with supervised sequence embedding

  • PDF
  • Talk: Slide

  • Abstract In this paper, we introduce a novel approach for modeling n-grams in a latent space learned from supervised signals. The proposed procedure uses only unigram features to model short phrases (n-grams) in the latent space. The phrases are then combined to form document-level latent representation for a given text, where position of an n-gram in the document is used to compute corresponding combining weight. The resulting two-stage supervised embedding is then coupled with a classifier to form an end-to-end system that we apply to the large-scale sentiment classification task. The proposed model does not require feature selection to retain effective features during pre-processing, and its parameter space grows linearly with size of n-gram. We present comparative evaluations of this method using two large-scale datasets for sentiment classification in online reviews (Amazon and TripAdvisor). The proposed method outperforms standard baselines that rely on bag-of-words representation populated with n-gram features.

Paper2: Sentiment Classification Based on Supervised Latent n-gram Analysis

  • PDF
  • Talk: Slide

  • Abstract In this paper, we propose an efficient embedding for modeling higher-order (n-gram) phrases that projects the n-grams to low-dimensional latent semantic space, where a classification function can be defined. We utilize a deep neural network to build a unified discriminative framework that allows for estimating the parameters of the latent space as well as the classification function with a bias for the target classification task at hand. We apply the framework to large-scale sentimental classification task. We present comparative evaluation of the proposed method on two (large) benchmark data sets for online product reviews. The proposed method achieves superior performance in comparison to the state of the art.


  title={Document classification with weighted supervised n-gram embedding},
  author={Qi, Yanjun and Bai, Bing},
  month=nov # "~18",
  publisher={Google Patents},
  note={US Patent 8,892,488}

Support or Contact

Having trouble with our tools? Please contact Yanjun Qi and we’ll help you sort it out.

Systems and methods for semi-supervised relationship extraction

Title: Systems and methods for semi-supervised relationship extraction

  • authors: Qi, Yanjun and Bai, Bing and Ning, Xia and Kuksa, Pavel

Paper1: Semi-supervised abstraction-augmented string kernel for multi-level bio-relation extraction

  • PDF
  • Talk: Slide

  • Abstract Bio-relation extraction (bRE), an important goal in bio-text mining, involves subtasks identifying relationships between bio-entities in text at multiple levels, e.g., at the article, sentence or relation level. A key limitation of current bRE systems is that they are restricted by the availability of annotated corpora. In this work we introduce a semi-supervised approach that can tackle multi-level bRE via string comparisons with mismatches in the string kernel framework. Our string kernel implements an abstraction step, which groups similar words to generate more abstract entities, which can be learnt with unlabeled data. Specifically, two unsupervised models are proposed to capture contextual (local or global) semantic similarities between words from a large unannotated corpus. This Abstraction-augmented String Kernel (ASK) allows for better generalization of patterns learned from annotated data and provides a unified framework for solving bRE with multiple degrees of detail. ASK shows effective improvements over classic string kernels on four datasets and achieves state-of-the-art bRE performance without the need for complex linguistic features.

ask1 ask2

Paper2: Semi-Supervised Convolution Graph Kernels for Relation Extraction

  • PDF

  • Talk: Slide
  • URL More

  • Abstract Extracting semantic relations between entities is an important step towards automatic text understanding. In this paper, we propose a novel Semi-supervised Convolution Graph Kernel (SCGK) method for semantic Relation Extraction (RE) from natural language. By encoding English sentences as dependence graphs among words, SCGK computes kernels (similarities) between sentences using a convolution strategy, i.e., calculating similarities over all possible short single paths from two dependence graphs. Furthermore, SCGK adds three semi-supervised strategies in the kernel calculation to incorporate soft-matches between (1) words, (2) grammatical dependencies, and (3) entire sentences, respectively. From a large unannotated corpus, these semi-supervision steps learn to capture contextual semantic patterns of elements in natural sentences, which therefore alleviate the lack of annotated examples in most RE corpora. Through convolutions and multi-level semi-supervisions, SCGK provides a powerful model to encode both syntactic and semantic evidence existing in natural English sentences, which effectively recovers the target relational patterns of interest. We perform extensive experiments on five RE benchmark datasets which aim to identify interaction relations from biomedical literature. Our results demonstrate that SCGK achieves the state-of-the-art performance on the task of semantic relation extraction.

Paper3: Semi-Supervised Bio-Named Entity Recognition with Word-Codebook Learning

  • Pavel P. Kuksa, Yanjun Qi,
  • PDF

  • Abstract We describe a novel semi-supervised method called WordCodebook Learning (WCL), and apply it to the task of bionamed entity recognition (bioNER). Typical bioNER systems can be seen as tasks of assigning labels to words in bioliterature text. To improve supervised tagging, WCL learns a class of word-level feature embeddings to capture word semantic meanings or word label patterns from a large unlabeled corpus. Words are then clustered according to their embedding vectors through a vector quantization step, where each word is assigned into one of the codewords in a codebook. Finally codewords are treated as new word attributes and are added for entity labeling. Two types of wordcodebook learning are proposed: (1) General WCL, where an unsupervised method uses contextual semantic similarity of words to learn accurate word representations; (2) Task-oriented WCL, where for every word a semi-supervised method learns target-class label patterns from unlabeled data using supervised signals from trained bioNER model. Without the need for complex linguistic features, we demonstrate utility of WCL on the BioCreativeII gene name recognition competition data, where WCL yields state-of-the-art performance and shows great improvements over supervised baselines and semi-supervised counter peers.


  author = {Pavel P. Kuksa and Yanjun Qi and Bing Bai and Ronan Collobert and
	Jason Weston and Vladimir Pavlovic and Xia Ning},
  title = {Semi-Supervised Abstraction-Augmented String Kernel for Multi-Level
	Bio-Relation Extraction},
  booktitle = {ECML},
  year = {2010},
  note = {Acceptance rate: 106/658 (16%)},
  bib2html_pubtype = {Refereed Conference},

gsk1 gsk2 gsk3 gsk4 gsk5

Support or Contact

Having trouble with our tools? Please contact Yanjun Qi and we’ll help you sort it out.