This page summarizes our efforts for re-implementing gkm-svm algorithm with a novel and faster algorithmic design and better programming structure. Please feel free to email me when you find my typos.

overview of fast-gkm-svm

Existing gkm-SVM algorithms suffer from the slow kernel computation time, as they depend exponentially on the sub-sequence feature-length, number of mismatch positions, and the task’s alphabet size. In this project, we introduce a fast and scalable algorithm for calculating gapped k-mer string kernels. Our method, named FastSK, uses a simplified kernel formulation that decomposes the kernel calculation into a set of independent counting operations over the possible mismatch positions. This simplified decomposition allows us to devise a fast Monte Carlo approximation that rapidly converges. FastSK can scale to much greater feature lengths, allows us to consider more mismatches, and is performant on a variety of sequence analysis tasks.

Important wrappers we develop to help users

python wrapper

R wrapper

  • (to come)


Have questions or suggestions? Feel free to ask me on Twitter or email me.

Thanks for reading!