The website jointnets.org introduces updates of a suite of graphical model tools we have developed for estimating relationships (in the form of graphs) among variables from heterogeneous data sets. Feel free to submit pull requests when you find my typos.

# Blog Posts

Technology revolutions in the past decade have collected large-scale heterogeneous samples from many scientific domains. For instance, genomic technologies have delivered petabytes of molecular measurements across more than hundreds of types of cells and tissues from national projects like ENCODE and TCGA. Neuroimaging technologies have generated petabytes of functional magnetic resonance imaging (fMRI) datasets across thousands of human subjects (shared publicly through projects like openfMRI). Given such data, understanding and quantifying variable graphs from heterogeneous samples (about multiple contexts) is a fundamental analysis task.

Such variable graphs can significantly simplify network-driven studies about diseases, can help understand the neural characteristics underlying clinical disorders and can allow for understanding genetic or neural pathways and systems. The number of contexts (denoted as $K$) that those applications need to consider grows extremely fast, ranging from tens (e.g., cancer types in TCGA) to thousands (e.g., number of subjects in openfMRI~). The number of variables (denoted as $p$) ranges from hundreds (e.g., number of brain regions) to tens of thousands (e.g., number of human genes).

One typical approach to tackle the above data analysis problem is to jointly estimate $K$ different but related conditional dependency graphs through a multi-task formulation of the sparse Gaussian Graphical Model (multi-sGGM). Most current studies of multi-sGGMs, however, involve expensive and difficult non-smooth optimizations, making them difficult to scale up to many dimensions (large $p$) or with many contexts (large $K$).

We aim to fill this gap and have designed a category of novel estimators that can achieve fast and scalable joint structure estimation of multiple sGGMs. There exist four important tasks when learning multi-sGGMs from heterogeneous samples:

• (1) enforcing graph relatedness through structural norms. The first type of multi-sGGMs seeks to optimize a sparsity regularized data likelihood function plus an extra norm function to enforce structural similarity among the multiple networks to be estimated.
• (2) estimating the change of variable dependencies directly. The second category aims to estimate changes in the dependency structure of two $p$-dimensional Gaussian Graphical Model (GGMs), based on $n_c$ and $n_d$ samples from two contexts respectively. For instance, on samples from a controlled disease study, ‘c’ may represent the ‘‘control’ group and ‘d’ for ‘‘disease’.
• (3) learning task-specific edges explicitly. Explicitly quantifying the context-specific substructures involves a very challenging optimization task under the traditional MLE based multi-sGGMs formulation.
• (4) incorporating existing knowledge of the variable nodes or knowledge of the relationships among nodes. In addition to the samples themselves, additional information is widely available in real-world applications, including for instance the spatial and anatomical evidence of brain regions when estimating the functional brain connectivity networks from fMRI samples).

Targeting each challenge, our work introduces estimators that are both computationally efficient and theoretically guaranteed. The website JointNets.org introduces the suite of tools we have developed for helping researchers effectively translate aggregated data into knowledge that take the form of graphs.

## Background: Sparse Gaussian Graphical Model (sGGM) and multi-sGGMs

The sparse Gaussian Graphical Model(sGGM) assumes data samples are independently and identically drawn from a multivariate normal distribution with mean $\mu$ and covariance matrix $\Sigma$. The graph structure $G$ among $p$ features is encoded by the sparsity pattern of the inverse covariance matrix, also named precision matrix, $\Omega$.

In $G$ an edge does not connect $j$-th node and $k$-th node (i.e., conditional independent) if and only if $\Omega_{jk} = 0$. sGGM imposes a sparse L1 penalty on the $\Omega$.

Modern multi-context molecular datasets are high dimensional, heterogeneous and noisy. For such heterogeneous data samples, rather than estimating sGGM of each condition separately, a multi-task formulation that jointly estimates $K$ different but related sGGMs can lead to a better generalization.

## A list of Our Tools for Joint learning of Multiple Sparse Gaussian Graphical Model (multi-sGGM) in a Scalable Way

We have designed a suite of novel and fast machine-learning algorithms to identify context-specific interaction graphs from such data.

#### So far, we have released the following R packages:

No. Tool Name Short Description Venue
1 JEEK Fast~and~Scalable~Joint~Estimator~for **Integrating Additional Knowledge** in Learning Multiple Related Sparse Gaussian Graphical Models ICML18
2 DIFFEE Fast~and~Scalable Learning of **Sparse Changes** in High-Dimensional Gaussian Graphical Model Structure AISTAT18
3 FASJEM A Fast and Scalable Joint Estimator for Learning **Multiple Related** Sparse Gaussian Graphical Models AISTAT17
4 SIMULE A~constrained~L1~minimization~approach~for~estimating~multiple~Sparse **Gaussian or Nonparanormal** Graphical Models MachineLearning 17
5 W-SIMULE A Constrained, Weighted-L1 Minimization Approach for Joint Discovery of **Heterogeneous Neural Connectivity** Graphs with Additional Prior knowledge NIPS17 Network workshop

## Contact

Have questions or suggestions? Feel free to ask me on Twitter or email me.

# A Series of Tutorials We wrote to explain the JointSGGM tools we built

#### So far, we have released the following Tutorials:

No. Tutorial Name
1 Review I: Probability Foundations
2 Review II: Gaussian Graphical Model Basics
3 Review III: Markov Random Field and Log Linear Model
4 Review IV: A Unified Framework for M-estimaotr and Elementary Estimators
5 Review V: Sparse Gaussian Graphical Model estimators
6 Review VI: Multi-task sGGMs and optimization challenges
7 Review VII: Multi-task sGGMs estimators
8 Review VIII: Three metrics for evaluating estimators/learners
9 Reviews: Combined all Tutorials for Joint-sGGMs
10 201807-Beilun-Defense Talk
11 2018-BeilunDefense + 2017-AllJointGGTutorials

## Contact

Have questions or suggestions? Feel free to ask me on Twitter or email me.

# JEEK - Fast and Scalable Joint Estimator for Integrating Additional Knowledge in Learning Multiple Related Sparse Gaussian Graphical Models

### GitRepo for R package: URL

install.packages("jeek")
library(jeek)
demo(jeek)


### Abstract

We consider the problem of including additional knowledge in estimating sparse Gaussian graphical models (sGGMs) from aggregated samples, arising often in bioinformatics and neuroimaging applications. Previous joint sGGM estimators either fail to use existing knowledge or cannot scale-up to many tasks (large $K$) under a high-dimensional (large $p$) situation. In this paper, we propose a novel \underline{J}oint \underline{E}lementary \underline{E}stimator incorporating additional \underline{K}nowledge (JEEK) to infer multiple related sparse Gaussian Graphical models from large-scale heterogeneous data. Using domain knowledge as weights, we design a novel hybrid norm as the minimization objective to enforce the superposition of two weighted sparsity constraints, one on the shared interactions and the other on the task-specific structural patterns. This enables JEEK to elegantly consider various forms of existing knowledge based on the domain at hand and avoid the need to design knowledge-specific optimization. JEEK is solved through a fast and entry-wise parallelizable solution that largely improves the computational efficiency of the state-of-the-art $O(p^5K^4)$ to $O(p^2K^4)$. We conduct a rigorous statistical analysis showing that JEEK achieves the same convergence rate $O(\log(Kp)/n_{tot})$ as the state-of-the-art estimators that are much harder to compute. Empirically, on multiple synthetic datasets and two real-world data, JEEK outperforms the speed of the state-of-arts significantly while achieving the same level of prediction accuracy.

One significant caveat of state-of-the-art joint sGGM estimators is the fact that little attention has been paid to incorporating existing knowledge of the nodes or knowledge of the relationships among nodes in the models. In addition to the samples themselves, additional information is widely available in real-world applications. In fact, incorporating the knowledge is of great scientific interest. A prime example is when estimating the functional brain connectivity networks among brain regions based on fMRI samples, the spatial position of the regions are readily available. Neuroscientists have gathered considerable knowledge regarding the spatial and anatomical evidence underlying brain connectivity (e.g., short edges and certain anatomical regions are more likely to be connected \cite{watts1998collective}). Another important example is the problem of identifying gene-gene interactions from patients’ gene expression profiles across multiple cancer types. Learning the statistical dependencies among genes from such heterogeneous datasets can help to understand how such dependencies vary from normal to abnormal and help to discover contributing markers that influence or cause the diseases. Besides the patient samples, state-of-the-art bio-databases like HPRD \cite{prasad2009human} have collected a significant amount of information about direct physical interactions among corresponding proteins, regulatory gene pairs or signaling relationships collected from high-qualify bio-experiments.

Although being strong evidence of structural patterns we aim to discover, this type of information has rarely been considered in the joint sGGM formulation of such samples. This paper aims to fill this gap by adding additional knowledge most effectively into scalable and fast joint sGGM estimations.

The proposed JEEK estimator provides the flexibility of using ($K+1$) different weight matrices representing the extra knowledge. We try to showcase a few possible designs of the weight matrices, including (but not limited to):

• Spatial or anatomy knowledge about brain regions;
• Knowledge of known co-hub nodes or perturbed nodes;
• Known group information about nodes, such as genes belonging to the same biological pathway or cellular location;
• Using existing known edges as the knowledge, like the known protein interaction databases for discovering gene networks (a semi-supervised setting for such estimations).

We sincerely believe the scalability and flexibility provided by JEEK can make structure learning of joint sGGM feasible in many real-world tasks.

### Citations

@conference{wang2018jeek,
Author = {Wang, Beilun and Sekhon, Arshdeep and Qi, Yanjun},
Booktitle = {Proceedings of The 35th International Conference on Machine Learning (ICML)},
Title = {A Fast and Scalable Joint Estimator for Integrating Additional Knowledge in Learning Multiple Related Sparse Gaussian Graphical Models},
Year = {2018}}
}


# DIFFEE to identify Sparse Changes in High-Dimensional Gaussian Graphical Model Structure

## Tool DIFFEE: Fast and Scalable Learning of Sparse Changes in High-Dimensional Gaussian Graphical Model Structure

### R package: CRAN

install.packages("diffee")
library(diffee)
demo(diffee)


### Abstract

We focus on the problem of estimating the change in the dependency structures of two p-dimensional Gaussian Graphical models (GGMs). Previous studies for sparse change estimation in GGMs involve expensive and difficult non-smooth optimization. We propose a novel method, DIFFEE for estimating DIFFerential networks via an Elementary Estimator under a high-dimensional situation. DIFFEE is solved through a faster and closed form solution that enables it to work in large-scale settings. We conduct a rigorous statistical analysis showing that surprisingly DIFFEE achieves the same asymptotic convergence rates as the state-of-the-art estimators that are much more difficult to compute. Our experimental results on multiple synthetic datasets and one real-world data about brain connectivity show strong performance improvements over baselines, as well as significant computational benefits.

### Citations

@InProceedings{pmlr-v84-wang18f,
title =    {Fast and Scalable Learning of Sparse Changes in High-Dimensional Gaussian Graphical Model Structure},
author =   {Beilun Wang and arshdeep Sekhon and Yanjun Qi},
booktitle =    {Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics},
pages =    {1691--1700},
year =   {2018},
editor =   {Amos Storkey and Fernando Perez-Cruz},
volume =   {84},
series =   {Proceedings of Machine Learning Research},
address =    {Playa Blanca, Lanzarote, Canary Islands},
month =    {09--11 Apr},
publisher =    {PMLR},
pdf =    {http://proceedings.mlr.press/v84/wang18f/wang18f.pdf},
url =    {http://proceedings.mlr.press/v84/wang18f.html},
abstract =   {We focus on the problem of estimating the change in the dependency structures of two $p$-dimensional Gaussian Graphical models (GGMs). Previous studies for sparse change estimation in GGMs involve expensive and difficult non-smooth optimization. We propose a novel method, DIFFEE for estimating DIFFerential networks via an Elementary Estimator under a high-dimensional situation. DIFFEE is solved through a faster and closed form solution that enables it to work in large-scale settings. We conduct a rigorous statistical analysis showing that surprisingly DIFFEE achieves the same asymptotic convergence rates as the state-of-the-art estimators that are much more difficult to compute. Our experimental results on multiple synthetic datasets and one real-world data about brain connectivity show strong performance improvements over baselines, as well as significant computational benefits.}
}


# W-SIMULE

## Tool W-SIMULE: A Constrained, Weighted-L1 Minimization Approach for Joint Discovery of Heterogeneous Neural Connectivity Graphs with Additional Prior knowledge

### We are updating the R package: simule with one more function: W-SIMULE

install.packages("simule")
library(simule)
demo(wsimule)


### Abstract

Determining functional brain connectivity is crucial to understanding the brain and neural differences underlying disorders such as autism. Recent studies have used Gaussian graphical models to learn brain connectivity via statistical dependencies across brain regions from neuroimaging. However, previous studies often fail to properly incorporate priors tailored to neuroscience, such as preferring shorter connections. To remedy this problem, the paper here introduces a novel, weighted-ℓ1, multi-task graphical model (W-SIMULE). This model elegantly incorporates a flexible prior, along with a parallelizable formulation. Additionally, W-SIMULE extends the often-used Gaussian assumption, leading to considerable performance increases. Here, applications to fMRI data show that W-SIMULE succeeds in determining functional connectivity in terms of (1) log-likelihood, (2) finding edges that differentiate groups, and (3) classifying different groups based on their connectivity, achieving 58.6\% accuracy on the ABIDE dataset. Having established W-SIMULE’s effectiveness, it links four key areas to autism, all of which are consistent with the literature. Due to its elegant domain adaptivity, W-SIMULE can be readily applied to various data types to effectively estimate connectivity.

### Citations

@article{singh2017constrained,
title={A Constrained, Weighted-L1 Minimization Approach for Joint Discovery of Heterogeneous Neural Connectivity Graphs},
author={Singh, Chandan and Wang, Beilun and Qi, Yanjun},
journal={arXiv preprint arXiv:1709.04090},
year={2017}
}


# FASJEM R package is released!

### R package: fasjem

install.packages("fasjem")
library(fasjem)
demo(fasjem)


### Abstract

Estimating multiple sparse Gaussian Graphical Models (sGGMs) jointly for many related tasks (large K) under a high-dimensional (large p) situation is an important task. Most previous studies for the joint estimation of multiple sGGMs rely on penalized log-likelihood estimators that involve expensive and difficult non-smooth optimizations. We propose a novel approach, FASJEM for fast and scalable joint structure-estimation of multiple sGGMs at a large scale. As the first study of joint sGGM using the M-estimator framework, our work has three major contributions: (1) We solve FASJEM through an entry-wise manner which is parallelizable. (2) We choose a proximal algorithm to optimize FASJEM. This improves the computational efficiency from O(Kp3 ) to O(Kp2 ) and reduces the memory requirement from O(Kp2 ) to O(K). (3) We theoretically prove that FASJEM achieves a consistent estimation with a convergence rate of O(log(Kp)/ntot). On several synthetic and four real-world datasets, FASJEM shows significant improvements over baselines on accuracy, computational complexity and memory costs.

### Citations

@inproceedings{wang2017fast,
title={A Fast and Scalable Joint Estimator for Learning Multiple Related Sparse Gaussian Graphical Models},
author={Wang, Beilun and Gao, Ji and Qi, Yanjun},
booktitle={Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, PMLR:, 2017.},
volume={54},
pages={1168--1177},
year={2017}
}


# SIMULE R package is released!

## Tool SIMULE: A constrained l1 minimization approach for estimating multiple Sparse Gaussian or Nonparanormal Graphical Models

### R package: simule

install.packages("simule")
library(simule)
demo(simule)


### Abstract

Identifying context-specific entity networks from aggregated data is an important task, arising often in bioinformatics and neuroimaging. Computationally, this task can be formulated as jointly estimating multiple different, but related, sparse Undirected Graphical Models (UGM) from aggregated samples across several contexts. Previous joint-UGM studies have mostly focused on sparse Gaussian Graphical Models (sGGMs) and can’t identify context-specific edge patterns directly. We, therefore, propose a novel approach, SIMULE (detecting Shared and Individual parts of MULtiple graphs Explicitly) to learn multi-UGM via a constrained L1 minimization. SIMULE automatically infers both specific edge patterns that are unique to each context and shared interactions preserved among all the contexts. Through the L1 constrained formulation, this problem is cast as multiple independent subtasks of linear programming that can be solved efficiently in parallel. In addition to Gaussian data, SIMULE can also handle multivariate Nonparanormal data that greatly relaxes the normality assumption that many real-world applications do not follow. We provide a novel theoretical proof showing that SIMULE achieves a consistent result at the rate O(log(Kp)/n_{tot}). On multiple synthetic datasets and two biomedical datasets, SIMULE shows significant improvement over state-of-the-art multi-sGGM and single-UGM baselines.

### Citations

@Article{Wang2017,
author="Wang, Beilun and Singh, Ritambhara and Qi, Yanjun",
title="A constrained L1 minimization approach for estimating multiple sparse Gaussian or nonparanormal graphical models",
journal="Machine Learning",
year="2017",
month="Oct",
day="01",
volume="106",
number="9",
pages="1381--1417",
abstract="Identifying context-specific entity networks from aggregated data is an important task, arising often in bioinformatics and neuroimaging applications. Computationally, this task can be formulated as jointly estimating multiple different, but related, sparse undirected graphical models(UGM) from aggregated samples across several contexts. Previous joint-UGM studies have mostly focused on sparse Gaussian graphical models (sGGMs) and can't identify context-specific edge patterns directly. We, therefore, propose a novel approach, SIMULE (detecting Shared and Individual parts of MULtiple graphs Explicitly) to learn multi-UGM via a constrained  L1 minimization. SIMULE automatically infers both specific edge patterns that are unique to each context and shared interactions preserved among all the contexts. Through the  L1 constrained formulation, this problem is cast as multiple independent subtasks of linear programming that can be solved efficiently in parallel. In addition to Gaussian data, SIMULE can also handle multivariate Nonparanormal data that greatly relaxes the normality assumption that many real-world applications do not follow. We provide a novel theoretical proof showing that SIMULE achieves a consistent result at the rate
log (Kp)/(n_tot). On multiple synthetic datasets and two biomedical datasets, SIMULE shows significant improvement over state-of-the-art multi-sGGM and single-UGM baselines
(SIMULE implementation and the used datasets @  https://github.com/QData/SIMULE  ).",
issn="1573-0565",
doi="10.1007/s10994-017-5635-7",
url="https://doi.org/10.1007/s10994-017-5635-7"
}