Modified Detecting Deviating Cells (MDDC) Algorithm

Modified Detecting Deviating Cells (MDDC) Algorithm#

The Modified Detecting Deviating Cells (MDDC) algorithm is described as follows:

  1. Standardized Pearson Residual Calculation

    Let the row marginals of a contingency table be denoted as \(n_{i\bullet} = \sum_{j = 1}^{J} n_{ij}\), the column marginals as \(n_{\bullet j} = \sum_{i = 1}^{I} n_{ij}\), and the total number of reports as \(n_{\bullet \bullet} = \sum_{i = 1}^{I} \sum_{j = 1}^{J} n_{ij}\).

    For each cell in the contingency table, compute the standardized Pearson residual:

    \[e_{ij} = \frac{n_{ij} - \frac{n_{i\bullet} n_{\bullet j}}{n_{\bullet \bullet}}}{\sqrt{\frac{n_{i\bullet} n_{\bullet j}}{n_{\bullet \bullet}} \left(1 - \frac{n_{i\bullet}}{n_{\bullet \bullet}}\right) \left(1 - \frac{n_{\bullet j}}{n_{\bullet \bullet}}\right)}}\]
  2. Separating Residuals and Determining Cutoff Values

    Separate the set of residuals into two groups:

    • \(\{e^+_{ij}\}\) for cells with \(n_{ij} > 0\)

    • \(\{e^0_{ij}\}\) for cells with \(n_{ij} = 0\)

    To identify the cutoff value use either the boxplot statistic defined below or the Monte Carlo (MC) method applied to the standardized Pearson residuals.

    \[ \begin{align}\begin{aligned}c_{univ,j^*}^0 = Q_1(\{e^0_{ij^*}\}) - 1.5 \times IQR(\{e^0_{ij^*}\})\\c_{univ,j^*}^+ = Q_3(\{e^+_{ij^*}\}) + 1.5 \times IQR(\{e^+_{ij^*}\})\end{aligned}\end{align} \]

    Define a matrix \(\mathbf{U}\) with entries:

    \[\begin{split}u_{ij} = \begin{cases} e^+_{ij} & \text{if } |e^+_{ij}| \leq c^+_{univ,j} \\ \text{NA} & \text{if } |e^+_{ij}| > c^+_{univ,j} \\ e^0_{ij} & \text{if } e^0_{ij} \geq c^0_{univ,j} \\ \text{NA} & \text{if } e^0_{ij} < c^0_{univ,j} \end{cases}\end{split}\]

    Cells with \(e^+_{ij} > c^+_{univ}\) are labeled as signals.

  3. Correlation and Connection Determination

    For any two AE rows \(i \neq k\), compute their Pearson correlation:

    \[cor_{ik} = Corr(\tilde{u}_{i}, \tilde{u}_{k})\]

    where \(\tilde{u}_{i} = (u_{i1}, u_{i2}, \ldots, u_{iJ})\) and \(\tilde{u}_{k} = (u_{k1}, u_{k2}, \ldots, u_{kJ})\). Let \(c_{corr}\) be the correlation threshold. For the \(i\)-th AE, if \(|cor_{ik}| \geq c_{corr}\) for \(k \neq i\), then AE \(k\) is called a “connected” AE to AE \(i\).

  4. Prediction and Weighted Mean Calculation

    Obtain predicted values for each cell based on the connected AEs. Suppose there are \(m\) connected AEs for AE \(i\), with correlations \(cor_{ik_1}, cor_{ik_2}, \ldots, cor_{ik_m}\). For each connected AE \(k\), fit a simple linear regression with intercept \(\alpha_{ik}\) and slope \(\beta_{ik}\) using \(\tilde{u}_{i}\) as the response variable and \(\tilde{u}_{k}\) as the explanatory variable. The fitted value for AE \(i\) based on AE \(k\) is:

    \[\hat{u}_{ikj} = \alpha_{ik} + \beta_{ik} u_{kj}\]

    The fitted value for AE \(i\) based on all the connected AEs is obtained as a weighted mean:

    \[\hat{u}_{ij} = \sum_{k=1}^{m} w_{ik} \hat{u}_{ikj}\]

    where \(w_{ik} = \frac{|cor_{ik}|}{\sum_{l=k_1}^{k_m} |cor_{il}|}\).

  5. Final Residual Calculation and P-value

    Compute:

    \[r_{ij} = \frac{(e_{ij} - \hat{u}_{ij}) - A_j}{\sqrt{B_j}}\]

    where:

    \[ \begin{align}\begin{aligned}A_j = \frac{1}{I_j} \sum_{i \in I_j} (e_{ij} - \hat{u}_{ij})\\B_j = \frac{1}{I_j} \sum_{i \in I_j} \left[(e_{ij} - \hat{u}_{ij}) - A_j\right]^2\end{aligned}\end{align} \]

    The computation is over all \(i\) within drug \(j\) where neither \(e_{ij}\) nor \(\hat{u}_{ij}\) is NA. Calculate the upper tail probability of \(r_{ij}\) in the standard normal distribution, which is used as the p-value for each cell. Obtain adjusted p-values via the Benjamini-Hochberg procedure to control the false discovery rate. The second set of signals are the cells with adjusted p-values less than 0.05.