Researcher at the Department of Computational Biology and Medical Sciences, The University of Tokyo.
Deep learning models often achieve high accuracy but lack interpretability, making them unsuitable for critical applications such as medical diagnosis, biomolecule design, criminal justice, etc. The Sparse High-Order Interaction Model (SHIM) addresses this limitation by providing both transparency and predictive reliability. However, realworld data often contain outliers, which can distort model performance. To overcome this, we propose Huberized-SHIM, an extension of SHIM that integrates Huber loss-based robust regression to mitigate the impact of outliers. We introduce a homotopy-based exact regularization path algorithm and a novel tree-pruning criterion to efficiently manage interaction complexity. Additionally, we incorporate the conformal prediction framework to enhance statistical reliability. Empirical evaluations on synthetic and real-world datasets demonstrate the superior robustness and accuracy of Huberized-SHIM in high-stakes decisionmaking contexts.
[paper]Interpretable (or explainable) machine learning models, such as decision trees, play a crucial role in the context of trustworthy AI. However, finding optimal decision trees (i.e., minimum size and maximum accuracy trees) is not a simple task and remains an active area of research. While a single decision tree has limited expressivity, using an ensemble of decision trees can effectively capture the complex structures found in many real-world applications. Many existing tree ensemble methods are greedy and suboptimal, and often suffer from randomness in the tree generation process. In this paper, we introduce DT-sampler, a SAT-based decision tree ensemble which allows explicit control over both the size and accuracy of the sampled trees. We developed a novel SAT-based encoding method that utilizes only branch nodes, resulting in a compact representation of decision tree space. Additionally, standard point predictions made using decision tree ensembles do not offer any statistical guarantee over miscoverage rate. We employ conformal prediction (CP), a distribution-free statistical framework which provides a valid finite-sample coverage guarantee, to demonstrate that DT-sampler is statistically more efficient and produces stable results when compared with random forest classifier. We demonstrate the effectiveness of our method through several benchmark and real-world datasets.
[paper] [code]]Solving black-box optimization problems with Ising machines is increasingly common in materials science. However, their application to crystal structure prediction (CSP) is still ineffective due to symmetry agnostic encoding of atomic coordinates. We introduce CRYSIM, an algorithm that encodes the space group, the Wyckoff positions combination, and coordinates of independent atomic sites as separate variables. This encoding reduces the search space substantially by exploiting the symmetry in space groups. When CRYSIM is interfaced to Fixstars Amplify, a GPU-based Ising machine, its prediction performance is competitive with CALYPSO and Bayesian optimization for crystals containing more than 150 atoms in a unit cell. Although it is not realistic to interface CRYSIM to current small-scale quantum devices, it has the potential to become the standard CSP algorithm in the coming quantum age.
[paper] [code]]Message passing neural networks have demonstrated significant efficacy in predicting molecular interactions. Introducing equivariant vectorial representations augments expressivity by capturing geometric data symmetries, thereby improving model accuracy. However, two-body bond vectors in opposition may cancel each other out during message passing, leading to the loss of directional information on their shared node. In this study, we develop Equivariant N-body Interaction Networks (ENINet) that explicitly integrates l = 1 equivariant many-body interactions to enhance directional symmetric information in the message passing scheme. We provided a mathematical analysis demonstrating the necessity of incorporating many-body equivariant interactions and generalized the formulation to N-body interactions. Experiments indicate that integrating many-body equivariant representations enhances prediction accuracy across diverse scalar and tensorial quantum chemical properties.
[paper] [code]]Multi-Objective Optimization (MOO) presents a significant challenge in various real-world applications. For complex problems, it is usually impossible to find a single solution that optimizes all objectives simultaneously. In experimental design scenarios, obtaining the entire Pareto set (PS) is beneficial as it allows for flexible exploration of the design space. We have developed an efficient Pareto set learning (PSL) algorithm that learns the continuous manifold of the Pareto front (PF). This enables a robot or a domain expert to explore the PF in real-time, eliminating the need to reconstruct the PF for new trade-off preferences among objectives.
[paper] [code]The Sparse High-order Interaction Model (SHIM) is an interpretable yet non-linear machine learning model. It is a useful model that can capture the interactions of many features, which is crucial in many real-world applications, such as gene-gene interactions and identifying groups of mutations. However, finding a point prediction in regression is often not enough, and many real-world high-stakes decision-making problems demand a prediction band (or interval) that encloses the point prediction. We developed an efficient algorithm that can produce statistically efficient (narrow) prediction intervals containing the point prediction of a SHIM.
[paper] [code]]Finding statistically significant (low p-values) high-order feature interactions are challenging because of the intrinsic high dimensionality of the combinatorial effects. Another problem in data-driven modeling is the effect of “cherry-picking” (i.e., selection bias). We developed a fast algorithm using a branch-and-bound tree pruning strategy that can correct the selection bias and provide statistically valid (provides selection bias corrected p-values) high-order feature interactions.
[paper] [code]SHIMR is an interpretable, non-linear machine learning model that includes a rejection option, which is essential for high-stakes decision-making, such as in medical diagnosis. This model can identify uncertain areas within the data and has the ability to refrain from making a decision when it lacks confidence. For instance, it can automatically pinpoint samples that are close to the decision boundary and choose not to make a decision for those instances. SHIMR is equipped to address class imbalance issues, and its visualization module illustrates the relationships between model scores, feature interactions, and their importance, making it a valuable tool for promoting trustworthy AI.
[paper] [code]