FODS 2020: ACM - IMS FOUNDATIONS OF DATA SCIENCE
PROGRAM FOR MONDAY, OCTOBER 19TH
Days:
next day
all days

View: session overviewtalk overview

09:00-12:00 Session 1: Tutorial

Causal Reasoning Tutorial

09:00
Causal Reasoning Tutorial
12:30-13:30 Session 2: Keynote

TBA

12:30
AutoML and interpretability: powering the machine learning revolution in healthcare

ABSTRACT. AutoML and interpretability are both fundamental to the successful uptake of machine learning by non-expert end users. The former will lower barriers to entry and unlock potent new capabilities that are out of reach when working with ad-hoc models, while the latter will ensure that outputs are transparent, trustworthy, and meaningful. In healthcare, AutoML and interpretability are already beginning to empower the clinical community by enabling the crafting of actionable analytics that can inform and improve decision-making by clinicians, administrators, researchers, policymakers, and beyond.

This keynote presents state-of-the-art AutoML and interpretability methods for healthcare developed in our lab and how they have been applied in various clinical settings (including cancer, cardiovascular disease, cystic fibrosis, and recently Covid-19), and then explains how these approaches form part of a broader vision for the future of machine learning in healthcare.

14:00-15:30 Session 3: Plenary Session

Methodology

14:00
ADAGES: adaptive aggregation with stability for distributed feature selection

ABSTRACT.  In this era of big data, not only the large amount of data keeps motivating distributed computing, but concerns on data privacy also put forward the emphasis on distributed learning. To conduct feature selection and to control the false discovery rate in a distributed pattern with multi-machines or multi-institutions, an efficient aggregation method is necessary. In this paper, we propose an adaptive aggregation method called ADAGES which can be flexibly applied to any machine-wise feature selection method. We will show that our method is capable of controlling the overall FDR with a theoretical foundation while maintaining power as good as the Union aggregation rule in practice.

14:20
Classification Acceleration via Merging Decision Trees

ABSTRACT. Tree (ensemble) methods have been extremely popular in web search and data mining in general. Since a single tree typically does not provide good accuracy, often an ensemble of many trees is used as the model, resulting in large model size and slow testing speed. This is the major drawback of tree ensemble methods.

In this paper we study the problem of merging decision trees: Given $k$ decision trees $T_1,T_2,T_3...,T_k$, we merge these trees into one super tree $T$ with much smaller size. The tree $T$ is an integration of $k$ decision trees, and each leaf has a major label, also can be considered as a compression of a random forest. For any testing instance, the tree $T$ gives the same prediction as the random forest consisting of $T_1,T_2,T_3...,T_k$ but save the computational effort needed for traversing multiple trees. Therefore, the method in this paper is suitable for classification problems with time constraints, for example, the online classification task such that it needs to predict a label for a new instance before the next instance arrives. Experiment analysis on six real world data sets shows that the super tree $T$ runs significantly faster than the random forest with large $k$. The merging also save space in storing those trees, and it makes the forest model more interpretable since one tree is easier to be interpreted than $k$ trees in natural manner.

14:40
Tree Space Prototypes: Another Look at Making Tree Ensembles Interpretable

ABSTRACT. Ensembles of decision trees are known to perform well on many problems, but are not interpretable. In contrast to existing explanations of tree ensembles that explain relationships between features and predictions, we propose an alternative approach to interpreting tree ensembles by surfacing representative points for each class, in which we explain a prediction by presenting points with similar predictions – prototypes. We introduce a new distance for Gradient Boosted Tree models, and propose new prototype selection methods with theoretical guarantees, with the flexibility to choose a different number of prototypes in each class. We demonstrate our methods on random forests and gradient boosted trees, showing that our found prototypes perform as well as or even better than the original tree ensemble when used as a nearest-prototype classifier. In a user study, we demonstrate that prototypes are better than a popular feature attribution method, Shapley values, at making a tree ensemble classifier predictable to humans.

15:00
Ensembles of bagged TAO trees consistently improve over Random Forests, AdaBoost and Gradient Boosting

ABSTRACT. Ensemble methods based on trees, such as Random Forests, AdaBoost and gradient boosting, are widely recognized as among the best off-the-shelf classifiers: they typically achieve state-of-the-art accuracy in many problems with little effort in tuning hyperparameters, and they are often used in applications, possibly combined with other methods such as neural nets. While many variations of forest methods exist, using different diversity mechanisms (such as bagging, feature sampling or boosting), nearly all rely on training individual trees in a highly suboptimal way using greedy top-down tree induction algorithms such as CART or C5.0. We study forests where each tree is trained on a bootstrapped or random sample but using the recently proposed tree alternating optimization (TAO), which is able to learn trees that have both fewer nodes and lower error. The better optimization of individual trees translates into forests that achieve higher accuracy but using fewer, smaller trees with oblique nodes. We demonstrate this in a range of datasets and with a careful study of the complementary effect of optimization and diversity in the construction of the forest. These bagged TAO trees improve consistently and by a considerable margin over Random Forests, AdaBoost, gradient boosting and other forest algorithms in every single dataset we tried.

15:45-17:15 Session 4: Plenary Session

Fairness, Privacy, Interpretability

15:45
Interpreting Black Box Models via Hypothesis Testing

ABSTRACT. In science and medicine, model interpretations may be reported as discoveries of natural phenomena or used to guide patient treatments. In such high-stakes tasks, false discoveries may lead investigators astray. These applications would therefore benefit from control over the finite-sample error rate of interpretations. We reframe black box model interpretability as a multiple hypothesis testing problem. The task is to discover ``important'' features by testing whether the model prediction is significantly different from what would be expected if the features were replaced with uninformative counterfactuals. We propose two testing methods: one that provably controls the false discovery rate but which is not yet feasible for large-scale applications, and an approximate testing method which can be applied to real-world data sets. In simulation, both tests have high power relative to existing interpretability methods. When applied to state-of-the-art vision and language models, the framework selects features that intuitively explain model predictions. The resulting explanations have the additional advantage that they are themselves easy to interpret.

16:05
Congenial Differential Privacy under Mandated Disclosure

ABSTRACT. Differentially private data releases are often required to satisfy a set of external constraints that reflect the legal, ethical, and logical mandates to which the data curator is obligated. The enforcement of constraint, when treated as post-processing, adds an extra phase in the production of privatized data. It is well understood in the theory of multi-phase processing that congeniality, a form of procedural compatibility, between phases is critical to enabling the end users to obtain statistically valid results more straightforwardly. Congenial differential privacy is theoretically principled, which facilitates transparency and intelligibility of the mechanism that would otherwise be undermined by ad-hoc post-processing procedures. We therefore advocate for the systematic integration of mandated disclosure into the design of the privacy mechanism via standard probabilistic conditioning on the invariant margins. Conditioning automatically renders congeniality because any extra post-processing phase becomes unnecessary. We provide both initial theoretical guarantees and a Markov chain algorithm for our proposal. We also discuss intriguing theoretical issues that arise in comparing congenital differential privacy and optimization-based post-processing, as well as directions for further research.

16:25
Incentives Needed for Low-Cost Fair Data Reuse

ABSTRACT. One of the central goals in algorithmic fairness is to build systems with fairness properties that compose gracefully. Although the importance of this goal was recognized early, limited progress has been made. A major effort and step towards this objective in data science has been the development of fair representations which guarantee fairness under sequential composition by imposing a stringent demographic secrecy constraint. In this work, we propose a fresh approach to building fairly composable data-science pipelines by incorporating information about parties' incentives into fairness interventions. Specifically, we show that in a stylized, but natural model, it is possible to relax the demographic secrecy requirement to obtain incentive-compatible representations, where rational parties obtain exponentially greater utilities vis-a-vis any demographically secret representation and will still be fair. These substantial utility-gains are recovered not from the well-known cost of fairness, but rather from a cost of demographic secrecy which we formalize and quantify for the first time. Our results open several new directions for research on fair data-science pipelines, fair machine learning, and algorithmic fairness more broadly.

16:45
Applying Algorithmic Accountability Frameworks withDomain-specific Codes of Ethics: A Case Study in EcosystemForecasting for Shellfish Toxicity in the Gulf of Maine
PRESENTER: Isabella Grasso

ABSTRACT. Ecological forecasts are used to drive decisions that can have significant impacts on the lives of individuals and on the health of ecosystems. These forecasts, or models, embody the ethics of their creators as well as many seemingly arbitrary implementation choices made along the way. They can contain implementation errors as well as reflect patterns of bias learned when ingesting datasets derived from past biased decision making. Principles and frameworks for algorithmic accountability allow a wide range of stakeholders to place the results of models and software systems into context. We demonstrate how the combination of algorithmic accountability frameworks and domain-specific codes of ethics offer a key way to answer calls to uphold fairness and  human values in each domain which utilizes machine learning algorithms rather than search for one universal definition of fairness. This helps avoid many of the unintended consequences that can result from deploying “black box” systems to solve complex problems. In this paper, we discuss our experience with applying algorithmic accountability principles and frameworks to ecosystem forecasting, in particular to forecasting shellfish toxicity in the Gulf of Maine. We adapt existing frameworks such as Datasheets for Datasets and Model Cards for Model Reporting from their original focus on personally identifiable private data to include public datasets such as those often used in eco-forecasting applications to audit a case study, a shellfish toxicity forecast. We find that high level algorithmic accountability frameworks and domain level codes of ethics compliment each other, incentivizing more transparency, accountability, and fairness in automated decision-making systems.