NeurIPS 2021 Papers

See main program poster talks and times below

ORAL

Friday, December 10, 2021
11:30 am EDT
Bellman-consistent Pessimism for Offline Reinforcement Learning

Tengyang Xie (University of Illinois at Urbana-Champaign) · Ching-An Cheng (Georgia Tech) · Nan Jiang (University of Illinois at Urbana-Champaign) · Paul Mineiro (Microsoft) · Alekh Agarwal (Microsoft Research)

The use of pessimism, when reasoning about datasets lacking exhaustive exploration has recently gained prominence in offline reinforcement learning. Despite the robustness it adds to the algorithm, overly pessimistic reasoning can be equally damaging in precluding the discovery of good policies, which is an issue for the popular bonus-based pessimism. In this paper, we introduce the notion of Bellman-consistent pessimism for general function approximation: instead of calculating a point-wise lower bound for the value function, we implement pessimism at the initial state over the set of functions consistent with the Bellman equations. Our theoretical guarantees only require Bellman closedness as standard in the exploratory setting, in which case bonus-based pessimism fails to provide guarantees. Even in the special case of linear function approximation where stronger expressivity assumptions hold, our result improves upon a recent bonus-based approach by O(d) in its sample complexity (when the action space is finite). Remarkably, our algorithms automatically adapt to the best bias-variance tradeoff in the hindsight, whereas most prior approaches require tuning extra hyperparameters a priori.

Friday, December 10, 2021
11:30 am EDT
Drop, Swap, and Generate: A Self-Supervised Approach for Generating Neural Activity

Ran Liu (Georgia Institute of Technology) · Mehdi Azabou (Georgia Institute of Technology) · Max Dabagia (Georgia Institute of Technology) · Chi-Heng Lin (gatech) · Mohammad Gheshlaghi Azar (DeepMind) · Keith Hengen (Washington University, St. Louis) · Michal Valko (DeepMind Paris / Inria / ENS Paris-Saclay) · Eva Dyer (Georgia Institute of Technology)

Meaningful and simplified representations of neural activity can yield insights into how and what information is being processed within a neural circuit. However, without labels, finding representations that reveal the link between the brain and behavior can be challenging. Here, we introduce a novel unsupervised approach for learning disentangled representations of neural activity called Swap-VAE. Our approach combines a generative modeling framework with an instance-specific alignment loss that tries to maximize the representational similarity between transformed views of the input (brain state). These transformed (or augmented) views are created by dropping out neurons and jittering samples in time, which intuitively should lead the network to a representation that maintains both temporal consistency and invariance to the specific neurons used to represent the neural state. Through evaluations on both synthetic data and neural recordings from hundreds of neurons in different primate brains, we show that it is possible to build representations that disentangle neural datasets along relevant latent dimensions linked to behavior.

SPOTLIGHT

Tuesday, December 7, 2021
Tuesday 7:30 pm EDT
Iterative Teaching by Label Synthesis

Weiyang Liu (Georgia Tech) · Zhen Liu (University of Montreal, MILA) · Hanchen Wang (University of Cambridge) · Liam Paull (Université de Montréal) · Bernhard Schölkopf (MPI for Biological Cybernetics) · Adrian Weller (University of Cambridge )

In this paper, we consider the problem of iterative machine teaching, where a teacher provides examples sequentially based on the current iterative learner. In contrast to previous methods that have to scan over the entire pool and select teaching examples from it in each iteration, we propose a label synthesis teaching framework where the teacher randomly selects input teaching examples (e.g., images) and then synthesizes suitable outputs (e.g., labels) for them. We show that this framework can avoid costly example selection while still provably achieving exponential teachability. We propose multiple novel teaching algorithms in this framework. Finally, we empirically demonstrate the value of our framework.

Wednesday, December 8, 2021
Wednesday 7:30 pm EDT
Second-Order Neural ODE Optimizer

Guan-Horng Liu (Georgia Institute of Technology) · Tianrong Chen (Georgia Institute of Technology) · Evangelos Theodorou (Georgia Institute of Technology)

We propose a novel second-order optimization framework for training the emerging deep continuous-time models, specifically the Neural Ordinary Differential Equations (Neural ODEs). Since their training already involves expensive gradient computation by solving a backward ODE, deriving efficient second-order methods becomes highly nontrivial. Nevertheless, inspired by the recent Optimal Control (OC) interpretation of training deep networks, we show that a specific continuous-time OC methodology, called Differential Programming, can be adopted to derive backward ODEs for higher-order derivatives at the same O(1) memory cost. We further explore a low-rank representation of the second-order derivatives and show that it leads to efficient preconditioned updates with the aid of Kronecker-based factorization. The resulting method – named SNOpt – converges much faster than first-order baselines in wall-clock time, and the improvement remains consistent across various applications, e.g. image classification, generative flow, and time-series prediction. Our framework also enables direct architecture optimization, such as the integration time of Neural ODEs, with second-order feedback policies, strengthening the OC perspective as a principled tool of analyzing optimization in deep learning. Our code is available at https://github.com/ghliu/snopt.

Thursday, December 9, 2021
Thursday 11:30 am EDT
A Geometric Perspective towards Neural Calibration via Sensitivity Decomposition

Junjiao Tian (Georgia Institute of Technology) · Dylan Yung (Georgia Institute of Technology) · Yen-Chang Hsu (Georgia Institute of Technology) · Zsolt Kira (Georgia Institute of Techology)

It is well known that vision classification models suffer from poor calibration in the face of data distribution shifts. In this paper, we take a geometric approach to this problem. We propose Geometric Sensitivity Decomposition (GSD) which decomposes the norm of a sample feature embedding and the angular similarity to a target classifier into an instance-dependent and an instance-independent com-ponent. The instance-dependent component captures the sensitive information about changes in the input while the instance-independent component represents the insensitive information serving solely to minimize the loss on the training dataset. Inspired by the decomposition, we analytically derive a simple extension to current softmax-linear models, which learns to disentangle the two components during training. On several common vision models, the disentangled model out-performs other calibration methods on standard calibration metrics in the face of out-of-distribution (OOD) data and corruption with significantly less complexity. Specifically, we surpass the current state of the art by 30.8% relative improvement on corrupted CIFAR100 in Expected Calibration Error.

Thursday, December 9, 2021
Thursday 3:30 am EDT
Habitat 2.0: Training Home Assistants to Rearrange their Habitat

Andrew Szot (Georgia Institute of Technology) · Alexander Clegg (Facebook (FAIR Labs)) · Eric Undersander (Facebook) · Erik Wijmans (Georgia Institute of Technology) · Yili Zhao (Facebook AI Research) · John Turner (Facebook) · Noah Maestre (Facebook) · Mustafa Mukadam (Facebook AI Research) · Devendra Singh Chaplot (Carnegie Mellon University) · Oleksandr Maksymets (Facebook AI Research) · Aaron Gokaslan (Facebook) · Vladimír Vondruš (Magnum Engine) · Sameer Dharur (Georgia Tech) · Franziska Meier (Facebook AI Research) · Wojciech Galuba (Facebook AI Research) · Angel Chang (Simon Fraser University) · Zsolt Kira (Georgia Institute of Techology) · Vladlen Koltun (Apple) · Jitendra Malik (UC Berkeley) · Manolis Savva (Simon Fraser University) · Dhruv Batra ()

We introduce Habitat 2.0 (H2.0), a simulation platform for training virtual robots in interactive 3D environments and complex physics-enabled scenarios. We make comprehensive contributions to all levels of the embodied AI stack – data, simulation, and benchmark tasks. Specifically, we present: (i) ReplicaCAD: an artist-authored, annotated, reconfigurable 3D dataset of apartments (matching real spaces) with articulated objects (e.g. cabinets and drawers that can open/close); (ii) H2.0: a high-performance physics-enabled 3D simulator with speeds exceeding 25,000 simulation steps per second (850x real-time) on an 8-GPU node, representing 100x speed-ups over prior work; and, (iii) Home Assistant Benchmark (HAB): a suite of common tasks for assistive robots (tidy the house, stock groceries, set the table) that test a range of mobile manipulation capabilities. These large-scale engineering contributions allow us to systematically compare deep reinforcement learning (RL) at scale and classical sense-plan-act (SPA) pipelines in long-horizon structured tasks, with an emphasis on generalization to new objects, receptacles, and layouts. We find that (1) flat RL policies struggle on HAB compared to hierarchical ones; (2) a hierarchy with independent skills suffers from ‘hand-off problems’, and (3) SPA pipelines are more brittle than RL policies.

Thursday, December 9, 2021
Thursday 7:30 pm EDT
Combiner: Full Attention Transformer with Sparse Computation Cost

Hongyu Ren (Stanford University) · Hanjun Dai (Georgia Institute of Technology) · Zihang Dai (Google Brain) · Mengjiao Yang (Google Brain) · Jure Leskovec (Stanford University/Pinterest) · Dale Schuurmans (Google Brain & University of Alberta) · Bo Dai (Google Brain)

Transformers provide a class of expressive architectures that are extremely effective for sequence modeling. However, the key limitation of transformers is their quadratic memory and time complexity O(L2) with respect to the sequence length in attention layers, which restricts application in extremely long sequences. Most existing approaches leverage sparsity or low-rank assumptions in the attention matrix to reduce cost, but sacrifice expressiveness. Instead, we propose Combiner, which provides full attention capability in each attention head while maintaining low computation and memory complexity. The key idea is to treat the self-attention mechanism as a conditional expectation over embeddings at each location, and approximate the conditional distribution with a structured factorization. Each location can attend to all other locations, either via direct attention, or through indirect attention to abstractions, which are again conditional expectations of embeddings from corresponding local regions. We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention, resulting in the same sub-quadratic cost (O(Llog(L)) or O(L√L)). Combiner is a drop-in replacement for attention layers in existing transformers and can be easily implemented in common frameworks. An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach, yielding state-of-the-art results on several image and text modeling tasks.

Friday, December 10, 2021
Friday 11:30 am EDT
Generic Neural Architecture Search via Regression

Yuhong Li (University of Illinois at Urbana-Champaign) · Cong Hao (Georgia Institute of Technology) · Pan Li (Stanford University) · Jinjun Xiong (IBM Research) · Deming Chen (University of Illinois, Urbana Champaign)

Most existing neural architecture search (NAS) algorithms are dedicated to and evaluated by the downstream tasks, e.g., image classification in computer vision. However, extensive experiments have shown that, prominent neural architectures, such as ResNet in computer vision and LSTM in natural language processing, are generally good at extracting patterns from the input data and perform well on different downstream tasks. In this paper, we attempt to answer two fundamental questions related to NAS. (1) Is it necessary to use the performance of specific downstream tasks to evaluate and search for good neural architectures? (2) Can we perform NAS effectively and efficiently while being agnostic to the downstream tasks? To answer these questions, we propose a novel and generic NAS framework, termed Generic NAS (GenNAS). GenNAS does not use task-specific labels but instead adopts regression on a set of manually designed synthetic signal bases for architecture evaluation. Such a self-supervised regression task can effectively evaluate the intrinsic power of an architecture to capture and transform the input signal patterns, and allow more sufficient usage of training samples. Extensive experiments across 13 CNN search spaces and one NLP space demonstrate the remarkable efficiency of GenNAS using regression, in terms of both evaluating the neural architectures (quantified by the ranking correlation Spearman’s rho between the approximated performances and the downstream task performances) and the convergence speed for training (within a few seconds). For example, on NAS-Bench-101, GenNAS achieves 0.85 rho while the existing efficient methods only achieve 0.38. We then propose an automatic task search to optimize the combination of synthetic signals using limited downstream-task-specific labels, further improving the performance of GenNAS. We also thoroughly evaluate GenNAS’s generality and end-to-end NAS performance on all search spaces, which outperforms almost all existing works with significant speedup. For example, on NASBench-201, GenNAS can find near-optimal architectures within 0.3 GPU hour.

POSTER

Tuesday, December 7, 2021
Tuesday 11:30 am EDT
Adversarial Graph Augmentation to Improve Graph Contrastive Learning

Susheel Suresh (Purdue University) · Pan Li (Stanford University) · Cong Hao (Georgia Institute of Technology) · Jennifer Neville (Purdue University)

Self-supervised learning of graph neural networks (GNN) is in great need because of the widespread label scarcity issue in real-world graph/network data. Graph contrastive learning (GCL), by training GNNs to maximize the correspondence between the representations of the same graph in its different augmented forms, may yield robust and transferable GNNs even without using labels. However, GNNs trained by traditional GCL often risk capturing redundant graph features and thus may be brittle and provide sub-par performance in downstream tasks. Here, we propose a novel principle, termed adversarial-GCL (textit{AD-GCL}), which enables GNNs to avoid capturing redundant information during the training by optimizing adversarial graph augmentation strategies used in GCL. We pair AD-GCL with theoretical explanations and design a practical instantiation based on trainable edge-dropping graph augmentation. We experimentally validate AD-GCL by comparing with the state-of-the-art GCL methods and achieve performance gains of up-to~14% in unsupervised, ~6% in transfer and~3% in semi-supervised learning settings overall with 18 different benchmark datasets for the tasks of molecule property regression and classification, and social network classification.

Tuesday, December 7, 2021
Tuesday 11:30 am EDT
Deep inference of latent dynamics with spatio-temporal super-resolution using selective backpropagation through time

Feng Zhu (Emory University) · Andrew Sedler (Georgia Institute of Technology) · Harrison A Grier (The University of Chicago) · Nauman Ahad (Georgia Institute of Technology) · Mark Davenport (Georgia Institute of Technology) · Matthew Kaufman (University of Chicago) · Andrea Giovannucci (University of North Carolina, Chapel Hill) · Chethan Pandarinath (Emory University)

Modern neural interfaces allow access to the activity of up to a million neurons within brain circuits. However, bandwidth limits often create a trade-off between greater spatial sampling (more channels or pixels) and the temporal frequency of sampling. Here we demonstrate that it is possible to obtain spatio-temporal super-resolution in neuronal time series by exploiting relationships among neurons, embedded in latent low-dimensional population dynamics. Our novel neural network training strategy, selective backpropagation through time (SBTT), enables learning of deep generative models of latent dynamics from data in which the set of observed variables changes at each time step. The resulting models are able to infer activity for missing samples by combining observations with learned latent dynamics. We test SBTT applied to sequential autoencoders and demonstrate more efficient and higher-fidelity characterization of neural population dynamics in electrophysiological and calcium imaging data. In electrophysiology, SBTT enables accurate inference of neuronal population dynamics with lower interface bandwidths, providing an avenue to significant power savings for implanted neuroelectronic interfaces. In applications to two-photon calcium imaging, SBTT accurately uncovers high-frequency temporal structure underlying neural population activity, substantially outperforming the current state-of-the-art. Finally, we demonstrate that performance could be further improved by using limited, high-bandwidth sampling to pretrain dynamics models, and then using SBTT to adapt these models for sparsely-sampled data.

Tuesday, December 7, 2021
Tuesday 11:30 am EDT
Generalizable Imitation Learning from Observation via Inferring Goal Proximity

Youngwoon Lee (University of Southern California) · Andrew Szot (Georgia Institute of Technology) · Shao-Hua Sun (University of Southern California) · Joseph Lim (MIT)

Task progress is intuitive and readily available task information that can guide an agent closer to the desired goal. Furthermore, a progress estimator can generalize to new situations. From this intuition, we propose a simple yet effective imitation learning from observation method for a goal-directed task using a learned goal proximity function as a task progress estimator, for better generalization to unseen states and goals. We obtain this goal proximity function from expert demonstrations and online agent experience, and then use the learned goal proximity as a dense reward for policy training. We demonstrate that our proposed method can robustly generalize compared to prior imitation learning methods on a set of goal-directed tasks in navigation, locomotion, and robotic manipulation, even with demonstrations that cover only a part of the states.

Tuesday, December 7, 2021
Tuesday 11:30 am EDT
Human-Adversarial Visual Question Answering

Sasha Sheng (None) · Amanpreet Singh (Facebook AI Research) · Vedanuj Goswami (Facebook) · Jose Magana (Instituto Tecnológico y de Estudios Superiores de Monterrey (ITESM)) · Tristan Thrush (Facebook) · Wojciech Galuba (Facebook AI Research) · Devi Parikh (Georgia Tech / Facebook AI Research (FAIR)) · Douwe Kiela (Facebook AI Research)

Performance on the most commonly used Visual Question Answering dataset (VQA v2) is starting to approach human accuracy. However, in interacting with state-of-the-art VQA models, it is clear that the problem is far from being solved. In order to stress test VQA models, we benchmark them against human-adversarial examples. Human subjects interact with a state-of-the-art VQA model, and for each image in the dataset, attempt to find a question where the model’s predicted answer is incorrect. We find that a wide range of state-of-the-art models perform poorly when evaluated on these examples. We conduct an extensive analysis of the collected adversarial examples and provide guidance on future research directions. We hope that this Adversarial VQA (AdVQA) benchmark can help drive progress in the field and advance the state of the art.

Tuesday, December 7, 2021
Tuesday 11:30 am EDT
Random Noise Defense Against Query-Based Black-Box Attacks

Zeyu Qin (The Chinese University of Hong Kong, Shenzhen) · Yanbo Fan (NLPR, CASIA) · Hongyuan Zha (Georgia Tech) · Baoyuan Wu (Tencent AI Lab)

The query-based black-box attacks have raised serious threats to machine learning models in many real applications. In this work, we study a lightweight defense method, dubbed Random Noise Defense (RND), which adds proper Gaussian noise to each query. We conduct the theoretical analysis about the effectiveness of RND against query-based black-box attacks and the corresponding adaptive attacks. Our theoretical results reveal that the defense performance of RND is determined by the magnitude ratio between the noise induced by RND and the noise added by the attackers for gradient estimation or local search. The large magnitude ratio leads to the stronger defense performance of RND, and it’s also critical for mitigating adaptive attacks. Based on our analysis, we further propose to combine RND with a plausible Gaussian augmentation Fine-tuning (RND-GF). It enables RND to add larger noise to each query while maintaining the clean accuracy to obtain a better trade-off between clean accuracy and defense performance. Additionally, RND can be flexibly combined with the existing defense methods to further boost the adversarial robustness, such as adversarial training (AT). Extensive experiments on CIFAR-10 and ImageNet verify our theoretical findings and the effectiveness of RND and RND-GF.

Tuesday, December 7, 2021
Tuesday 11:30 am EDT
Reusing Combinatorial Structure: Faster Iterative Projections over Submodular Base Polytopes

Jai Moondra (Georgia Institute of Technology) · Hassan Mortagy (Georgia Institute of Technology) · Swati Gupta (Georgia Institute of Technology)

Optimization algorithms such as projected Newton’s method, FISTA, mirror descent and its variants enjoy near-optimal regret bounds and convergence rates, but suffer from a computational bottleneck of computing projections” in potentially each iteration (e.g., O(T1/2) regret of online mirror descent). On the other hand, conditional gradient variants solve a linear optimization in each iteration, but result in suboptimal rates (e.g., O(T3/4) regret of online Frank-Wolfe). Motivated by this trade-off in runtime v/s convergence rates, we consider iterative projections of close-by points over widely-prevalent submodular base polytopes B(f). We develop a toolkit to speed up the computation of projections using both discrete and continuous perspectives. We subsequently adapt the away-step Frank-Wolfe algorithm to use this information and enable early termination. For the special case of cardinality based submodular polytopes, we improve the runtime of computing certain Bregman projections by a factor of Ω(n/log(n)). Our theoretical results show orders of magnitude reduction in runtime in preliminary computational experiments.

Tuesday, December 7, 2021
Tuesday 11:30 am EDT
Scalable Diverse Model Selection for Accessible Transfer Learning

Daniel Bolya (Georgia Institute of Technology) · Rohit Mittapalli (Georgia Institute of Technology) · Judy Hoffman (FAIR and Georgia Tech)

With the preponderance of pretrained deep learning models available off-the-shelf from model banks today, finding the best weights to fine-tune to your use-case can be a daunting task. Several methods have recently been proposed to find good models for transfer learning, but they either don’t scale well to large model banks or don’t perform well on the diversity of off-the-shelf models. Ideally the question we want to answer is, “given some data and a source model, can you quickly predict the model’s accuracy after fine-tuning?” In this paper, we formalize this setting as “Scalable Diverse Model Selection” and propose several benchmarks for evaluating on this task. We find that existing model selection and transferability estimation methods perform poorly here and analyze why this is the case. We then introduce simple techniques to improve the performance and speed of these algorithms. Finally, we iterate on existing methods to create PARC, which outperforms all other methods on diverse model selection. We have released the benchmarks and method code in hope to inspire future work in model selection for accessible transfer learning.

Tuesday, December 7, 2021
Tuesday 11:30 am EDT
Scallop: From Probabilistic Deductive Databases to Scalable Differentiable Reasoning

Jiani Huang (School of Engineering and Applied Science, University of Pennsylvania) · Ziyang Li (University of Pennsylvania) · Binghong Chen (Georgia Institute of Technology) · Karan Samel (Georgia Institute of Technology) · Mayur Naik (University of Pennsylvania) · Le Song (Georgia Institute of Technology) · Xujie Si (University of Pennsylvania)

Deep learning and symbolic reasoning are complementary techniques for an intelligent system. However, principled combinations of these techniques have limited scalability, rendering them ill-suited for real-world applications. We propose Scallop, a system that builds upon probabilistic deductive databases, to bridge this gap. The key insight underlying Scallop is a provenance framework that introduces a tunable parameter to specify the level of reasoning granularity. Scallop thereby i) generalizes exact probabilistic reasoning, ii) asymptotically reduces computational cost, and iii) provides relative accuracy guarantees. On a suite of tasks that involve mathematical and logical reasoning, Scallop scales significantly better without sacrificing accuracy compared to DeepProbLog, a principled neural logic programming approach. We also create and evaluate on a real-world Visual Question Answering (VQA) benchmark that requires multi-hop reasoning. Scallop outperforms two VQA-tailored models, a Neural Module Networks based and a transformer based model, by 12.42% and 21.66% respectively.

Tuesday, December 7, 2021
Tuesday 7:30 pm EDT
Diffusion Normalizing Flow

Qinsheng Zhang (Georgia Institute of Technology) · Yongxin Chen (Georgia Institute of Technology)

We present a novel generative modeling method called diffusion normalizing flow based on stochastic differential equations (SDEs). The algorithm consists of two neural SDEs: a forward SDE that gradually adds noise to the data to transform the data into Gaussian random noise, and a backward SDE that gradually removes the noise to sample from the data distribution. By jointly training the two neural SDEs to minimize a common cost function that quantifies the difference between the two, the backward SDE converges to a diffusion process the starts with a Gaussian distribution and ends with the desired data distribution. Our method is closely related to normalizing flow and diffusion probabilistic models, and can be viewed as a combination of the two. Compared with normalizing flow, diffusion normalizing flow is able to learn distributions with sharp boundaries. Compared with diffusion probabilistic models, diffusion normalizing flow requires fewer discretization steps and thus has better sampling efficiency. Our algorithm demonstrates competitive performance in both high-dimension data density estimation and image generation tasks.

Tuesday, December 7, 2021
Tuesday 7:30 pm EDT
Simple steps are all you need: Frank-Wolfe and generalized self-concordant functions

Alejandro Carderera (Georgia Institute of Technology) · Mathieu Besançon (Zuse Institute Berlin) · Sebastian Pokutta (Zuse Institute Berlin)

Generalized self-concordance is a key property present in the objective function of many important learning problems. We establish the convergence rate of a simple Frank-Wolfe variant that uses the open-loop step size strategy γt=2/(t+2), obtaining a O(1/t convergence rate for this class of functions in terms of primal gap and Frank-Wolfe gap, where t is the iteration count. This avoids the use of second-order information or the need to estimate local smoothness parameters of previous work. We also show improved convergence rates for various common cases, e.g., when the feasible region under consideration is uniformly convex or polyhedral.

Tuesday, December 7, 2021
Tuesday 7:30 pm EDT
Transfer Learning of Graph Neural Networks with Ego-graph Information Maximization

Qi Zhu (University of Illinois, Urbana Champaign) · Carl Yang (Emory University) · Yidan Xu (University of Washington) · Haonan Wang (University of Illinois at Urbana-Champaign) · Chao Zhang (Georgia Institute of Technology) · Jiawei Han (University of Illinois at Urbana-Champaign)

Graph neural networks (GNNs) have achieved superior performance in various applications, but training dedicated GNNs can be costly for large-scale graphs. Some recent work started to study the pre-training of GNNs. However, none of them provide theoretical insights into the design of their frameworks, or clear requirements and guarantees towards their transferability. In this work, we establish a theoretically grounded and practically useful framework for the transfer learning of GNNs. Firstly, we propose a novel view towards the essential graph information and advocate the capturing of it as the goal of transferable GNN training, which motivates the design of EGI (Ego-Graph Information maximization) to analytically achieve this goal. Secondly,when node features are structure-relevant, we conduct an analysis of EGI transferability regarding the difference between the local graph Laplacians of the source and target graphs. We conduct controlled synthetic experiments to directly justify our theoretical conclusions. Comprehensive experiments on two real-world network datasets show consistent results in the analyzed setting of direct-transfering, while those on large-scale knowledge graphs show promising results in the more practical setting of transfering with fine-tuning.

Wednesday, December 8, 2021
Wednesday 3:30 am EDT
Bridging Explicit and Implicit Deep Generative Models via Neural Stein Estimators

Qitian Wu (Shanghai Jiao Tong University) · RUI GAO (University of Texas at Austin) · Hongyuan Zha (Georgia Tech)

There are two types of deep generative models: explicit and implicit. The former defines an explicit density form that allows likelihood inference; while the latter targets a flexible transformation from random noise to generated samples. While the two classes of generative models have shown great power in many applications, both of them, when used alone, suffer from respective limitations and drawbacks. To take full advantages of both models and enable mutual compensation, we propose a novel joint training framework that bridges an explicit (unnormalized) density estimator and an implicit sample generator via Stein discrepancy. We show that our method 1) induces novel mutual regularization via kernel Sobolev norm penalization and Moreau-Yosida regularization, and 2) stabilizes the training dynamics. Empirically, we demonstrate that proposed method can facilitate the density estimator to more accurately identify data modes and guide the generator to output higher-quality samples, comparing with training a single counterpart. The new approach also shows promising results when the training samples are contaminated or limited.

Wednesday, December 8, 2021
Wednesday 3:30 am EDT
Only Train Once: A One-Shot Neural Network Training And Pruning Framework

Tianyi Chen (Microsoft) · Bo Ji (National University of Singapore) · Tianyu Ding (Johns Hopkins University) · Biyi Fang (Microsoft) · Guanyi Wang (Georgia Institute of Technology) · Zhihui Zhu (University of Denver) · Luming Liang (Microsoft) · Yixin Shi (Microsoft) · Sheng Yi (North Carolina State University) · Xiao Tu (Microsoft)

Structured pruning is a commonly used technique in deploying deep neural networks (DNNs) onto resource-constrained devices. However, the existing pruning methods are usually heuristic, task-specified, and require an extra fine-tuning procedure. To overcome these limitations, we propose a framework that compresses DNNs into slimmer architectures with competitive performances and significant FLOPs reductions by Only-Train-Once (OTO). OTO contains two key steps: (i) we partition the parameters of DNNs into zero-invariant groups, enabling us to prune zero groups without affecting the output; and (ii) to promote zero groups, we then formulate a structured-sparsity optimization problem, and propose a novel optimization algorithm, Half-Space Stochastic Projected Gradient (HSPG), to solve it, which outperforms the standard proximal methods on group sparsity exploration, and maintains comparable convergence. To demonstrate the effectiveness of OTO, we train and compress full models simultaneously from scratch without fine-tuning for inference speedup and parameter reduction, and achieve state-of-the-art results on VGG16 for CIFAR10, ResNet50 for CIFAR10 and Bert for SQuAD and competitive result on ResNet50 for ImageNet. The source code is available at https://github.com/tianyic/onlytrainonce.

Wednesday, December 8, 2021
Wednesday 7:30 pm EDT
Finite Sample Analysis of Average-Reward TD Learning and QQ-Learning

Sheng Zhang (Georgia Institute of Technology) · (None) · Siva Theja Maguluri (Georgia Institute of Technology)

The focus of this paper is on sample complexity guarantees of average-reward reinforcement learning algorithms, which are known to be more challenging to study than their discounted-reward counterparts. To the best of our knowledge, we provide the first known finite sample guarantees using both constant and diminishing step sizes of (i) average-reward TD(λ) with linear function approximation for policy evaluation and (ii) average-reward Q-learning in the tabular setting to find the optimal policy. A major challenge is that since the value functions are agnostic to an additive constant, the corresponding Bellman operators are no longer contraction mappings under any norm. We obtain the results for TD(λ) by working in an appropriately defined subspace that ensures uniqueness of the solution. For Q-learning, we exploit the span seminorm contractive property of the Bellman operator, and construct a novel Lyapunov function obtained by infimal convolution of a generalized Moreau envelope and the indicator function of a set.

Wednesday, December 8, 2021
Wednesday 7:30 pm EDT
Locally Valid and Discriminative Prediction Intervals for Deep Learning Models

Zhen Lin (University of Illinois, Urbana Champaign) · Shubhendu Trivedi (MIT) · Jimeng Sun (Georgia Tech)

Crucial for building trust in deep learning models for critical real-world applications is efficient and theoretically sound uncertainty quantification, a task that continues to be challenging. Useful uncertainty information is expected to have two key properties: It should be valid (guaranteeing coverage) and discriminative (more uncertain when the expected risk is high). Moreover, when combined with deep learning (DL) methods, it should be scalable and affect the DL model performance minimally. Most existing Bayesian methods lack frequentist coverage guarantees and usually affect model performance. The few available frequentist methods are rarely discriminative and/or violate coverage guarantees due to unrealistic assumptions. Moreover, many methods are expensive or require substantial modifications to the base neural network. Building upon recent advances in conformal prediction [13, 33] and leveraging the classical idea of kernel regression, we propose Locally Valid and Discriminative prediction intervals (LVD), a simple, efficient, and lightweight method to construct discriminative prediction intervals (PIs) for almost any DL model. With no assumptions on the data distribution, such PIs also offer finite-sample local coverage guarantees (contrasted to the simpler marginal coverage). We empirically verify, using diverse datasets, that besides being the only locally valid method for DL, LVD also exceeds or matches the performance (including coverage rate and prediction accuracy) of existing uncertainty quantification methods, while offering additional benefits in scalability and flexibility.

Wednesday, December 8, 2021
Wednesday 7:30 pm EDT
Multi-task Learning of Order-Consistent Causal Graphs

Xinshi Chen (Georgia Institution of Technology) · Haoran Sun (Georgia Institute of Technology) · Caleb Ellington (School of Computer Science, Carnegie Mellon University) · Eric Xing (Petuum Inc. / Carnegie Mellon University) · Le Song (Georgia Institute of Technology)

We consider the problem of discovering K related Gaussian directed acyclic graphs (DAGs), where the involved graph structures share a consistent causal order and sparse unions of supports. Under the multi-task learning setting, we propose a l1/l2-regularized maximum likelihood estimator (MLE) for learning K linear structural equation models. We theoretically show that the joint estimator, by leveraging data across related tasks, can achieve a better sample complexity for recovering the causal order (or topological order) than separate estimations. Moreover, the joint estimator is able to recover non-identifiable DAGs, by estimating them together with some identifiable DAGs. Lastly, our analysis also shows the consistency of union support recovery of the structures. To allow practical implementation, we design a continuous optimization problem whose optimizer is the same as the joint estimator and can be approximated efficiently by an iterative algorithm. We validate the theoretical analysis and the effectiveness of the joint estimator in experiments.

Wednesday, December 8, 2021
Wednesday 7:30 pm EDT
RoMA: Robust Model Adaptation for Offline Model-based Optimization

Sihyun Yu (Korea Advanced Institute of Science and Technology) · Sungsoo Ahn (MBZUAI) · Le Song (Georgia Institute of Technology) · Jinwoo Shin (KAIST)

We consider the problem of searching an input maximizing a black-box objective function given a static dataset of input-output queries. A popular approach to solving this problem is maintaining a proxy model, e.g., a deep neural network (DNN), that approximates the true objective function. Here, the main challenge is how to avoid adversarially optimized inputs during the search, i.e., the inputs where the DNN highly overestimates the true objective function. To handle the issue, we propose a new framework, coined robust model adaptation (RoMA), based on gradient-based optimization of inputs over the DNN. Specifically, it consists of two steps: (a) a pre-training strategy to robustly train the proxy model and (b) a novel adaptation procedure of the proxy model to have robust estimates for a specific set of candidate solutions. At a high level, our scheme utilizes the local smoothness prior to overcome the brittleness of the DNN. Experiments under various tasks show the effectiveness of RoMA compared with previous methods, obtaining state-of-the-art results, e.g., RoMA outperforms all at 4 out of 6 tasks and achieves runner-up results at the remaining tasks.

Thursday, December 9, 2021
Thursday 11:30 am EDT
Benign Overfitting in Multiclass Classification: All Roads Lead to Interpolation

Ke Wang (University of California, Santa Barbara) · Vidya Muthukumar (Georgia Institute of Technology) · Christos Thrampoulidis (University of British Columbia)

The growing literature on “benign overfitting” in overparameterized models has been mostly restricted to regression or binary classification settings; however, most success stories of modern machine learning have been recorded in multiclass settings. Motivated by this discrepancy, we study benign overfitting in multiclass linear classification. Specifically, we consider the following popular training algorithms on separable data: (i) empirical risk minimization (ERM) with cross-entropy loss, which converges to the multiclass support vector machine (SVM) solution; (ii) ERM with least-squares loss, which converges to the min-norm interpolating (MNI) solution; and, (iii) the one-vs-all SVM classifier. Our first key finding is that under a simple sufficient condition, all three algorithms lead to classifiers that interpolate the training data and have equal accuracy. When the data is generated from Gaussian mixtures or a multinomial logistic model, this condition holds under high enough effective overparameterization. Second, we derive novel error bounds on the accuracy of the MNI classifier, thereby showing that all three training algorithms lead to benign overfitting under sufficient overparameterization. Ultimately, our analysis shows that good generalization is possible for SVM solutions beyond the realm in which typical margin-based bounds apply.

Thursday, December 9, 2021
Thursday 11:30 am EDT
Fast and Memory Efficient Differentially Private-SGD via JL Projections

Zhiqi Bu (University of Pennsylvania) · Sivakanth Gopi (Microsoft Research) · Janardhan Kulkarni (Microsoft Research) · Yin Tat Lee (UW) · Hanwen Shen (Stanford) · Uthaipon Tantipongpipat (Georgia Tech)

Differentially Private-SGD (DP-SGD) of Abadi et al. and its variations are the only known algorithms for private training of large scale neural networks. This algorithm requires computation of per-sample gradients norms which is extremely slow and memory intensive in practice. In this paper, we present a new framework to design differentially private optimizers called DP-SGD-JL and DP-Adam-JL. Our approach uses Johnson–Lindenstrauss (JL) projections to quickly approximate the per-sample gradient norms without exactly computing them, thus making the training time and memory requirements of our optimizers closer to that of their non-DP versions. Unlike previous attempts to make DP-SGD faster which work only on a subset of network architectures or use compiler techniques, we propose an algorithmic solution which works for any network in a black-box manner which is the main contribution of this paper. To illustrate this, on IMDb dataset, we train a Recurrent Neural Network (RNN) to achieve good privacy-vs-accuracy tradeoff, while being significantly faster than DP-SGD and with a similar memory footprint as non-private SGD.

Thursday, December 9, 2021
Thursday 11:30 am EDT
Learning Hard Optimization Problems: A Data Generation Perspective

James Kotary (Syracuse University) · Ferdinando Fioretto (Syracuse University) · Pascal Van Hentenryck (Georgia Institute of Technology)

Optimization problems are ubiquitous in our societies and are present in almost every segment of the economy. Most of these optimization problems are NP-hard and computationally demanding, often requiring approximate solutions for large-scale instances. Machine learning frameworks that learn to approximate solutions to such hard optimization problems are a potentially promising avenue to address these difficulties, particularly when many closely related problem instances must be solved repeatedly. Supervised learning frameworks can train a model using the outputs of pre-solved instances. However, when the outputs are themselves approximations, when the optimization problem has symmetric solutions, and/or when the solver uses randomization, solutions to closely related instances may exhibit large differences and the learning task can become inherently more difficult. This paper demonstrates this critical challenge, connects the volatility of the training data to the ability of a model to approximate it, and proposes a method for producing (exact or approximate) solutions to optimization problems that are more amenable to supervised learning tasks. The effectiveness of the method is tested on hard non-linear nonconvex and discrete combinatorial problems.

Thursday, December 9, 2021
Thursday 11:30 am EDT
SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation

Abhinav Moudgil (Georgia Institute of Technology) · Arjun Majumdar (Georgia Institute of Technology) · Harsh Agrawal (Snap, Inc) · Stefan Lee (Oregon State University) · Dhruv Batra ()

Natural language instructions for visual navigation often use scene descriptions (e.g., bedroom) and object references (e.g., green chairs) to provide a breadcrumb trail to a goal location. This work presents a transformer-based vision-and-language navigation (VLN) agent that uses two different visual encoders — a scene classification network and an object detector — which produce features that match these two distinct types of visual cues. In our method, scene features contribute high-level contextual information that supports object-level processing. With this design, our model is able to use vision-and-language pretraining (i.e., learning the alignment between images and text from large-scale web data) to substantially improve performance on the Room-to-Room (R2R) and Room-Across-Room (RxR) benchmarks. Specifically, our approach leads to improvements of 1.8% absolute in SPL on R2R and 3.7% absolute in SR on RxR. Our analysis reveals even larger gains for navigation instructions that contain six or more object references, which further suggests that our approach is better able to use object features and align them to references in the instructions.

Thursday, December 9, 2021
Thursday 11:30 am EDT
When in Doubt: Neural Non-Parametric Uncertainty Quantification for Epidemic Forecasting

Harshavardhan Kamarthi (Georgia Institute of Technology) · Lingkai Kong (Georgia Tech) · Alexander Rodriguez (Georgia Institute of Technology) · Chao Zhang (Georgia Institute of Technology) · B. Aditya Prakash (Georgia Institute of Technology)

Accurate and trustworthy epidemic forecasting is an important problem for public health planning and disease mitigation. Most existing epidemic forecasting models disregard uncertainty quantification, resulting in mis-calibrated predictions. Recent works in deep neural models for uncertainty-aware time-series forecasting also have several limitations; e.g., it is difficult to specify proper priors in Bayesian NNs, while methods like deep ensembling can be computationally expensive. In this paper, we propose to use neural functional processes to fill this gap. We model epidemic time-series with a probabilistic generative process and propose a functional neural process model called EpiFNP, which directly models the probability distribution of the forecast value in a non-parametric way. In EpiFNP, we use a dynamic stochastic correlation graph to model the correlations between sequences, and design different stochastic latent variables to capture functional uncertainty from different perspectives. Our experiments in a real-time flu forecasting setting show that EpiFNP significantly outperforms state-of-the-art models in both accuracy and calibration metrics, up to 2.5x in accuracy and 2.4x in calibration. Additionally, as EpiFNP learns the relations between the current season and similar patterns of historical seasons, it enables interpretable forecasts. Beyond epidemic forecasting, EpiFNP can be of independent interest for advancing uncertainty quantification in deep sequential models for predictive analytics.

Thursday, December 9, 2021
Thursday 3:30 am EDT
Learning Knowledge Graph-based World Models of Textual Environments

Prithviraj Ammanabrolu (Georgia Institute of Technology) · Mark Riedl (Georgia Institute of Technology)

World models improve a learning agent’s ability to efficiently operate in interactive and situated environments. This work focuses on the task of building world models of text-based game environments. Text-based games, or interactive narratives, are reinforcement learning environments in which agents perceive and interact with the world using textual natural language. These environments contain long, multi-step puzzles or quests woven through a world that is filled with hundreds of characters, locations, and objects. Our world model learns to simultaneously: (1) predict changes in the world caused by an agent’s actions when representing the world as a knowledge graph; and (2) generate the set of contextually relevant natural language actions required to operate in the world. We frame this task as a Set of Sequences generation problem by exploiting the inherent structure of knowledge graphs and actions and introduce both a transformer-based multi-task architecture and a loss function to train it. A zero-shot ablation study on never-before-seen textual worlds shows that our methodology significantly outperforms existing textual world modeling techniques as well as the importance of each of our contributions.

Thursday, December 9, 2021
Thursday 7:30 pm EDT
A Biased Graph Neural Network Sampler with Near-Optimal Regret

Qingru Zhang (Georgia Institute of Technology) · David P Wipf (AWS) · Quan Gan (New York University) · Le Song (Georgia Institute of Technology)

Graph neural networks (GNN) have recently emerged as a vehicle for applying deep network architectures to graph and relational data. However, given the increasing size of industrial datasets, in many practical situations, the message passing computations required for sharing information across GNN layers are no longer scalable. Although various sampling methods have been introduced to approximate full-graph training within a tractable budget, there remain unresolved complications such as high variances and limited theoretical guarantees. To address these issues, we build upon existing work and treat GNN neighbor sampling as a multi-armed bandit problem but with a newly-designed reward function that introduces some degree of bias designed to reduce variance and avoid unstable, possibly-unbounded pay outs. And unlike prior bandit-GNN use cases, the resulting policy leads to near-optimal regret while accounting for the GNN training dynamics introduced by SGD. From a practical standpoint, this translates into lower variance estimates and competitive or superior test accuracy across several benchmarks.

Thursday, December 9, 2021
Thursday 7:30 pm EDT
Finite-Sample Analysis of Off-Policy TD-Learning via Generalized Bellman Operators

Zaiwei Chen (Georgia Institute of Technology) · Siva Theja Maguluri (Georgia Institute of Technology) · Sanjay Shakkottai (University of Texas at Austin) · Karthikeyan Shanmugam (IBM Research, NY)

In TD-learning, off-policy sampling is known to be more practical than on-policy sampling, and by decoupling learning from data collection, it enables data reuse. It is known that policy evaluation has the interpretation of solving a generalized Bellman equation. In this paper, we derive finite-sample bounds for any general off-policy TD-like stochastic approximation algorithm that solves for the fixed-point of this generalized Bellman operator. Our key step is to show that the generalized Bellman operator is simultaneously a contraction mapping with respect to a weighted ℓp-norm for each p in [1,∞), with a common contraction factor. Off-policy TD-learning is known to suffer from high variance due to the product of importance sampling ratios. A number of algorithms (e.g. Qπ(λ), Tree-Backup(λ), Retrace(λ), and Q-trace) have been proposed in the literature to address this issue. Our results immediately imply finite-sample bounds of these algorithms. In particular, we provide first-known finite-sample guarantees for Qπ(λ), Tree-Backup(λ), and Retrace(λ), and improve the best known bounds of Q-trace in citep{chen2021finite}. Moreover, we show the bias-variance trade-offs in each of these algorithms.

Thursday, December 9, 2021
Thursday 7:30 pm EDT
Heuristic-Guided Reinforcement Learning

Ching-An Cheng (Georgia Tech) · Andrey Kolobov (Microsoft Research) · Adith Swaminathan (Microsoft Research)

We provide a framework to accelerate reinforcement learning (RL) algorithms by heuristics that are constructed by domain knowledge or offline data. Tabula rasa RL algorithms require environment interactions or computation that scales with the horizon of the sequential decision-making task. Using our framework, we show how heuristic-guided RL induces a much shorter horizon sub-problem that provably solves the original task. Our framework can be viewed as a horizon-based regularization for controlling bias and variance in RL under a finite interaction budget. In theory, we characterize the properties of a good heuristic and the resulting impact on RL acceleration. In particular, we introduce the novel concept of an improvable heuristic that can allow any RL agent to conservatively extrapolate beyond its prior knowledge. In practice, we instantiate our framework to accelerate several state-of-the-art algorithms in simulated robotic control tasks and procedurally generated games. Our framework complements the rich literature on warm-starting RL using expert demonstrations or exploratory data-sets, and creates a unified channel to inject prior knowledge into RL.

Thursday, December 9, 2021
Thursday 7:30 pm EDT
Locality Sensitive Teaching

Zhaozhuo Xu (Rice University) · Beidi Chen (Rice University) · Chaojian Li (Rice University) · Weiyang Liu (Georgia Tech) · Le Song (Georgia Institute of Technology) · Yingyan Lin (Rice University) · Anshumali Shrivastava (Rice University / ThirdAI Corp.)

The emergence of the Internet-of-Things (IoT) sheds light on applying the machine teaching (MT) algorithms for online personalized education on home devices. This direction becomes more promising during the COVID-19 pandemic when in-person education becomes infeasible. However, as one of the most influential and practical MT paradigms, iterative machine teaching (IMT) is prohibited on IoT devices due to its inefficient and unscalable algorithms. IMT is a paradigm where a teacher feeds examples iteratively and intelligently based on the learner’s status. In each iteration, current IMT algorithms greedily traverse the whole training set to find an example for the learner, which is computationally expensive in practice. We propose a novel teaching framework, Locality Sensitive Teaching (LST), based on locality sensitive sampling, to overcome these challenges. LST has provable near-constant time complexity, which is exponentially better than the existing baseline. With at most 425.12x speedups and 99.76% energy savings over IMT, LST is the first algorithm that enables energy and time efficient machine teaching on IoT devices. Owing to LST’s substantial efficiency and scalability, it is readily applicable in real-world education scenarios.

Thursday, December 9, 2021
Thursday 7:30 pm EDT
Neural Tangent Kernel Maximum Mean Discrepancy

Xiuyuan Cheng (Duke University) · Yao Xie (Georgia Tech)

We present a novel neural network Maximum Mean Discrepancy (MMD) statistic by identifying a new connection between neural tangent kernel (NTK) and MMD. This connection enables us to develop a computationally efficient and memory-efficient approach to compute the MMD statistic and perform NTK based two-sample tests towards addressing the long-standing challenge of memory and computational complexity of the MMD statistic, which is essential for online implementation to assimilating new samples. Theoretically, such a connection allows us to understand the NTK test statistic properties, such as the Type-I error and testing power for performing the two-sample test, by adapting existing theories for kernel MMD. Numerical experiments on synthetic and real-world datasets validate the theory and demonstrate the effectiveness of the proposed NTK-MMD statistic.

Thursday, December 9, 2021
Thursday 7:30 pm EDT
Observation-Free Attacks on Stochastic Bandits

Yinglun Xu (University of Illinois, Urbana Champaign) · Bhuvesh Kumar (Georgia Tech) · Jacob Abernethy (Georgia Institute of Technology)

We study data corruption attacks on stochastic multi arm bandit algorithms. Existing attack methodologies assume that the attacker can observe the multi arm bandit algorithm’s realized behavior which is in contrast to the adversaries modeled in the robust multi arm bandit algorithms literature. To the best of our knowledge, we develop the first data corruption attack on stochastic multi arm bandit algorithms which works without observing the algorithm’s realized behavior. Through this attack, we also discover a sufficient condition for a stochastic multi arm bandit algorithm to be susceptible to adversarial data corruptions. We show that any bandit algorithm that makes decisions just using the empirical mean reward, and the number of times that arm has been pulled in the past can suffer from linear regret under data corruption attacks. We further show that various popular stochastic multi arm bandit algorithms such UCB, ϵ-greedy and Thompson Sampling satisfy this sufficient condition and are thus prone to data corruption attacks. We further analyze the behavior of our attack for these algorithms and show that using only o(T) corruptions, our attack can force these algorithms to select a potentially non-optimal target arm preferred by the attacker for all but o(T) rounds.

Thursday, December 9, 2021
Thursday 7:30 pm EDT
Pessimism Meets Invariance: Provably Efficient Offline Mean-Field Multi-Agent RL

Minshuo Chen (Georgia Tech) · Yan Li (Georgia Institute of Technology) · Ethan Wang (Georgia Institute of Technology) · Zhuoran Yang (Princeton) · Zhaoran Wang (Princeton University) · Tuo Zhao (Georgia Tech)

Mean-Field Multi-Agent Reinforcement Learning (MF-MARL) is attractive in the applications involving a large population of homogeneous agents, as it exploits the permutation invariance of agents and avoids the curse of many agents. Most existing results only focus on online settings, in which agents can interact with the environment during training. In some applications such as social welfare optimization, however, the interaction during training can be prohibitive or even unethical in the societal systems. To bridge such a gap, we propose a SAFARI (peSsimistic meAn-Field vAlue iteRatIon) algorithm for off-line MF-MARL, which only requires a handful of pre-collected experience data. Theoretically, under a weak coverage assumption that the experience dataset contains enough information about the optimal policy, we prove that for an episodic mean-field MDP with a horizon H and N training trajectories, SAFARI attains a sub-optimality gap of O(H2deff/√N), where deff is the effective dimension of the function class for parameterizing the value function, but independent on the number of agents. Numerical experiments are provided.

Thursday, December 9, 2021
Thursday 7:30 pm EDT
ProTo: Program-Guided Transformer for Program-Guided Tasks

Zelin Zhao (The Chinese University of Hong Kong) · Karan Samel (Georgia Institute of Technology) · Binghong Chen (Georgia Institute of Technology) · lee song (Georgia Institute of Technology)

Programs, consisting of semantic and structural information, play an important role in the communication between humans and agents. Towards learning general program executors to unify perception, reasoning, and decision making, we formulate program-guided tasks which require learning to execute a given program on the observed task specification. Furthermore, we propose Program-Guided Transformers (ProTo), which integrates both semantic and structural guidance of a program by leveraging cross-attention and masked self-attention to pass messages between the specification and routines in the program. ProTo executes a program in a learned latent space and enjoys stronger representation ability than previous neural-symbolic approaches. We demonstrate that ProTo significantly outperforms the previous state-of-the-art methods on GQA visual reasoning and 2D Minecraft policy learning datasets. Additionally, ProTo demonstrates better generalization to unseen, complex, and human-written programs.

Thursday, December 9, 2021
Thursday 7:30 pm EDT
Towards understanding retrosynthesis by energy-based models

Ruoxi Sun (Columbia University) · Hanjun Dai (Georgia Institute of Technology) · Li Li (Google) · Steven Kearnes (Google Research) · Bo Dai (Google Brain)

Retrosynthesis is the process of identifying a set of reactants to synthesize a target molecule. It is of vital importance to material design and drug discovery. Existing machine learning approaches based on language models and graph neural networks have achieved encouraging results. However, the inner connections of these models are rarely discussed, and rigorous evaluations of these models are largely in need. In this paper, we propose a framework that unifies sequence- and graph-based methods as energy-based models (EBMs) with different energy functions. This unified view establishes connections and reveals the differences between models, thereby enhancing our understanding of model design. We also provide a comprehensive assessment of performance to the community. Moreover, we present a novel dual variant within the framework that performs consistent training to induce the agreement between forward- and backward-prediction. This model improves the state-of-the-art of template-free methods with or without reaction types.

Friday, December 10, 2021
Friday 11:30 am EDT
EDGE: Explaining Deep Reinforcement Learning Policies

Wenbo Guo (Pennsylvania State University) · Xian Wu (Pennsylvania State University) · Usmann Khan (Georgia Institute of Technology) · Xinyu Xing (Penn State University)

With the rapid development of deep reinforcement learning (DRL) techniques, there is an increasing need to understand and interpret DRL policies. While recent research has developed explanation methods to interpret how an agent determines its moves, they cannot capture the importance of actions/states to a game’s final result. In this work, we propose a novel self-explainable model that augments a Gaussian process with a customized kernel function and an interpretable predictor. Together with the proposed model, we also develop a parameter learning procedure that leverages inducing points and variational inference to improve learning efficiency. Using our proposed model, we can predict an agent’s final rewards from its game episodes and extract time step importance within episodes as strategy-level explanations for that agent. Through experiments on Atari and MuJoCo games, we verify the explanation fidelity of our method and demonstrate how to employ interpretation to understand agent behavior, discover policy vulnerabilities, remediate policy errors, and even defend against adversarial attacks.

Friday, December 10, 2021
Friday 11:30 am EDT
No RL, No Simulation: Learning to Navigate without Navigating

Meera Hahn (Georgia Institute of Technology) · Devendra Singh Chaplot (Carnegie Mellon University) · Shubham Tulsiani (UC Berkeley) · Mustafa Mukadam (Facebook AI Research) · James M Rehg (Georgia Institute of Technology) · Abhinav Gupta (Facebook AI Research/CMU)

Most prior methods for learning navigation policies require access to simulation environments, as they need online policy interaction and rely on ground-truth maps for rewards. However, building simulators is expensive (requires manual effort for each and every scene) and creates challenges in transferring learned policies to robotic platforms in the real-world, due to the sim-to-real domain gap. In this paper, we pose a simple question: Do we really need active interaction, ground-truth maps or even reinforcement-learning (RL) in order to solve the image-goal navigation task? We propose a self-supervised approach to learn to navigate from only passive videos of roaming. Our approach, No RL, No Simulator (NRNS), is simple and scalable, yet highly effective. NRNS outperforms RL-based formulations by a significant margin. We present NRNS as a strong baseline for any future image-based navigation tasks that use RL or Simulation.

Friday, December 10, 2021
Friday 11:30 am EDT
The Utility of Explainable AI in Ad Hoc Human-Machine Teaming

Rohan Paleja (Georgia Institute of Technology) · Muyleng Ghuy (Georgia Institute of Technology) · Nadun Ranawaka Arachchige (Georgia Institute of Technology) · Reed Jensen (MIT Lincoln Laboratory, Massachusetts Institute of Technology) · Matthew Gombolay (Georgia Institute of Technology)

Recent advances in machine learning have led to growing interest in Explainable AI (xAI) to enable humans to gain insight into the decision-making of machine learning models. Despite this recent interest, the utility of xAI techniques has not yet been characterized in human-machine teaming. Importantly, xAI offers the promise of enhancing team situational awareness (SA) and shared mental model development, which are the key characteristics of effective human-machine teams. Rapidly developing such mental models is especially critical in ad hoc human-machine teaming, where agents do not have a priori knowledge of others’ decision-making strategies. In this paper, we present two novel human-subject experiments quantifying the benefits of deploying xAI techniques within a human-machine teaming scenario. First, we show that xAI techniques can support SA (p<0.05). Second, we examine how different SA levels induced via a collaborative AI policy abstraction affect ad hoc human-machine teaming performance. Importantly, we find that the benefits of xAI are not universal, as there is a strong dependence on the composition of the human-machine team. Novices benefit from xAI providing increased SA (p<0.05) but are susceptible to cognitive overhead (p<0.05). On the other hand, expert performance degrades with the addition of xAI-based support (<0.05), indicating that the cost of paying attention to the xAI outweighs the benefits obtained from being provided additional information to enhance SA. Our results demonstrate that researchers must deliberately design and deploy the right xAI techniques in the right scenario by carefully considering human-machine team composition and how the xAI method augments SA.

Other

Datasets and Benchmarks

A Large-Scale Database for Graph Representation Learning

Scott Freitas, Yuxiao Dong, Joshua Neil, Duen Horng Chau

With the rapid emergence of graph representation learning, the construction of new large-scale datasets are necessary to distinguish model capabilities and accurately assess the strengths and weaknesses of each technique. By carefully analyzing existing graph databases, we identify 3 critical components important for advancing the field of graph representation learning: (1) large graphs, (2) many graphs, and (3) class diversity. To date, no single graph database offers all of these desired properties. We introduce MalNet , the largest public graph database ever constructed, representing a large-scale ontology of malicious software function call graphs. MalNet contains over 1.2 million graphs, averaging over 15k nodes and 35k edges per graph, across a hierarchy of 47 types and 696 families. Compared to the popular REDDIT-12K database, MalNet offers 105x more graphs, 44x larger graphs on average, and 63x more classes. We provide a detailed analysis of MalNet, discussing its properties and provenance, along with the evaluation of state-of-the-art machine learning and graph neural network techniques. The unprecedented scale and diversity of MalNet offers exciting opportunities to advance the frontiers of graph representation learning–enabling new discoveries and research into imbalanced classification, explainability and the impact of class hardness. The database is publicly available at www.mal-net.org.

Argoverse 2.0: Next Generation Datasets for Self-driving Perception and Forecasting

Benjamin Wilson, William Qi , Tanmay Agarwal , John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, Deva Ramanan, Peter Carr, James Hays

We introduce Argoverse 2.0 — a collection of three datasets for perception and forecasting research in the self-driving domain. The annotated Sensor Dataset contains 1,000 sequences of multimodal data, encompassing high-resolution imagery from seven ring cameras, and two stereo cameras in addition to lidar point clouds, and 6-DOF map-aligned pose. Sequences contain 3D cuboid annotations for 25 object categories, all of which are sufficiently-sampled to support training and evaluation of 3D perception models. The Lidar Dataset contains 20,000 sequences of unlabeled lidar point clouds and map-aligned pose. This dataset is the largest ever collection of lidar sensor data and supports self-supervised learning and the emerging task of point cloud forecasting. Finally, the Motion Forecasting Dataset contains 100,000 scenarios mined for interesting and challenging interactions between the AV and other actors in each local scene. Models are tasked with the prediction of future motion for “scored actors” in each scenario and are provided with track histories that capture object location, heading, velocity, and category. In all three datasets, each scenario contains its own HD Map with 3D lane and crosswalk geometry – sourced from data captured in six distinct cities. We believe these datasets will support new and existing machine learning research problems in ways that existing datasets do not. All datasets are released under the CC BY-NC-SA 4.0 license.

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel Chang, Manolis Savva, Yili Zhao, Dhruv Batra

We present the Habitat-Matterport 3D (HM3D) dataset. HM3D is a large-scale dataset of 1,000 building-scale 3D reconstructions from a diverse set of real-world locations. Each scene in the dataset consists of a textured 3D mesh reconstruction of interiors such as multi-ﬂoor residences, stores, and other private indoor spaces.HM3D surpasses existing datasets available for academic research in terms of physical scale, completeness of the reconstruction, and visual ﬁdelity. HM3D contains 112.5k m^2 of navigable space, which is 1.4 – 3.7× larger than other building-scale datasets (MP3D, Gibson). When compared to existing photorealistic 3D datasets (Replica, MP3D, Gibson, ScanNet), rendered images from HM3D have 20 – 85% higher visual ﬁdelity w.r.t. counterpart images captured with real cameras, and HM3D meshes have 34 – 91% fewer artifacts due to incomplete surface reconstruction.The increased scale, ﬁdelity, and diversity of HM3D directly impacts the performance of embodied AI agents trained using it. In fact, we ﬁnd that HM3D is ‘pareto optimal’ in the following sense – agents trained to perform PointGoal navigation on HM3D achieve the highest performance regardless of whether they are evaluated on HM3D, Gibson, or MP3D. No similar claim can be made about training on other datasets. HM3D-trained PointNav agents achieve 100% performance on Gibson-test dataset, suggesting that it might be time to retire that episode dataset.

Neural Latents Benchmark ’21: Evaluating latent models of neural population activity

F. Pei, Joel Ye, David Zoltowski, Anqi Wu, Raeed H Chowdhury, Hansem Sohn, Joseph E O’Doherty, Krishna V Shenoy, Matthew T Kaufman, Mark Churchland, Mehrdad Jazayeri, Lee E Miller, Jonathan Pillow, Il Memming Park, Eva L Dyer, Chethan Pandarinath

Advances in neural recording present increasing opportunities to study neural activity in unprecedented detail. Latent variable models (LVMs) are promising tools for analyzing this rich activity across diverse neural systems and behaviors, as LVMs do not depend on known relationships between the activity and external experimental variables. However, progress with LVMs for neuronal population activity is currently impeded by a lack of standardization, resulting in methods being developed and compared in an ad hoc manner. To coordinate these modeling efforts, we introduce a benchmark suite for latent variable modeling of neural population activity. We curate four datasets of neural spiking activity from cognitive, sensory, and motor areas to promote models that apply to the wide variety of activity seen across these areas. We identify unsupervised evaluation as a common framework for evaluating models across datasets, and apply several baselines that demonstrate benchmark diversity.

Teach Me to Explain: A Review of Datasets for Explainable Natural Language Processing

Sarah Wiegreffe, Ana Marasovic

Explainable Natural Language Processing (EXNLP) has increasingly focused on collecting human-annotated textual explanations. These explanations are used downstream in three ways: as data augmentation to improve performance on a predictive task, as supervision to train models to produce explanations for their predictions, and as a ground-truth to evaluate model-generated explanations. In this review, we identify 65 datasets with three predominant classes of textual explanations (highlights, free-text, and structured), organize the literature on annotating each type, identify strengths and shortcomings of existing collection methodologies, and give recommendations for collecting EXNLP datasets in the future.

Trust, but Verify: Cross-Modality Fusion for HD Map Change Detection

John Lambert (Georgia Tech), James Hays (Georgia Tech)

High-definition (HD) map change detection is the task of determining when sensor data and map data are no longer in agreement with one another due to real-world changes. We collect the first dataset for the task, which we entitle the Trust, but Verify (TbV) dataset, by mining thousands of hours of data from over 9 months of autonomous vehicle fleet operations. We present learning-based formulations for solving the problem in the bird’s eye view and ego-view. Because real map changes are infrequent and vector maps are easy to synthetically manipulate, we lean on simulated data to train our model. Perhaps surprisingly, we show that such models can generalize to real world distributions. The dataset, consisting of maps and logs collected in six North American cities, is one of the largest AV datasets to date with more than 7.9 million images and will be made available to the public, along with code and models.

Demos

An Interactive Tool for Computation with Assemblies of Neurons
(last session in Demos 2)

Seung Je Jung (Georgia Institute of Technology), Daniel Mitropolsky (Columbia University), Christos H. Papadimitriou (Columbia University), Santosh S. Vempala (Georgia Institute of Technology)

The Assembly Calculus (AC) is a novel framework intended to bridge the gap between the level of neuron and synapses, and that of cognition. AC is a computational system entailing (1) a basic data item called an assembly, a stable set of neurons explained below; (2) a set of operations that create and manipulate assemblies; and (3) an execution model which is squarely based on basic tenets of neuroscience. Importantly, it allows the creation of biologically plausible, flexible and interpretable programs, enabling one to develop tangible hypotheses on how specific brain functions may work. To facilitate such experimentation, we present here a tool which in real-time allows the simulation, modification and visualisation of this computational system, including several prepared examples.

Workshops

Human-Centered AI
Explainability Pitfalls: Beyond Dark Patterns in Explainable AI

Upol Ehsan, Mark O. Riedl

To make Explainable AI (XAI) systems trustworthy, understanding harmful effects is just as important as producing well-designed explanations. In this paper, we address an important yet unarticulated type of negative effect in XAI. We introduce explainability pitfalls(EPs), unanticipated negative downstream effects from AI explanations manifesting even when there is no intention to manipulate users. EPs are different from, yet related to, dark patterns, which are intentionally deceptive practices. We articulate the concept of EPs by demarcating it from dark patterns and highlighting the challenges arising from uncertainties around pitfalls. We situate and operationalize the concept using a case study that showcases how, despite best intentions, unsuspecting negative effects such as unwarranted trust in numerical explanations can emerge. We propose proactive and preventative strategies to address EPs at three interconnected levels: research, design, and organizational.

NeurIPS Workshop on Self-Supervised Learning in Theory and Practice
Mine Your Own View: Self-supervised learning through across-sample prediction

M. Azabou, M. Gheshlaghi Azar, R. Liu, C-H Lin, E. Johnson, K. Bhaskharan-Nair, M. Dabagia, K.B. Hengen, W. Gray-Roncal, M. Valko, E.L. Dyer

Traditionally, neural decoding has been performed through supervised approaches that aim to map specific behaviors or stimuli to specific neural activity patterns through labeled data. However, the representations learned through a supervised approach typically require simple trial structure and repetitive behaviors, and fail to generalize to new datasets. Here, we ask whether we can use self-supervised learning principles to learn more robust and generalizable representations of neural activity. Rather than using labels to guide learning, we essentially ask the network to build a representation that makes it easy to predict across nearby points in time, as well as across adaptively “mined” samples that are nonlocal but close in terms of their representations in the network. We show that by incorporating nonlocal mined views into the system, and predicting across distinct time points, the network can build representations that allow for more faithful decoding on downstream tasks.

NeurIPS Workshop on Self-Supervised Learning in Theory and Practice
Using self-supervision and augmentations to build insights into neural coding

M. Azabou+, M. Dabagia+, R. Liu+, C-H Lin, Keith B. Hengen, E.L. Dyer

Recent results have shown that self-supervised approaches can be used to build robust representations of brain states, outperforming supervised methods on downstream brain decoding tasks. We discuss the implications of these results, how self-supervised learning can reveal interesting properties of neural computation, and how different augmentations can be used and designed to dissect competing theories of neural computation.