Garrett Bingham and Risto Miikkulainen
Recent studies have shown that the choice of activation function can significantly affect the performance of deep learning networks. However, the benefits of novel activation functions have been inconsistent and task-dependent, and therefore the rectified linear unit (ReLU) is still the most commonly used. This paper proposes a technique for customizing activation functions automatically, resulting in reliable improvements in performance. Evolutionary search is used to discover the general form of the function, and gradient descent to optimize its parameters for different parts of the network and over the learning process. Experiments with three different neural network architectures on the CIFAR-100 image classification dataset show that this approach is effective. It discovers different activation functions for different architectures, and consistently improves accuracy over ReLU and other recently proposed activation functions by significant margins. The approach can therefore be used as an automated optimization step in applying deep learning to new tasks.
Garrett Bingham*, William Macke*, and Risto Miikkulainen
The choice of activation function can have a large effect on the performance of a neural network. While there have been some attempts to hand-engineer novel activation functions, the Rectified Linear Unit (ReLU) remains the most commonly-used in practice. This paper shows that evolutionary algorithms can discover novel activation functions that outperform ReLU. A tree-based search space of candidate activation functions is defined and explored with mutation, crossover, and exhaustive search. Experiments on training wide residual networks on the CIFAR-10 and CIFAR-100 image datasets show that this approach is effective. Replacing ReLU with evolved activation functions results in statistically significant increases in network accuracy. Optimal performance is achieved when evolution is allowed to customize activation functions to a particular task; however, these novel activation functions are shown to generalize, achieving high performance across tasks. Evolutionary optimization of activation functions is therefore a promising new dimension of metalearning in neural networks.
Rui Zhang, Caitlin Westerfield, Sungrok Shim, Garrett Bingham, Alexander Fabbri, William Hu, Neha Verma, Dragomir Radev
Low-resource cross-lingual document retrieval performance is improved with deep bilingual query-document
representations. Experimental results on the MATERIAL dataset show that our model outperforms the competitive translation-based baselines on English-Swahili, English-Tagalog,
and English-Somali cross-lingual information
An automatically discovered bidirectional recurrent architecture nearly matches state-of-the-art accuracy for part of speech tagging across 60 treebanks.
Image credit: https://github.com/quark0/darts
Most neural architecture search approaches utilize reinforcement learning or neuroevolutionary methods. Architecture optimization by gradient descent has been considered as a possible alternative.
However, by training language models on
Penn treebank, we demonstrate that gradient
descent explores the search space ineffectively,
and find that randomly initialized architectures
are often able to outperform those discovered
after extensive searching. We argue that gradient descent simply serves as a proxy for arbitrarily modifying the architecture, and show
that gradient descent does not discover more
capable architectures with each iteration of architecture search.
Benjamin Yip, Garrett Bingham, Katherine Kempfert, Jonathan Fabish, Troy Kling, Cuixian Chen, and Yishi Wang
2018 IEEE International Conference on Big Data
I discovered thousands of gender, race, and birthdate inconsistencies in the MORPH-II face image dataset that previously published
research had missed. In this paper we discuss our strategy to fix these errors and release these corrections in the hope that future research utilizing MORPH-II
will be more accurate.
Random Subspace Two-dimensional LDA (RS-2DLDA) improves upon a 2D generalization of LDA in which
the input data is left in matrix form instead of being vectorized. RS-2DLDA builds an ensemble of classifiers by performing k-nearest neighbor
classification in subspaces defined by random selections of the feature vectors learned during training. This gives high accuracy and prevents
overfitting. Applied to face recognition, RS-2DLDA outperformed similar approaches on the MORPH-II and ORL datasets.