Machine Learning Latest Submitted Preprints | 2019-03-13

in #learning5 years ago

Machine Learning


Communication-efficient distributed SGD with Sketching (1903.04488v1)

Nikita Ivkin, Daniel Rothchild, Enayat Ullah, Vladimir Braverman, Ion Stoica, Raman Arora

2019-03-12

Large-scale distributed training of neural networks is often limited by network bandwidth, wherein the communication time overwhelms the local computation time. Motivated by the success of sketching methods in sub-linear/streaming algorithms, we propose a sketching-based approach to minimize the communication costs between nodes without losing accuracy. In our proposed method, workers in a distributed, synchronous training setting send sketches of their gradient vectors to the parameter server instead of the full gradient vector. Leveraging the theoretical properties of sketches, we show that this method recovers the favorable convergence guarantees of single-machine top- SGD. Furthermore, when applied to a model with dimensions on workers, our method requires only bytes of communication, compared to for vanilla distributed SGD. To validate our method, we run experiments using a residual network trained on the CIFAR-10 dataset. We achieve no drop in validation accuracy with a compression ratio of 4, or about 1 percentage point drop with a compression ratio of 8. We also demonstrate that our method scales to many workers.

Efficient Optimization of Echo State Networks for Time Series Datasets (1903.05071v1)

Jacob Reinier Maat, Nikos Gianniotis, Pavlos Protopapas

2019-03-12

Echo State Networks (ESNs) are recurrent neural networks that only train their output layer, thereby precluding the need to backpropagate gradients through time, which leads to significant computational gains. Nevertheless, a common issue in ESNs is determining its hyperparameters, which are crucial in instantiating a well performing reservoir, but are often set manually or using heuristics. In this work we optimize the ESN hyperparameters using Bayesian optimization which, given a limited budget of function evaluations, outperforms a grid search strategy. In the context of large volumes of time series data, such as light curves in the field of astronomy, we can further reduce the optimization cost of ESNs. In particular, we wish to avoid tuning hyperparameters per individual time series as this is costly; instead, we want to find ESNs with hyperparameters that perform well not just on individual time series but rather on groups of similar time series without sacrificing predictive performance significantly. This naturally leads to a notion of clusters, where each cluster is represented by an ESN tuned to model a group of time series of similar temporal behavior. We demonstrate this approach both on synthetic datasets and real world light curves from the MACHO survey. We show that our approach results in a significant reduction in the number of ESN models required to model a whole dataset, while retaining predictive performance for the series in each cluster.

Application of Duration-of-Stay Storage Assignment with Deep Neural Networks under Uncertainty (1903.05063v1)

Michael Lingzhi Li, Elliott Wolf, Daniel Wintz

2019-03-12

Optimizing storage assignment is a central problem in warehousing. Past literature has shown the superiority of the Duration-of-Stay (DoS) method in assigning pallets, but the methodology requires perfect prior knowledge of DoS for each pallet, which is unknown and uncertain under realistic conditions. In this paper, we introduce a new framework that accounts for such uncertainty using a novel combination of convolutional and recurrent neural network models, ParallelNet. Through collaboration with a large cold storage company, we demonstrate ParallelNet achieves a 10% decrease in MSLE and 29% decrease in MAPE compared to CNN-LSTM on unseen future shipments, and suffers less performance decay as time increases. The framework is then integrated into a first-of-its-kind Storage Assignment system, which is being piloted in warehouses across the country.

On the convex geometry of blind deconvolution and matrix completion (1902.11156v2)

Felix Krahmer, Dominik Stöger

2019-02-28

Low-rank matrix recovery from structured measurements has been a topic of intense study in the last decade and many important problems like matrix completion and blind deconvolution have been formulated in this framework. An important benchmark method to solve these problems is to minimize the nuclear norm, a convex proxy for the rank. A common approach to establish recovery guarantees for this convex program relies on the construction of a so-called approximate dual certificate. However, this approach provides only limited insight in various respects. Most prominently, the noise bounds exhibit seemingly suboptimal dimension factors. In this paper we take a novel, more geometric viewpoint to analyze both the matrix completion and the blind deconvolution scenario. We find that for both these applications the dimension factors in the noise bounds are not an artifact of the proof, but the problems are intrinsically badly conditioned. We show, however, that bad conditioning only arises for very small noise levels: Under mild assumptions that include many realistic noise levels we derive near-optimal error estimates for blind deconvolution under adversarial noise.

Generating Compact Geometric Track-Maps for Train Positioning Applications (1903.05014v1)

Hanno Winter, Stefan Luthardt, Volker Willert, Jürgen Adamy

2019-03-12

In this paper we present a method to generate compact geometric track-maps for train-borne localization applications. We first give a brief overview on the role of track maps and it becomes apparent that there are hardly any adequate methods to generate suitable geometric track-maps. Therefore, we present a novel map generation procedure that uses an optimization formulation to find the continuous sequence of track geometries that fits the available measurement data best. The optimization is initialized with the results from a localization filter developed in our previous work. The filter also provides the required information for shape identification and measurement association. The approach will be evaluated using simulated data in comparison to the typically used data-point based maps.

An Efficient Augmented Lagrangian Based Method for Constrained Lasso (1903.05006v1)

Zengde Deng, Anthony Man-Cho So

2019-03-12

Variable selection is one of the most important tasks in statistics and machine learning. To incorporate more prior information about the regression coefficients, the constrained Lasso model has been proposed in the literature. In this paper, we present an inexact augmented Lagrangian method to solve the Lasso problem with linear equality constraints. By fully exploiting second-order sparsity of the problem, we are able to greatly reduce the computational cost and obtain highly efficient implementations. Furthermore, numerical results on both synthetic data and real data show that our algorithm is superior to existing first-order methods in terms of both running time and solution accuracy.

Efficient Metric Learning for the Analysis of Motion Data (1610.05083v3)

Babak Hosseini, Barbara Hammer

2016-10-17

We investigate metric learning in the context of dynamic time warping (DTW), the by far most popular dissimilarity measure used for the comparison and analysis of motion capture data. While metric learning enables a problem-adapted representation of data, the majority of methods has been proposed for vectorial data only. In this contribution, we extend the popular principle offered by the large margin nearest neighbors learner (LMNN) to DTW by treating the resulting component-wise dissimilarity values as features. We demonstrate that this principle greatly enhances the classification accuracy in several benchmarks. Further, we show that recent auxiliary concepts such as metric regularization can be transferred from the vectorial case to component-wise DTW in a similar way. We illustrate that metric regularization constitutes a crucial prerequisite for the interpretation of the resulting relevance profiles.

Theory III: Dynamics and Generalization in Deep Networks (1903.04991v1)

Andrzej Banburski, Qianli Liao, Brando Miranda, Lorenzo Rosasco, Bob Liang, Jack Hidary, Tomaso Poggio

2019-03-12

We review recent observations on the dynamical systems induced by gradient descent methods used for training deep networks and summarize properties of the solutions they converge to. Recent results illuminate the absence of overfitting in the special case of linear networks for binary classification. They prove that minimization of loss functions such as the logistic, the cross-entropy and the exponential loss yields asymptotic convergence to the maximum margin solution for linearly separable datasets, independently of the initial conditions. Here we discuss the case of nonlinear DNNs near zero minima of the empirical loss, under exponential-type and square losses, for several variations of the basic gradient descent algorithm, including a new NMGD (norm minimizing gradient descent) version that converges to the minimum norm fixed points of the gradient descent iteration. Our main results are: 1) gradient descent algorithms with weight normalization constraint achieve generalization; 2) the fundamental reason for the effectiveness of existing weight normalization and batch normalization techniques is that they are approximate implementations of maximizing the margin under unit norm constraint; 3) without unit norm constraints some level of generalization can still be obtained for not-too-deep networks because the balance of the weights across different layers, if present at initialization, is maintained by the gradient flow. In the perspective of these theoretical results, we discuss experimental evidence around the apparent absence of overfitting, that is the observation that the expected classification error does not get worse when increasing the number of parameters. Our explanation focuses on the implicit normalization enforced by algorithms such as batch normalization. In particular, the control of the norm of the weights is related to Halpern iterations for minimum norm solutions.

Adaptive Sample-Efficient Blackbox Optimization via ES-active Subspaces (1903.04268v2)

Krzysztof Choromanski, Aldo Pacchiano, Jack Parker-Holder, Yunhao Tang

2019-03-07

We present a new algorithm ASEBO for conducting optimization of high-dimensional blackbox functions. ASEBO adapts to the geometry of the function and learns optimal sets of sensing directions, which are used to probe it, on-the-fly. It addresses the exploration-exploitation trade-off of blackbox optimization, where each single function query is expensive, by continuously learning the bias of the lower-dimensional model used to approximate gradients of smoothings of the function with compressed sensing and contextual bandits methods. To obtain this model, it uses techniques from the emerging theory of active subspaces in the novel ES blackbox optimization context. As a result, ASEBO learns the dynamically changing intrinsic dimensionality of the gradient space and adapts to the hardness of different stages of the optimization without external supervision. Consequently, it leads to more sample-efficient blackbox optimization than state-of-the-art algorithms. We provide rigorous theoretical justification of the effectiveness of our method. We also empirically evaluate it on the set of reinforcement learning policy optimization tasks as well as functions from the recently open-sourced Nevergrad library, demonstrating that it consistently learns optimal inputs with fewer queries to a blackbox function than other methods.

Cascaded Projection: End-to-End Network Compression and Acceleration (1903.04988v1)

Breton Minnehan, Andreas Savakis

2019-03-12

We propose a data-driven approach for deep convolutional neural network compression that achieves high accuracy with high throughput and low memory requirements. Current network compression methods either find a low-rank factorization of the features that requires more memory, or select only a subset of features by pruning entire filter channels. We propose the Cascaded Projection (CaP) compression method that projects the output and input filter channels of successive layers to a unified low dimensional space based on a low-rank projection. We optimize the projection to minimize classification loss and the difference between the next layer's features in the compressed and uncompressed networks. To solve this non-convex optimization problem we propose a new optimization method of a proxy matrix using backpropagation and Stochastic Gradient Descent (SGD) with geometric constraints. Our cascaded projection approach leads to improvements in all critical areas of network compression: high accuracy, low memory consumption, low parameter count and high processing speed. The proposed CaP method demonstrates state-of-the-art results compressing VGG16 and ResNet networks with over 4x reduction in the number of computations and excellent performance in top-5 accuracy on the ImageNet dataset before and after fine-tuning.



Coin Marketplace

STEEM 0.30
TRX 0.11
JST 0.033
BTC 64275.05
ETH 3147.49
USDT 1.00
SBD 4.29