Loris Bazzani

Amazon (Germany)

Author Statistics

Papers

Citation

H-Index

i-10 index

Research Trends

Author Order

Document Type

Co-Authors

Vittorio Murino

Italian Institute of Technology

Marco Cristani

University of Verona

Hà Quang Minh

RIKEN Center for Advanced Intelligence Project

Michael Donoser

Amazon (Germany)

Giulia Paggetti

University of Verona

Hugo Larochelle

Google (Canada)

Erhan Gündoğdu

Amazon (Germany)

Michela Farenzena

University of Verona

Diego Tosato

Istituto Nazionale di Fisica Nucleare, Galileo Galilei Institute for Theoretical Physics

Gloria Menegaz

University of Verona

Cooperative Institutions

Italian Institute of Technology

University of Verona

Amazon (Germany)

Imperial College London

Institute of Informatics and Telematics

University of Genoa

Institut national de recherche en informatique et en automatique

Google (United States)

Australian Centre for Robotic Vision

Centre National de la Recherche Scientifique

Author Statistics

Papers

Citation

H-Index

i-10 index

Research Field

Kernel Methods on Approximate Infinite-Dimensional Covariance Operators for Image Classification

arXiv (Cornell University) (2016)

Hà Quang Minh Marco San Biagio Loris Bazzani Vittorio Murino

This paper presents a novel framework for visual object recognition using infinite-dimensional covariance operators of input features in the paradigm of kernel methods on infinite-dimensional Riemannian manifolds. Our formulation provides in particular a rich representation of image features by exploiting their non-linear correlations. Theoretically, we provide a finite-dimensional approximation of the Log-Hilbert-Schmidt (Log-HS) distance between covariance operators that is scalable to large datasets, while maintaining an effective discriminating capability. This allows us to efficiently approximate any continuous shift-invariant kernel defined using the Log-HS distance. At the same time, we prove that the Log-HS inner product between covariance operators is only approximable by its finite-dimensional counterpart in a very limited scenario. Consequently, kernels defined using the Log-HS inner product, such as polynomial kernels, are not scalable in the same way as shift-invariant kernels. Computationally, we apply the approximate Log-HS distance formulation to covariance operators of both handcrafted and convolutional features, exploiting both the expressiveness of these features and the power of the covariance representation. Empirically, we tested our framework on the task of image classification on twelve challenging datasets. In almost all cases, the results obtained outperform other state of the art methods, demonstrating the competitiveness and potential of our framework.

Kernel (algebra)

Representation

Inner product space

10.48550/arxiv.1609.09251

Cite

Citations (2)

Learning attentional policies for tracking and recognition in video with deep networks

International Conference on Machine Learning (2011)

Loris Bazzani Hugo Larochelle Vittorio Murino Jo-Anne Ting Nando de Freitas

We propose a novel attentional model for simultaneous object tracking and recognition that is driven by gaze data. Motivated by theories of the human perceptual system, the model consists of two interacting pathways: ventral and dorsal. The ventral pathway models object appearance and classification using deep (factored)-restricted Boltzmann machines. At each point in time, the observations consist of retinal images, with decaying resolution toward the periphery of the gaze. The dorsal pathway models the location, orientation, scale and speed of the attended object. The posterior distribution of these states is estimated with particle filtering. Deeper in the dorsal pathway, we encounter an attentional mechanism that learns to control gazes so as to minimize tracking uncertainty. The approach is modular (with each module easily replaceable with more sophisticated algorithms), straightforward to implement, practically efficient, and works well in simple video sequences.

Tracking (education)

Source

Cite

Citations (47)

Online Bayesian Non-parametrics for Social Group Detection

Matteo Zanotto Loris Bazzani Marco Cristani Vittorio Murino

10.5244/c.26.111

Cite

Citations (17)

Recurrent Mixture Density Network for Spatiotemporal Visual Attention

arXiv (Cornell University) (2016)

Loris Bazzani Hugo Larochelle Lorenzo Torresani

In many computer vision tasks, the relevant information to solve the problem at hand is mixed to irrelevant, distracting information. This has motivated researchers to design attentional models that can dynamically focus on parts of images or videos that are salient, e.g., by down-weighting irrelevant pixels. In this work, we propose a spatiotemporal attentional model that learns where to look in a video directly from human fixation data. We model visual attention with a mixture of Gaussians at each frame. This distribution is used to express the probability of saliency for each pixel. Time consistency in videos is modeled hierarchically by: 1) deep 3D convolutional features to represent spatial and short-term time relations and 2) a long short-term memory network on top that aggregates the clip-level representation of sequential clips and therefore expands the temporal domain from few frames to seconds. The parameters of the proposed model are optimized via maximum likelihood estimation using human fixations as training data, without knowledge of the action in each video. Our experiments on Hollywood2 show state-of-the-art performance on saliency prediction for video. We also show that our attentional model trained on Hollywood2 generalizes well to UCF101 and it can be leveraged to improve action classification accuracy on both datasets.

10.48550/arxiv.1603.08199

Cite

Citations (115)

Learning Joint Visual Semantic Matching Embeddings for Language-Guided Retrieval

Lecture notes in computer science (2020)

Yanbei Chen Loris Bazzani

Semantic Matching

Discriminative model

10.1007/978-3-030-58542-6_9

Cite

Citations (53)

A unifying framework in vector-valued reproducing kernel Hilbert spaces for manifold regularization and co-regularized multi-view learning

Journal of Machine Learning Research (2016)

Hà Quang Minh Loris Bazzani Vittorio Murino

This paper presents a general vector-valued reproducing kernel Hilbert spaces (RKHS) framework for the problem of learning an unknown functional dependency between a structured input space and a structured output space. Our formulation encompasses both Vector-valued Manifold Regularization and Co-regularized Multi-view Learning, providing in particular a unifying framework linking these two important learning approaches. In the case of the least square loss function, we provide a closed form solution, which is obtained by solving a system of linear equations. In the case of Support Vector Machine (SVM) classification, our formulation generalizes in particular both the binary Laplacian SVM to the multi-class, multi-view settings and the multi-class Simplex Cone SVM to the semi-supervised, multi-view settings. The solution is obtained by solving a single quadratic optimization problem, as in standard SVM, via the Sequential Minimal Optimization (SMO) approach. Empirical results obtained on the task of object recognition, using several challenging data sets, demonstrate the competitiveness of our algorithms compared with other state-of-the-art methods.

Representer theorem

Kernel (algebra)

Regularization

10.5555/2946645.2946670

Cite

Citations (46)

A Comparison of Multi Hypothesis Kalman Filter and Particle Filter for Multi-target Tracking

Computer Vision and Pattern Recognition (2009)

Loris Bazzani Domenico D. Bloisi Vittorio Murino

Tracking (education)

Alpha beta filter

Source

Cite

Citations (22)

Symmetry-driven accumulation of local features for human characterization and re-identification

Computer Vision and Image Understanding (2012)

Loris Bazzani Marco Cristani Vittorio Murino

Robustness

10.1016/j.cviu.2012.10.008

Cite

Citations (288)

LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts

arXiv (Cornell University) (2024)

Anh-Quan Cao Maximilian Jaritz Matthieu Guillaumin Raoul de Charette Loris Bazzani

Large-scale vision-language pre-trained (VLP) models (e.g., CLIP) are renowned for their versatility, as they can be applied to diverse applications in a zero-shot setup. However, when these models are used in specific domains, their performance often falls short due to domain gaps or the under-representation of these domains in the training data. While fine-tuning VLP models on custom datasets with human-annotated labels can address this issue, annotating even a small-scale dataset (e.g., 100k samples) can be an expensive endeavor, often requiring expert annotators if the task is complex. To address these challenges, we propose LatteCLIP, an unsupervised method for fine-tuning CLIP models on classification with known class names in custom domains, without relying on human annotations. Our method leverages Large Multimodal Models (LMMs) to generate expressive textual descriptions for both individual images and groups of images. These provide additional contextual information to guide the fine-tuning process in the custom domains. Since LMM-generated descriptions are prone to hallucination or missing details, we introduce a novel strategy to distill only the useful information and stabilize the training. Specifically, we learn rich per-class prototype representations from noisy generated texts and dual pseudo-labels. Our experiments on 10 domain-specific datasets show that LatteCLIP outperforms pre-trained zero-shot methods by an average improvement of +4.74 points in top-1 accuracy and other state-of-the-art unsupervised methods by +3.45 points.

10.48550/arxiv.2410.08211

Cite

Citations (0)

SDALF: Modeling Human Appearance with Symmetry-Driven Accumulation of Local Features

Advances in computer vision and pattern recognition (2014)

Loris Bazzani Marco Cristani Vittorio Murino

Identification

10.1007/978-1-4471-6296-4_3

Cite

Citations (28)