Journal Information
Published by: ArXiv
Total articles ≅ 1,375,082

Latest articles in this journal

Medhini Narasimhan, Licheng Yu, Sean Bell, Ning Zhang, Trevor Darrell
Published: 23 March 2023
by ArXiv
Journal: ArXiv
Abstract:
Given the enormous number of instructional videos available online, learning a diverse array of multi-step task models from videos is an appealing goal. We introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos. We pre-train VideoTaskformer using a simple and effective objective: predicting weakly supervised textual labels for steps that are randomly masked out from an instructional video (masked step modeling). Compared to prior work which learns step representations locally, our approach involves learning them globally, leveraging video of the entire surrounding task as context. From these learned representations, we can verify if an unseen video correctly executes a given task, as well as forecast which steps are likely to be taken after a given step. We introduce two new benchmarks for detecting mistakes in instructional videos, to verify if there is an anomalous step and if steps are executed in the right order. We also introduce a long-term forecasting benchmark, where the goal is to predict long-range future steps from a given step. Our method outperforms previous baselines on these tasks, and we believe the tasks will be a valuable way for the community to measure the quality of step representations. Additionally, we evaluate VideoTaskformer on 3 existing benchmarks -- procedural activity recognition, step classification, and step forecasting -- and demonstrate on each that our method outperforms existing baselines and achieves new state-of-the-art performance.
Relja Arandjelović, Alex Andonian, Arthur Mensch, Olivier J. Hénaff, Jean-Baptiste Alayrac, Andrew Zisserman
Published: 23 March 2023
by ArXiv
Journal: ArXiv
Abstract:
The core problem in zero-shot open vocabulary detection is how to align visual and text features, so that the detector performs well on unseen classes. Previous approaches train the feature pyramid and detection head from scratch, which breaks the vision-text feature alignment established during pretraining, and struggles to prevent the language model from forgetting unseen classes. We propose three methods to alleviate these issues. Firstly, a simple scheme is used to augment the text embeddings which prevents overfitting to a small number of classes seen during training, while simultaneously saving memory and computation. Secondly, the feature pyramid network and the detection head are modified to include trainable gated shortcuts, which encourages vision-text feature alignment and guarantees it at the start of detection training. Finally, a self-training approach is used to leverage a larger corpus of image-text pairs thus improving detection performance on classes with no human annotated bounding boxes. Our three methods are evaluated on the zero-shot version of the LVIS benchmark, each of them showing clear and significant benefits. Our final network achieves the new stateof-the-art on the mAP-all metric and demonstrates competitive performance for mAP-rare, as well as superior transfer to COCO and Objects365.
Ioannis Argyriou, Craig Lage, George H. Rieke, Danny Gasman, Jeroen Bouwman, Jane Morrison, Mattia Libralato, Daniel Dicken, Bernhard R. Brandl, Javier Álvarez-Márquez, et al.
Published: 23 March 2023
by ArXiv
Journal: ArXiv
Abstract:
The Mid-Infrared Instrument (MIRI) on board the James Webb Space Telescope (JWST) uses three Si:As impurity band conduction (IBC) detector arrays. The output voltage level of each MIRI detector pixel is digitally recorded by sampling-up-the-ramp. For uniform or low-contrast illumination, the pixel ramps become non-linear in a predictable way, but in areas of high contrast, the non-linearity curve becomes much more complex. We provide observational evidence of the Brighter-Fatter Effect (BFE) in MIRI conventional and high-contrast coronographic imaging, low-resolution spectroscopy, and medium-resolution spectroscopy data and investigate the physical mechanism that gives rise to the effect on the detector pixel raw voltage integration ramps. We use public data from the JWST/MIRI commissioning and Cycle 1 phase. We also develop a numerical electrostatic model of the MIRI detectors using a modified version of the public Poisson_CCD code. The physical mechanism behind the MIRI BFE is fundamentally different to that of CCDs and Hawaii-2RG (H2RG) detectors. This is due to the largest majority of the MIRI photo-excited electric charges not being stored at the pixels but at the input to the MIRI detector unit cell buffer amplifier capacitance. The resulting detector voltage debiasing alters the electrostatic potential inside the infrared-active layer and causes new photo-excited electrons, generated at a bright pixel, to be guided to the neighboring fainter pixels. Observationally, the debiasing-induced BFE makes the JWST MIRI data yield 10-25 % larger and 0.5-8 % brighter point sources and spectral line profiles as a function of the output level covered by the detector pixels. We find that the profile of the shrinking detector depletion region has implications for developing a pixel ramp non-linearity model for point sources observed with MIRI.
Nupur Kumari, Bingliang Zhang, Sheng-Yu Wang, Eli Shechtman, Richard Zhang, Jun-Yan Zhu
Published: 23 March 2023
by ArXiv
Journal: ArXiv
Abstract:
Large-scale text-to-image diffusion models can generate high-fidelity images with powerful compositional ability. However, these models are typically trained on an enormous amount of Internet data, often containing copyrighted material, licensed images, and personal photos. Furthermore, they have been found to replicate the style of various living artists or memorize exact training samples. How can we remove such copyrighted concepts or images without retraining the model from scratch? To achieve this goal, we propose an efficient method of ablating concepts in the pretrained model, i.e., preventing the generation of a target concept. Our algorithm learns to match the image distribution for a target style, instance, or text prompt we wish to ablate to the distribution corresponding to an anchor concept. This prevents the model from generating target concepts given its text condition. Extensive experiments show that our method can successfully prevent the generation of the ablated concept while preserving closely related concepts in the model.
Lucy Chai, Richard Tucker, Zhengqi Li, Phillip Isola, Noah Snavely
Published: 23 March 2023
by ArXiv
Journal: ArXiv
Abstract:
Despite increasingly realistic image quality, recent 3D image generative models often operate on 3D volumes of fixed extent with limited camera motions. We investigate the task of unconditionally synthesizing unbounded nature scenes, enabling arbitrarily large camera motion while maintaining a persistent 3D world model. Our scene representation consists of an extendable, planar scene layout grid, which can be rendered from arbitrary camera poses via a 3D decoder and volume rendering, and a panoramic skydome. Based on this representation, we learn a generative world model solely from single-view internet photos. Our method enables simulating long flights through 3D landscapes, while maintaining global scene consistency--for instance, returning to the starting point yields the same view of the scene. Our approach enables scene extrapolation beyond the fixed bounds of current 3D generative models, while also supporting a persistent, camera-independent world representation that stands in contrast to auto-regressive 3D prediction models. Our project page: https://chail.github.io/persistent-nature/.
Mehmet Aygün, Oisin Mac Aodha
Published: 23 March 2023
by ArXiv
Journal: ArXiv
Abstract:
We introduce SAOR, a novel approach for estimating the 3D shape, texture, and viewpoint of an articulated object from a single image captured in the wild. Unlike prior approaches that rely on pre-defined category-specific 3D templates or tailored 3D skeletons, SAOR learns to articulate shapes from single-view image collections with a skeleton-free part-based model without requiring any 3D object shape priors. To prevent ill-posed solutions, we propose a cross-instance consistency loss that exploits disentangled object shape deformation and articulation. This is helped by a new silhouette-based sampling mechanism to enhance viewpoint diversity during training. Our method only requires estimated object silhouettes and relative depth maps from off-the-shelf pre-trained networks during training. At inference time, given a single-view image, it efficiently outputs an explicit mesh representation. We obtain improved qualitative and quantitative results on challenging quadruped animals compared to relevant existing work.
Pei-Xin Shen, Vivien Perrin, Mircea Trif, Pascal Simon
Published: 23 March 2023
by ArXiv
Journal: ArXiv
Abstract:
A chain of magnetic impurities deposited on the surface of a superconductor can form a topological Shiba band that supports Majorana zero modes and hold a promise for topological quantum computing. Yet, most experiments scrutinizing these zero modes rely on transport measurements, which only capture local properties. Here we propose to leverage the intrinsic dynamics of the magnetic impurities to access their non-local character. We use linear response theory to determine the dynamics of the uniform magnonic mode in the presence of external $ac$ magnetic fields and the coupling to the Shiba electrons. We demonstrate that this mode, which spreads over the entire chain of atoms, becomes imprinted with the parity of the ground state and, moreover, can discriminate between Majorana and trivial zero modes located at the end of the chain. Our approach offers a non-invasive alternative to the scanning tunnelling microscopy techniques used to probe Majorana zero modes. Conversely, the magnons could facilitate the manipulation of Majorana zero modes in topological Shiba chains.
Stephanie Milani, Anssi Kanervisto, Karolis Ramanauskas, Sander Schulhoff, Brandon Houghton, Sharada Mohanty, Byron Galbraith, Ke Chen, Yan Song, Tianze Zhou, et al.
Published: 23 March 2023
by ArXiv
Journal: ArXiv
Abstract:
To facilitate research in the direction of fine-tuning foundation models from human feedback, we held the MineRL BASALT Competition on Fine-Tuning from Human Feedback at NeurIPS 2022. The BASALT challenge asks teams to compete to develop algorithms to solve tasks with hard-to-specify reward functions in Minecraft. Through this competition, we aimed to promote the development of algorithms that use human feedback as channels to learn the desired behavior. We describe the competition and provide an overview of the top solutions. We conclude by discussing the impact of the competition and future directions for improvement.
Zhanghan Ke, Yuhao Liu, Lei Zhu, Nanxuan Zhao, Rynson W. H. Lau
Published: 23 March 2023
by ArXiv
Journal: ArXiv
Abstract:
In this paper, we present a Neural Preset technique to address the limitations of existing color style transfer methods, including visual artifacts, vast memory requirement, and slow style switching speed. Our method is based on two core designs. First, we propose Deterministic Neural Color Mapping (DNCM) to consistently operate on each pixel via an image-adaptive color mapping matrix, avoiding artifacts and supporting high-resolution inputs with a small memory footprint. Second, we develop a two-stage pipeline by dividing the task into color normalization and stylization, which allows efficient style switching by extracting color styles as presets and reusing them on normalized input images. Due to the unavailability of pairwise datasets, we describe how to train Neural Preset via a self-supervised strategy. Various advantages of Neural Preset over existing methods are demonstrated through comprehensive evaluations. Besides, we show that our trained model can naturally support multiple applications without fine-tuning, including low-light image enhancement, underwater image correction, image dehazing, and image harmonization.
Runsen Xu, Tai Wang, Wenwei Zhang, Runjian Chen, Jinkun Cao, Jiangmiao Pang, Dahua Lin
Published: 23 March 2023
by ArXiv
Journal: ArXiv
Abstract:
This paper introduces the Masked Voxel Jigsaw and Reconstruction (MV-JAR) method for LiDAR-based self-supervised pre-training and a carefully designed data-efficient 3D object detection benchmark on the Waymo dataset. Inspired by the scene-voxel-point hierarchy in downstream 3D object detectors, we design masking and reconstruction strategies accounting for voxel distributions in the scene and local point distributions within the voxel. We employ a Reversed-Furthest-Voxel-Sampling strategy to address the uneven distribution of LiDAR points and propose MV-JAR, which combines two techniques for modeling the aforementioned distributions, resulting in superior performance. Our experiments reveal limitations in previous data-efficient experiments, which uniformly sample fine-tuning splits with varying data proportions from each LiDAR sequence, leading to similar data diversity across splits. To address this, we propose a new benchmark that samples scene sequences for diverse fine-tuning splits, ensuring adequate model convergence and providing a more accurate evaluation of pre-training methods. Experiments on our Waymo benchmark and the KITTI dataset demonstrate that MV-JAR consistently and significantly improves 3D detection performance across various data scales, achieving up to a 6.3% increase in mAPH compared to training from scratch. Codes and the benchmark will be available at https://github.com/SmartBot-PJLab/MV-JAR .
Back to Top Top