Masterarbeit, 2022
39 Seiten, Note: 7.50
1 Chapter 1
1.1 Introduction
1.2 Purpose Statement
1.3 Approach
1.3.1 Natural Language Processing (NLP)
1.3.2 Computer Vision (CV)
2 Chapter 2
2.1 Transformer
2.2 Transformer - Building Blocks
2.2.1 Transformer - Workflow
2.2.2 Transformers - Digest
2.3 Vision Transformer (ViT)
2.3.1 Key Ideas
2.4 ViT in CNN Realm
2.4.1 ViT - State of the Art (SOTA)
2.4.2 ViT and CNN: A Shared Vision?
3 Chapter 3
3.1 Perspectives for Transformers and ViTs
3.2 Selected Learning Paradigms
3.2.1 Model Soups - Ensemble Learning
3.2.2 Multimodal Learning
3.2.3 Self-Supervised Learning
3.2.4 Other Approaches and Open Question
3.3 Beyond Transformers?
3.4 Personal Path Of Exploration
3.5 Conclusion
The primary objective of this thesis is to identify, analyze, and extract the key elements that enable Transformer-based architectures to transition successfully from natural language processing to the domain of computer vision, and to evaluate how Vision Transformers (ViTs) compete with traditional convolutional neural networks.
2.4.2 ViT and CNN: A Shared Vision?
As mentioned earlier in this work, CNNs-based models are the state-of-the-art when it comes to computer vision related tasks. In this section, we explore the commonalities or the lack thereof between CNNs and ViTs.
In: (Tuli et al. 2021), they question which neural architecture between CNN-based and Transformer-based is closer to the human vision process.
Their rational is based on the following: in the classification context, there is one way to be right but there are multiple ways to be wrong. Taken as a basis, they measure error consistency of two categories of models: Transformer-based i.e.: ViT(ViT-B/16, ViT-B/32, ViT-L/16 and ViT-L/32) and CNN-based i.e.: BiT-M-R50x1. They perform their tests(Cohen’s κ and Jensen-Shannon distance) on the Stylized ImageNet(SIN) dataset (Geirhos et al. 2019).
The results indicate that ViTs errors are more in line with those of humans than CNNs based approach.
Another possible point of comparison concerns model robustness in adversarial settings.
In the next work: (Benz et al. 2021), they conduct an empirical testing of image classification with three architectures: CNNs, ViTs and MLP-Mixer. We are concerned in the first two. Attacks of concerns are:
Robustness against white-box attacks
Robustness against black-box attacks
– Query-based black-box attacks
– Transfer-based black-box attacks
Robustness against common corruptions
Robustness against Universal Adversarial Perturbations(UAPs)
By applying high-pass and low-pass filtering, they propose a frequency analysis as to why CNNs are less robust. It appears that CNNs models rely more on high-frequency features. Whereas ViTs count on low-frequency ones. It constitutes an element for their increased robustness.
Chapter 1: Provides an introduction to the research context, establishing the role of Transformers in both NLP and the evolving landscape of computer vision.
Chapter 2: Details the core Transformer architecture, its building blocks like self-attention, and specifically examines the adaptation of Vision Transformers (ViTs) and their performance relative to CNNs.
Chapter 3: Offers reflections on future extensions, discusses machine learning paradigms like self-supervised learning, and explores the theoretical limits and potential of Transformer-based designs.
Transformer, Computer Vision, Vision Transformer, ViT, Self-Attention, Convolutional Neural Networks, CNN, Deep Learning, Natural Language Processing, Inductive Bias, Multi-Head Attention, Foundational Models, Artificial Intelligence, Model Robustness, Self-Supervised Learning
The thesis investigates the underlying mechanisms that allow the Transformer neural architecture to effectively transfer its success from natural language processing to computer vision.
The main themes include the mechanics of self-attention, the architectural transition from sequence data to image patches, the comparative performance of ViTs versus CNNs, and the broader implications of inductive bias.
The goal is to identify and extract the "fuel"—the specific components or architectural features—that enables Vision Transformers to compete with or exceed established convolutional neural networks.
The study utilizes a backward analysis approach, examining the origins of Transformer architectures in NLP and systematically reviewing influential research papers to assess their application and effectiveness in vision tasks.
The main body deconstructs the Transformer architecture, details the workflow of Vision Transformers (including, e.g., patch division), and analyzes empirical benchmarks comparing them against traditional CNN-based paradigms.
The research is best characterized by terms such as Transformer, Vision Transformer, self-attention, computer vision, CNN, inductive bias, and model robustness.
Unlike CNNs, which use convolutional filters with strong inductive spatial bias, ViTs divide images into non-overlapping patches, flattening them into sequences to leverage the Transformer's self-attention mechanism for capturing global dependencies.
The author discusses Turing-completeness to argue that the flexibility of the attention mechanism allows Transformers to serve as a general-purpose computational primitive capable of expressing any desired algorithm.
The thesis highlights that ViTs show higher robustness against certain adversarial attacks than CNNs, likely because ViTs are more receptive to low-frequency features, whereas CNNs rely more heavily on high-frequency information.
The author explores a novel visualization approach using binary analysis to interpret the internal structure of models like GPT-2, suggesting that such representations could offer new insights into how these architectures embody themselves in high-dimensional space.
Der GRIN Verlag hat sich seit 1998 auf die Veröffentlichung akademischer eBooks und Bücher spezialisiert. Der GRIN Verlag steht damit als erstes Unternehmen für User Generated Quality Content. Die Verlagsseiten GRIN.com, Hausarbeiten.de und Diplomarbeiten24 bieten für Hochschullehrer, Absolventen und Studenten die ideale Plattform, wissenschaftliche Texte wie Hausarbeiten, Referate, Bachelorarbeiten, Masterarbeiten, Diplomarbeiten, Dissertationen und wissenschaftliche Aufsätze einem breiten Publikum zu präsentieren.
Kostenfreie Veröffentlichung: Hausarbeit, Bachelorarbeit, Diplomarbeit, Dissertation, Masterarbeit, Interpretation oder Referat jetzt veröffentlichen!

