Khajuria, Tarun, juhendajaDias, Braian OlmiroTartu Ülikool. Loodus- ja täppisteaduste valdkondTartu Ülikool. Arvutiteaduse instituut2023-10-182023-10-182023https://hdl.handle.net/10062/93580Neural Network models have achieved state of the art results in various tasks related to vision and language, there are still questions regarding their logical reasoning capabilities. In particular, its not clear whether these models can reason beyond using analogy. For example, in an image captioning model, the model can either learn to correlate a scene representation to a caption i.e. text space, or the model could learn to bind objects explicitly and the utilise the explicit composition of individual representations. The inability of models to perform the later has been related to their failures to generalise on wider scenarios in various tasks. Transformer based models have achieved high performance in various language and vision tasks. Their success has been accredited to their ability to model long range relations between sequences. But in vision transformers there has been a discussion that the use of patches as tokens and the interaction between them, gives them an ability to flexibly bind and model compositional relations between various objects at different distances. Hence, showing aspects on explicit compositional abilities. In this thesis, we perform experiments on the Transformer (VIT) based vision encoder of an image captioning model. In particular we probe the internal representation of the encoder at various layers to examine if a single token captures the representation of 1) an object 2) related objects in scene 3) composition of two objects in the scene. In our results we find some evidence to hint binding of object properties into a single token as the image is processed by the transformer. Further, this work provides a list of methods to create and setup a dataset to study internal compositionality in Vision Transformers models and suggests future lines of study to expand this analysis.engopenAccessAttribution-NonCommercial-NoDerivatives 4.0 InternationalTransformersVision transformercompositionalitycomputer visionneural networksimage captioning modelsmagistritöödinformaatikainfotehnoloogiainformaticsinfotechnologyContent based analysis of compositionality in Vision TransformersThesis