Last modified: 2024-12-11

HTB - Like-A-Glove Writeup by McShooty

Challenge Description

Words carry semantic information, and similar to how people can infer meaning based on a word's context, AI can derive representations for words based on their context too. However, the kinds of meaning that a model uses may not match our own. In this challenge, we've encountered a pair of AIs communicating in metaphors that are challenging to decode! The embedding model used is GloVe (Global Vectors for Word Representation), specifically the glove-twitter-25 variant.

Key Points: - The AI model used in this challenge is glove-twitter-25. To reverse engineer the flag, we will use the same model.

Analyzing the Provided File

Input File: `chal.txt`

The file contains several lines formatted as follows:

Like <word1> is to <word2>, <word3> is to?

Here are a few examples from the file:

Like non-mainstream is to efl, battery-powered is to?
Like sycophancy is to بالشهادة, cont is to?
Like беспощадно is to indépendance, rs is to?
Like ajaajjajaja is to hahahahahahahahaahah, ２ is to?
...
Like raving is to سگن, happy is to?

Considering that there are two AIs communicating with each other, we can infer that they are exchanging the flag in this peculiar manner.

To tackle this challenge, it helps to visualize language models as a matrix word-plane, even though language models are typically trained in a high-dimensional space:

Word Plane Visualization

Each of these points can be considered as a vector in space. For example, if "hackthebox" is represented as the point ((1.36, 2.48)), we want to find the vector corresponding to the word that replaces the question mark in the sentence:

Like non-mainstream is to efl, battery-powered is to?

Visualizing the Analogy

We can visualize the relationship between the words:

Analogy Visualization

By calculating the vector similarity of the first two words, we can identify the word that has a similar relationship to the third word.

Mathematical Calculations

To express this mathematically, if we denote the word we are looking for as (\vec{x}) (representing word4), we can derive the relationship as follows:

$$ \vec{x} \approx \vec{word2} - \vec{word1} + \vec{word3} $$

This formula indicates that we find (\vec{x}) by applying the transformation from word1 to word2 onto word3.

Using the cosine similarity formula, we define it as:

$$ \text{cosine_similarity}(\vec{a}, \vec{b}) = \frac{\vec{a} \cdot \vec{b}}{|\vec{a}| |\vec{b}|} $$

where $\cdot$ denotes the dot product, and $|\vec{a}|$ and $|\vec{b}|$ are the magnitudes (or norms) of the vectors. By manipulating the vectors appropriately, we can uncover the hidden word.

Utilizing Existing Tools

Fortunately, modern technology has alleviated the need for manual calculations. A Python library is available that facilitates the calculation of word similarities using the GloVe-Twitter-25 model: Gensim.

The Script

Here's how the script is structured:

Import the Model:

import gensim.downloader as api

Load the Model:

def load_model(model_name='glove-twitter-25'):
    model = api.load(model_name)
    return model

Retrieve the Word Vector:

def get_word_vector(model, word):
    try:
        vector = model[word]
        return vector
    except KeyError:
        return None

Extract Words Using Regular Expressions:

match = re.match(r"Like (.+?) is to (.+?), (.+?) is to\?", line.strip())

Calculate the Analogy:

if match:
    word1, word2, word3 = match.groups()
    vector1 = get_word_vector(model, word1)
    vector2 = get_word_vector(model, word2)
    vec_target = get_word_vector(model, word3)

    if vector1 is not None and vector2 is not None and vec_target is not None:
        analogy_vector = vec_target + (vector2 - vector1)
        result = model.similar_by_vector(analogy_vector, topn=1)
        print(f"'{word1} is to {word2} as {word3} is to {result[0][0]}' with similarity {result[0][1]}")

Full Script Example

The complete script looks like this:

import gensim.downloader as api
import re

def load_model(model_name='glove-twitter-25'):
    model = api.load(model_name)
    return model

def get_word_vector(model, word):
    try:
        vector = model[word]
        return vector
    except KeyError:
        return None

def process_line(line, model):
    match = re.match(r"Like (.+?) is to (.+?), (.+?) is to\?", line.strip())
    if match:
        word1, word2, word3 = match.groups()
        vector1 = get_word_vector(model, word1)
        vector2 = get_word_vector(model, word2)
        vec_target = get_word_vector(model, word3)

        if vector1 is not None and vector2 is not None and vec_target is not None:
            analogy_vector = vec_target + (vector2 - vector1)
            result = model.similar_by_vector(analogy_vector, topn=1)
            print(f"'{word1} is to {word2} as {word3} is to {result[0][0]}' with similarity {result[0][1]}")
        else:
            missing_words = [word for word, vec in zip([word1, word2, word3], [vector1, vector2, vec_target]) if vec is None]
            print(f"The following words were not found in the model: {', '.join(missing_words)}")
    else:
        print(f"Line format is incorrect: {line.strip()}")

def process_file(filename, model):
    with open(filename, 'r', encoding='utf-8') as file:
        for line in file:
            process_line(line, model)

def main():
    model = load_model()
    filename = 'chal.txt'
    process_file(filename, model)

if __name__ == "__main__":
    flag = ""
    main()
    print(flag)

Conclusion

By utilizing the GloVe embedding model and understanding vector arithmetic, we successfully decipher the hidden messages exchanged between the AIs. This challenge showcases the power of word embeddings in capturing semantic relationships and enables us to find the flag hidden within the metaphorical dialogue.

If you have any questions or need further clarification on specific parts, feel free to reach out!