tig-monorepo/docs/challenges/vector_search.md
2025-05-20 22:26:10 +01:00

3.6 KiB

Vector Range Search

Vector range search (or vector search engine) is the task where, given 2 sets of vectors with the same number of dimensions, a database set and a query set, find for each query vector a nearby vector in the database set, such that the mean distance between the query vectors and their corresponding vector in the database is within a threshold value.

Example

  • 10 vectors in the database set.
  • 2-dimensional space.
  • 3 vectors in the query set.
  • Mean distance threshold is 0.2.
  • Distance is Euclidean distance
vector_database = [
    [0.05, 0.16],
    [0.31, 0.74],
    [0.32, 0.8 ],
    [0.03, 0.25],
    [0.33, 0.07],
    [0.88, 0.77],
    [0.91, 0.29],
    [0.7 , 0.02],
    [0.53, 0.04],
    [0.72, 0.38]
]

query_vectors = [
    [0.89, 0.86],
    [0.26, 0.88],
    [0.17, 0.41]
]

The Euclidean distance from each query vector to the database set is approximately:

distances = [
    [1.09, 0.59, 0.57, 1.05, 0.97, 0.09, 0.57, 0.86, 0.9 , 0.51],
    [0.75, 0.15, 0.1 , 0.67, 0.81, 0.63, 0.88, 0.97, 0.88, 0.68],
    [0.28, 0.36, 0.42, 0.21, 0.38, 0.8 , 0.75, 0.66, 0.52, 0.55]
]

It can be seen that, for each query vector, if we select the following vectors in the database, the mean Euclidean distance will be below 0.2:

indexes = [
    5, // select vector 5 in database as "nearby" to query vector 0
    1, // select vector 1 in database as "nearby" to query vector 1
    0, // select vector 0 in database as "nearby" to query vector 2
]
total_distance = 0.09 + 0.1 + 0.28 = 0.47
mean_distance = 0.47 / 3 = 0.16

Our Challenge

In TIG, the vector search challenge features vectors with 250 dimensions, and uses the Euclidean distance. The set we sample from is the hypercube [-1,1]^{250}. The number of Database vectors scales with the number of Query vectors, such that the number Database vectors is the Query vectors multiplied by 100. There are two parameters that can be adjusted in order to vary the difficulty of the challenge instance:

  • Parameter 1: num\textunderscore{ }queries = The number of queries.
  • Parameter 2: better\textunderscore{ }than\textunderscore{ }baseline = The mean Euclidean distance of query vectors to selected nearby vectors in the database have to be below threshold = 11 - better_than_baseline / 1000.

Real-world data is typically clustered, we generate cluster sizes from the log-normal distribution, such that the mean number of points in a cluster is 700. All vectors in the Query and Database sets are generated in the following way. When a vector is generated it is assigned a cluster center with a probability proportional to that clusters size. Once a vector is assigned a cluster center it is generated from a anisotropic Guassian with mean equal to the cluster center.

Application

Vector search has a wide range of applications an example of which is Threshold-Based Anomaly Detection, where the vector database represents operational data in a high-dimensional space, and query vectors represent new incoming data points to be monitored for anomalies. If the average distance exceeds a predefined threshold, the query vectors are flagged as anomalies.

See also for example Outlier detection for high dimensional data: https://dl.acm.org/doi/abs/10.1145/375663.375668

Another example application of vector search is in network security, where the vector database corresponds to historical traffic patterns, and query vectors are new traffic data. By tracking the mean distance between sets new data points and historic "regular" data, any deviation exceeding a threshold can indicate a potential intrusion.