向量索引算法是向量数据库的核心技术,负责构建高效的向量相似性搜索索引。本模块实现了多种先进的近似最近邻(ANN)算法,包括基于图的索引、基于量化的索引和基于哈希的索引,以满足不同场景下的性能需求。
2. 数学基础
2.1 向量空间与距离度量
2.1.1 向量空间定义
设向量空间 V ⊂ R d \mathcal{V} \subset \mathbb{R}^d V ⊂ R d ,其中 d d d 为向量维度。对于向量 v = ( v 1 , v 2 , . . . , v d ) ∈ V \mathbf{v} = (v_1, v_2, ..., v_d) \in \mathcal{V} v = ( v 1 , v 2 , ... , v d ) ∈ V ,定义以下范数:
L2范数(欧几里得范数) :
∥ v ∥ 2 = ∑ i = 1 d v i 2 \|\mathbf{v}\|_2 = \sqrt{\sum_{i=1}^{d} v_i^2} ∥ v ∥ 2 = ∑ i = 1 d v i 2
L1范数(曼哈顿范数) :
∥ v ∥ 1 = ∑ i = 1 d ∣ v i ∣ \|\mathbf{v}\|_1 = \sum_{i=1}^{d} |v_i| ∥ v ∥ 1 = ∑ i = 1 d ∣ v i ∣
L∞范数(切比雪夫范数) :
∥ v ∥ ∞ = max 1 ≤ i ≤ d ∣ v i ∣ \|\mathbf{v}\|_\infty = \max_{1 \leq i \leq d} |v_i| ∥ v ∥ ∞ = max 1 ≤ i ≤ d ∣ v i ∣
2.1.2 距离度量函数
定义距离度量函数 δ : V × V → R + \delta: \mathcal{V} \times \mathcal{V} \rightarrow \mathbb{R}^+ δ : V × V → R + :
欧几里得距离 :
δ 2 ( u , v ) = ∑ i = 1 d ( u i − v i ) 2 \delta_2(\mathbf{u}, \mathbf{v}) = \sqrt{\sum_{i=1}^{d} (u_i - v_i)^2} δ 2 ( u , v ) = ∑ i = 1 d ( u i − v i ) 2
余弦距离 :
δ cos ( u , v ) = 1 − u ⋅ v ∥ u ∥ 2 ∥ v ∥ 2 \delta_{\cos}(\mathbf{u}, \mathbf{v}) = 1 - \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\|_2 \|\mathbf{v}\|_2} δ c o s ( u , v ) = 1 − ∥ u ∥ 2 ∥ v ∥ 2 u ⋅ v
内积距离 :
δ i p ( u , v ) = − u ⋅ v \delta_{ip}(\mathbf{u}, \mathbf{v}) = -\mathbf{u} \cdot \mathbf{v} δ i p ( u , v ) = − u ⋅ v
汉明距离 (二进制向量):
δ H ( u , v ) = ∑ i = 1 d 1 [ u i ≠ v i ] \delta_H(\mathbf{u}, \mathbf{v}) = \sum_{i=1}^{d} \mathbf{1}[u_i \neq v_i] δ H ( u , v ) = ∑ i = 1 d 1 [ u i = v i ]
2.2 近似最近邻问题
2.2.1 精确最近邻
给定查询向量 q ∈ V \mathbf{q} \in \mathcal{V} q ∈ V 和数据集 D = { v 1 , v 2 , . . . , v n } \mathcal{D} = \{\mathbf{v}_1, \mathbf{v}_2, ..., \mathbf{v}_n\} D = { v 1 , v 2 , ... , v n } ,精确最近邻定义为:
NN ( q ) = arg min v ∈ D δ ( q , v ) \text{NN}(\mathbf{q}) = \arg\min_{\mathbf{v} \in \mathcal{D}} \delta(\mathbf{q}, \mathbf{v}) NN ( q ) = arg min v ∈ D δ ( q , v )
精确K近邻定义为:
KNN ( q , k ) = { v 1 ∗ , v 2 ∗ , . . . , v k ∗ } \text{KNN}(\mathbf{q}, k) = \{\mathbf{v}_1^*, \mathbf{v}_2^*, ..., \mathbf{v}_k^*\} KNN ( q , k ) = { v 1 ∗ , v 2 ∗ , ... , v k ∗ }
其中 δ ( q , v 1 ∗ ) ≤ δ ( q , v 2 ∗ ) ≤ . . . ≤ δ ( q , v k ∗ ) \delta(\mathbf{q}, \mathbf{v}_1^*) \leq \delta(\mathbf{q}, \mathbf{v}_2^*) \leq ... \leq \delta(\mathbf{q}, \mathbf{v}_k^*) δ ( q , v 1 ∗ ) ≤ δ ( q , v 2 ∗ ) ≤ ... ≤ δ ( q , v k ∗ )
2.2.2 近似最近邻
( 1 + ϵ ) (1+\epsilon) ( 1 + ϵ ) -近似最近邻定义为:
ANN ( q , ϵ ) = v ′ s.t. δ ( q , v ′ ) ≤ ( 1 + ϵ ) ⋅ δ ( q , v ∗ ) \text{ANN}(\mathbf{q}, \epsilon) = \mathbf{v}' \text{ s.t. } \delta(\mathbf{q}, \mathbf{v}') \leq (1+\epsilon) \cdot \delta(\mathbf{q}, \mathbf{v}^*) ANN ( q , ϵ ) = v ′ s.t. δ ( q , v ′ ) ≤ ( 1 + ϵ ) ⋅ δ ( q , v ∗ )
其中 v ∗ \mathbf{v}^* v ∗ 为精确最近邻,ϵ ≥ 0 \epsilon \geq 0 ϵ ≥ 0 为近似误差。
3. 基于图的索引算法
3.1 HNSW算法(Hierarchical Navigable Small World)
3.1.1 算法原理
HNSW构建多层图结构,每层都是一个小世界网络。设图 G = ( V , E ) G = (V, E) G = ( V , E ) ,其中:
顶点集:V = { 1 , 2 , . . . , n } V = \{1, 2, ..., n\} V = { 1 , 2 , ... , n }
边集:E ⊆ V × V E \subseteq V \times V E ⊆ V × V
3.1.2 层级结构
节点 i i i 的层级 l i l_i l i 遵循几何分布:
P ( l i = l ) = ( 1 − p ) l ⋅ p P(l_i = l) = (1-p)^l \cdot p P ( l i = l ) = ( 1 − p ) l ⋅ p
其中 p = 1 ln 2 p = \frac{1}{\ln 2} p = l n 2 1 ,期望层级为 E [ l i ] = 1 − p p ≈ 1.44 E[l_i] = \frac{1-p}{p} \approx 1.44 E [ l i ] = p 1 − p ≈ 1.44 。
最大层级:L = ⌊ − ln ( uniform ( 0 , 1 ) ) ⋅ m L ⌋ L = \lfloor -\ln(\text{uniform}(0,1)) \cdot m_L \rfloor L = ⌊ − ln ( uniform ( 0 , 1 )) ⋅ m L ⌋
其中 m L = 1 ln 2 m_L = \frac{1}{\ln 2} m L = l n 2 1 为层级生成参数。
3.1.3 图构建算法
Algorithm: HNSW Construction Input: dataset D, parameters M, ef_construction Output: multi-layer graph G 1. Initialize empty graph G with layers 0 to L_max 2. For each vector v in D: a. Determine layer l = get_random_level() b. Find entry point ep at top layer c. For layer lc from L_max down to l+1: - ep = search_layer(v, ep, 1, lc) d. For layer lc from l down to 0: - candidates = search_layer(v, ep, ef_construction, lc) - neighbors = select_neighbors(candidates, M) - Add bidirectional links between v and neighbors - Prune connections if necessary 3. Return G
3.1.4 邻居选择策略
采用启发式邻居选择算法:
Algorithm: Select Neighbors Heuristic Input: candidates C, number of connections M Output: selected neighbors N 1. Sort candidates by distance to query 2. N = empty set 3. For each candidate c in C: a. If |N| < M: - Add c to N b. Else: - If c improves connectivity: - Remove worst neighbor from N - Add c to N 4. Return N
连通性评估函数:
Connectivity ( v , N ) = ∑ u ∈ N 1 δ ( v , u ) \text{Connectivity}(\mathbf{v}, \mathcal{N}) = \sum_{\mathbf{u} \in \mathcal{N}} \frac{1}{\delta(\mathbf{v}, \mathbf{u})} Connectivity ( v , N ) = ∑ u ∈ N δ ( v , u ) 1
3.1.5 搜索算法
Algorithm: HNSW Search Input: query q, graph G, parameter ef Output: k nearest neighbors 1. ep = entry_point 2. For layer lc from L_max down to 1: - ep = search_layer(q, ep, 1, lc) 3. candidates = search_layer(q, ep, ef, 0) 4. Return top-k from candidates Function: search_layer(q, ep, num_closest, lc) 1. visited = {ep} 2. candidates = priority_queue with ep 3. dynamic_candidates = priority_queue with ep 4. 5. While candidates is not empty: a. current = candidates.pop_closest() b. If distance(q, current) > distance(q, furthest_in_dynamic): - Break c. For each neighbor of current at layer lc: - If neighbor not in visited: - Add to visited - If distance(q, neighbor) < distance(q, furthest_in_dynamic) or |dynamic| < num_closest: - Add neighbor to candidates and dynamic_candidates - If |dynamic_candidates| > num_closest: - Remove furthest from dynamic_candidates 6. Return dynamic_candidates
3.1.6 复杂度分析
构建复杂度 :O ( n log n ⋅ M ⋅ d ) O(n \log n \cdot M \cdot d) O ( n log n ⋅ M ⋅ d )
搜索复杂度 :O ( log n ⋅ d ) O(\log n \cdot d) O ( log n ⋅ d )
空间复杂度 :O ( n ⋅ M ) O(n \cdot M) O ( n ⋅ M )
其中 n n n 为数据点数量,M M M 为每层最大连接数,d d d 为向量维度。
3.2 NSW算法(Navigable Small World)
3.2.1 小世界网络性质
小世界网络满足以下性质:
高聚类系数 :C = 3 × triangles connected triples C = \frac{3 \times \text{triangles}}{\text{connected triples}} C = connected triples 3 × triangles
短平均路径长度 :L = 1 n ( n − 1 ) ∑ i ≠ j d i j L = \frac{1}{n(n-1)} \sum_{i \neq j} d_{ij} L = n ( n − 1 ) 1 ∑ i = j d ij
其中 d i j d_{ij} d ij 为节点 i i i 和 j j j 之间的最短路径长度。
3.2.2 贪心搜索算法
Algorithm: Greedy Search in NSW Input: query q, graph G, entry point ep Output: nearest neighbor 1. current = ep 2. While True: a. found_closer = False b. For each neighbor n of current: - If distance(q, n) < distance(q, current): - current = n - found_closer = True - Break c. If not found_closer: - Return current
3.2.3 性能分析
平均搜索复杂度:O ( log n ) O(\log n) O ( log n )
最坏情况复杂度:O ( n ) O(n) O ( n )
成功概率:P ( success ) ≥ 1 − 1 n α P(\text{success}) \geq 1 - \frac{1}{n^{\alpha}} P ( success ) ≥ 1 − n α 1 ,其中 α > 0 \alpha > 0 α > 0
3.3 图索引优化技术
3.3.1 动态图维护
插入操作 :
Algorithm: Dynamic Insert Input: new vector v, graph G Output: updated graph G' 1. Find insertion layer l = get_random_level() 2. Search for closest neighbors at each layer 3. Insert v and create connections 4. Update affected neighborhoods 5. Rebalance if necessary
删除操作 :
Algorithm: Dynamic Delete Input: vector v to delete, graph G Output: updated graph G' 1. Find all neighbors of v 2. Remove v from graph 3. Reconnect orphaned neighbors 4. Optimize local connectivity
3.3.2 图剪枝策略
基于角度的剪枝 :
对于节点 i i i 的邻居 j j j 和 k k k ,如果角度 θ j i k \theta_{jik} θ jik 过小,则移除较远的邻居:
cos ( θ j i k ) = ( v j − v i ) ⋅ ( v k − v i ) ∥ v j − v i ∥ ∥ v k − v i ∥ \cos(\theta_{jik}) = \frac{(\mathbf{v}_j - \mathbf{v}_i) \cdot (\mathbf{v}_k - \mathbf{v}_i)}{\|\mathbf{v}_j - \mathbf{v}_i\| \|\mathbf{v}_k - \mathbf{v}_i\|} cos ( θ jik ) = ∥ v j − v i ∥∥ v k − v i ∥ ( v j − v i ) ⋅ ( v k − v i )
剪枝条件:cos ( θ j i k ) > cos ( θ threshold ) \cos(\theta_{jik}) > \cos(\theta_{\text{threshold}}) cos ( θ jik ) > cos ( θ threshold )
基于距离的剪枝 :
移除满足以下条件的边 ( i , j ) (i,j) ( i , j ) :
∃ k ∈ N ( i ) : δ ( v i , v k ) + δ ( v k , v j ) ≤ ( 1 + ϵ ) ⋅ δ ( v i , v j ) \exists k \in N(i): \delta(\mathbf{v}_i, \mathbf{v}_k) + \delta(\mathbf{v}_k, \mathbf{v}_j) \leq (1+\epsilon) \cdot \delta(\mathbf{v}_i, \mathbf{v}_j) ∃ k ∈ N ( i ) : δ ( v i , v k ) + δ ( v k , v j ) ≤ ( 1 + ϵ ) ⋅ δ ( v i , v j )
4. 基于量化的索引算法
4.1 乘积量化(Product Quantization, PQ)
4.1.1 算法原理
将 d d d 维向量分解为 m m m 个子向量:
v = [ v ( 1 ) , v ( 2 ) , . . . , v ( m ) ] \mathbf{v} = [\mathbf{v}^{(1)}, \mathbf{v}^{(2)}, ..., \mathbf{v}^{(m)}] v = [ v ( 1 ) , v ( 2 ) , ... , v ( m ) ]
其中每个子向量 v ( j ) ∈ R d / m \mathbf{v}^{(j)} \in \mathbb{R}^{d/m} v ( j ) ∈ R d / m 。
对每个子空间独立进行K-means聚类:
v ( j ) ≈ c j , q j \mathbf{v}^{(j)} \approx \mathbf{c}_{j,q_j} v ( j ) ≈ c j , q j
其中 c j , q j \mathbf{c}_{j,q_j} c j , q j 为第 j j j 个子空间的第 q j q_j q j 个聚类中心。
4.1.2 距离计算
对于查询向量 q = [ q ( 1 ) , . . . , q ( m ) ] \mathbf{q} = [\mathbf{q}^{(1)}, ..., \mathbf{q}^{(m)}] q = [ q ( 1 ) , ... , q ( m ) ] 和量化向量 v \mathbf{v} v :
δ 2 ( q , v ) ≈ ∑ j = 1 m δ 2 ( q ( j ) , c j , q j ) \delta^2(\mathbf{q}, \mathbf{v}) \approx \sum_{j=1}^{m} \delta^2(\mathbf{q}^{(j)}, \mathbf{c}_{j,q_j}) δ 2 ( q , v ) ≈ ∑ j = 1 m δ 2 ( q ( j ) , c j , q j )
4.1.3 查找表优化
预计算距离查找表:
LUT [ j ] [ k ] = δ 2 ( q ( j ) , c j , k ) \text{LUT}[j][k] = \delta^2(\mathbf{q}^{(j)}, \mathbf{c}_{j,k}) LUT [ j ] [ k ] = δ 2 ( q ( j ) , c j , k )
距离计算简化为:
δ 2 ( q , v ) ≈ ∑ j = 1 m LUT [ j ] [ q j ] \delta^2(\mathbf{q}, \mathbf{v}) \approx \sum_{j=1}^{m} \text{LUT}[j][q_j] δ 2 ( q , v ) ≈ ∑ j = 1 m LUT [ j ] [ q j ]
时间复杂度从 O ( d ) O(d) O ( d ) 降低到 O ( m ) O(m) O ( m ) 。
4.1.4 量化误差分析
量化误差的期望值:
E [ ∥ v − v ^ ∥ 2 ] = ∑ j = 1 m E [ ∥ v ( j ) − c j , q j ∥ 2 ] E[\|\mathbf{v} - \hat{\mathbf{v}}\|^2] = \sum_{j=1}^{m} E[\|\mathbf{v}^{(j)} - \mathbf{c}_{j,q_j}\|^2] E [ ∥ v − v ^ ∥ 2 ] = ∑ j = 1 m E [ ∥ v ( j ) − c j , q j ∥ 2 ]
对于高斯分布数据,量化误差约为:
E [ error ] ≈ d m ⋅ σ 2 k 2 / d E[\text{error}] \approx \frac{d}{m} \cdot \frac{\sigma^2}{k^{2/d}} E [ error ] ≈ m d ⋅ k 2/ d σ 2
其中 σ 2 \sigma^2 σ 2 为数据方差,k k k 为每个子空间的聚类数。
4.2 标量量化(Scalar Quantization, SQ)
4.2.1 均匀量化
对于标量 x ∈ [ x min , x max ] x \in [x_{\min}, x_{\max}] x ∈ [ x m i n , x m a x ] ,b b b 位均匀量化:
Q ( x ) = round ( x − x min x max − x min ⋅ ( 2 b − 1 ) ) Q(x) = \text{round}\left(\frac{x - x_{\min}}{x_{\max} - x_{\min}} \cdot (2^b - 1)\right) Q ( x ) = round ( x m a x − x m i n x − x m i n ⋅ ( 2 b − 1 ) )
反量化:
x ^ = x min + Q ( x ) 2 b − 1 ⋅ ( x max − x min ) \hat{x} = x_{\min} + \frac{Q(x)}{2^b - 1} \cdot (x_{\max} - x_{\min}) x ^ = x m i n + 2 b − 1 Q ( x ) ⋅ ( x m a x − x m i n )
量化误差:
ϵ = ∣ x − x ^ ∣ ≤ x max − x min 2 b + 1 \epsilon = |x - \hat{x}| \leq \frac{x_{\max} - x_{\min}}{2^{b+1}} ϵ = ∣ x − x ^ ∣ ≤ 2 b + 1 x m a x − x m i n
4.2.2 非均匀量化
基于数据分布的最优量化器设计。对于概率密度函数 f ( x ) f(x) f ( x ) ,最优量化边界满足Lloyd条件:
质心条件 :
c i = ∫ b i − 1 b i x f ( x ) d x ∫ b i − 1 b i f ( x ) d x c_i = \frac{\int_{b_{i-1}}^{b_i} x f(x) dx}{\int_{b_{i-1}}^{b_i} f(x) dx} c i = ∫ b i − 1 b i f ( x ) d x ∫ b i − 1 b i x f ( x ) d x
边界条件 :
b i = c i + c i + 1 2 b_i = \frac{c_i + c_{i+1}}{2} b i = 2 c i + c i + 1
4.3 残差量化(Residual Quantization, RQ)
4.3.1 多级量化
第一级量化:v ≈ c 1 + r 1 \mathbf{v} \approx \mathbf{c}_1 + \mathbf{r}_1 v ≈ c 1 + r 1
第二级量化:r 1 ≈ c 2 + r 2 \mathbf{r}_1 \approx \mathbf{c}_2 + \mathbf{r}_2 r 1 ≈ c 2 + r 2
递归进行 L L L 级量化:
v ≈ ∑ l = 1 L c l \mathbf{v} \approx \sum_{l=1}^{L} \mathbf{c}_l v ≈ ∑ l = 1 L c l
4.3.2 误差递减
第 l l l 级的量化误差:
ϵ l = ∥ r l − 1 − c l ∥ \epsilon_l = \|\mathbf{r}_{l-1} - \mathbf{c}_l\| ϵ l = ∥ r l − 1 − c l ∥
总误差:
ϵ total = ∥ v − ∑ l = 1 L c l ∥ ≤ ∏ l = 1 L α l ⋅ ϵ 0 \epsilon_{\text{total}} = \|\mathbf{v} - \sum_{l=1}^{L} \mathbf{c}_l\| \leq \prod_{l=1}^{L} \alpha_l \cdot \epsilon_0 ϵ total = ∥ v − ∑ l = 1 L c l ∥ ≤ ∏ l = 1 L α l ⋅ ϵ 0
其中 α l < 1 \alpha_l < 1 α l < 1 为第 l l l 级的误差衰减因子。
5. 基于哈希的索引算法
5.1 局部敏感哈希(Locality Sensitive Hashing, LSH)
5.1.1 LSH函数族
对于距离函数 δ \delta δ 和阈值 r 1 < r 2 r_1 < r_2 r 1 < r 2 ,LSH函数族 H \mathcal{H} H 满足:
如果 δ ( u , v ) ≤ r 1 \delta(\mathbf{u}, \mathbf{v}) \leq r_1 δ ( u , v ) ≤ r 1 ,则 P [ h ( u ) = h ( v ) ] ≥ p 1 P[h(\mathbf{u}) = h(\mathbf{v})] \geq p_1 P [ h ( u ) = h ( v )] ≥ p 1
如果 δ ( u , v ) ≥ r 2 \delta(\mathbf{u}, \mathbf{v}) \geq r_2 δ ( u , v ) ≥ r 2 ,则 P [ h ( u ) = h ( v ) ] ≤ p 2 P[h(\mathbf{u}) = h(\mathbf{v})] \leq p_2 P [ h ( u ) = h ( v )] ≤ p 2
其中 p 1 > p 2 p_1 > p_2 p 1 > p 2 。
5.1.2 随机投影LSH
对于欧几里得空间,LSH函数定义为:
h a , b ( v ) = ⌊ a ⋅ v + b w ⌋ h_{\mathbf{a},b}(\mathbf{v}) = \left\lfloor \frac{\mathbf{a} \cdot \mathbf{v} + b}{w} \right\rfloor h a , b ( v ) = ⌊ w a ⋅ v + b ⌋
其中:
a ∼ N ( 0 , I ) \mathbf{a} \sim \mathcal{N}(0, I) a ∼ N ( 0 , I ) :随机投影向量
b ∼ Uniform [ 0 , w ] b \sim \text{Uniform}[0, w] b ∼ Uniform [ 0 , w ] :随机偏移
w > 0 w > 0 w > 0 :桶宽度
碰撞概率:
P [ h ( u ) = h ( v ) ] = ∫ 0 w 1 w f ( t w ) d t P[h(\mathbf{u}) = h(\mathbf{v})] = \int_0^w \frac{1}{w} f\left(\frac{t}{w}\right) dt P [ h ( u ) = h ( v )] = ∫ 0 w w 1 f ( w t ) d t
其中 f ( x ) = 2 Φ ( x ) − 1 − 2 x 2 π e − x 2 / 2 f(x) = 2\Phi(x) - 1 - \frac{2x}{\sqrt{2\pi}} e^{-x^2/2} f ( x ) = 2Φ ( x ) − 1 − 2 π 2 x e − x 2 /2 ,Φ \Phi Φ 为标准正态分布的CDF。
5.1.3 余弦LSH
对于余弦相似度,使用符号随机投影:
h a ( v ) = sign ( a ⋅ v ) h_\mathbf{a}(\mathbf{v}) = \text{sign}(\mathbf{a} \cdot \mathbf{v}) h a ( v ) = sign ( a ⋅ v )
碰撞概率:
P [ h ( u ) = h ( v ) ] = 1 − arccos ( u ⋅ v ) π P[h(\mathbf{u}) = h(\mathbf{v})] = 1 - \frac{\arccos(\mathbf{u} \cdot \mathbf{v})}{\pi} P [ h ( u ) = h ( v )] = 1 − π a r c c o s ( u ⋅ v )
5.2 多重哈希
5.2.1 AND构造
组合 k k k 个LSH函数:
g ( v ) = ( h 1 ( v ) , h 2 ( v ) , . . . , h k ( v ) ) g(\mathbf{v}) = (h_1(\mathbf{v}), h_2(\mathbf{v}), ..., h_k(\mathbf{v})) g ( v ) = ( h 1 ( v ) , h 2 ( v ) , ... , h k ( v ))
碰撞概率:
P [ g ( u ) = g ( v ) ] = P [ h ( u ) = h ( v ) ] k P[g(\mathbf{u}) = g(\mathbf{v})] = P[h(\mathbf{u}) = h(\mathbf{v})]^k P [ g ( u ) = g ( v )] = P [ h ( u ) = h ( v ) ] k
5.2.2 OR构造
使用 L L L 个独立的哈希表:
至少一个哈希表中碰撞的概率:
P OR = 1 − ( 1 − P [ g ( u ) = g ( v ) ] ) L P_{\text{OR}} = 1 - (1 - P[g(\mathbf{u}) = g(\mathbf{v})])^L P OR = 1 − ( 1 − P [ g ( u ) = g ( v )] ) L
5.2.3 参数优化
对于给定的 p 1 , p 2 p_1, p_2 p 1 , p 2 ,最优参数:
k ∗ = ln ( 1 / p 2 ) ln ( 1 / p 1 ) k^* = \frac{\ln(1/p_2)}{\ln(1/p_1)} k ∗ = l n ( 1/ p 1 ) l n ( 1/ p 2 )
L ∗ = ln ( 1 / δ ) ( p 1 ∗ ) k L^* = \frac{\ln(1/\delta)}{(p_1^*)^k} L ∗ = ( p 1 ∗ ) k l n ( 1/ δ )
其中 δ \delta δ 为失败概率。
5.3 学习哈希
5.3.1 监督哈希
给定相似性标签 S = { ( v i , v j , s i j ) } S = \{(\mathbf{v}_i, \mathbf{v}_j, s_{ij})\} S = {( v i , v j , s ij )} ,优化目标:
min W ∑ ( i , j ) ∈ S ℓ ( s i j , sim ( h ( v i ) , h ( v j ) ) ) \min_{\mathbf{W}} \sum_{(i,j) \in S} \ell(s_{ij}, \text{sim}(h(\mathbf{v}_i), h(\mathbf{v}_j))) min W ∑ ( i , j ) ∈ S ℓ ( s ij , sim ( h ( v i ) , h ( v j )))
其中 h ( v ) = sign ( W T v ) h(\mathbf{v}) = \text{sign}(\mathbf{W}^T \mathbf{v}) h ( v ) = sign ( W T v ) ,ℓ \ell ℓ 为损失函数。
5.3.2 无监督哈希
保持数据分布的哈希函数学习:
min W ∑ i = 1 n ∥ v i − W h ( v i ) ∥ 2 \min_{\mathbf{W}} \sum_{i=1}^{n} \|\mathbf{v}_i - \mathbf{W} h(\mathbf{v}_i)\|^2 min W ∑ i = 1 n ∥ v i − W h ( v i ) ∥ 2
约束条件:h ( v i ) ∈ { − 1 , + 1 } k h(\mathbf{v}_i) \in \{-1, +1\}^k h ( v i ) ∈ { − 1 , + 1 } k
6. 混合索引策略
6.1 多级索引
6.1.1 粗粒度过滤
第一级:使用快速但不精确的索引(如LSH)进行粗筛选
第二级:使用精确但较慢的索引(如图索引)进行精确搜索
6.1.2 级联搜索
Algorithm: Cascaded Search Input: query q, coarse index I₁, fine index I₂ Output: k nearest neighbors 1. candidates₁ = I₁.search(q, k₁) // k₁ >> k 2. candidates₂ = I₂.search(q, candidates₁, k) 3. Return candidates₂
搜索复杂度:O ( log k 1 ) + O ( k 1 log k ) O(\log k_1) + O(k_1 \log k) O ( log k 1 ) + O ( k 1 log k )
6.2 自适应索引选择
6.2.1 查询特征分析
定义查询特征向量:
f q = [ ∥ q ∥ , sparsity ( q ) , entropy ( q ) , k , ϵ ] \mathbf{f}_q = [\|\mathbf{q}\|, \text{sparsity}(\mathbf{q}), \text{entropy}(\mathbf{q}), k, \epsilon] f q = [ ∥ q ∥ , sparsity ( q ) , entropy ( q ) , k , ϵ ]
6.2.2 索引选择模型
训练分类器 C : f q → I C: \mathbf{f}_q \rightarrow \mathcal{I} C : f q → I ,其中 I \mathcal{I} I 为索引类型集合。
选择准则:
I ∗ = arg min I ( α ⋅ Latency ( I , f q ) + β ⋅ Error ( I , f q ) ) \mathcal{I}^* = \arg\min_{\mathcal{I}} \left(\alpha \cdot \text{Latency}(\mathcal{I}, \mathbf{f}_q) + \beta \cdot \text{Error}(\mathcal{I}, \mathbf{f}_q)\right) I ∗ = arg min I ( α ⋅ Latency ( I , f q ) + β ⋅ Error ( I , f q ) )
7. 索引构建优化
7.1 并行构建
7.1.1 数据并行
将数据集分割为 P P P 个子集:
D = D 1 ∪ D 2 ∪ . . . ∪ D P \mathcal{D} = \mathcal{D}_1 \cup \mathcal{D}_2 \cup ... \cup \mathcal{D}_P D = D 1 ∪ D 2 ∪ ... ∪ D P
每个处理器独立构建子索引,最后合并。
7.1.2 模型并行
对于图索引,并行处理不同层级:
Algorithm: Parallel Graph Construction 1. Distribute nodes across processors 2. For each layer l in parallel: - Each processor builds local connections - Synchronize and merge global connections 3. Optimize global connectivity
7.1.3 负载均衡
使用工作窃取算法平衡负载:
每个处理器维护本地任务队列
空闲处理器从其他处理器窃取任务
动态调整任务粒度
7.2 增量构建
7.2.1 在线插入
Algorithm: Online Insertion Input: new vector v, existing index I Output: updated index I' 1. Find insertion position using existing index 2. Update local neighborhood 3. Propagate changes to affected regions 4. Rebalance if necessary
插入复杂度:O ( log n ⋅ d ) O(\log n \cdot d) O ( log n ⋅ d )
7.2.2 批量更新
Algorithm: Batch Update Input: vector batch B, existing index I Output: updated index I' 1. Sort batch by insertion order 2. Group nearby insertions 3. Process groups in parallel 4. Merge updates atomically
批量更新可以减少重建开销:
Speedup = ∣ B ∣ ⋅ SingleInsertCost BatchUpdateCost ≈ ∣ B ∣ \text{Speedup} = \frac{|B| \cdot \text{SingleInsertCost}}{\text{BatchUpdateCost}} \approx \sqrt{|B|} Speedup = BatchUpdateCost ∣ B ∣ ⋅ SingleInsertCost ≈ ∣ B ∣
7.3 内存优化
7.3.1 缓存友好的数据布局
结构体数组(AoS) vs 数组结构体(SoA) :
// AoS layout struct Node { float vector[d]; int neighbors[M]; }; Node nodes[n]; // SoA layout float vectors[n][d]; int neighbors[n][M];
SoA布局通常有更好的缓存局部性。
7.3.2 内存池管理
使用内存池减少内存分配开销:
Memory Pool Design: ├── Small Object Pool (< 1KB) ├── Medium Object Pool (1KB - 1MB) └── Large Object Pool (> 1MB)
内存利用率:> 95%
分配延迟:< 100ns
8. 查询优化
8.1 查询规划
8.1.1 成本模型
定义查询成本函数:
Cost ( Plan ) = α ⋅ CPU_Cost + β ⋅ IO_Cost + γ ⋅ Memory_Cost \text{Cost}(\text{Plan}) = \alpha \cdot \text{CPU\_Cost} + \beta \cdot \text{IO\_Cost} + \gamma \cdot \text{Memory\_Cost} Cost ( Plan ) = α ⋅ CPU_Cost + β ⋅ IO_Cost + γ ⋅ Memory_Cost
其中:
CPU_Cost = distance_computations × cpu_unit_cost \text{CPU\_Cost} = \text{distance\_computations} \times \text{cpu\_unit\_cost} CPU_Cost = distance_computations × cpu_unit_cost
IO_Cost = disk_accesses × io_unit_cost \text{IO\_Cost} = \text{disk\_accesses} \times \text{io\_unit\_cost} IO_Cost = disk_accesses × io_unit_cost
Memory_Cost = memory_usage × memory_unit_cost \text{Memory\_Cost} = \text{memory\_usage} \times \text{memory\_unit\_cost} Memory_Cost = memory_usage × memory_unit_cost
8.1.2 计划枚举
Algorithm: Query Plan Enumeration Input: query q, available indexes I Output: optimal plan P* 1. Generate candidate plans: - Single index plans - Multi-index plans - Hybrid plans 2. Estimate cost for each plan 3. Select plan with minimum cost 4. Return P*
8.2 并行查询执行
8.2.1 查询内并行
Algorithm: Parallel Query Execution Input: query q, parallelism degree P Output: k nearest neighbors 1. Partition search space into P regions 2. Execute search in parallel: For each processor p: - local_results[p] = search_region(q, region[p]) 3. Merge results: - global_results = merge(local_results) 4. Return top-k from global_results
理论加速比:Speedup = P 1 + overhead \text{Speedup} = \frac{P}{1 + \text{overhead}} Speedup = 1 + overhead P
实际加速比通常为:0.7 P 0.7P 0.7 P 到 0.9 P 0.9P 0.9 P
8.2.2 流水线执行
Pipeline Stages: Stage 1: Candidate Generation ↓ Stage 2: Distance Computation ↓ Stage 3: Result Ranking ↓ Stage 4: Post-processing
流水线吞吐量:Throughput = 1 max ( Stage_Time ) \text{Throughput} = \frac{1}{\max(\text{Stage\_Time})} Throughput = m a x ( Stage_Time ) 1
8.3 缓存优化
8.3.1 查询结果缓存
使用LRU-K缓存策略:
Score ( q ) = ∑ i = 1 K w i ⋅ ( current_time − access_time i ) \text{Score}(q) = \sum_{i=1}^{K} w_i \cdot (\text{current\_time} - \text{access\_time}_i) Score ( q ) = ∑ i = 1 K w i ⋅ ( current_time − access_time i )
缓存命中率:Hit_Rate = Cache_Hits Total_Queries \text{Hit\_Rate} = \frac{\text{Cache\_Hits}}{\text{Total\_Queries}} Hit_Rate = Total_Queries Cache_Hits
目标命中率:> 80%
8.3.2 预计算优化
预计算常用查询的部分结果:
热点向量 :预计算与热点向量的距离
聚类中心 :预计算查询到聚类中心的距离
入口点 :预计算最优入口点
预计算收益:
Benefit = Query_Frequency × Computation_Saved − Storage_Cost \text{Benefit} = \text{Query\_Frequency} \times \text{Computation\_Saved} - \text{Storage\_Cost} Benefit = Query_Frequency × Computation_Saved − Storage_Cost
9. 性能评估与调优
9.1 性能指标
9.1.1 准确性指标
召回率(Recall) :
Recall@k = ∣ Retrieved ∩ Relevant ∣ ∣ Relevant ∣ \text{Recall@k} = \frac{|\text{Retrieved} \cap \text{Relevant}|}{|\text{Relevant}|} Recall@k = ∣ Relevant ∣ ∣ Retrieved ∩ Relevant ∣
精确率(Precision) :
Precision@k = ∣ Retrieved ∩ Relevant ∣ ∣ Retrieved ∣ \text{Precision@k} = \frac{|\text{Retrieved} \cap \text{Relevant}|}{|\text{Retrieved}|} Precision@k = ∣ Retrieved ∣ ∣ Retrieved ∩ Relevant ∣
平均精确率(MAP) :
MAP = 1 ∣ Q ∣ ∑ q ∈ Q AP ( q ) \text{MAP} = \frac{1}{|Q|} \sum_{q \in Q} \text{AP}(q) MAP = ∣ Q ∣ 1 ∑ q ∈ Q AP ( q )
9.1.2 效率指标
查询延迟 :
P50延迟:< 10ms
P95延迟:< 50ms
P99延迟:< 100ms
查询吞吐量 :
单线程:> 1K QPS
多线程:> 10K QPS
索引构建时间 :
构建速度:> 10K vectors/sec
内存使用:< 2x 原始数据大小
9.2 参数调优
9.2.1 网格搜索
Algorithm: Grid Search Tuning Input: parameter space P, evaluation metric M Output: optimal parameters P* 1. Define parameter grid: - M ∈ {8, 16, 32, 64} - ef_construction ∈ {100, 200, 400} - ef_search ∈ {50, 100, 200} 2. For each parameter combination p: - Build index with parameters p - Evaluate performance M(p) 3. Return P* = argmax M(p)
9.2.2 贝叶斯优化
使用高斯过程建模参数-性能关系:
f ( p ) ∼ G P ( μ ( p ) , k ( p , p ′ ) ) f(\mathbf{p}) \sim \mathcal{GP}(\mu(\mathbf{p}), k(\mathbf{p}, \mathbf{p}')) f ( p ) ∼ G P ( μ ( p ) , k ( p , p ′ ))
采集函数(Expected Improvement):
EI ( p ) = σ ( p ) [ Z Φ ( Z ) + ϕ ( Z ) ] \text{EI}(\mathbf{p}) = \sigma(\mathbf{p}) \left[ Z \Phi(Z) + \phi(Z) \right] EI ( p ) = σ ( p ) [ Z Φ ( Z ) + ϕ ( Z ) ]
其中 Z = μ ( p ) − f ( p + ) σ ( p ) Z = \frac{\mu(\mathbf{p}) - f(\mathbf{p}^+)}{\sigma(\mathbf{p})} Z = σ ( p ) μ ( p ) − f ( p + )
9.3 A/B测试
9.3.1 实验设计
对照组 :当前算法配置
实验组 :新算法配置
流量分割 :50%-50%随机分配
评估指标 :延迟、准确率、资源使用
9.3.2 统计显著性检验
使用t检验评估性能差异:
t = X ˉ 1 − X ˉ 2 s 1 2 n 1 + s 2 2 n 2 t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} t = n 1 s 1 2 + n 2 s 2 2 X ˉ 1 − X ˉ 2
显著性水平:α = 0.05 \alpha = 0.05 α = 0.05
10. 前沿技术与发展趋势
10.1 深度学习增强索引
10.1.1 学习化索引
使用神经网络学习数据分布,构建更高效的索引:
Index ( q ) = NN ( q ; θ ) \text{Index}(\mathbf{q}) = \text{NN}(\mathbf{q}; \theta) Index ( q ) = NN ( q ; θ )
其中 NN \text{NN} NN 为神经网络,θ \theta θ 为参数。
10.1.2 端到端优化
联合优化向量表示和索引结构:
min θ , ϕ L task + λ L index \min_{\theta, \phi} \mathcal{L}_{\text{task}} + \lambda \mathcal{L}_{\text{index}} min θ , ϕ L task + λ L index
其中 L task \mathcal{L}_{\text{task}} L task 为任务损失,L index \mathcal{L}_{\text{index}} L index 为索引效率损失。
10.2 硬件加速
10.2.1 GPU加速
利用GPU的并行计算能力:
SIMD操作 :向量化距离计算
共享内存 :减少内存访问延迟
异步执行 :重叠计算和内存传输
10.2.2 专用硬件
设计专门的向量搜索芯片:
向量处理单元(VPU) :专门的向量运算硬件
内容寻址存储器(CAM) :硬件级别的并行搜索
近数据计算 :在存储器附近进行计算
10.3 分布式索引
10.3.1 分片策略
基于哈希的分片 :
Shard ( v ) = hash ( v ) m o d N \text{Shard}(\mathbf{v}) = \text{hash}(\mathbf{v}) \bmod N Shard ( v ) = hash ( v ) mod N
基于聚类的分片 :
Shard ( v ) = arg min i δ ( v , c i ) \text{Shard}(\mathbf{v}) = \arg\min_i \delta(\mathbf{v}, \mathbf{c}_i) Shard ( v ) = arg min i δ ( v , c i )
10.3.2 一致性保证
使用Raft协议保证分布式索引的一致性:
日志复制 :确保操作顺序一致
领导者选举 :处理节点故障
成员变更 :动态调整集群规模