Fountain Voyage

A Comprehensive Guide to Mixture of Experts (MoE): Exploring Mixtral 8X7B, DBRX, and Deepseek-v2 Architectures and Applications

blog@zair.top (Tim) — Wed, 25 Dec 2024 15:44:22 +0800

Dive into the architecture and working principles of Mixture of Experts (MoE) models, exploring popular frameworks like Mixtral 8X7B, DBRX, and Deepseek-v2. Learn their applications and advantages, implement an MoE model using Python, and evaluate its performance on tasks like logical reasoning, summarization, and entity extraction.

Mixture of Experts (MoE) has been a viral concept in the world of Large Language Models (LLMs). It not only marks a breakthrough in efficiency and scalability but also offers novel solutions to handle complex tasks. In simple terms, MoE splits a large model into multiple smaller models, where each smaller model, called an “expert,” specializes in a specific task or dataset type. When processing a particular task, the model activates only the relevant “experts,” without engaging the entire network, thereby saving computational resources significantly.

In this article, you will:

Master the basics of MoE model architecture and its working principles.
Learn about several popular MoE models, such as Mixtral 8X7B, DBRX, and Deepseek-v2.
Implement an MoE model in Google Colab using Python code.
Assess the performance of a typical MoE model on tasks like logical reasoning, summarization, and entity extraction.
Understand the advantages and challenges of using MoE models in complex natural language processing tasks and code generation.

1 What is a Mixture of Experts (MoE) Model?

Most modern deep learning models rely on neural networks with multiple layers, each containing numerous “neurons.” These neurons process input data, perform mathematical operations (e.g., activation functions), and pass the results to the subsequent layers. More advanced architectures, such as Transformers, employ self-attention mechanisms to capture complex patterns within data.

However, traditional dense architectures engage the entire network when solving any single task, which leads to extremely high computational costs. To address this, Mixture of Experts (MoE) introduces sparse architectures, where only the specific parts of the network—relevant “experts”—are activated based on the input task. This results in a significant reduction in computational requirements, particularly for resource-intensive tasks like natural language processing.

Imagine a team project where team members are divided into smaller groups, each focusing on a unique task. MoE operates in a similar way—it breaks a complex problem into smaller sub-tasks, with each “expert” network handling a specific sub-task.

1.1 Key Advantages of MoE Models

Faster Pretraining: MoE models expedite the training process compared to traditional dense models.
Improved Inference Speed: Even with a similar parameter count, MoE models offer faster inference.
Higher VRAM Requirements: MoE models demand more video memory since all “experts” must be stored simultaneously.

An MoE model consists of two core components:

Experts: Smaller neural networks specialized in different tasks.
Router: This module dynamically selects and activates the relevant experts for a given input. By activating only relevant experts, MoE optimizes performance and computational efficiency.

2 Popular MoE Models

MoE models have gained significant attention in AI research due to their ability to efficiently scale large language models while maintaining high performance. Notable examples like Mixtral 8X7B leverage sparse MoE architectures to activate only subsets of experts for specific inputs. This approach matches the performance of much larger dense models while significantly improving efficiency.

Let’s explore some prominent MoE models and implement them in Python using Ollama on Google Colab.

2.1 1. Mixtral 8X7B

Mixtral 8X7B is a decoder-only transformer model, where the input tokens are embedded as vectors, processed through decoder layers, and outputted as probabilities of each token occurring. The architecture incorporates a Sparse Mixture of Experts (SMoE) to handle word vectors efficiently, significantly reducing computation costs.

Notable Features:

Total experts: 8
Active experts per input: 2
Decoder layers: 32
Vocabulary size: 32,000
Embedding size: 4,096
Individual expert size: 5.6 billion parameters (shared with embedding, normalization layers, etc.).
Activated parameters: 12.8 billion.
Context length: 32k tokens.

Mixtral 8X7B has demonstrated proficiency in various tasks, including text generation, translation, summarization, sentiment analysis, educational content, customer support automation, and research assistance. Its architecture ensures versatility across domains.

2.2 2. DBRX

DBRX, developed by Databricks, is a decoder-only, Transformer-based LLM trained on next-token prediction. It uses fine-grained MoE architectures, achieving a total of 132 billion parameters where only 36 billion are activated per input. Notably, DBRX consists of more, smaller experts compared to Mixtral and Grok-1.

Key Architectural Features:

Fine-Grained Experts: Experts are divided into segments, enabling higher specialization without parameter inflation.
Number of experts: 16
Active experts per layer: 4
Decoder layers: 24
Active parameters: 36 billion
Total parameters: 132 billion
Context length: 32k tokens

DBRX excels in use cases like code generation, mathematical reasoning, and complex language understanding.

2.3 3. Deepseek-v2

Deepseek-v2 employs two core ideas:

Fine-Grained Experts: Divided into smaller segments for more focused specialization and knowledge retrieval.
Shared Experts: Certain universally relevant experts remain constantly activated to generalize knowledge across tasks.

Key Features:

Total parameters: 236 billion
Active parameters: 21 billion
Experts per layer: 160 (activating 8)
Shared experts per layer: 2
Active experts per layer: 8
Decoder layers: 60
Context length: 128k tokens

Deepseek-v2 is exceptionally skilled in conversation-based applications like chatbots, content creation, language translation, and summarization while also excelling at code generation.

3 Implementing an MoE Model in Python

Now, let’s implement an MoE model using Python.

3.1 Step 1: Install Required Python Libraries

1
2
3
4
5


!sudo apt update
!sudo apt install -y pciutils
!pip install langchain-ollama
!curl -fsSL https://ollama.com/install.sh | sh
!pip install ollama==0.4.2

3.2 Step 2: Enable Multi-Threading

Launch the Ollama server using threading:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


import threading
import subprocess
import time

def run_ollama_serve():
  subprocess.Popen(["ollama", "serve"])

thread = threading.Thread(target=run_ollama_serve)
thread.start()
time.sleep(5)

3.3 Step 3: Fetch a Python-Compatible Model (e.g., DBRX)

1

!ollama pull dbrx

3.4 Step 4: Query the Model

Use the DBRX model for text summarization:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama.llms import OllamaLLM
from IPython.display import Markdown

template = """Question: {question}

Answer: Let's think step by step."""

prompt = ChatPromptTemplate.from_template(template)

model = OllamaLLM(model="dbrx")

chain = prompt | model

# Input for processing
input_data = {
    "question": 'Summarize this text into one sentence: "Bob is a boy. Bob has a dog. Bob takes his dog for a walk. At the park, Bob throws a stick for the dog to retrieve. The dog chases a squirrel; Bob chases his dog. Bob finds his dog, and they return home together."'
}

# Process input and display output
response = chain.invoke(input_data)
display(Markdown(response))

4 Performance Evaluation of MoE Models

To better understand the practical implications of MoE models, let’s assess their performance across common natural language processing (NLP) tasks such as logical reasoning, summarization, entity extraction, and mathematical reasoning.

4.1 1. Logical Reasoning Task

Input Prompt:
“Provide a list of 13 words, each containing exactly 9 letters.”

Model Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


Here are 13 words, each containing 9 letters:  
1. Agriculture  
2. Beautiful  
3. Chocolate  
4. Dangerous  
5. Encyclopedia  
6. Fireplace  
7. Grammarly  
8. Hamburger  
9. Important  
10. Juxtapose  
11. Kitchener  
12. Landscape  
13. Necessary

Analysis:
Out of the 13 words listed by the model, only 8 words have exactly 9 letters. This reveals a partial failure in logical reasoning and precision. While MoE models excel in efficiency and general knowledge-based tasks, their ability to handle highly specific constraints like word length in this example may require fine-tuning or extensive training data aligned with the task.

4.2 2. Summarization Task

Input Prompt:
“Summarize the following text into one sentence: ‘Bob is a boy. Bob has a dog. Bob takes his dog for a walk. At the park, Bob throws a stick for the dog to retrieve. The dog chases a squirrel; Bob chases his dog. Bob finds his dog, and they return home together.’”

Model Output:

1

Bob enjoys a lively outing with his dog at the park, playing fetch, chasing a squirrel, and returning home together.

Analysis:
The output demonstrates strong summarization capabilities. The response is concise yet captures the key elements of the input text. This showcases the model’s competence for tasks requiring high-level understanding and condensing information for clarity.

4.3 3. Entity Extraction Task

Input Prompt:
“Extract all numeric values and their corresponding units from the following text: ‘A marathon covers a distance of 42 kilometers, with over 30,000 participants.’”

Model Output:

1
2


1. 42 (kilometers, marathon distance)  
2. 30,000 (number of participants)

Analysis:
The model correctly identifies both numeric entities and provides their corresponding contexts and units. This demonstrates its proficiency in extracting meaningful information in structured or semi-structured formats, making it suitable for tasks like data mining, financial analysis, or legal document review.

4.4 4. Mathematical Reasoning Task

Input Prompt:
“I have 2 apples. I bought 2 more. I used 2 apples to bake a pie. After eating half the pie, how many apples do I have left?”

Model Output:

1
2
3
4
5


1. Starting with 2 apples.  
2. Buying 2 more brings the total to 4 apples (2 + 2 = 4).  
3. Using 2 apples for a pie leaves 2 apples (4 - 2 = 2).  
4. Eating half the pie doesn't impact the number of apples left, as it's unrelated to pie contents.  
Final count: 2 apples.

Analysis:
The reasoning is accurate, and the model successfully arrives at the correct answer. This demonstrates the model’s ability to perform chained reasoning and disentangle task-independent details.

5 Final Thoughts

Mixture of Experts (MoE) represents a major evolution in deep learning, offering an efficient and scalable alternative to traditional dense architectures. Models like Mixtral 8X7B, DBRX, and Deepseek-v2 demonstrate how sparse activation, fine-grained expert segmentation, and shared knowledge mechanisms contribute to groundbreaking advances across diverse domains like NLP, code generation, and summarization.

As MoE technologies continue evolving, we expect further innovations to address their memory and routing bottlenecks, opening the door to even more complex, capable AI systems.

6 Frequently Asked Questions (FAQ)

Q1: What sets MoE models apart from traditional dense models?
A: MoE models activate only task-relevant experts, reducing computational demands and improving efficiency without compromising performance.

Q2: How are experts selected in MoE models?
A: A routing mechanism dynamically selects the most relevant experts based on the input.

Q3: Can MoE models handle highly complex tasks like math reasoning or programming?
A: Yes, models like DBRX are specifically designed for complex tasks, although some challenges persist in precision-critical queries.

Q4: What are the hardware requirements for deploying MoE models?
A: GPUs with ample VRAM are critical for storing inactive experts; optimizations, such as shared experts, can help mitigate memory overhead.

Q5: Which tasks benefit the most from MoE models?
A: NLP, summarization, conversational AI, code generation, and entity extraction are some of the most common and effective applications of MoE models.

7 Further Reading

Support Vector Machine (SVM) from a Mathematical Perspective: Solving Optimization Problems

blog@zair.top (Tim) — Wed, 27 Nov 2024 16:53:06 +0800

Support Vector Machine (SVM) is a classic algorithm in machine learning. This article focuses on the formula derivation in SVM, such as detailed reasoning of the margin distance formula, and the formulation of the primal and dual problems. It delves into optimization problems, including constructing the Lagrangian function to handle constrained optimization problems and using KKT conditions to find optimal solutions. It also covers the characteristics of polynomial and Gaussian kernel functions.

Margin Distance Derivation

In Support Vector Machine (SVM), the equations for the positive and negative hyperplanes are respectively: $$ \vec{w} \cdot \vec{x} + b = 1 \quad \text{(Positive Hyperplane)} $$ $$ \vec{w} \cdot \vec{x} + b = -1 \quad \text{(Negative Hyperplane)} $$ where $\vec{w}=(w_1, w_2)$ is the weight vector, $b$ is the bias term, and $\vec{x}=(x_1, x_2)$ is the data point.

Assume $\vec{x_m}$ is a point on the positive hyperplane, and $\vec{x_n}$ is a point on the negative hyperplane, then: $$ w_1 x_{1m} + w_2 x_{2m} + b = 1 \quad \text{(1)} $$ $$ w_1 x_{1n} + w_2 x_{2n} + b = -1 \quad \text{(2)} $$

Subtracting equation (2) from equation (1), we get: $$ w_1 (x_{1m} - x_{1n}) + w_2 (x_{2m} - x_{2n}) = 2 $$ In vector form: $$ \vec{w} \cdot (\vec{x_m} - \vec{x_n}) = 2 \quad \text{(3)} $$ Consider two points $\vec{x_0}$ and $\vec{x_p}$ on the decision hyperplane, which satisfy the decision hyperplane equation $\vec{w} \cdot \vec{x} + b = 0$, i.e.: $$ w_1 x_{10} + w_2 x_{20} + b = 0 $$ $$ w_1 x_{1p} + w_2 x_{2p} + b = 0 $$ Subtracting these two equations gives: $$ w_1 (x_{10} - x_{1p}) + w_2 (x_{20} - x_{2p}) = 0 $$ In vector form: $$ \vec{w} \cdot (\vec{x_0} - \vec{x_p}) = 0 \quad \text{(4)} $$ Equation (4) indicates that $\vec{w}$ is perpendicular to the vector difference between any two points on the decision hyperplane.

From equations (3) and (4), we know that the dot product of $\vec{w}$ and $(\vec{x_m} - \vec{x_n})$ is 2. According to the definition of vector dot product $\vec{a} \cdot \vec{b}=|\vec{a}| \cdot |\vec{b}| \cdot \cos \theta$, where $\theta$ is the angle between $\vec{w}$ and $(\vec{x_m} - \vec{x_n})$, we have: $$ |\vec{x_m} - \vec{x_n}| \cdot \cos \theta \cdot |\vec{w}| = 2 $$ Let $L = |\vec{x_m} - \vec{x_n}| \cdot \cos \theta$, then: $$ L \cdot |\vec{w}| = 2 $$ Solving for $L$ gives: $$ L=\frac{2}{|\vec{w}|} $$

Here, $L$ is the margin distance of the SVM.

In deriving the margin distance, we utilized the geometric meaning of the vector dot product, i.e., $\vec{a} \cdot \vec{b}=|\vec{a}| \cdot |\vec{b}| \cdot \cos \theta$, where $\theta$ is the angle between the two vectors. Through this relationship, we transformed the dot product into a relationship involving vector magnitudes and angles, thus deriving the expression for the margin distance.

Dual Equivalence Proof

In linear Support Vector Machine (SVM), the primal problem is to find the weight vector $w$ and bias $b$ that minimize the objective function:

$$ \min_w f(w) = \frac{1}{2} |w|^2 $$

Here, $|w|^2$ represents the Euclidean norm squared of the vector $w$, i.e., the $L_2$ norm. The goal is to minimize the width of the decision boundary to achieve better generalization ability. This problem is subject to the following constraints:

$$ y_j (w^T x_j + b) - 1 \geq 0 $$

Here, $x_j$ is the $j$-th training sample, and $y_j$ is the corresponding label, taking values of +1 or -1. This ensures that all data points are correctly classified and are at least one unit distance from the decision boundary.

To handle this constrained optimization problem, we construct the Lagrangian function:

$$ L(w, b, \alpha) = f(w) - \sum_{j = 1}^n \alpha_j g_j(w, b) $$

Here, $\alpha_j \geq 0$ are the Lagrange multipliers used to introduce the constraint conditions of the primal problem $g_j(w, b) = y_j (w^T x_j + b) - 1 \geq 0$.

Next, we define the dual function $q(\alpha)$ as:

$$ q(\alpha) = \min_{w, b} L(w, b, \alpha) = \min_{w, b} \left( f(w) - \sum_{j = 1}^n \alpha_j g_j(w, b) \right) $$

Since $\alpha_j \geq 0$ and $g_j(w^{*}, b^{*}) \geq 0$, we can derive:

$$ q(\alpha) = \min_{w, b} \left( f(w) - \sum_{j = 1}^n \alpha_j g_j(w, b) \right) \leq f(w^*) - \sum_{j = 1}^n \alpha_j g_j(w^*, b^*) \leq f(w^*) \leq f(w) $$

This means the dual function provides a lower bound for the primal problem. Next, we need to find an $\alpha^*$ such that:

$$ q(\alpha) \leq q(\alpha^*) \leq f(w^*) \leq f(w) $$

The primal and dual problems of SVM can be expressed as:

$$ \max_{\alpha} q(\alpha) = \max_{\alpha} \min_{w, b} L(w, b, \alpha) $$

With the constraint: $ \alpha_i \geq 0 $

And when weak duality holds, we have $q(\alpha^*) \leq f(w^*)$; when strong duality holds, i.e., Slater’s condition is satisfied, we have $q(\alpha^*) = f(w^*)$. Slater’s condition requires the existence of a feasible solution such that all inequality constraints are strictly satisfied, and linear SVM is linearly separable, which automatically satisfies Slater’s condition.

Thus, we have:

$$ f(w) \geq q(\alpha^*) = f(w^*) \geq q(\alpha_i) $$

From the above equation, we can deduce:

$$ q(\alpha^*) \geq q(\alpha_i) $$ $$ f(w^*) \leq f(w) $$

$f(w)$ finds the minimum value (primal problem), $q(\alpha)$ finds the maximum value (dual problem), and the optimal solutions of the primal and dual problems are equal, i.e.:

$ w^*, b^* $ are the solutions to the primal problem, $\alpha^*$ is the solution to the dual problem, and $f(w^*) = q(\alpha^*)$.

We can see that in linear SVM, when specific conditions (Slater’s condition) are met, the solutions to the primal and dual problems are consistent. This provides an effective way to solve complex optimization problems, especially when the primal problem is difficult to solve directly, the dual problem can be solved indirectly.

Simple Example

To more intuitively understand the equivalence of the solutions to the primal and dual problems mentioned above, consider a simple optimization problem defined as follows:

The primal problem is: $$ \min_x f(x) = x^2 $$ With the constraint: $$ x - 1 \geq 0 $$

The goal of this problem is to minimize the function $f(x) = x^2$, while $x$ needs to satisfy $x \geq 1$. Intuitively, we know that when $x = 1$, $f(x) = 1$, which is the minimum value under the given constraint.

To verify duality, we construct the Lagrangian function:

$$ q(\alpha) = \min_x L(x, \alpha) = \min_x (x^2 - \alpha(x - 1)) $$

Here, $\alpha \geq 0$ is the Lagrange multiplier used to introduce the constraint condition $x - 1 \geq 0$ from the primal problem. By constructing the Lagrangian function, we convert the constrained optimization problem into an unconstrained problem.

Next, we take the partial derivative of $L(x, \alpha)$ with respect to $x$ and set it to zero:

$$ \frac{\partial L}{\partial x} = 0 2x - \alpha = 0 $$

Solving for $x$ gives:

$$ x = \frac{\alpha}{2} $$

Substituting $x = \frac{\alpha}{2}$ into $q(\alpha)$:

$$ q(\alpha) = - \frac{\alpha^2}{4} + \alpha $$

Now we have obtained the form of the dual function $q(\alpha)$. Next, we need to solve for the maximum value of the dual problem $\max_{\alpha} q(\alpha) $

To do this, we take the derivative of $q(\alpha)$ with respect to $\alpha$ and set it to zero:

$$ \frac{dq}{d\alpha} = - \frac{\alpha}{2} + 1 = 0 $$

Solving for $\alpha$ gives $$ \alpha = 2 $$

Substituting $\alpha = 2$ into $x = \frac{\alpha}{2}$, we get: $$ x = 1 $$

At this point, substituting $\alpha = 2$ into $q(\alpha)$, we calculate:

$$ q(\alpha) = - \frac{2^2}{4} + 2 = 1 $$

Through this simple example, we can see that the solution to the primal problem $x = 1$, $f(x) = 1$, is equivalent to the solution to the dual problem $\alpha = 2$, $q(\alpha) = 1$. This verifies that under certain conditions, the solutions to the dual problem and the primal problem are consistent.

By applying dual theory, we not only found the solution to the primal problem but also obtained the same result by solving the dual problem, thus verifying the equivalence of the solutions to the dual problem.

Solving with KKT Conditions

SVM Satisfying KKT Conditions

The original optimization problem of SVM is a convex optimization problem. The objective function of SVM $\frac{1}{2}|w|^2$ is a quadratic function, which is a convex function with respect to $w$. At the same time, the constraint condition $y_i(w \cdot x_i + b) \geq 1$ is linear (affine constraint), and therefore also convex. In convex optimization problems, a local optimal solution is a global optimal solution, and the KKT conditions are necessary and sufficient conditions. This means that if a point satisfies the KKT conditions, it is a global optimal solution.

The objective function $\frac{1}{2}|w|^2$ is continuous and differentiable, and the constraint condition $y_i(w \cdot x_i + b) \geq 1$ is also continuous and differentiable. This smoothness ensures the existence and uniqueness of gradients, allowing the gradient conditions in the KKT conditions (i.e., taking partial derivatives with respect to $w$ and $b$ and setting them to zero) to be effectively applied.

In convex optimization problems, the KKT conditions are not only necessary conditions but also sufficient conditions. That is, if a point satisfies the KKT conditions, it must be a global optimal solution. For SVM, by solving the KKT conditions, we can find the optimal $w^*$ and $b^*$, thereby determining the best separating hyperplane.

Solving Linear Support Vector Machine Using KKT Conditions

The original SVM optimization problem is to minimize $\frac{1}{2}|w|^{2}$ while satisfying the constraint $y_{i}(w\cdot x_{i}+b)\geqslant1$, where $i = 1,2,\cdots,N$.

First, construct the Lagrangian function $L(w,b,\alpha)=\frac{1}{2}|w|^{2}-\sum_{i = 1}^{N}\alpha_{i}(y_{i}(w\cdot x_{i}+b)-1)$, where $\alpha_{i}\geqslant0$ are the Lagrange multipliers. According to the KKT conditions, we have:

$$ \nabla_{w}L(w^*,b^*,\alpha^*) = w^*-\sum_{i = 1}^{N}\alpha_{i}^*y_{i}x_{i}=0 $$

$$ \nabla_{b}L(w^*,b^*,\alpha^*)=-\sum_{i = 1}^{N}\alpha_{i}^*y_{i}=0 $$

$$ \alpha_{i}^*(y_{i}(w^*\cdot x_{i}+b^*)-1)=0 $$

$$ y_{i}(w^*\cdot x_{i}+b^*)-1\geqslant0 $$

$$ \alpha_{i}^*\geqslant0 $$

These conditions apply to all $i = 1,2,\cdots,N$.

From $\nabla_{w}L(w^*,b^*,\alpha^*) = w^*-\sum_{i = 1}^{N}\alpha_{i}^*y_{i}x_{i}=0$, we can derive

$$ w^*=\sum_{i = 1}^{N}\alpha_{i}^*y_{i}x_{i} \quad \text{(5)} $$ Since at least one $\alpha_{j}^*>0$ exists (assuming $\alpha_{i}^*=0$ would lead to a contradiction given the solution from $\nabla_{w}L(w^*,b^*,\alpha^*) = w^*-\sum_{i = 1}^{N}\alpha_{i}^*y_{i}x_{i}=0$).

For solving $b^*$, by substituting $w^*=\sum_{i = 1}^{N}\alpha_{i}^*y_{i}x_{i}$ into $y_{j}(w^*\cdot x_{j}+b^*)-1 = 0$ (considering the case where $\alpha_{j}^*>0$ exists), and noting that $y_{j}^{2}=1$, we obtain:

$$ b^*=y_{j}-\sum_{i = 1}^{N}\alpha_{i}^*y_{i}(x_{i}\cdot x_{j}) \quad \text{(6)} $$

Based on the above theory, the separating hyperplane can be expressed as:

$$ \sum_{i = 1}^{N}\alpha_{i}^*y_{i}(x\cdot x_{i})+b^*=0 $$

Thus, the classification decision function can be written as:

$$ f(x)=\text{sign}(\sum_{i = 1}^{N}\alpha_{i}^*y_{i}(x\cdot x_{i})+b^*) $$

In SVM, the complementary slackness condition $\alpha_i (y_i(w \cdot x_i + b) - 1) = 0$ indicates that if a sample point $x_i$ is not a support vector (i.e., $y_i(w \cdot x_i + b) > 1$), then the corresponding Lagrange multiplier $\alpha_i$ must be zero. Conversely, if a sample point is a support vector (i.e., $y_i(w \cdot x_i + b) = 1$), then the corresponding $\alpha_i$ can be non-zero. This condition ensures that only support vectors contribute to the solution of the optimization problem, simplifying the problem-solving process.

Polynomial and Gaussian Kernel Functions

If the existing problem is not linearly separable, we can map the existing data to a higher-dimensional space, making it a linearly separable problem in that space. However, directly performing calculations in a high-dimensional feature space can be very complex. From equations (5) and (6), we know that we do not need to actually map the data to a high-dimensional space; we only need to know the inner product between data points. The role of the kernel function is to avoid explicitly performing high-dimensional feature mapping by indirectly calculating the inner product in the high-dimensional feature space through the kernel function value in the original feature space.

The Gaussian kernel function is a common kernel function, with the form: $$ K(x, y) = \exp\left(-\gamma |x - y|^2\right) $$

where $\gamma$ is a positive parameter that controls the width of the kernel function.

We can perform a Taylor expansion on the exponential function:

$$ \exp(z) = \sum_{k=0}^{\infty} \frac{z^k}{k!} $$

Substituting $ z = -\gamma |x - y|^2 $ into the above formula, we get:

$$ K(x, y) = \exp\left(-\gamma |x - y|^2\right) = \sum_{k=0}^{\infty} \frac{(-\gamma |x - y|^2)^k}{k!} $$

The polynomial kernel function has the form:

$$ K_{\text{poly}}(x, y) = (x \cdot y + c)^d $$

where $ c $ is a constant term, and $ d $ is the degree of the polynomial.

$|x - y|^2$ can be expanded as:

$$ |x - y|^2 = (x - y) \cdot (x - y) = x \cdot x + y \cdot y - 2 x \cdot y $$

Substituting this expression into the Taylor expansion of the Gaussian kernel function:

$$ K(x, y) = \sum_{k=0}^{\infty} \frac{(-\gamma (x \cdot x + y \cdot y - 2 x \cdot y))^k}{k!} $$

We can see that each term $ \frac{(-\gamma (x \cdot x + y \cdot y - 2 x \cdot y))^k}{k!} $ is essentially a polynomial term, meaning each term can be expressed as a combination of different powers of $ x $ and $ y $.

If we closely observe each term, we can find that the Gaussian kernel function is actually obtained by harmonizing different orders of polynomial kernel functions. Each term $ \frac{(-\gamma (x \cdot x + y \cdot y - 2 x \cdot y))^k}{k!} $ can be regarded as a weighted form of a $ k $-order polynomial kernel function.

For example, when $ k = 1 $:

$$ \frac{(-\gamma (x \cdot x + y \cdot y - 2 x \cdot y))^1}{1!} = -\gamma (x \cdot x + y \cdot y - 2 x \cdot y) $$

When $ k = 2 $:

$$ \frac{(-\gamma (x \cdot x + y \cdot y - 2 x \cdot y))^2}{2!} = \frac{\gamma^2 (x \cdot x + y \cdot y - 2 x \cdot y)^2}{2} $$

These terms are polynomial forms of $ x $ and $ y $, and are weighted by the factorial $ k! $.

The Gaussian kernel function can be viewed as being harmonized through different orders of polynomial kernel functions in infinite dimensions. This harmonization allows the Gaussian kernel function to capture more complex nonlinear relationships in high-dimensional feature spaces. Therefore, in many nonlinear task scenarios, the Gaussian kernel function is a good choice.

VLDB 2024 Guangzhou Trip

blog@zair.top (Tim) — Mon, 23 Sep 2024 11:02:06 +0800

VLDB 2024 is a top international conference in the field of data management and databases, covering areas such as data management, database architecture, graph data management, data privacy and security, data mining, machine learning, artificial intelligence, and database systems research. This year’s conference was held from August 26 to 30 in Guangzhou, China, bringing together researchers and enterprises from around the world. The conference featured a rich content with over 250 research presentations, 11 keynote speeches, multiple tutorials, and workshops, providing attendees with extensive learning and networking opportunities.

Recently, I went to Guangzhou to attend VLDB 2024 and experienced the full academic conference process. I encountered many interesting things during the trip and thought of writing a blog to share.

Note

This article is a tourism-oriented share, including insights from Guangzhou and an introduction to the academic conference process, without specific report content and direction, feel free to enjoy.

The trip lasted 6 days, with the itinerary as follows. The first 5 days were spent near the Pazhou Island Canton Fair Complex, mainly attending the conference and the Pearl River night cruise; the last day was a special forces-style tour, starting from Shamian Island on the west side, passing through the Thirteen Hongs Museum, walking to the Sacred Heart Cathedral, and the east side was the Huangpu Military Academy Memorial.

Day 0: A Little Guangzhou Shock for the Socialites

At 3 PM, I took the subway from Guangzhou South Station to Pazhou Station. Before even exiting the station, I realized something was off as the subway was packed with cosplayers. Coincidentally, the mischievous map chose the exit at the Poly World Trade Center, which turned out to be the nest of the second-dimensional creatures: from the subway exit to the underground of the venue, from the Poly World Trade Center to the Canton Fair Complex, Pazhou Island was crowded with them! Not only were there many participants, but the variety of cosplay characters was also rich, ranging from domestic games to Japanese anime, from Laoda to Trump assassination… The parking lot was also filled with itasha cars, almost none of which were electric cars, probably because the owners were also deep into the second dimension.

After finally breaking through the crowd to the hotel, I found out at 9 PM when I went to eat that the comic convention hadn’t ended yet, and they even set up a night market along the riverbank—it’s Monday the next day! The second-dimensional atmosphere in Guangzhou is just too strong.

Day 1: Academic Locusts and Banquet Blunders

VLDB2024 was hosted by the Hong Kong University of Science and Technology (Guangzhou) at the Langham Hotel in Guangzhou. Registration on the first day started at 8 AM, and being too excited for my first academic conference, I arrived early. During registration, they handed out badges, conference manuals, schedules, meal tickets, boat tickets, a box of souvenirs, and a bunch of sponsor advertisements. The souvenirs included a hardcover notebook, a ballpoint pen, and an umbrella, available in black and blue themes.

The first day’s conference content was entirely organized in the form of workshops. There were workshops on different themes running simultaneously, so it was necessary to choose the themes of interest in advance. I chose themes related to LLM, Knowledge Graph, and Vector Database. Understanding the presentation content was not difficult, but during the discussion session, a few guys got really into it, and their speech speed gradually increased, making it hard to follow. My English listening and speaking skills still need practice.

What are the sessions at an academic conference?

1. Opening Ceremony

The formal start of the conference.
Speeches by the organizers or heads of the hosting institutions, introducing the conference theme, agenda, and important attendees.
Sometimes includes special welcome ceremonies or cultural displays.

2. Keynote Speech

Delivered by a renowned scholar or expert in a particular field, usually centered around the conference theme.
Keynote speeches are often inspiring, providing forward-looking or summarizing perspectives for attendees.
It is a highly anticipated session, generally lasting 45 minutes to an hour.

3. Parallel Sessions

Attendees are grouped by research direction or theme, held simultaneously in multiple venues.
Each group has multiple presentations where researchers showcase their research findings.
Each presentation is followed by a brief Q&A session for audience questions and discussions.

4. Poster Session

Researchers display their research findings through posters, usually held in the exhibition area or during breaks.
Attendees can freely visit various poster booths during designated times, interact with researchers, and have discussions.
Poster sessions are suitable for showcasing novel or immature research projects.

5. Panel Discussion

A discussion group composed of multiple experts, engaging in dialogue on a specific topic.
Experts first present their views, followed by guided interaction and discussion by a moderator, and finally open to audience questions.
Panel discussions help explore research issues or hot topics in an academic field from multiple perspectives.

6. Roundtable Discussion

Participants engage in open discussion around a specific theme in a roundtable format.
Roundtable discussions emphasize interactivity, allowing participants to speak, ask questions, and share views at any time.

7. Workshop

Focuses on hands-on and interactive learning, where participants learn new skills, methods, or tools at the conference.
Workshops usually require pre-registration and sometimes require participants to bring their own data or materials.

8. Networking Sessions

Social sessions within the conference aimed at providing attendees with a relaxed environment for informal exchanges and building collaborations.
Networking activities may include dinners, tea breaks, lunches, and receptions.

9. Closing Ceremony

The summary and closing session of the conference.
Organizers review the main content and outcomes of the conference, possibly recognizing outstanding presentations, papers, or research achievements.
Usually announces the time and location of the next conference.

10. Q&A Sessions

Follow each presentation or panel discussion, allowing the audience to ask questions.
Researchers or speakers provide detailed answers to audience questions, promoting two-way communication.

11. Breakout Sessions

In large conferences, attendees are often divided into several groups to discuss specific subtopics or research directions.
Each group focuses on a specific theme, and may report discussion outcomes or suggestions.

12. Exhibition Booths

In some academic conferences, especially in technical or applied fields, there are usually exhibition areas.
Related institutions, publishers, and companies showcase their products, technologies, services, or publications, and attendees can freely visit.

13. Awards and Recognition

Some conferences have award sessions for papers, presentations, or research achievements.
Outstanding researchers, papers, poster displays, etc., may be recognized during the closing ceremony or a dedicated award ceremony.

14. Coffee Breaks and Lunches

Break times during the conference, offering refreshments or lunch.
Coffee breaks and lunch times are also important opportunities for attendees to engage in informal exchanges and socializing.

Tea breaks were at 10 AM and 3 PM, and the 8 AM breakfast was also in the format of a tea break. For the big shots, tea breaks are a good time for socializing, discussing research content, or exchanging contact information. But for someone like me, an academic locust, tea breaks are just tea breaks. Although the food at each tea break varied a bit, it generally consisted of cookies, bread, cakes, fruits, and drinks. Among them, the mousse cake and Thai fish cake with sweet chili sauce were my favorites.

The evening welcome banquet was also very sumptuous, served buffet-style. However, there were so many people in line that I didn’t pay attention to the dishes beforehand and ended up scooping two spoons of rice, missing out on many good dishes. The dishes included about 30% Chinese cuisine, with the rest being Japanese (sushi and sashimi), curry, vegetable salad, Western desserts, fruits, and drinks (red wine and juice), catering to different dietary preferences. However, the Chinese dishes were quite average, leading me to end up with carbs on carbs—curry rice and small cakes. Alas, I must say goodbye to the small cakes.

During the banquet, there was a mix-up. I attended the conference with another classmate, and neither of us had met our advisor in person. After selecting our dishes and randomly finding a table to sit at, we started listening to what others at the table were discussing. Then we had the following conversation: “Don’t you think the person across looks like our advisor?” “No way, such a coincidence?” “The research direction matches, they just said…” “The age doesn’t match…” “The number of people matches…” “The height matches too…” “It’s too similar…” “It’s exactly like…” “What should we do, should we acknowledge now?” “Let’s eat first.” So we nervously finished our meal, went up to acknowledge, and found out they were not our advisor. They were PhD students from Tianjin University and East China Normal University, with similar research directions. After introducing ourselves, we became friends. Later, we found out their paper was nominated for Best Research Paper, and we were really envious.

Day 2: Lion Dance, Opening!

The most impressive part of the opening ceremony was the Guangzhou Sandpit Lion Dance performance. At the beginning of the performance, the dancers manipulated the lion to make various realistic movements, imitating the lion’s living habits, such as shaking its head, scratching, rolling, etc. The most distinctive feature was the lion’s eyes that could move (this was my basis for judging it as a sandpit lion). Then the lion started performing high-difficulty jumps and stepping on piles on stage, showcasing the dancers’ precise skills and strong physique. Finally, the leaders went on stage to dot the eyes, and the lion held out congratulatory couplets. The entire performance, accompanied by the rhythm of gongs and drums, made the lion sometimes mighty, sometimes agile, with both a sense of power and fun.

This was my first time seeing a real lion dance performance, and it was the most distinctive Guangdong lion dance. Although many games have scenes based on the Guangdong lion dance, especially the iconic pile-stepping action, these scenes often focus on rendering the performance setting, lacking the fun brought by the lion’s imitating actions, and the sense of power is not as strong as the live performance.

After the lion dance performance, the conference chair, Professor Chen Lei from HKUST, introduced some general information about the conference: this conference had the highest number of attendees in VLDB’s history, with 60% of attendees from mainland China, it was the richest conference (half a million dollars), AI-related papers were the most, and papers on database underlying design were the least.

After the opening ceremony, a full day of parallel sessions officially began. Parallel sessions also require selecting themes of interest in advance, and it’s best to read the papers of interest beforehand and even prepare questions to ask. Dinner was as sumptuous as the previous day, with Italian vegetable soup and lasagna being very good! (The only) This day also had skewers, with two types: chicken wings and gluten. After dinner, I walked around and found that at any table with Chinese people, there was no one who didn’t take skewers, and some even had plain rice with skewers, indicating that everyone still had a Chinese palate. ~~Code farmer barbecue, great potential.~~

Day 3: From Cantonese Cuisine Pinnacle to Deck

The morning started with a keynote from Google, focusing on the impact of large models on database research. This was followed by parallel sessions similar to the previous day.

Exquisite Cantonese Cuisine

The evening banquet was held at Bingsheng Taste Haiyin Main Store. On the way, I met two students studying in the United States, who commented that when Chinese people organize conferences, they must give enough face. This was not false, as I later learned that this restaurant is both Michelin-rated and known as the pinnacle of Cantonese cuisine. One of them, living in the West Coast, described their life as being like living in a village, with nothing to eat or play; the other, living in the East Coast, also mentioned poor dietary conditions, allowing them to maintain different sports activities daily. However, when discussing the purpose of pursuing a graduate degree, one of them mentioned it was to maintain a sense of insight and lifelong learning mindset, which I couldn’t agree more with.

The dinner dishes were indeed very rich. Since it was inconvenient to use a phone at the table, I can only recall a few impressive dishes from memory: American ginseng pork rib soup, whole roasted suckling pig (but with two red light bulbs stuffed in its eyes, as an outsider I would call it a fel energy roasted suckling pig), garlic steamed Boston lobster, sea cucumber, fish, chicken and duck, sweet and sour pork, sweet water, tomatoes and honeydew melon, Sachima, and Pu’er tea. Overall, it was good, with the roast suckling pig being crispy and tender, and the sweet and sour pork being tangy and delicious, with ingredients selected having a strong Lingnan characteristic. However, some of the distinctive dishes in Cantonese cuisine were the hardest to eat: the sweet water was really sweet to the point of being cloying, the American ginseng in the soup was sour and astringent, and there was a dim sum that tasted like mooncake skin wrapped with raw celery pork dumpling filling, which was really hard to accept. Interestingly, the restaurant did not provide knives and forks by default, forcing some foreigners to use chopsticks, but they seemed to have no difficulty.

At our table with our advisor, there were researchers from Amazon and an AP from an American university. The Amazon researcher, in front of my advisor, talked about some things from the industry’s perspective that were not suitable for students to hear. The AP talked about the relaxed atmosphere in American universities where you can get tenure by working at your own pace. The pork rib soup might have been soured by the American ginseng at this point.

After dinner, all attendees walked to the pier together for the Pearl River night cruise. On the way, I met a guy in his first year of PhD at Nagoya University, who praised the PhD life in Japan, including the free-range supervision, high subsidies, seller’s job market, and cultural environment. Can pursuing a PhD be this enjoyable?

Pearl River Night Cruise

As night fell, the banks of the Pearl River lit up. We boarded the cruise ship, slowly sailing into the dazzling night. The high-rise buildings on both sides were brightly lit, reflecting on the shimmering river, like stars falling into the blue waves.

Along the way, we passed iconic buildings, with the Canton Tower towering majestically, its changing lights dreamlike. The skyscrapers of Zhujiang New Town shone brightly, outlining the skyline of a modern city. When the ship passed under the Haizhu Bridge, the traffic on the bridge was heavy, and the ships below were bustling, presenting a busy scene.

The breeze gently brushed my face, and the river water lightly tapped the ship’s side. Folk music came from below the deck, adding a touch of leisure to the night. Tourists leaned on the railing to gaze into the distance or chatted with drinks in hand, immersed in the charming night.

The two-hour journey passed in a flash. Looking back at the entire journey, the bustling city and long history perfectly blended on this mother river, leaving a lasting impression.

New Type of Socializing

While my classmate and I were enjoying the view at the bow of the third deck, a foreigner approached us. He was carrying a backpack, sweating profusely, and seemed a bit rushed. He asked if there was Wi-Fi on the ship and if we knew the password, as he needed to contact his family. We found out there wasn’t, so we offered to share our hotspot. After successfully connecting, we realized that foreign social media apps are mostly blocked in China, and this guy might have been looking for Wi-Fi in Guangzhou but kept getting blocked. From his perspective: being in a distant and mysterious East, unable to bypass the firewall with any Wi-Fi, losing contact with family for three days, and people unwilling to “share” their hotspots—it’s really pitiful. So we tried to share the magic cat’s ability over the local network, but this simple task on a computer couldn’t be accomplished on a phone.

However, we soon discovered that this guy was a social butterfly. He kept finding topics to talk about: “What does ICBC mean? What does ‘Liede’ mean? Where are you from? What’s that logo?”… and kept asking to take photos with us.

We chatted all the way, adding a different kind of fun to the night cruise.

Later on Whova, we found out that this guy ranked second in activity, worked at a German university, and had a company—this didn’t seem like someone who had lost contact with family and needed to borrow Wi-Fi everywhere.

Day 4: Freebies Collection

During VLDB2024, there were booths from sponsors like Google, Meta, Alibaba, and ByteDance. At these booths, you could fill out surveys and receive freebies. The surveys mainly investigated research directions and job preferences, and some directly collected business cards. Among all the freebies, I liked Google’s white shirt and the penguin in the picture below the most. ~~If you don’t know which company this penguin belongs to, once you see the genuine anti-counterfeiting sticker on it, you’ll know it’s Tencent’s.~~

The afternoon presentations gave me a lot of inspiration. Some work seemed simple, even just a naive idea, but after in-depth research, it could form a complete paper, which was impressive. One memorable moment was a lone wolf from Cornell University who presented two independently completed works in the same session. Holding a microphone, with one hand in his pocket and both arms propped on the podium, his PPT adopted a simple large-font style (no slide had more than 30 words), treating the presentation as a speech. This might be what it means to be a big shot, completely next level.

This day marked the end of all our conference schedules, and when leaving the hotel, we encountered extremely severe thunderstorm weather with torrential rain.

Day 5: Half a Day Through Five Hundred Years of China

With the “since we’re here” principle, I arranged a special forces-style intensive itinerary for my last day in Guangzhou. On this day, I visited several famous attractions in Guangzhou, experiencing the city’s rich historical and cultural heritage. In the early morning, I first arrived at the exotic Shamian Island. Strolling through this former concession area, where European-style buildings and tropical plants complement each other, it felt like traveling back in time. A standard for judging whether a place’s scenery is beautiful enough is whether people take wedding photos there, like the Pukou Railway Station, Green Expo Garden, and Sun Yat-sen Mausoleum in Nanjing, and Shamian Island in Guangzhou.

After spending the morning on Shamian Island, I walked to the nearby Thirteen Hongs Museum at noon to learn about Guangzhou’s glorious history as an important port on the Maritime Silk Road. The museum’s exhibits and pictures vividly depict Guangzhou’s prosperous trade from the 17th century to the semi-colonial scene of the 19th century. I learned that after Guangzhou was forced to open for trade, foreign trading houses were initially located along the river in the city center opposite the current Canton Tower, but later moved to Shamian Island due to a fire.

In the afternoon, I visited the Sacred Heart Cathedral, a large Gothic building and a must-visit spot recommended by Xiaohongshu. It turned out to be just a photo spot, with only the entrance accessible most of the time.

In the afternoon, I visited the Huangpu Military Academy site. Strolling through this place that cultivated countless revolutionary martyrs, I hoped to feel the passionate era. However, due to not doing homework in advance, it was somewhat disappointing and different from what I imagined. The original site of the Huangpu Military Academy was bombed to ashes by the Japanese during the Anti-Japanese War, and the current attractions are just efforts to restore it. Looking at the neat and orderly restored buildings, I kept having a “Confucius Temple” déjà vu—only feeling like I was in a scenic spot. After searching for a long time, I couldn’t find the couplet at the main gate: “If you want to get promoted and get rich, please go elsewhere; if you are afraid of death, don’t enter this door.”

As night fell, I returned to the Pearl River, strolling along the riverside. The night view of Guangzhou, with its horizon and skyline, the Little Waist and Grand Theater, is never tiring. From the Haizhu Bridge to the Liede Bridge, the same Pearl River presents a different scene in each century from the 17th to the 21st century.

These days in Guangzhou felt like a journey through ancient and modern times. From the academic halls of Pazhou to the dazzling lights on both sides of the Pearl River; from the exotic charm of Shamian Island to the revolutionary relics of the Huangpu Military Academy. This city, in its unique way, perfectly integrates academia, culture, history, and modernity. Under the shadows of skyscrapers, there are also the fireworks of urban villages; under the busy bridges, there are also the figures of the homeless, reminding that this city is still in the process of continuous progress and inclusiveness. But these scenes not only do not diminish Guangzhou’s charm but make the city appear more real and full of human touch.

Goodbye, Guangzhou.

Data Exhaustion Crisis: The Data Bottleneck in AI Development and Countermeasures

blog@zair.top (Tim) — Wed, 21 Aug 2024 15:05:44 +0800

We estimate that the stock of public text generated by humans is about 300 trillion tokens. If trends continue, language models will completely exhaust this stock between 2026 and 2032, or even earlier if overtrained. ——Epoch AI

In 2006, Fei-Fei Li, then a professor at the University of Illinois (now a professor at Stanford University), saw the potential of the internet to change artificial intelligence (AI) research. Research in the field of linguistics had identified 80,000 “noun synonym sets,” which are collections of synonyms describing the same type of thing. Fei-Fei Li hypothesized that among the billions of images on the internet, there must be countless instances of these synonym sets. If enough of these images could be collected, a massive database could be created that surpasses any previous AI training resources. She said, “Many people focus on models, we should focus on data.” Thus, the ImageNet project was born.

The internet not only provided images but also resources for image annotation. After finding images of cats, dogs, chairs, etc., through search engines, people on Amazon’s crowdsourcing platform Mechanical Turk manually checked and annotated these images. Eventually, a database containing millions of verified images was constructed. It was the use of part of the ImageNet data to train AlexNet in 2012 that demonstrated the great potential of “deep learning,” kicking off the last AI cycle and spawning an industry dependent on large amounts of annotated data.

1 The Data-Driven AI Era

In this AI cycle, AI development has extended to large language models (LLM), which also rely on internet data for training, but in a different way. The classic training task in the field of computer vision (CV) is to predict the content of an image (image classification task), but the classic task for LLM training is to predict the missing words in a text based on context.

This training method does not require manually annotated data; the system can leave out words, infer, and evaluate the correctness of the answers through “self-supervised training.” However, this method requires a large amount of data. Generally speaking, the more text a model obtains and the larger the data volume, the better its performance (Scaling Law). The internet provides tens of billions of texts, which are as valuable to LLM as carbon deposited over billions of years is to modern industry—a precious resource that can be refined into fuel.

A common source of training data is Common Crawl, an internet archive containing 50 billion web pages. As AI models develop, more data is added, such as Books3—a database containing thousands of books. However, as the demand for text data by AI grows, the supply of high-quality data on the internet is gradually becoming overwhelmed. According to Epoch AI’s estimates, by 2028, high-quality text data on the internet will be completely utilized, and the industry will face the so-called “data wall.” The situation is more severe in the Chinese internet, from “the Chinese internet is collapsing” to major platforms locking their doors, as everyone realizes the value of data and puts it in a vault. How to overcome this barrier may be one of the most challenging issues in the future development of AI, and it may also be the issue most likely to slow down its progress.

2 Data Ownership and Copyright Issues

AI models increasingly rely on internet data, but the copyright issues of data are also full of controversy. Many of the data used to train large language models are often used without the consent of the copyright holders, and some AI companies have even utilized content behind paywalls. Although AI companies claim that such use falls under the “fair use” doctrine of copyright law, copyright holders do not agree. Getty Images sued image generation company Stability AI, accusing it of unauthorized use of its image library. The New York Times sued OpenAI and Microsoft, accusing them of infringing the copyrights of millions of articles. Stack Overflow, Reddit, and X (formerly Twitter) now charge AI companies fees. Zhihu is also interfering with crawlers like Bing and Google by using garbled text to restrict its Chinese content from being used as datasets for AI training.

Different regions have different attitudes toward this issue. Japan and Israel have taken a lenient stance to promote their AI industries. The EU does not have a general concept of “fair use” and may be stricter. Domestically, only a national data bureau has been established, clarifying that data has a dual identity as both production material and production object.

3 Existing Data Usage Strategies

Facing the data wall, the AI field has proposed several countermeasures. One key strategy is to focus on data quality rather than quantity. AI labs no longer blindly use the entire internet’s data to train models but instead focus more on data filtering, cleaning, and optimization to ensure that models can extract the most valuable content from it. In the past year (2024), OpenAI’s models no longer seem to be “far ahead,” and everyone’s models are performing comparably, with differences in performance on different tasks coming from the construction of training data. After all, there are many open-source algorithms and models, but very few open-source datasets.

Obtaining “real-world information” is crucial, especially when models involve a lot of reasoning, making authoritative resources like academic textbooks particularly valuable. However, finding the optimal balance between different data sources remains a mysterious art.

During data usage, models also face the problem of “catastrophic forgetting”—that is, when a system is overtrained on certain types of data, it may excel in that field while forgetting previously learned knowledge. Therefore, the order of data during training also needs careful consideration. If all data on a particular topic (such as mathematics) is concentrated at the end of the training process, the model may perform well on math problems but may weaken its abilities in other areas. This unbalanced training method exacerbates the risk of catastrophic forgetting.

When data involves different fields and different forms (modalities), these strategies become more complex. As new text data becomes scarce, leading models like OpenAI’s GPT-4 and Google’s Gemini use not only text but also images, videos, and audio for training during self-supervised learning. However, video data is particularly tricky because video files contain extremely dense data points. To simplify the problem, existing models usually extract only a few frames for simplified processing, and academia is still searching for more efficient solutions.

4 Synthetic Data and AI Self-Training

Model capabilities can also be enhanced by fine-tuning (using additional data) based on the versions generated during self-supervised learning (pre-training versions). For example, “supervised fine-tuning” involves providing the model with question-answer pairs collected or created by humans to teach the model what constitutes a good answer. Another method, “reinforcement learning based on human feedback” (RLHF), tells the model whether an answer satisfies the questioner.

In RLHF, users provide feedback on the quality of the model’s output, which is then used to adjust the model’s parameters (weights). User interactions with chatbots, such as likes or dislikes, are particularly useful for RLHF. This is the mechanism of the “data flywheel”: more users bring in more data, which in turn optimizes better models. AI companies closely monitor the various questions users pose to their models, then collect data to adjust the models to cover these topics. Companies like Alibaba, ByteDance, and Minimax have launched price wars for models, and it’s hard to say there isn’t some consideration of this aspect.

As pre-training data on the internet gradually depletes, the importance of post-training is becoming increasingly prominent. Annotation companies like Scale AI and Surge AI earn hundreds of millions of dollars annually by collecting post-training data. Scale recently raised $1 billion at a valuation of $14 billion. Today’s annotation work has surpassed the era of Mechanical Turk: top annotators can earn up to $100 per hour. Although post-training helps generate better models and meets the needs of many commercial applications, it remains an incremental improvement, addressing symptoms but not the root cause.

In addition to gradually overcoming the data wall, another solution is to completely bypass it by using machine-generated synthetic data. DeepMind (a subsidiary of Google) launched the AlphaGo Zero model, which is a good example. The company’s first successful Go model was trained using data from millions of amateur games, while AlphaGo Zero did not use any existing data. Instead, it learned Go by playing 4.9 million games against itself over three days and recorded successful strategies. This “reinforcement learning” taught it how to respond to opponents’ moves by simulating a large number of possible responses and choosing the strategy with the highest probability of success.

Similar methods can also be used for LLM, such as the currently strongest open-source large model, Llama 3.1. A significant proportion of the SFT data in Llama 3.1 consists of synthetic data generated by the model, while a large proportion of the data in the SFT phase of Gemma2 is synthesized by larger models, proving that the quality of synthetic data is not inferior to that of human-annotated data.

Can we infinitely generate synthetic data and climb to the sky with one foot on the other? I believe the answer is no. A study published in Nature last month found that “abusing” synthetic data in model training can lead to “irreversible defects.” When models are fine-tuned with data synthesized by the model itself, it only takes a few rounds for the model to start talking nonsense, a phenomenon researchers call “model collapse.”

The bigger issue is how to extend this method to vertical fields such as healthcare or education. In games, the definition of victory is clear, and it is easier to collect data on whether a move is beneficial. In other fields, this is much more complex. Data on “good” decisions is usually collected from experts, but this is both expensive and time-consuming, and the solutions are not comprehensive. How to determine whether an expert is correct is also a recursive problem.

5 Conclusion

Acquiring more data will be key to maintaining rapid AI progress. Whether it is specialized data obtained from expert sources or machine-generated synthetic data, AI’s progress depends on the continuous supply of data. As the most easily accessible data reserves gradually deplete, the AI industry has made many efforts to alleviate this issue:

Emphasizing data quality and data cleaning
Increasing the proportion of mathematical, logical, and code data, adjusting training order
Using synthetic data to supplement real data

But these seem unsustainable, and new data sources or sustainable alternatives must be sought; or from the algorithm architecture level, design new architectures that do not rely on data, thus ushering in the next AI cycle.

6 Recommended Reading

Fine-tuning GPT-4o-mini to Generate Blog Articles

blog@zair.top (Tim) — Wed, 24 Jul 2024 15:05:44 +0800

The new model GPT-4o-mini, released on July 18, surpasses GPT-3.5 and approaches GPT-4 in performance, while costing only half as much as GPT-3.5. It is also the fastest in response time among the entire series of models. OpenAI officially opened the fine-tuning interface for GPT-4o-mini today, offering a daily free quota of 2M tokens until September 23, 2024.

~~It’s not that Llama 3.1 405B is unaffordable, but GPT-4o-mini offers better cost-effectiveness.~~

1 Suitable Scenarios for Fine-tuning

For general simple tasks, writing prompts is sufficient for the model to perform well. For more complex tasks, you can try using a Chain of Thought to break down the complex task into multiple steps and reason through them gradually. However, for tasks requiring high precision and consistent output, fine-tuning is necessary.

The table below compares the advantages, disadvantages, and application scenarios of these three methods.

Method	Advantages	Disadvantages	Application Scenarios
Fine-tuning	Provides high-quality results	Requires a lot of time and resources to prepare and train data	Requires stable, reliable, and high-quality output
	Suitable for complex tasks and customization in specific fields	Feedback loop is slow, training cost is high	Improves model performance in specific tasks or fields
	Saves tokens, reduces latency	Requires foundational knowledge of deep learning	When tasks require high precision or unique style, tone, format
Prompting	Fast iteration and testing	Depends on the quality of prompt design	Quick prototyping and testing of common tasks
	Suitable for initial exploration and general tasks	May not be accurate enough for complex tasks	When flexible adjustment of model output is needed
	No need for additional data preparation and training resources		Not suitable for tasks with many examples and complex logic
Chain of Thought	Provides step-by-step logic and reasoning	Increases the complexity and length of prompts	Handles tasks requiring reasoning and logical steps
	Improves performance on complex tasks	Increases token usage and latency	Involves multi-step problem-solving scenarios
	Easily combines multiple strategies and tools	May still not be enough for very complex tasks	When a clear logical process and step-by-step execution are needed

The NFL theorem tells us that no method is suitable for all scenarios, and the same applies here; fine-tuning is not necessarily better than the other two methods. However, it is clear that fine-tuning is suitable for those “hard-to-describe tasks”, such as a specific style and tone. Moreover, these three methods are not mutually exclusive; a fine-tuned model using carefully designed prompts, or even combined with a Chain of Thought, might achieve better results.

For simply writing an article or paragraph, prompts are enough. However, for a blog article considering SEO, there are many details, such as the frequency of core keywords. These details may not be fully understood by a large model, and as a user, you may not be able to describe them well in prompts. Therefore, writing such a blog article can use fine-tuning.

2 Preparing Data

Data needs to be organized in jsonl format, with each line being a json. For example:

1
2
3


{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}

You can also set weights in multi-turn dialogues, with a weight of 0 indicating that the model should avoid such responses.

1
2
3


{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris", "weight": 0}, {"role": "user", "content": "Can you be more sarcastic?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already.", "weight": 1}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "William Shakespeare", "weight": 0}, {"role": "user", "content": "Can you be more sarcastic?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?", "weight": 1}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "384,400 kilometers", "weight": 0}, {"role": "user", "content": "Can you be more sarcastic?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters.", "weight": 1}]}

Of course, processing data is the most time-consuming part, and you can directly use the dataset I created. This dataset is used for fine-tuning large models, sourced from scraping over 3000 pages across 13 categories from the reads.alibaba.com website. The open-source content includes not only the processed data but also the raw data and crawler code.

Upload the prepared data and record the returned file ID.

1
2
3
4
5
6
7


from openai import OpenAI
client = OpenAI()

client.files.create(
  file=open("all_filter_2120.jsonl", "rb"),
  purpose="fine-tune"
)

3 Fine-tuning the Model

Once the data is prepared, verified, and the token cost is confirmed, you can create a fine-tuning task.

1
2
3
4
5
6
7


from openai import OpenAI
client = OpenAI()

client.fine_tuning.jobs.create(
  training_file="file-zWptPbsD37ZnemssjpsK6CnF", 
  model="gpt-4o-mini"
)

More detailed parameter configurations for this step can be found in the official API documentation.

These two steps can also be quickly completed in the UI interface. After submitting the task, you can also view the progress and loss changes in real time on the UI interface.

4 Invoking the Model

Use the following code to query the status of the fine-tuning task. Once the job is successful, you will see the fine_tuned_model field filled with the model’s name. Note this name for invocation.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


from openai import OpenAI
client = OpenAI()

# Query fine-tuning job list
client.fine_tuning.jobs.list(limit=10)

# Query fine-tuning job details
client.fine_tuning.jobs.retrieve("ftjob-gvP0VB7RlWcF3QHdQrEVf49Y")

# Cancel job
client.fine_tuning.jobs.cancel("ftjob-gvP0VB7RlWcF3QHdQrEVf49Y")

# View logs in the job
client.fine_tuning.jobs.list_events(fine_tuning_job_id="ftjob-gvP0VB7RlWcF3QHdQrEVf49Y", limit=10)

# Delete fine-tuned model
client.models.delete("ft:gpt-3.5-turbo:acemeco:suffix:abc123")

The invocation method is the same as the official model; you only need to change the model name, for example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


from openai import OpenAI
client = OpenAI()

completion = client.chat.completions.create(
  model="ft:gpt-4o-mini-2024-07-18:personal:0724:9oMH6S7A",
  messages=[
    {"role": "system", "content": "Please write an SEO article of no less than 800 words based on the title I gave you, including at least 4 subtitles by HTML format. Do not include the  ,  tag.  Do not include the  tag in the start and end of the content. Directly start with the content."},
    {"role": "user", "content": f"title:{task.title},core keyword:{task.coreKeywords},related keyword:{task.relatedKeywords}"}
  ]
)
print(completion.choices[0].message)

5 Evaluating Results

During training, there are two metrics available for reference: loss value and token accuracy. The official explanation is as follows:

Validation loss and validation token accuracy are calculated in two different ways—on a small batch of data during each step and on the entire validation set at the end of each epoch. The entire validation loss and entire validation token accuracy metrics are the most accurate indicators of tracking the overall performance of the model. These statistics are intended to provide a sanity check to ensure that training is proceeding smoothly (loss should decrease, token accuracy should increase).

However, metrics are just references, and the actual effect still needs to be evaluated by yourself. The fine-tuned model has at least the following improvements:

Article length increased by 20%
Article structure is closer to the training data
No more formatting errors (such as markdown format, adding CSS, etc.)

An article generated with the title “What is the Difference Between a Mural and a Mosaic?” is as follows:

6 Reference Articles

https://platform.openai.com/docs/guides/fine-tuning

Four Design Patterns of AI Agents: A Pathway to Artificial General Intelligence

blog@zair.top (Tim) — Wed, 22 May 2024 18:00:22 +0800

Four Design Patterns of AI Agents: A Pathway to Artificial General Intelligence

If using AI to complete tasks is likened to writing an essay, then the non-agent approach is to have AI write it from start to finish without any modifications, while the agent approach allows AI to make multiple modifications and use tools and collaborate with the outside world during the process. In 2024, agents are seen as one of the pathways to achieving Artificial General Intelligence (AGI).

Foundational models have driven the development of generative AI, enabling AI Agents to automatically assist users in completing tasks. Over the past year, various agents have emerged. Inspired by Andrew Ng’s speech at the Red Shirt AI Summit, this article combines papers published in the past year and engineering blogs like Langchain to organize existing agents and summarize design paradigms, hoping to aid in designing agents based on foundational models.

The following table preliminarily organizes 16 patterns:

Pattern	Description
Passive goal creator	Analyzes explicit cues from users through a conversational interface to maintain interactivity, goal tracking, and intuitiveness.
Proactive goal creator	Anticipates user goals by understanding human interactions and capturing context to enhance interactivity, goal tracking, and accessibility.
Prompt/response optimiser	Optimizes prompts/responses based on expected input or output content and format to provide standardization, response accuracy, interoperability, and adaptability.
Retrieval augmented generation	Enhances the knowledge update capability of agents while maintaining data privacy in local foundational model agent systems.
One-shot model querying	Accesses the foundational model in a single instance to generate all steps needed for planning, improving cost efficiency and simplifying processes.
Incremental model querying	Accesses the foundational model at each step of the plan generation process to provide supplementary context, improve response accuracy, and interpretability.
Single-path plan generator	Coordinates the generation of intermediate steps to achieve user goals, improving reasoning certainty, coherence, and efficiency.
Multi-path plan generator	Allows multiple options to be created at each step of achieving user goals, enhancing reasoning certainty, coherence, alignment with human preferences, and inclusivity.
Self-reflection	Enables agents to generate feedback on the planning and reasoning process and provide self-improvement guidance to enhance reasoning certainty, interpretability, continuous improvement, and efficiency.
Cross-reflection	Uses different agents or foundational models to provide feedback and improve the generated planning and reasoning process, enhancing reasoning certainty, interpretability, interoperability, inclusivity, scalability, and continuous improvement.
Human reflection	Collects human feedback to improve planning and reasoning processes, effectively aligning with human preferences, enhancing contestability, effectiveness, fairness, and continuous improvement.
Voting-based cooperation	Allows agents to freely express opinions and reach consensus through voting, improving diversity, effective division of labor, and fault tolerance.
Role-based cooperation	Assigns different roles and finalizes decisions based on the roles of agents, improving decision certainty, division of labor, fault tolerance, scalability, and accountability.
Debate-based cooperation	Agents provide and receive feedback through debate, adjusting their ideas and actions until consensus is reached, improving decision certainty, adaptability, interpretability, response accuracy, and critical thinking.
Multimodal guardrails	Controls the input and output of foundational models to meet specific requirements, such as user demands, ethical standards, and legal regulations, enhancing robustness, safety, standard alignment, and adaptability.
Tool/agent registry	Maintains a unified and convenient source for selecting different agents and tools, improving discoverability, efficiency, and tool applicability.

These 16 patterns can all be attributed to the four paradigms proposed by Andrew Ng in his speech at the Red Shirt AI Summit, namely:

Reflection
Tool Use
Planning
Multiagent Collaboration

1 Reflection

1.1 Basic Reflection

In the context of LLM Agent construction, reflection refers to the process of prompting LLMs to observe their past steps (and potential observations from tools/environments) to assess the quality of selected actions. This feedback is then used for downstream tasks such as replanning, searching, or evaluation. The figure below shows a basic reflection pattern.

Basic Reflection

1.2 Reflexion Actor

Proposed by Shinn et al., Reflexion is an architecture that learns through language feedback and self-reflection. This agent reviews its task results to generate higher quality final outcomes, but at the cost of longer execution time. It mainly consists of three components:

Actor (agent) with self-reflection
External evaluator (task-specific, e.g., code compilation steps)
Episodic memory that stores the reflections from (1).

Reflexion Actor

1.3 LATS

Language Agent Tree Search (LATS), proposed by Zhou et al., is a general LLM Agent search algorithm that combines reflection/evaluation and search (specifically Monte Carlo Tree Search), achieving better overall task performance compared to similar techniques like ReACT, Reflexion, or Tree of Thoughts.

It has four main steps:

Select: pick the best next actions based on the aggregate rewards from step (2). Either respond (if a solution is found or the max search depth is reached) or continue searching.
Expand and simulate: select the “best” 5 potential actions to take and execute them in parallel.
Reflect + Evaluate: observe the outcomes of these actions and score the decisions based on reflection (and possibly external feedback).
Backpropagate: update the scores of the root trajectories based on the outcomes.

LATS

2 Tool Use

Invoke tools, using them in the form of functions.

3 Planning

3.1 ReAct

ReAct enhances the capabilities of agents by combining reasoning and action. The ReAct method allows agents to react immediately after receiving information, rather than waiting for all information to be processed. At the same time, this method emphasizes the close integration of reasoning and action, where agents not only need to analyze and understand input information but also take corresponding actions based on the analysis results. The advantage of this approach lies in its flexibility and adaptability to the environment.

ReAct Framework

3.2 Plan and Execute

The core idea of Plan and Execute is to first develop a multi-step plan and then execute the plan item by item. After completing specific tasks, the plan can be revisited and appropriately modified.

Compared to typical ReAct-style agents, which think one step at a time, the advantage of this “plan and execute” style is:

Clear long-term planning (even very powerful LLMs find this challenging)
The ability to use smaller/weaker models during execution steps, using larger/better models only during planning steps

Plan-and-Execute

3.3 ReWOO

In ReWOO, Xu et al. propose an agent that combines multi-step planning and variable substitution to achieve efficient tool use. It improves ReACT-style agent architecture in the following ways:

Reduces token consumption and execution time by generating the complete toolchain used at once. (ReACT-style agent architecture requires many LLM calls and has redundant prefixes because system prompts and previous steps are provided to the LLM at each reasoning step)
Simplifies fine-tuning. Since planning data does not depend on tool outputs, the model can theoretically be fine-tuned without actually calling the tools.

Reasoning without Observation

3.4 LLMCompiler

LLMCompiler is an agent architecture that accelerates agent task execution by executing tasks in a DAG in a blitz manner. It also saves the cost of redundant token usage by reducing the number of LLM calls. It mainly consists of three parts:

Planner: stream a DAG of tasks.
Task Fetching Unit: schedules and executes the tasks as soon as they are executable.
Joiner: Responds to the user or triggers a second plan.

LLMCompiler

4 Multiagent collaboration

4.1 Supervison

Manages and schedules multiple agents for collaboration through a supervisor.

Supervision Method

4.2 Hierarchical Teams

Completes complex and large-scale tasks by organizing agents in a hierarchical, tiered manner. AutoGen is a typical representative of this approach.

Hierarchical Team Method

4.3 Collaboration

A single agent’s ability to use multiple (domain) tools is limited, requiring multiple agents to collaborate using more types of tools. The “divide and conquer” approach can be used, allowing each agent to become an “expert” focused on handling a specific type of problem, and then having them collaborate.

A Basic Multi-Agent Collaboration

5 Evaluation

One of the most straightforward ideas is to use an agent as a “virtual user” for evaluation, as many task results cannot be quantitatively evaluated. However, for tasks with clear metrics (classification, regression), a tool can be directly utilized for evaluation.

Agent-based Evaluation

6 Other Ways to Achieve AGI

Agents are just one promising way to achieve AGI, but not the only method. The agent approach can be organically combined with methods like RAG and user involvement. For example, Shi et al. combined agents and retrieval to solve Olympiad programming problems with large models.

7 Recommended Reading

https://github.com/AGI-Edgerunners/LLM-Agents-Papers

https://github.com/zjunlp/LLMAgentPapers

AI agent task decomposition and scheduling classic articles - bonelee - Blog Park (cnblogs.com)

Four Agent Paradigms | CRITIC: Andrew Ng’s Recommended Agent Design Paradigms - Zhihu (zhihu.com)

8 References

Kim, Sehoon, Suhong Moon, Ryan Tabrizi, Nicholas Lee, Michael W. Mahoney, Kurt Keutzer, and Amir Gholami. “An LLM Compiler for Parallel Function Calling.” arXiv, February 6, 2024. https://doi.org/10.48550/arXiv.2312.04511.
Liu, Yue, Sin Kit Lo, Qinghua Lu, Liming Zhu, Dehai Zhao, Xiwei Xu, Stefan Harrer, and Jon Whittle. “Agent Design Pattern Catalogue: A Collection of Architectural Patterns for Foundation Model Based Agents.” arXiv, May 16, 2024. https://doi.org/10.48550/arXiv.2405.10467.
Shi, Quan, Michael Tang, Karthik Narasimhan, and Shunyu Yao. “Can Language Models Solve Olympiad Programming?” arXiv, April 16, 2024. https://doi.org/10.48550/arXiv.2404.10952.
Shinn, Noah, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. “Reflexion: Language Agents with Verbal Reinforcement Learning.” arXiv, October 10, 2023. https://doi.org/10.48550/arXiv.2303.11366.
Wang, Lei, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. “Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models.” arXiv, May 26, 2023. https://doi.org/10.48550/arXiv.2305.04091.
Xu, Binfeng, Zhiyuan Peng, Bowen Lei, Subhabrata Mukherjee, Yuchen Liu, and Dongkuan Xu. “ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models.” arXiv, May 22, 2023. https://doi.org/10.48550/arXiv.2305.18323.
Yao, Shunyu, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. “ReAct: Synergizing Reasoning and Acting in Language Models.” arXiv, March 9, 2023. https://doi.org/10.48550/arXiv.2210.03629.
“Yoheinakajima/Babyagi.” Accessed May 21, 2024. https://github.com/yoheinakajima/babyagi/tree/main.
“LangGraph tutorials.” Accessed May 21, 2024. https://langchain-ai.github.io/langgraph/tutorials/
Zhou, Andy, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. “Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models.” arXiv, December 5, 2023. https://doi.org/10.48550/arXiv.2310.04406.
Zhou, Pei, Jay Pujara, Xiang Ren, Xinyun Chen, Heng-Tze Cheng, Quoc V. Le, Ed H. Chi, Denny Zhou, Swaroop Mishra, and Huaixiu Steven Zheng. “Self-Discover: Large Language Models Self-Compose Reasoning Structures.” arXiv, February 5, 2024. https://doi.org/10.48550/arXiv.2402.03620.

Vector Database Comparison: Weaviate, Milvus, and Qdrant

blog@zair.top (Tim) — Tue, 02 Apr 2024 15:44:22 +0800

The success of RAG systems largely depends on their ability to efficiently acquire and process massive amounts of information. Vector databases play an irreplaceable role in this and form the core of RAG systems. Vector databases are specifically designed to store and manage high-dimensional vector data, capable of converting and storing text, images, audio, and even video into vectors (this will be elaborated on later in the article). The ultimate effect that RAG systems can achieve depends on the performance of these underlying vector databases.

Among the many vector databases and vector libraries, each has its own characteristics, and choosing one suitable for your application scenario requires evaluation. This article will delve into the key factors to consider when choosing a vector database for RAG, including open-source availability, CRUD (Create, Read, Update, Delete) support, distributed architecture, replica support, scalability, performance, and continuous maintenance.

Currently, databases specifically designed for vectors like Weaviate, Milvus, Qdrant, Vespa, and Pinecone are highly regarded in the industry. In addition, some earlier vector libraries also have this functionality. This article will also compare various vector libraries, such as FAISS, HNSWLib, ANNOY, and SQL databases supporting vector functions, such as pgvector and Supabase.

Image Semantic Search Implemented with Milvus

1 Vector Libraries (FAISS, HNSWLib, ANNOY)

The difference between vector databases and vector libraries is that vector libraries are mainly used for storing static data, where indexed data is immutable. This is because vector libraries only store vector embeddings and do not store the associated objects that generate these vector embeddings. Therefore, unlike vector databases, vector libraries do not support CRUD (Create, Read, Update, Delete) operations. This means adding new documents to existing indexes in vector libraries like FAISS or ANNOY can be difficult. HNSWLib is an exception, as it has CRUD functionality and uniquely supports concurrent read and write operations. However, it also suffers from the limitations of being a vector library, such as not providing deployment ecosystems, replication capabilities, and fault tolerance.

2 Full-Text Search Databases (ElasticSearch, OpenSearch)

Full-text search databases (e.g., ElasticSearch and OpenSearch) can support comprehensive text retrieval and advanced analysis functions. However, when it comes to performing vector similarity searches and handling high-dimensional data, they are not as strong as specialized vector databases. These databases often need to be used in conjunction with other tools to achieve semantic search, as they mainly rely on inverted indexes rather than vector indexes. According to Qdrant’s test results, Elasticsearch lags in performance compared to vector databases like Weaviate, Milvus, and Qdrant.

3 SQL Databases Supporting Vectors (pgvector, Supabase, StarRocks)

SQL databases like pgvector provide a way to integrate vector data into existing data storage systems through their vector support extensions, but they also have some obvious drawbacks compared to dedicated vector databases.

The most obvious drawback is the mismatch between the relational model of traditional SQL databases and the nature of unstructured vector data. This mismatch leads to inefficient operations involving vector similarity searches, and these databases do not perform well in building indexes and handling large amounts of vector data, as detailed in the ANN benchmarks. Additionally, the upper limit of vector dimensions supported by pgvector (2000 dimensions) is lower compared to dedicated vector databases like Weaviate, which can handle up to 65535-dimensional vector data. In terms of scalability and efficiency, dedicated vector databases also have more advantages. SQL database extensions supporting vectors, such as pgvector, are more suitable for scenarios where the amount of vector data is small (fewer than 100,000 vectors) and vector data is only a supplementary function of the application. Conversely, if vector data is the core of the application or if there are high requirements for scalability, dedicated vector databases would be a more suitable choice.

As for StarRocks, it is another system running on the SQL framework, optimized for online analytical processing (OLAP) and online transaction processing (OLTP) scenarios, but not specifically optimized for vector similarity searches.

4 NoSQL Databases Supporting Vectors (Redis, MongoDB)

The newly added vector support features in NoSQL databases are still in the early stages and have not been fully tested and verified. Taking Redis Vector Similarity Search (VSS) as an example, this feature was only released in April 2022, less than two years ago. Although Redis VSS can serve as a multifunctional database, it is not optimized and designed specifically for vector similarity search.

5 Specialized Vector Databases (Pinecone, Milvus, Weaviate, Qdrant, Vald, Chroma, Vespa, Vearch)

Specialized vector databases inherently support various vector operations, such as dot product, cosine similarity, etc. These databases are designed to handle high-dimensional data, capable of handling a large number of query requests, and can quickly complete similarity searches between vectors. To achieve these goals, they employ various indexing strategies, usually based on approximate nearest neighbor (ANN) algorithms. These algorithms need to balance efficiency, storage space usage, and search accuracy. For example, the FLAT index is a vector index that does not use any optimization or approximation techniques, meaning it can achieve 100% recall and accuracy, but it is slower and less efficient than other types of vector indexes; relatively speaking, the IVF_FLAT index sacrifices some accuracy for faster search speed; the HNSW index provides a compromise between accuracy and search speed.

Pinecone is a closed-source vector database maintained by a professional team, with limited scalability features in its free version. Chroma is a system specifically designed for audio data, but it has not been specially optimized for handling text data. Compared to other mainstream vector databases, Chroma has relatively scarce comprehensive performance benchmark data. Since Chroma uses SQLite as a document storage method in its 0.4 version, it may not be as scalable and efficient as other storage solutions specifically designed for vector data.

Vearch and Vald have shortcomings in integration with Langchain, which is very unfavorable for development use. Compared to competitors like Milvus, their developer community is smaller, and the maintenance of the open-source community is not active enough.

Therefore, for RAG, Weaviate, Milvus, Qdrant, and Vespa may be the best choices. In theory, the most suitable system should be selected based on performance and scalability benchmarks (see ANN Benchmarks below). However, there are also some system design and feature characteristics that need to be compared. The table below provides a visual comparison from these aspects.

Database	Qdrant	Weaviate	Milvus
Open-source and self-hostable	✅	✅	✅
Open-source license	Apache-2.0	BSD	Apache-2.0
Development language	Rust	Go	Go, C++
Github Stars	17k	9.2k	26.2k
First release date	2021	2019	2019
SDK	Python, JS, Go, Java, .Net, Rust	Python, JS, Java, Go	Python, Java, JS, Go
Hosted cloud service	✅	✅	✅
Built-in text embedding	✅FastEmbed	✅	❌
Hybrid retrieval	❌	✅RRF+RSF	✅In-table multi-vector hybrid
Metadata filtering	✅	✅	✅
BM25 support	❌	✅	✅
Text search	✅	✅	❌
Single-point multi-vector	✅	✅
Tensor search	❌	❌	❌
Langchain integration	✅	✅	✅
Llama index integration	✅	✅	✅
Geo-geographic information search	✅	✅	❌
Multi-tenant support	✅ via collections/metadata	✅
Metadata and document size limit	Unlimited
Maximum dimension	Unlimited	65535	32768
Index types	HNSW	HNSW	ANNOY, FAISS, HNSW, ScANN …
Streaming index	❌
Sparse vector support	❌	❌	❌
Temporary index support (excluding server)	✅	❌
Sharding
Price
Facets (aggregation with counts)	❌	✅
Built-in image embedding		✅
Recommendation API	✅
Personalization
User events
Call built-in LLM for RAG		✅Generative Search

Database	Qdrant	Weaviate	Milvus
Subjective advantages	1. Can store multiple types of vectors (images, text, etc.) in one collection 2. Very low resource usage	1. Relatively good performance 2. Supports built-in embedding 3. Supports text search 4. GraphQL API 5. Supports S3 backup	1. Officially supported visual operation interface 2. High search accuracy 3. Rich SDK 4. GPU acceleration

In summary, Qdrant has particularly low overhead, Weaviate supports a combination of vector search, object storage, and inverted index, and Milvus has the strongest performance and the most features.

6 Comparison of Search Methods in Vector Databases

	Milvus	Weaviate	Qdrant
Unique search methods	Multi-vector search	BM25 keyword search + hybrid search	Keyword filtering applied to vector search

6.1 Milvus

Milvus supports two types of searches, depending on the number of vector fields in the collection: single-vector search and multi-vector search.

Single-vector search uses the search() method, comparing the query vector with existing vectors in the collection, returning the IDs of the most similar entities and their distances, and optionally returning the vector values and metadata of the results.

Multi-vector search applies to collections with two or more vector fields and is executed through the hybrid_search() method, which performs multiple approximate nearest neighbor (ANN) search requests and combines the results for re-ranking to return the most relevant matches. (Supported only in the latest 2.4.x version, with a maximum of 10 vectors for search)

Multi-vector search is particularly suitable for complex situations requiring high precision, especially when an entity can be represented by multiple different vectors. This applies to the same data (e.g., a sentence) processed by different embedding models or when multimodal information (e.g., a person’s image, fingerprint, and voiceprint) is converted into various vector formats. By performing “multi-path recall” across the table and assigning weights to these vectors, their combined effect can significantly increase recall capability and improve the effectiveness of search results.

Other basic search operations:

Basic searches include single-vector search, batch vector search, partition search, and searches with specified output fields.
Filtered search refines search results based on filtering conditions of scalar fields.
Range search finds vectors within a specific distance range from the query vector.
Grouped search groups search results based on specific fields to ensure diversity in the results.

6.2 Weaviate

Vector similarity search: Covers a range of approximate search methods, seeking objects most similar to the query vector representation.
Image search: Uses images as input for similarity search.
Keyword search: A keyword search using the BM25F algorithm to rank results.
Hybrid search: Combines BM25 and similarity search to rank results.
Generative search: Uses search results as prompts for LLM.
Re-ranking: Re-ranks retrieved search results using a re-ranking module.
Aggregation: Aggregates data from the result set.
Filters: Applies conditional filters to searches.

6.3 Qdrant

Supported basic search operations:

Filtering by relevance score
Loading multiple search operations in a single request
Recommendation API
Grouping operations

Other search methods supported by Qdrant:

Does Qdrant support a full-text search or a hybrid search?

Qdrant is primarily a vector search engine, and we only implement full-text support when it does not affect vector search use cases. This includes interfaces and performance.

What Qdrant can do:

Use full-text filters for search

Apply full-text filters to vector searches (i.e., perform vector searches within records containing specific words or phrases)

Perform prefix search and semantic instant search

Features Qdrant plans to introduce in the future:

Support for sparse vectors, such as those used in SPLADE or similar models

Features Qdrant does not intend to support:

BM25 or other non-vector-based retrieval or ranking functions

Built-in ontologies or knowledge graphs

Query analyzers and other NLP tools

What is the difference between BM25 and simple keyword search?

Relevance Scoring:
- Simple keyword search is usually based on term frequency: if a term appears in a document, then the document is considered relevant. This method may only count the occurrence of keywords, and all keywords are considered equally important.
- BM25 uses a more complex algorithm that considers term frequency, document length, and the inverse document frequency of the term (i.e., its rarity across all documents). This means BM25 can provide a more refined relevance score, better reflecting the match between the query and the document.
Document Length Handling:
- Simple keyword search may not consider the length of the document. This may lead to longer documents (containing more words) being overly prioritized simply because they have more opportunities to contain the keywords.
- BM25 considers the length of the document through a normalization process within its algorithm, avoiding this bias and ensuring fairness in relevance scoring for both long and short documents.
Importance of Query Terms:
- In simple keyword search, all keywords may be treated equally, regardless of their commonality.
- BM25 uses inverse document frequency (IDF) to adjust the importance of each query term. This means terms that appear in fewer documents (more unique terms) will have a greater impact on the document’s relevance score.
Parameter Tuning:
- Simple keyword search usually does not have many configurable parameters to optimize search results.
- BM25 provides parameters (such as k1 and b) that allow fine-tuning of the algorithm’s sensitivity to suit different types of text and search needs.

Compared to simple keyword search, BM25 offers a more complex and refined method for evaluating the relevance between documents and queries, capable of producing more accurate and user-expected search results.

The current dilemma is whether there is a solution that can achieve both the semantic search characteristics of vector databases and the precision characteristics of traditional keyword search.

7 Appendix

7.1 ANN Benchmarks

Benchmarks are influenced by various factors affecting database performance, such as search type (filtered search or regular search), configuration settings, indexing algorithms, data embeddings, hardware, etc. In addition to the performance of benchmark tests, the selection of vector libraries should also consider distributed capabilities, support for memory replicas and caching, adopted indexing algorithms, vector similarity search capabilities (including hybrid search, filtering, and multiple similarity metrics), sharding mechanisms, clustering methods, scalability potential, data consistency, and overall system availability.

ANN-Benchmarks is a primary benchmarking platform for evaluating the performance of approximate nearest neighbor search algorithms. In text retrieval, the performance of vector databases on angular metrics is often more important than their performance on Euclidean metrics. This is because angular metrics are more sensitive to the semantic similarity of text documents, while Euclidean metrics are more sensitive to document length and scale. Therefore, when considering the context of retrieval-enhanced generation, more attention should be paid to evaluating the performance of vector databases on angular datasets across different dimensions.

7.1.1 glove-100-angular

Evidently, Milvus has the highest throughput when the recall value is below 0.95. When the recall value exceeds 0.95, the throughput gap narrows. Vespa has the longest build time. Weaviate and Milvus have comparable build times, but Milvus is slightly longer. In terms of index size, Weaviate's index is the smallest. Although Milvus's index is the largest, it is still less than 1.5GB (for a dataset containing 1.2 million vectors, each with 100 dimensions).

7.1.2 nytimes-256-angular

The results on this dataset are similar to those on the glove-100-angular dataset. Weaviate has the longest build time and the smallest index on this dataset. Milvus's index is the largest, but it is only 440MB (for a dataset containing 290,000 vectors, each with 256 dimensions).

7.2 Vector Similarity Metrics

Metric	Description	Supported Databases
Cosine Distance	Measures the cosine of the angle between two vectors	pgvector, Pinecone, Weaviate, Qdrant, Milvus, Vespa
Euclidean Distance (L2)	Calculates the straight-line distance between two vectors in multidimensional space	pgvector, Pinecone, Qdrant, Milvus, Vespa
Inner Product (Dot Product)	Calculates the sum of the products of corresponding vector components	pgvector, Pinecone, Weaviate, Qdrant, Milvus
L2 Squared Distance	The square of the Euclidean distance between two vectors	Weaviate
Hamming Distance	Measures the number of differences between vectors in each dimension	Weaviate, Milvus, Vespa
Manhattan Distance	Measures the distance between vector dimensions along right-angle axes	Weaviate

Below is a detailed introduction to each metric, including their relative advantages, disadvantages, and suitable use cases.

7.2.1 Cosine Distance

Cosine distance measures the cosine of the angle between two vectors, commonly used for handling normalized or convex sets.

Advantages: Mainly considers the direction of vectors, making it very suitable for high-dimensional spaces, such as text comparison, where document length is less important.
Disadvantages: Not suitable for scenarios requiring matching vector dimensions, such as comparing image embeddings based on pixel density. If the data does not form a convex set, it may not provide an accurate similarity measure.

Cosine distance is suitable for document classification, semantic search, recommendation systems, and any other tasks involving high-dimensional and standardized data. In information retrieval, cosine distance is often used to measure the similarity between query content and document vectors, ignoring their length but focusing on semantic meaning.

7.2.2 Euclidean Distance L2

Euclidean distance calculates the straight-line distance between two vectors in multidimensional space, also known as the second norm.

Advantages: Intuitive, easy to calculate, sensitive to both the size and direction of vectors.
Disadvantages: May perform poorly in high-dimensional spaces due to the “curse of dimensionality.”

Suitable for image recognition, speech recognition, handwriting analysis, and other scenarios.

7.2.3 Inner Product

Inner product calculates the sum of the products of corresponding vector components, also known as the nth norm.

Advantages: Fast calculation, reflects the size and direction of vectors.
Disadvantages: Sensitive to both the direction and size of vectors.

The classic application of inner product is in the field of recommendation systems. In recommendation systems, the inner product can be used to determine the similarity between user vectors and item vectors, helping predict a user’s interest in an item. Inner product is suitable for recommendation systems, collaborative filtering, and matrix decomposition.

7.2.4 L2 Squared Distance

The square of the Euclidean distance between two vectors.

Advantages: Penalizes large differences between vector elements, which can be useful in some situations.
Disadvantages: The square operation may distort distances and is sensitive to outliers.

L2 squared distance is particularly suitable for problems involving differences in individual dimensions, such as comparing the differences between two images in image processing.

7.2.5 Hamming Distance

Measures the number of differences between vectors in each dimension.

Advantages: Suitable for comparing binary or categorical data.
Disadvantages: Not applicable to continuous or numerical data.

The applicable scenarios are also quite specific, such as error detection and correction (categorical data); measuring the genetic distance between two DNA strands.

7.2.6 Manhattan Distance L1

Measures the distance between vector dimensions along right-angle axes, also known as the first norm.

Advantages: More resistant to outliers than Euclidean distance.
Disadvantages: Less intuitive in geometric terms compared to Euclidean distance.

Suitable for calculating chessboard distance and shortest path problems in logistics planning.

8 References

From AI Knowledge Base to RAG

blog@zair.top (Tim) — Thu, 21 Mar 2024 17:14:22 +0800

When building AI applications, there is the problem of “AI not having seen the data in the task.” For example, for enterprises, AI cannot grasp the information of every customer; for individuals, AI is not well aware of some personal information and privacy information. Even if AI is very capable (the ideal world model is no exception), without data for specific tasks, it loses the ability to “analyze specific problems specifically.”

1 What is RAG

Improving the accuracy and reliability of generative AI models by retrieving external information is Retrieval-Augmented Generation (Retrieval-Augmented Generation). If the process of a large language model (LLM) completing a task is compared to an exam, then a large model with RAG is equivalent to an open-book exam, while without RAG, it is like a closed-book exam. RAG is a technology that helps LLMs retrieve information to improve generation results.

RAG was first proposed by Patrick Lewis and others in this paper, and the company they worked for is Cohere, which currently provides API services including Embedding and Rerank models with good performance.

2 Why RAG is needed

The emergence of RAG is to solve some problems and deficiencies of large language models in applications. The most prominent point is the hallucination problem of large models, where the output of large models does not match facts or fabricates some answers. Also, the data used to train LLMs may be outdated, and LLMs know nothing about relatively new information.

RAG allows LLMs to access the latest or customized information and allows users to verify the information sources of LLMs to ensure their accuracy. The data retrieved by RAG can be public (such as search engines) or private (such as company information, personal sensitive data), which gives RAG broad application prospects. RAG is already widely used, such as Nvidia’s NeMo Retriever reading internal company information, and Kimi Chat from the Dark Side of the Moon using search engines to answer questions.

Huang Renxun introducing NeMo Retriever at GTC2024

3 Knowledge Base Built Around RAG

AI knowledge bases are important tools that allow AI to “tailor to fit.” By helping AI better complete tasks through knowledge bases, the current construction of AI knowledge bases can be done in the following three ways:

Prompt Engineering
Fine Tuning
Embedding

Prompt engineering is to directly build a knowledge base in the prompt, putting all the information into the prompt. This method is suitable for small-scale use, but the number of tokens that current AI models can input basically cannot meet this implementation method. In fact, even as AI develops, one day when AI’s input window is large enough to accommodate a general knowledge base, building a knowledge base will still have its value. Because the length of the input content will affect AI’s performance (~~at least the current models are like this~~), you can check Needle In A Haystack - Pressure Testing LLMs for details.

Fine-tuning is a form that is popular in academia, using specific task data to fine-tune on pre-trained models. This approach is actually suitable for making an industry-general large model, such as a legal industry large model, a medical large model, etc. On one hand, the training data required for fine-tuning is not small, and the cost is high; on the other hand, fine-tuning is not flexible enough, such as timely adjustments based on one or two documents. The process of fine-tuning is actually learning and generalizing the training data, rather than memorizing the content, it is more about enhancing the ability in a certain field.

So the most mainstream way to build a knowledge base currently is mostly using the Embedding method. And this form of knowledge base also needs to be combined with RAG to be effective.

4 Basic Components of RAG

A classic, basic RAG composition is shown in the figure below.

The RAG system mainly includes three stages: indexing, retrieval, and generation.

4.1 Embedding

In this process, users need to upload documents first, and the system stores the uploaded documents in a vector database after embedding. Embedding is to convert semantically similar texts into vectors that are close in distance, so this process is commonly known as vectorization.

4.2 Retrieval

When users ask LLMs questions, the content of the question will be embedded and then matched in the vector database, querying a series of content. This is the first stage of retrieval.

4.3 Rerank

The content directly queried in the vector database may not be perfect, and the results often do not match the query content, so a second stage of retrieval is needed, which is Rerank. In this stage, the Rerank model will reorder the content queried in the previous stage and output the results according to relevance. After Rerank is completed, taking the Top K can be applied in the subsequent generation stage.

5 Implementing RAG in 5 Lines of Code

~~An assignment statement counts as one line~~

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


from transformers import RagTokenizer, RagRetriever, RagTokenForGeneration
from datasets import load_dataset

retriever = RagRetriever.from_pretrained(
    config_name = "facebook/rag-token-nq",
    index_name = "compressed",  # Type of index
    use_dummy_dataset = True,  # Use a dummy dataset for testing
    dataset = "wiki_dpr"  # Dataset used for retrieval
)

# Load the pre-trained tokenizer and model
tokenizer = RagTokenizer.from_pretrained("facebook/rag-token-nq")
model = RagTokenForGeneration.from_pretrained("facebook/rag-token-nq", retriever=retriever)

# Input question and convert to vector
input_ids = tokenizer(input_text = "What is the capital of France?", 
		return_tensors="pt").input_ids

# Generate answer
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

RagTokenizer is used for tokenizing text, RagTokenForGeneration is the generator part of the RAG model, and RagRetriever is responsible for retrieval. RagTokenizer.from_pretrained("facebook/rag-token-nq") loads a pre-trained tokenizer to convert text into a format that the model can understand (i.e., tokenization). RagTokenForGeneration.from_pretrained("facebook/rag-token-nq", retriever=retriever) loads a pre-trained RAG model. facebook/rag-token-nq is the name of the model and tokenizer, which are pre-trained on the Natural Questions dataset.

6 Open-source RAG Implementations

Dify is an LLM application development platform, with over 100,000 applications built based on Dify.AI. It integrates the concept of Backend as Service and LLMOps, covering the core technology stack needed to build generative AI native applications, including a built-in RAG engine. With Dify, you can deploy capabilities similar to Assistants API and GPTs based on any model. This project is hosted by a company in Suzhou and provides SaaS services.

Langchain-Chatchat is an open-source, offline deployable retrieval-augmented generation (RAG) large model knowledge base project based on large language models like ChatGLM and application frameworks like Langchain. Initially, it only supported the ChatGLM model, but later added support for many open-source models and online models.

The functional comparison of the two is shown in the table below:

	Dify-api	ChatChat
Peripheral Capabilities	General Document Reading	General Document Image OCR
Data Sources	Document Text Content Vector Database	Search Engine Vector Database
Model Support	Online Embedding Model Online Rerank Model Online LLM	Online Embedding Model Offline Embedding Model Offline LLM
Advanced Features	ES Hybrid Retrieval	None
Advanced RAG	Not Supported	Not Supported

In fact, there are some features that current open-source projects do not fully cover, such as:

Multimodal Capabilities
Traditional Relational Database Support
Multi-database Joint/Cross-database Information Retrieval
Citation Function
Advanced RAG
Evaluation Metrics

7 References

stderr and stdout: Understanding Logs and Output

blog@zair.top (Tim) — Sat, 17 Feb 2024 20:42:32 +0000

Standard output (stdout) and standard error (stderr) are simple concepts but play a central role in logging, error handling, and data stream management. This article explores the differences and applications of stdout and stderr, especially how to use them effectively in a Python environment.

1 Standard Output (`stdout`) and Standard Error (`stderr`)

In most operating systems, standard output and standard error are the two main output streams of a process. They provide a mechanism for a process to send information and error messages to a terminal or file. Although these two streams may be physically the same (e.g., both displayed on the same terminal interface), they are used for different logical purposes:

Standard Output (stdout): Typically used for outputting the results of program execution or normal operational information.
Standard Error (stderr): Specifically used for outputting error messages or warnings, which usually need to be seen or recorded even when standard output is redirected.

2 `print` and `logging` in Python

In Python, the print function sends information to stdout by default, while the logging module sends log messages to stderr by default. This distinction is made to separate normal program output from logs (including error and debug information), making it easier for developers to manage and filter output information.

2.1 Using `print`

print is the most basic output function in Python, used to send information to the standard output stream. It is simple and easy to use, suitable for quick debugging or displaying information to the user. For example:

1

print("Hello, world!")

2.2 Using `logging`

The logging module provides a flexible framework for adding log messages to an application. Unlike print, logging supports different log levels (DEBUG, INFO, WARNING, ERROR, CRITICAL), allowing developers to adjust the detail level and output location of logs as needed. For example:

1
2
3


import logging

logging.error('This is an error message')

3 tqdm and `stderr`

In complex or long-running programs, using a progress bar is an effective way to show the progress of a process to the user. Python’s tqdm library is a widely used tool for adding progress bars to the command line. tqdm outputs progress information to stderr by default to avoid interfering with normal program output (stdout).

3.1 Splitting `stdout` and `stderr`

In some cases, it is useful to separate normal output from error or log messages, such as redirecting them to different files or terminals. On the command line, redirection operators > and 2> can be used to achieve this. In Python code, finer control can be achieved by configuring the logging module or using specific file objects.

1

python script.py > output.log 2> error.log

By using command line redirection, Python’s print function, or even the logging module, you can flexibly control and split these two types of output, making error handling, logging, and user interaction clearer and more orderly.

4 Managing `stdout` and `stderr` with `nohup`

When deploying long-running background processes, the nohup command becomes an important tool. nohup, or “no hang up”, allows commands to continue running after the user logs out, which is especially useful for remotely started tasks. A key feature of nohup is its ability to manage stdout and stderr.

By default, using nohup to run a command will merge stdout and stderr and redirect them to the nohup.out file unless otherwise specified. This means that both normal output and error messages will be captured in the same file for later review. However, in some cases, it may be more useful to separate these two outputs.

4.1 Separating `stdout` and `stderr` with `nohup`

To output stdout and stderr to different files when using nohup, you can use redirection operators in combination. For example:

1

nohup python script.py > output.log 2> error.log &

This command redirects stdout to output.log, stderr to error.log, and runs in the background using &. This way, even if the terminal or SSH session is closed, the program will continue to run, and its output will be properly recorded.

5 Buffering Behavior in Python

stdout and stderr behave differently when buffering data. By default, stdout is line-buffered when connected to a terminal, caching data until a newline character is received or the buffer is full; in non-interactive mode, stdout is block-buffered (like a file). stderr, however, is always line-buffered (before Python 3.9, it was block-buffered in non-interactive mode). The following content is from the official documentation sys — System-specific parameters and functions — Python 3.12.2 documentation

When interactive, the stdout stream is line-buffered. Otherwise, it is block-buffered like regular text files. The stderr stream is line-buffered in both cases. You can make both streams unbuffered by passing the [u]() command-line option or setting the [PYTHONUNBUFFERED]() environment variable.

Changed in version 3.9: Non-interactive stderr is now line-buffered instead of fully buffered.

The smaller the buffering granularity, the more timely the output, but the greater the IO cost. In Python 3.8 and earlier, stdout and stderr had the same buffering granularity, which was not very reasonable; after version 3.9, stderr has a smaller buffering granularity, meaning each write operation’s output is more timely than stdout. This difference makes stderr suitable for error and log information, ensuring that even if a program crashes or exits abnormally, this information has a higher priority than standard output.

In C++, standard error is unbuffered (see below), which is more aggressive, but I personally think this is more reasonable.

Fortunately, in Python, you can disable this buffering behavior using python -u or by setting the PYTHONUNBUFFERED environment variable, or directly control the output timing by operating sys.stdout.flush().

6 Behavior in Python Concurrent Environments

When using stdout and stderr in multithreaded or multiprocess environments, output may interleave or become chaotic because output from different threads or processes may interfere with each other when writing to a terminal or file. One way to solve this problem is to create separate output files for each thread or process, or use thread locks (thread locks) or process synchronization mechanisms (such as multiprocessing.Lock) to synchronize access to stdout or stderr.

7 Controlling `stdout` and `stderr` in Python

In complex applications, you may need more flexible control over the destination of output streams. Python provides several ways to achieve this:

Redirecting stdout and stderr: You can redirect the standard output and error output of a Python program by changing the values of sys.stdout and sys.stderr. This is particularly useful for capturing and analyzing output, or redirecting output to non-standard output devices such as graphical interfaces.
Using the subprocess module: When running external commands or scripts, the subprocess module allows you to control the stdout and stderr streams of the command, including redirecting them to variables within the Python program, or separating or merging them.
Advanced applications of the logging module: Python’s logging module supports outputting logs to multiple destinations, including files, standard output, networks, etc. By configuring different log handlers, you can implement complex log management schemes, such as splitting logs to different outputs based on log level or message content.

7.1 Recommendations

Manage output carefully: When designing software, clearly distinguish between output for user interaction (stdout) and output for error reporting or logging (stderr). This helps improve the usability and maintainability of the program.
Optimize performance: Consider the performance impact of output operations, especially in scenarios with high-frequency logging or data output. Reasonable use of buffering and batch processing can reduce the impact on performance.
Security considerations: Before outputting sensitive information, perform appropriate filtering and desensitization to avoid leaking sensitive data through logs.

By deeply understanding and flexibly applying stdout and stderr, you can build more robust and manageable Python applications, effectively handle logs and output, and improve user experience and application stability.

8 Buffering Behavior in C++

In C++, stdout (usually corresponding to std::cout) and stderr (corresponding to std::cerr) have different buffering strategies:

std::cout is line-buffered by default, which means that when it is connected to a terminal, the output is flushed on each newline or when the buffer is full.
std::cerr is unbuffered by default, so data written to std::cerr is immediately output, which is very useful for reporting error information because it reduces the risk of error information not being output due to program crashes.

9 Redirecting `stdout` and `stderr`

In a C++ program, stdout and stderr can be redirected in several ways. A common method is to use the freopen function to redirect standard output or error output to a file at runtime:

1
2


freopen("output.txt", "w", stdout);
freopen("error.log", "w", stderr);

This method can be used to redirect output to a file for later analysis and debugging.

10 Use in C++ Multithreaded Environments

When using std::cout and std::cerr in multithreaded C++ programs, race conditions may occur, leading to chaotic output. To avoid this, it is recommended to use mutex locks (such as std::mutex) to synchronize access to these streams:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


#include 
#include 
#include 

std::mutex cout_mutex;

void thread_function(int id) {
    std::lock_guard<std::mutex> lock(cout_mutex);
    std::cout << "Thread " << id << " is running\n";
}

int main() {
    std::thread t1(thread_function, 1);
    std::thread t2(thread_function, 2);

    t1.join();
    t2.join();

    return 0;
}

11 Controlling Output in C++

The C++ standard library provides std::streambuf, which can be used to implement finer control over std::cout and std::cerr, including redirection and custom buffering behavior. By inheriting from std::streambuf and overriding the corresponding member functions, you can create custom buffering strategies or redirect output to GUI components, network connections, etc.

11.1 Recommendations

Use buffering wisely: Choose an appropriate buffering strategy based on the application scenario. For error information that requires immediate feedback, use std::cerr or manually flush std::cout.
Avoid direct use of standard output in multithreading: Use mutex locks or other synchronization mechanisms to ensure the consistency and order of output.
Use redirection and custom streambuf: To handle output more flexibly, consider using redirection or custom streambuf to meet special output needs, such as logging, network transmission, etc.

By mastering these advanced techniques, you can effectively manage and control program output while ensuring the robustness and flexibility of C++ programs.

Automatic Segmentation Tool for Long Webpage Screenshots

blog@zair.top (Tim) — Tue, 06 Feb 2024 11:29:01 +0000

1 Background

When needing to share or analyze web content, long screenshots are a very practical form as they can fully display the page. However, processing these long screenshots while maintaining their information integrity and readability, and facilitating subsequent operations, has always been a challenge. For example, as of early 2024, mainstream AI image models on the market still cannot handle very large and complex images. If a long screenshot is forcibly input into the model, it will result in degraded performance (many details cannot be recognized). To solve this problem, I developed a tool based on OpenCV, aimed at simplifying the processing of long screenshots while maintaining their content integrity and readability.

This project is open source on my Github: https://github.com/Ryaang/Web-page-Screenshot-Segmentation

Unlike many existing tools or methods, Web-page-Screenshot-Segmentation uses OpenCV to automatically identify and follow the natural dividing lines of web content, automatically finding the most suitable segmentation points. This means that whether it is titles, paragraphs, or charts, they can be neatly retained in the segmented images without content breakage or omission.

Using Web-page-Screenshot-Segmentation is very simple. You only need to prepare a long screenshot, and the tool will automatically analyze the image content and intelligently decide the segmentation points. The result will be a series of complete and well-structured images, convenient for sharing and further processing.

2 Introduction

This project is used to segment long screenshots of web pages into several parts based on the height of the text. The main idea is to find the low-variation areas of the image and then find the segmentation lines in these areas.

The output is small but complete images of the webpage, which can be used to generate webpages or train models using Screen-to-code. More results can be found in the images directory.

3 Getting Started

3.1 Installation

1

 pip install Web-page-Screenshot-Segmentation

4 Using in Command Line

Get the height of the segmentation lines of the image

1

python -m Web_page_Screenshot_Segmentation.master -f "path/to/img"

The output should be a list: [6, 868, 1912, 2672, 3568, 4444, 5124, 6036, 7698]. It is a list of the heights of the image segmentation lines. If you want to display this segmentation line in the image, you can add the -s True parameter:

1

python -m Web_page_Screenshot_Segmentation.master -f "path/to/img" -s True

4.1 Draw Segmentation Lines in the Image

1

python -m Web_page_Screenshot_Segmentation.drawer --image_file path/to/image.jpg --hl [100,200] --color (0,255,0)

4.2 Split Image

1

python -m Web_page_Screenshot_Segmentation.spliter --f path/to/image.jpg -ht "[233,456]"

You will get the segmented images saved in the path returned by the command.

For more usage explanations, please refer to the help:

1
2


python master.py --help
python spliter.py --help

5 Using from Source Code

5.1 split_heights Function

The split_heights function is used to segment the image into several parts based on various thresholds. It accepts the following parameters:

file_path: The path of the image file.
split: A boolean indicating whether to split the image.
height_threshold: The height threshold of the low-variation area.
variation_threshold: The variation threshold of the low-variation area.
color_threshold: The color difference threshold.
color_variation_threshold: The color difference variation threshold.
merge_threshold: The minimum distance threshold between two lines.

If split is False, the function returns a list of the heights of the segmentation lines; if split is True, it returns the path of the segmented images.

5.1.1 Example Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


import Web_page_Screenshot_Segmentation
from Web_page_Screenshot_Segmentation.master import split_heights

# Split the image at 'path/to/image.jpg' into several parts
split_image_path = split_heights(
    file_path='path/to/image.jpg',
    split=True,
    height_threshold=102,
    variation_threshold=0.5,
    color_threshold=100,
    color_variation_threshold=15,
    merge_threshold=350
)

print(f"The segmented images are saved in {split_image_path}")

In this example, the image at ‘path/to/image.jpg’ is segmented into several parts based on the provided thresholds. The segmented images are saved in the path returned by the function.

5.2 draw_line_from_file Function

The draw_line_from_file function is used to draw lines on the image at specified heights. It accepts the following parameters:

image_file: The path of the image file.
heights: A list of heights at which to draw lines.
color: The color of the lines. The default color is red (0, 0, 255).

The function reads the image from the provided file path, draws lines at the specified heights, and then saves the modified image to a new file. The new file is saved in the result directory, with the same name as the original file but with ‘result’ added before the file extension.

If the function encounters an error reading the image file (e.g., if the file path contains ‘.’ or Chinese characters), it will throw an exception.

5.2.1 Example Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


import Web_page_Screenshot_Segmentation
from Web_page_Screenshot_Segmentation.spliter import draw_line_from_file

# Draw lines at heights 100 and 200 on the image at 'path/to/image.jpg'
result_image_path = draw_line_from_file(
    image_file='path/to/image.jpg',
    heights=[100, 200],
    color=(0, 255, 0)  # Draw lines in green
)

print(f"The modified image is saved in {result_image_path}")

In this example, the image at ‘path/to/image.jpg’ is modified to draw green lines at heights 100 and 200. The modified image is saved in the path returned by the function.

Fountain Voyage

A Comprehensive Guide to Mixture of Experts (MoE): Exploring Mixtral 8X7B, DBRX, and Deepseek-v2 Architectures and Applications

1 What is a Mixture of Experts (MoE) Model?

1.1 Key Advantages of MoE Models

2 Popular MoE Models

2.1 1. Mixtral 8X7B

2.2 2. DBRX

2.3 3. Deepseek-v2

3 Implementing an MoE Model in Python

3.1 Step 1: Install Required Python Libraries

3.2 Step 2: Enable Multi-Threading

3.3 Step 3: Fetch a Python-Compatible Model (e.g., DBRX)

3.4 Step 4: Query the Model

4 Performance Evaluation of MoE Models

4.1 1. Logical Reasoning Task

4.2 2. Summarization Task

4.3 3. Entity Extraction Task

4.4 4. Mathematical Reasoning Task

5 Final Thoughts

6 Frequently Asked Questions (FAQ)

7 Further Reading

Support Vector Machine (SVM) from a Mathematical Perspective: Solving Optimization Problems

Margin Distance Derivation

Dual Equivalence Proof

Simple Example

Solving with KKT Conditions

SVM Satisfying KKT Conditions

Solving Linear Support Vector Machine Using KKT Conditions

Polynomial and Gaussian Kernel Functions

VLDB 2024 Guangzhou Trip

Day 0: A Little Guangzhou Shock for the Socialites

Day 1: Academic Locusts and Banquet Blunders

1. Opening Ceremony

2. Keynote Speech

3. Parallel Sessions

4. Poster Session

5. Panel Discussion

6. Roundtable Discussion

7. Workshop

8. Networking Sessions

9. Closing Ceremony

10. Q&A Sessions

11. Breakout Sessions

12. Exhibition Booths

13. Awards and Recognition

14. Coffee Breaks and Lunches

Day 2: Lion Dance, Opening!

Day 3: From Cantonese Cuisine Pinnacle to Deck

Exquisite Cantonese Cuisine

Pearl River Night Cruise

New Type of Socializing

Day 4: Freebies Collection

Day 5: Half a Day Through Five Hundred Years of China

Data Exhaustion Crisis: The Data Bottleneck in AI Development and Countermeasures

1 The Data-Driven AI Era

2 Data Ownership and Copyright Issues

3 Existing Data Usage Strategies

4 Synthetic Data and AI Self-Training

5 Conclusion

6 Recommended Reading

Fine-tuning GPT-4o-mini to Generate Blog Articles

1 Suitable Scenarios for Fine-tuning

2 Preparing Data

3 Fine-tuning the Model

4 Invoking the Model

, tag. Do not include the tag in the start and end of the content. Directly start with the content."}, {"role": "user", "content": f"title:{task.title},core keyword:{task.coreKeywords},related keyword:{task.relatedKeywords}"} ] ) print(completion.choices[0].message)

5 Evaluating Results

6 Reference Articles

Four Design Patterns of AI Agents: A Pathway to Artificial General Intelligence

Four Design Patterns of AI Agents: A Pathway to Artificial General Intelligence

1 Reflection

1.1 Basic Reflection

1.2 Reflexion Actor

1.3 LATS

2 Tool Use

3 Planning

3.1 ReAct

3.2 Plan and Execute

3.3 ReWOO

3.4 LLMCompiler

`, tag. Do not include the tag in the start and end of the content. Directly start with the content."}, {"role": "user", "content": f"title:{task.title},core keyword:{task.coreKeywords},related keyword:{task.relatedKeywords}"} ] ) print(completion.choices[0].message)`

1 Standard Output (`stdout`) and Standard Error (`stderr`)

2 `print` and `logging` in Python

2.1 Using `print`

2.2 Using `logging`

3 tqdm and `stderr`

3.1 Splitting `stdout` and `stderr`

4 Managing `stdout` and `stderr` with `nohup`

4.1 Separating `stdout` and `stderr` with `nohup`

7 Controlling `stdout` and `stderr` in Python

9 Redirecting `stdout` and `stderr`