Large Language Model Meets Graph Neural Network in Knowledge Distillation (2024)

Shengxiang HuSchool of Computer Engineeringand Science, Shanghai UniversityShanghaiChinashengxianghu@shu.edu.cn,Guobing ZouSchool of Computer Engineeringand Science, Shanghai UniversityShanghaiChinagbzou@shu.edu.cn,Song YangSchool of Computer Engineeringand Science, Shanghai UniversityShanghaiChinayangsong@shu.edu.cn,Yanglan GanSchool of Computer Science and Technology,Donghua UniversityShanghaiChinaylgan@dhu.edu.cn,Bofeng ZhangSchool of Computer and Information Engineering,Shanghai Polytechnic UniversityShanghaiChinabfzhang@sspu.edu.cnandYixin ChenDepartment of Computer Science and Engineering,Washington University in St. LouisMOUSAchen@cse.wustl.edu

Abstract.

Recent advancements in leveraging Large Language Models (LLMs) for Text-Attributed Graphs (TAGs) learning have shown significant potential, but practical deployment is often hindered by substantial computational and storage demands. Conventional Graph Neural Networks (GNNs) are more efficient but struggle with the intricate semantics embedded in TAGs. To combine the semantic understanding of LLMs with the efficiency of GNNs, we propose a novel LLM-to-GNN knowledge distillation framework, Linguistic Graph Knowledge Distillation (LinguGKD), which employs TAG-oriented instruction tuning to train pre-trained LLMs as teachers and introduces a layer-adaptive contrastive distillation strategy to align node features between teacher LLMs and student GNNs within a latent space, effectively transferring the semantic and complex relational understanding from LLMs to GNNs.Extensive experiments across various LLM and GNN architectures on multiple datasets demonstrate that LinguGKD significantly enhances the predictive accuracy and convergence rate of GNNs without requiring additional training data or model parameters. Compared to teacher LLMs, the distilled GNNs offer superior inference speed and reduced resource requirements, making them highly practical for deployment in resource-constrained environments. Furthermore, our framework demonstrates significant potential for leveraging ongoing advancements in LLM research to continuously improve GNN performance.

1. Introduction

Text-Attributed Graphs (TAGs) integrate structured graph data with rich textual information, providing a comprehensive representation of complex systems and encapsulating extensive knowledge. These graphs are extensively utilized across diverse domains (Li etal., 2022; Yang and Shi, 2024).Recently, the revolutionary impact of Large Language Models (LLMs), such as ChatGPT (Ouyang etal., 2022) and Llama (Touvron etal., 2023), on natural language processing brings new opportunities and challenges to the application and research of TAGs.

While LLMs exhibit remarkable reasoning and problem-solving capabilities for complex tasks, they are not always satisfactory in every context. Studies (Pan etal., 2024; Li etal., 2024) have shown that integrating knowledge graphs can enhance the reasoning and handling capabilities of LLMs.Furthermore, traditional Graph Neural Networks (GNNs) (Veličković etal., 2018; Wu etal., 2019; Chen etal., 2020) excel at interpreting graph structures but struggle with semantic processing (Li etal., 2023b), especially as the complexity and volume of associated textual data increase. LLMs, with their exceptional contextual and relational understanding, offer a novel perspective in evaluating TAGs by effectively capturing the semantic nuances embedded in textual data. Integrating LLMs with GNNs addresses the semantic gap, leveraging the structural strengths of GNNs and the semantic prowess of LLMs.These scenarios highlight the demand for and challenges in researching LLM-based graph learning approaches, which can not only amplify semantic interpretation in TAGs but also enhance the overall performance of graph-based tasks, underscoring the necessity and significance of this research direction.

Recent advancements in LLM-based graph learning focus on two main approaches: LLM as Enhancer (LaE) and LLM as Predictor (LaP) (Li etal., 2023a). The LaE approaches (He etal., 2024; Chen etal., 2024; Wei etal., 2024) enhance node embeddings in GNNs by utilizing the semantic processing capabilities of LLMs, addressing traditional GNN limitations in extracting semantic features from TAGs. In contrast, the LaP approaches (Wang etal., 2024; Fatemi etal., 2023; Ye etal., 2023) employ LLMs directly for prediction tasks within graph-related contexts, either by adapting Transformer-based models to incorporate graph structures or by encoding graph data in natural language for inference, thus significantly improving both semantic processing and structural understanding in graph-related tasks.These works demonstrate the potential of LLM as a foundational model for graph learning. However, the practical application of LLMs in graph learning faces significant challenges, particularly due to their large parameter sizes, often exceeding billions111https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard. This leads to high computational and storage demands, making cost-effective and widespread deployment challenging. Additionally, LLMs’ extended latency during inference presents practical limitations in operational settings.

We consider whether it is possible to leverage the powerful semantic and entity relationship understanding capabilities of graph-oriented LLMs while reducing their computational and storage resource consumption, making them suitable for large-scale deployment in resource-constrained production environments. Addressing these challenges, knowledge distillation (KD) (Chen etal., 2022; Samy etal., 2023; Joshi etal., 2024) from LLMs to GNNs emerges as a promising strategy, enabling the transfer of LLM insights to more compact GNNs. This approach not only aims to reduce the model size and computational demands of LLMs but also leverages GNNs’ strengths in processing structured graph data. By distilling the semantic and structural understanding of LLMs into lightweight GNNs, it is expected to optimize graph reasoning tasks, facilitating more effective and adaptable real-world applications. However, LLMs and GNNs are designed for different types of data with significantly different model architectures, posing substantial challenges for knowledge transfer from LLMs to GNNs. This challenge remains largely unexplored in current research. Thus, developing effective methods for knowledge distillation between these models is crucial to unlocking their combined potential.

Under the above motivations, we propose a novel LLM-to-GNN knowledge distillation framework, Linguistic Graph Knowledge Distillation (LinguGKD). To the best of our knowledge, this is the first framework that directly distills knowledge from teacher LLMs to student GNNs.Given the very early stage of graph-oriented LLM research and the lack of off-the-shelf LLMs designed specifically for graph tasks, inspired by (Ye etal., 2023), we begin by instruction tuning a pre-trained LLM (PLM) with carefully designed, tailored graph instruction prompts. This process equips the PLM with the capability to understand and process graph structures, thus obtaining an effective teacher LLM, named LinguGraph LLM. Then we introduce a layer-adaptive contrastive distillation strategy, complemented by a feature alignment mechanism to synchronize the feature spaces of the LLM and GNN, ensuring that the hierarchical node features learned by the teacher LLM are effectively aligned with those extracted by the student GNN. By doing so, it can propagate the teacher LLM’s deep semantic knowledge and intricate understanding of graph structures to the student GNN, leading to better TAG understanding capabilities.

Our extensive experimental evaluations, focused on node classification tasks, span various LLM and GNN models as well as multiple benchmark datasets, demonstrating the efficacy of the proposed LinguGKD framework in distilling graph knowledge from teacher LLMs to student GNNs and verifying its strong generality, suitable for different architectures of LLM and GNN. Specifically, the distilled GNNs exhibit a significant reduction in model complexity, making them more suitable for real-world applications with a much lower parameter count compared to LLMs, and a notable increase in inference speed. From the perspective of effectiveness, distilled GNNs not only achieve higher accuracy and faster convergence rates than the vanilla versions but also outperform those with advanced designs in certain scenarios, all without the need for additional training data or architectural changes. These results validate the effectiveness of our knowledge distillation framework and underscore its potential to enhance the practicality of LLMs in graph data processing, achieving an optimal balance between performance and efficiency.

The main contributions of this paper are summarized as follows:

  • We conceptualize the novel research problem of knowledge distillation from LLMs to GNNs and propose an innovative graph knowledge distillation framework termed LinguGKD, which leverages the comprehensive semantic insights of graph-oriented teacher LLMs to enrich student GNNs’ structural learning capabilities while maintaining their high efficiency.

  • Within the LinguGKD framework, we develop a unique layer-adaptive contrastive distillation strategy, which ensures effective synchronization of hierarchical node features between the teacher LLM and the student GNN, guaranteeing the transfer of deep semantic knowledge and complex graph structural understanding.

  • Extensive experimental evaluations across diverse LLM and GNN models as well as multiple benchmark datasets demonstrate that LinguGKD significantly enhances the classification accuracy of student GNNs while maintaining the lightweight nature. The distilled GNNs strike an optimal balance between performance in downstream tasks and efficiency in terms of time and space, making them practical for deployment on user devices in real-world scenarios.

The remainder of this paper is organized as follows: Section 2 provides the foundations and formulates the research problem; Section 3 delves into the details of the LinguGKD framework; Section 4 presents extensive experimental results and analyses; Section 5 reviews relevant literature; and Section 6 concludes the paper.

2. Preliminaries

This section formalizes the key concepts central to our study.

Large Language Model Meets Graph Neural Network in Knowledge Distillation (1)
Definition 0 (Text-Attributed Graph).

A Text-Attributed Graph (TAG) is a graph where each node is associated with textual data. Formally, a TAG is denoted as 𝒢=(𝒱,,𝒳)𝒢𝒱𝒳\mathcal{G}=(\mathcal{V},\mathcal{E},\mathcal{X})caligraphic_G = ( caligraphic_V , caligraphic_E , caligraphic_X ), where 𝒱={vi}i=1n𝒱superscriptsubscriptsubscript𝑣𝑖𝑖1𝑛\mathcal{V}=\{v_{i}\}_{i=1}^{n}caligraphic_V = { italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT represents the set of nodes, with n𝑛nitalic_n being the total number of nodes, \mathcal{E}caligraphic_E is the set of edges with eijsubscript𝑒𝑖𝑗e_{ij}\in\mathcal{E}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ caligraphic_E indicating an edge between nodes visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and 𝒳={xi}i=1n𝒳superscriptsubscriptsubscript𝑥𝑖𝑖1𝑛\mathcal{X}=\{x_{i}\}_{i=1}^{n}caligraphic_X = { italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT denotes the node attributes, where xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the textual attribute of node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Given a TAG 𝒢𝒢\mathcal{G}caligraphic_G, GNNs are essential for interpreting the graph’s topological dependencies:

Definition 0 (Graph Neural Network).

Graph Neural Networks (GNNs) are specialized for handling graph-structured data, primarily through a k𝑘kitalic_k-layer message-passing mechanism (Kipf and Welling, 2017), enabling the capture and analysis of k𝑘kitalic_k-hop node relationships and graph dynamics. This process is defined as:

(1)hv(k)=f(hv(k1),u𝒩(v)g(hu(k1),hv(k1)))superscriptsubscripth𝑣𝑘𝑓superscriptsubscripth𝑣𝑘1subscriptdirect-sum𝑢𝒩𝑣𝑔superscriptsubscripth𝑢𝑘1superscriptsubscripth𝑣𝑘1\textbf{h}_{v}^{(k)}=f(\textbf{h}_{v}^{(k-1)},\bigoplus_{u\in\mathcal{N}(v)}g(%\textbf{h}_{u}^{(k-1)},\textbf{h}_{v}^{(k-1)}))h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = italic_f ( h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT , ⨁ start_POSTSUBSCRIPT italic_u ∈ caligraphic_N ( italic_v ) end_POSTSUBSCRIPT italic_g ( h start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT , h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k - 1 ) end_POSTSUPERSCRIPT ) )

where hv(k)superscriptsubscripth𝑣𝑘\textbf{h}_{v}^{(k)}h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT is the feature representation of node v𝑣vitalic_v at the k𝑘kitalic_k-th layer, 𝒩(v)𝒩𝑣\mathcal{N}(v)caligraphic_N ( italic_v ) includes v𝑣vitalic_v’s neighboring nodes, and functions g()𝑔g(\cdot)italic_g ( ⋅ ) and f()𝑓f(\cdot)italic_f ( ⋅ ) are trainable, responsible for aggregating neighbor features and updating node features, respectively. The operator direct-sum\bigoplus denotes an aggregation function, such as summation (Zhang, 2020) or averaging (Kipf and Welling, 2017).

While GNNs excel at processing graph structures, they have limitations in understanding individual node semantics. To leverage the capability of LLMs in semantic feature learning, an LLM-to-GNN Graph Knowledge Distillation framework can be designed to distill knowledge from a teacher LLM to a student GNN, enhancing the GNN’s capability in interpreting node semantics and complex graph topology. This can be defined as follows:

Definition 0 (LLM-to-GNN Graph Knowledge Distillation).

An LLM-to-GNN Graph Knowledge Distillation (GKD) framework can be formulated as a quintuple: T,S,𝒮,,𝒜subscript𝑇subscript𝑆𝒮𝒜\langle\mathcal{M}_{T},\mathcal{M}_{S},\mathcal{S},\mathcal{F},\mathcal{A}\rangle⟨ caligraphic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , caligraphic_M start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , caligraphic_S , caligraphic_F , caligraphic_A ⟩, where the teacher model Tsubscript𝑇\mathcal{M}_{T}caligraphic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is a Transformer (Vaswani etal., 2017)-based LLM, fine-tuned on TAG 𝒢𝒢\mathcal{G}caligraphic_G for generative graph inference tasks; Ssubscript𝑆\mathcal{M}_{S}caligraphic_M start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT denotes the student GNN specified for discriminative tasks; \mathcal{F}caligraphic_F represents the knowledge to be transferred from Tsubscript𝑇\mathcal{M}_{T}caligraphic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT to Ssubscript𝑆\mathcal{M}_{S}caligraphic_M start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT via distillation scheme 𝒮𝒮\mathcal{S}caligraphic_S, learned by extraction algorithm 𝒜𝒜\mathcal{A}caligraphic_A. The distillation process can be formulated as follows:

(2)={T,S}={𝒜T(T,𝒢),𝒜S(S,𝒢)}subscript𝑇subscript𝑆subscript𝒜𝑇subscript𝑇𝒢subscript𝒜𝑆subscript𝑆𝒢\displaystyle\mathcal{F}=\{\mathcal{F}_{T},\mathcal{F}_{S}\}=\{\mathcal{A}_{T}%(\mathcal{M}_{T},\mathcal{G}),\mathcal{A}_{S}(\mathcal{M}_{S},\mathcal{G})\}caligraphic_F = { caligraphic_F start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT } = { caligraphic_A start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , caligraphic_G ) , caligraphic_A start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( caligraphic_M start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , caligraphic_G ) }
(3)D=loss(S,T)subscript𝐷losssubscript𝑆subscript𝑇\displaystyle\mathcal{L}_{D}=\text{loss}(\mathcal{F}_{S},\mathcal{F}_{T})caligraphic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT = loss ( caligraphic_F start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT )
(4)Snew=𝒮(S,D)superscriptsubscript𝑆new𝒮subscript𝑆subscript𝐷\displaystyle\mathcal{M}_{S}^{\text{new}}=\mathcal{S}(\mathcal{M}_{S},\mathcal%{L}_{D})caligraphic_M start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT new end_POSTSUPERSCRIPT = caligraphic_S ( caligraphic_M start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , caligraphic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT )

where 𝒜Tsubscript𝒜𝑇\mathcal{A}_{T}caligraphic_A start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and 𝒜Ssubscript𝒜𝑆\mathcal{A}_{S}caligraphic_A start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT are the knowledge extraction algorithms of Tsubscript𝑇\mathcal{M}_{T}caligraphic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and Ssubscript𝑆\mathcal{M}_{S}caligraphic_M start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT, respectively; loss()loss\text{loss}(\cdot)loss ( ⋅ ) denotes the divergence function (e.g., Kullback-Leibler divergence), and Snewsuperscriptsubscript𝑆new\mathcal{M}_{S}^{\text{new}}caligraphic_M start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT new end_POSTSUPERSCRIPT is the distilled student model that we ultimately require.

3. Approach

Figure 1 illustrates the LinguGKD framework for TAG-oriented graph knowledge distillation, highlighting three key components: teacher feature learning by the LLM, student feature learning by the GNN, and layer-adaptive contrastive distillation loss between the two feature sets. Before delving into these crucial modules, we elaborate on how to fine-tune a TAG-oriented Pre-trained Language Model (PLM) to understand graph structure and node semantics through our tailored instruction prompts.

3.1. TAG Instruction Tuning of Pre-trained LLM

Given a TAG 𝒢=(𝒱,,𝒳)𝒢𝒱𝒳\mathcal{G}=(\mathcal{V},\mathcal{E},\mathcal{X})caligraphic_G = ( caligraphic_V , caligraphic_E , caligraphic_X ), a center node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and a specific neighbor hop k𝑘kitalic_k, we can obtain a collection of neighbor subgraphs from structural-free up to the k𝑘kitalic_k-th hop: 𝒢ik={𝒢i(l)}l=0ksuperscriptsubscript𝒢𝑖𝑘superscriptsubscriptsuperscriptsubscript𝒢𝑖𝑙𝑙0𝑘\mathcal{G}_{i}^{k}=\{\mathcal{G}_{i}^{(l)}\}_{l=0}^{k}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = { caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_l = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, where 𝒢i(k)=(vi,𝒩i(k),𝒩i(k),𝒳𝒩i(k))superscriptsubscript𝒢𝑖𝑘subscript𝑣𝑖superscriptsubscript𝒩𝑖𝑘subscriptsuperscriptsubscript𝒩𝑖𝑘subscript𝒳superscriptsubscript𝒩𝑖𝑘\mathcal{G}_{i}^{(k)}=(v_{i},\mathcal{N}_{i}^{(k)},\mathcal{E}_{\mathcal{N}_{i%}^{(k)}},\mathcal{X}_{\mathcal{N}_{i}^{(k)}})caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , caligraphic_E start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , caligraphic_X start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) denotes the k𝑘kitalic_k-th hop subgraph, in which 𝒩i(k)superscriptsubscript𝒩𝑖𝑘\mathcal{N}_{i}^{(k)}caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT is the set of k𝑘kitalic_k-hop neighbors of visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and 𝒩i(k)subscriptsuperscriptsubscript𝒩𝑖𝑘\mathcal{E}_{\mathcal{N}_{i}^{(k)}}caligraphic_E start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, 𝒳𝒩i(k)subscript𝒳superscriptsubscript𝒩𝑖𝑘\mathcal{X}_{\mathcal{N}_{i}^{(k)}}caligraphic_X start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT are the corresponding edges and textual node attributes within the subgraph, respectively.

To enable the PLM to accurately understand graph structure and node semantics, it is essential to craft comprehensive instruction prompts for further instruction tuning. For a given 𝒢i(k)superscriptsubscript𝒢𝑖𝑘\mathcal{G}_{i}^{(k)}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT, we define a specific instruction prompt pksubscriptp𝑘\textbf{p}_{k}p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, which consists of three components: task-specific instructions \mathcal{I}caligraphic_I that delineate the expected model action, the structural prompt τksubscript𝜏𝑘\tau_{k}italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT that is the natural language description of the subgraph, and a task-relevant query 𝒬𝒬\mathcal{Q}caligraphic_Q, typically presented as a detailed question. The construction of pksubscriptp𝑘\textbf{p}_{k}p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is a concatenation of these elements, as shown in the lower left part of Figure 1:

(5)pk=concat(,τk,𝒬)subscriptp𝑘concatsubscript𝜏𝑘𝒬\textbf{p}_{k}=\text{concat}(\mathcal{I},\tau_{k},\mathcal{Q})p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = concat ( caligraphic_I , italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , caligraphic_Q )

Here, we employ node classification as a side task to enable the LLM to comprehend graphs. In alignment with OpenAI’s Prompt Engineering principles (OpenAI, 2023), we carefully design \mathcal{I}caligraphic_I and 𝒬𝒬\mathcal{Q}caligraphic_Q as follows:

Prompt 1 (TAG Node Classification Instruction).

Implement a node classification system for {{type of graph}}, representing nodes as tuples (node_{id}, {degree}, {attribute}). Classify nodes into [{{list of classification categories}}] based on attributes and link relations. {{classification criteria for a specific graph}}.

Prompt 2 (TAG Node Classification Query).

Which category should (node_{{id}}, {{node degree}}, {{node attributes}}) be classified as?

In these templates, each node is represented by a tuple that encapsulates the node’s id, degree, and the corresponding textual attributes. The placeholders within curly braces {{}} are filled based on the specific graph data. More precisely, the term {{type of graph}} specifies the domain this graph belongs to, while {{list of classification categories}} enumerates the potential categories for node classification. Furthermore, {{classification criteria for a specific graph}} denotes the optional prior knowledge that assists the LLM in the precise generation of labels for the center node.

In crafting the structural prompt, we prioritize key elements, such as node textual attributes, node degrees, and multi-order neighbor interactions, aligned with conventional multi-layer message-passing GNNs’ principles. Referring to the strategies from (Ye etal., 2023), we develop a linguistic structural encoder fesubscript𝑓𝑒f_{e}italic_f start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT that transforms 𝒢i(k)superscriptsubscript𝒢𝑖𝑘\mathcal{G}_{i}^{(k)}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT into a detailed natural language description, represented as τksubscript𝜏𝑘\tau_{k}italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT:

(6)τk=fe(𝒢i(k))subscript𝜏𝑘subscript𝑓𝑒superscriptsubscript𝒢𝑖𝑘\tau_{k}=f_{e}(\mathcal{G}_{i}^{(k)})italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT )

The prompt template of fesubscript𝑓𝑒f_{e}italic_f start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT is meticulously designed as follows:

Prompt 3 (TAG Structural Prompt).

(node_{{id}}, {{node degree}}, {{node attributes}}) is connected within {{k}} hops to {{k-th hop neighbors}} through paths that may involve {{intermediate paths}}.

In this context, {{k-th hop neighbors}} refers to the ensemble of neighbors reachable at the k𝑘kitalic_k-th hop in 𝒩i(k)superscriptsubscript𝒩𝑖𝑘\mathcal{N}_{i}^{(k)}caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT. Each neighbor in this list is characterized by a tuple that encapsulates its id, degree, and attributes, mirroring the representation of the central node. Furthermore, {{intermediate paths}} represents the sequences of paths connecting the central node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to its k𝑘kitalic_k-th hop neighbors, encompassing the edges defined within 𝒩i(k)subscriptsuperscriptsubscript𝒩𝑖𝑘\mathcal{E}_{\mathcal{N}_{i}^{(k)}}caligraphic_E start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. This template meticulously outlines the connectivity and relational dynamics within the subgraph based on node degrees, attributes, and the specified hop distances.

Iterating the structural encoding process, we can methodically generate a structural prompt for every subgraph within 𝒢iksuperscriptsubscript𝒢𝑖𝑘\mathcal{G}_{i}^{k}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, leading to the assembly of the k𝑘kitalic_k-hop structural prompt set 𝒯={fe(𝒢i(l))}l=0k𝒯superscriptsubscriptsubscript𝑓𝑒superscriptsubscript𝒢𝑖𝑙𝑙0𝑘\mathcal{T}=\{f_{e}(\mathcal{G}_{i}^{(l)})\}_{l=0}^{k}caligraphic_T = { italic_f start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_l = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT for the central node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Notably, l=0𝑙0l=0italic_l = 0 focuses exclusively on visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s textual attributes. This iterative generation of structural prompts systematically captures the nuanced relational dynamics at varying degrees of connectivity, from the immediate vicinity (l=0𝑙0l=0italic_l = 0) extending to the broader k𝑘kitalic_k-hop network. By concatenating instruction \mathcal{I}caligraphic_I and query 𝒬𝒬\mathcal{Q}caligraphic_Q with various structural prompts, we can obtain a set of graph instruction prompts 𝒫𝒫\mathcal{P}caligraphic_P:

(7)𝒫={concat(,τl,𝒬)},τl𝒯formulae-sequence𝒫concatsubscript𝜏𝑙𝒬for-allsubscript𝜏𝑙𝒯\mathcal{P}=\{\text{concat}(\mathcal{I},\tau_{l},\mathcal{Q})\},\quad\forall%\tau_{l}\in\mathcal{T}caligraphic_P = { concat ( caligraphic_I , italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , caligraphic_Q ) } , ∀ italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ caligraphic_T

Then we adopt instruction tuning (Zhang etal., 2023; Si etal., 2023; Honovich etal., 2023) to specifically tailor PLMs for generative node classification tasks. Specifically, each prompt pl𝒫subscriptp𝑙𝒫\textbf{p}_{l}\in\mathcal{P}p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ caligraphic_P is utilized to directly fine-tune the PLM to generate the semantic category yisubscripty𝑖\textbf{y}_{i}y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for the corresponding center node, without any modification of the pre-trained tokenizer and vocabulary. We utilize negative log-likelihood loss as our objective function:

(8)T(𝒫)=pl𝒫j=1|y|logp(y^j|pl,y^<j)subscript𝑇𝒫subscriptsubscriptp𝑙𝒫superscriptsubscript𝑗1y𝑝conditionalsubscript^𝑦𝑗subscriptp𝑙subscript^𝑦absent𝑗\displaystyle\mathcal{L}_{T}(\mathcal{P})=-\sum_{\textbf{p}_{l}\in\mathcal{P}}%\sum_{j=1}^{|\textbf{y}|}\log p(\hat{y}_{j}|\textbf{p}_{l},\hat{y}_{<j})caligraphic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( caligraphic_P ) = - ∑ start_POSTSUBSCRIPT p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ caligraphic_P end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | y | end_POSTSUPERSCRIPT roman_log italic_p ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT < italic_j end_POSTSUBSCRIPT )

where y^jsubscript^𝑦𝑗\hat{y}_{j}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT refers to the j𝑗jitalic_j-th token that the LLM generated for the node label. Employing this instruction tuning methodology, we develop a highly proficient LLM adept in TAG understanding. This fine-tuned LLM, which we call LinguGraph LLM, then acts as the teacher model Tsubscript𝑇\mathcal{M}_{T}caligraphic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT in the subsequent graph knowledge distillation process.

3.2. Knowledge Distillation from LinguGraph LLM to GNN

In this section, we detail the key components of the proposed LinguGKD framework, addressing the challenge of transferring insights from complex teacher LLMs to simpler student GNNs. This process emphasizes aligning the node latent features extracted by both the LLM and the GNN within a unified latent space. The following subsections elaborate on three pivotal phases of LinguGKD’s knowledge distillation process: extracting semantically-enhanced node features via LLMs, leveraging GNNs for structural node feature extraction, and implementing layer-adaptive alignment of semantic and structural features to distill knowledge from the teacher LLM to the student GNN.

3.2.1. Teacher Feature Learning via LinguGraph LLM

We begin by leveraging the fine-tuned LinguGraph LLM Tsubscript𝑇\mathcal{M}_{T}caligraphic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT to extract semantically-enriched node features, encapsulating textual attributes and multi-order neighbor information, as depicted in the Teacher Feature Learning module in Figure 1.

Building upon insights from (Xiao etal., 2023), we observe that tailored instructions significantly enhance the LLM’s proficiency in generating semantic features. Consequently, we leverage the entire instruction prompt set 𝒫𝒫\mathcal{P}caligraphic_P for the extraction of node semantic features, rather than limiting it to the structural prompt set 𝒯𝒯\mathcal{T}caligraphic_T. Specifically, given that LLMs primarily use the transformer architecture (Vaswani etal., 2017), which includes an embedding layer for token embedding, an n𝑛nitalic_n-layer transformer with multi-head self-attention for deriving word-level nonlinear interrelations, and an output layer for specific generative tasks, we extract node semantic features as follows.

For an instruction prompt pl𝒫subscriptp𝑙𝒫\textbf{p}_{l}\in\mathcal{P}p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ caligraphic_P, we process the sequence of tokens through the embedding layer:

(9)EL=EmbeddingL({ρi}i=1|pl|;WembL)superscript𝐸𝐿𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛superscript𝑔𝐿superscriptsubscriptsubscript𝜌𝑖𝑖1subscriptp𝑙superscriptsubscript𝑊emb𝐿E^{L}=Embedding^{L}(\{\rho_{i}\}_{i=1}^{|\textbf{p}_{l}|};W_{\text{emb}}^{L})italic_E start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT = italic_E italic_m italic_b italic_e italic_d italic_d italic_i italic_n italic_g start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( { italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT ; italic_W start_POSTSUBSCRIPT emb end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT )

where EmbeddingL()𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛superscript𝑔𝐿Embedding^{L}(\cdot)italic_E italic_m italic_b italic_e italic_d italic_d italic_i italic_n italic_g start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( ⋅ ) is the embedding layer of the LLM Tsubscript𝑇\mathcal{M}_{T}caligraphic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, and WembLsuperscriptsubscript𝑊emb𝐿W_{\text{emb}}^{L}italic_W start_POSTSUBSCRIPT emb end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT represents its parameters. ELsuperscript𝐸𝐿E^{L}italic_E start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT is the embedded sequence, each row of which represents a token.

Next, the embedded sequence is fed into the transformer layers:

(10)H=Transformer(EL;Wtr)𝐻𝑇𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑟superscript𝐸𝐿subscript𝑊trH=Transformer(E^{L};W_{\text{tr}})italic_H = italic_T italic_r italic_a italic_n italic_s italic_f italic_o italic_r italic_m italic_e italic_r ( italic_E start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ; italic_W start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT )

where Transformer()𝑇𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑟Transformer(\cdot)italic_T italic_r italic_a italic_n italic_s italic_f italic_o italic_r italic_m italic_e italic_r ( ⋅ ) denotes the transformer module of LLM Tsubscript𝑇\mathcal{M}_{T}caligraphic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, and Wtrsubscript𝑊trW_{\text{tr}}italic_W start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT represents its parameters. The transformer layers, using multi-head self-attention, compute the contextual relationships between tokens, resulting in a contextualized feature matrix H𝐻Hitalic_H.

Finally, we consider the feature of the last token ρ|pl|subscript𝜌subscriptp𝑙\rho_{|\textbf{p}_{l}|}italic_ρ start_POSTSUBSCRIPT | p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | end_POSTSUBSCRIPT from the final transformer layer as the l𝑙litalic_l-th order node latent feature:

(11)hlL=H|pl|,:superscriptsubscripth𝑙𝐿subscript𝐻subscriptp𝑙:\textbf{h}_{l}^{L}=H_{|\textbf{p}_{l}|,:}h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT = italic_H start_POSTSUBSCRIPT | p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | , : end_POSTSUBSCRIPT

Here, hlLdLsuperscriptsubscripth𝑙𝐿superscriptsubscript𝑑𝐿\textbf{h}_{l}^{L}\in\mathbb{R}^{d_{L}}h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUPERSCRIPT encapsulates the aggregated contextual information of the entire instruction prompt, integrating neighbor details and extensive node attribute data, where dLsubscript𝑑𝐿d_{L}italic_d start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT denotes the dimension of the node latent features extracted by the teacher LLM. Through this process, we obtain a rich semantic representation for each node, leveraging the comprehensive understanding capabilities of the LLM.

For distilling hierarchical knowledge from the teacher LLM, we then process each l𝑙litalic_l-th order feature, hlLsuperscriptsubscripth𝑙𝐿\textbf{h}_{l}^{L}h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT, through a specialized hop-specific neural knowledge filter, flsuperscriptsubscript𝑓𝑙\mathcal{M}_{f}^{l}caligraphic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, which is proficient in filtering pertinent layer knowledge:

(12)fl(hlL)=σ(WlhlL+bl),0lkformulae-sequencesuperscriptsubscript𝑓𝑙superscriptsubscripth𝑙𝐿𝜎subscript𝑊𝑙superscriptsubscripth𝑙𝐿subscript𝑏𝑙0𝑙𝑘\mathcal{M}_{f}^{l}(\textbf{h}_{l}^{L})=\sigma(W_{l}\textbf{h}_{l}^{L}+b_{l}),%\quad 0\leq l\leq kcaligraphic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ) = italic_σ ( italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT + italic_b start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) , 0 ≤ italic_l ≤ italic_k
(13)h^lL=LayerNorm(fl(hlL))superscriptsubscript^h𝑙𝐿𝐿𝑎𝑦𝑒𝑟𝑁𝑜𝑟𝑚superscriptsubscript𝑓𝑙superscriptsubscripth𝑙𝐿\hat{\textbf{h}}_{l}^{L}=LayerNorm(\mathcal{M}_{f}^{l}(\textbf{h}_{l}^{L}))over^ start_ARG h end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT = italic_L italic_a italic_y italic_e italic_r italic_N italic_o italic_r italic_m ( caligraphic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ) )

where Wl,blsubscript𝑊𝑙subscript𝑏𝑙W_{l},b_{l}italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT are the trainable parameters for the filter flsuperscriptsubscript𝑓𝑙\mathcal{M}_{f}^{l}caligraphic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, and σ𝜎\sigmaitalic_σ denotes the non-linear activation function.

Subsequently, to align the node features from different hops into the same lower-dimensional distillation vector space without altering their distribution, a cross-hop shared linear feature projector, psubscript𝑝\mathcal{M}_{p}caligraphic_M start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, is designed to restructure these features:

(14)hlT=p(h^lL;Wp,bp)=Wph^lL+bpsuperscriptsubscripth𝑙𝑇subscript𝑝superscriptsubscript^h𝑙𝐿subscript𝑊𝑝subscript𝑏𝑝subscript𝑊𝑝superscriptsubscript^h𝑙𝐿subscript𝑏𝑝\textbf{h}_{l}^{T}=\mathcal{M}_{p}(\hat{\textbf{h}}_{l}^{L};W_{p},b_{p})=W_{p}%\hat{\textbf{h}}_{l}^{L}+b_{p}h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = caligraphic_M start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( over^ start_ARG h end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ; italic_W start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) = italic_W start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT over^ start_ARG h end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT + italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT

where hlTdksuperscriptsubscripth𝑙𝑇superscriptsubscript𝑑𝑘\textbf{h}_{l}^{T}\in\mathbb{R}^{d_{k}}h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT symbolizes the adapted l𝑙litalic_l-order teacher knowledge, poised for subsequent distillation steps; Wp,bpsubscript𝑊𝑝subscript𝑏𝑝W_{p},b_{p}italic_W start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT are the trainable parameters for psubscript𝑝\mathcal{M}_{p}caligraphic_M start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT.

Applying this process across all subgraph orders up to the k𝑘kitalic_k-th, we obtain a set of hierarchical teacher node features T={hlT}l=0ksubscript𝑇superscriptsubscriptsuperscriptsubscripth𝑙𝑇𝑙0𝑘\mathcal{F}_{T}=\{\textbf{h}_{l}^{T}\}_{l=0}^{k}caligraphic_F start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = { h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_l = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, which is then utilized in the subsequent graph knowledge distillation, serving as the specific knowledge to be distilled from the teacher LLM to the student GNN.

3.2.2. Student Feature Learning via GNN

We then leverage the student GNN Ssubscript𝑆\mathcal{M}_{S}caligraphic_M start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT to extract multi-hop node features, as shown in the Student Feature Learning module in Figure 1. The chosen student model can be any off-the-shelf GNN. Although various GNN models have different architectures (Kipf and Welling, 2017; Veličković etal., 2018; Hamilton etal., 2017; Xu etal., 2018), they all interpret graph structures through a message-passing mechanism that enables central nodes to assimilate features from k𝑘kitalic_k-hop neighbors, capturing local graph structure nuances. The core principle of message-passing remains consistent: a structured process involving message construction, aggregation, and node feature updating. These stages allow GNNs to effectively represent intricate graph structures.

For a given k𝑘kitalic_k-hop neighbor subgraph 𝒢i(k)superscriptsubscript𝒢𝑖𝑘\mathcal{G}_{i}^{(k)}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT of a central node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the k𝑘kitalic_k-order message aggregation process in GNN unfolds as follows:

(15)hj(0)=EmbeddingG(xj;WembG),vjvi𝒩i(k)formulae-sequencesuperscriptsubscripth𝑗0𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛superscript𝑔𝐺subscript𝑥𝑗superscriptsubscript𝑊emb𝐺for-allsubscript𝑣𝑗subscript𝑣𝑖superscriptsubscript𝒩𝑖𝑘\displaystyle\textbf{h}_{j}^{(0)}=Embedding^{G}(x_{j};W_{\text{emb}}^{G}),%\quad\forall v_{j}\in v_{i}\cup\mathcal{N}_{i}^{(k)}h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = italic_E italic_m italic_b italic_e italic_d italic_d italic_i italic_n italic_g start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ; italic_W start_POSTSUBSCRIPT emb end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ) , ∀ italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∪ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT
(16)mij(l)=msg(l)(hi(l1),hj(l1),eij;Wmsg(l)),0<lkformulae-sequencesuperscriptsubscriptm𝑖𝑗𝑙subscriptsuperscript𝑙𝑚𝑠𝑔superscriptsubscripth𝑖𝑙1superscriptsubscripth𝑗𝑙1subscript𝑒𝑖𝑗superscriptsubscript𝑊𝑚𝑠𝑔𝑙0𝑙𝑘\displaystyle\textbf{m}_{i\leftarrow j}^{(l)}=\mathcal{M}^{(l)}_{msg}(\textbf{%h}_{i}^{(l-1)},\textbf{h}_{j}^{(l-1)},e_{ij};W_{msg}^{(l)}),\quad 0<l\leq km start_POSTSUBSCRIPT italic_i ← italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = caligraphic_M start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m italic_s italic_g end_POSTSUBSCRIPT ( h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ; italic_W start_POSTSUBSCRIPT italic_m italic_s italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) , 0 < italic_l ≤ italic_k
(17)hlG=update(l)(hi(l1),vj𝒩imij(l);Wupdate(l))superscriptsubscripth𝑙𝐺superscriptsubscript𝑢𝑝𝑑𝑎𝑡𝑒𝑙superscriptsubscripth𝑖𝑙1subscriptdirect-sumsubscript𝑣𝑗subscript𝒩𝑖superscriptsubscriptm𝑖𝑗𝑙superscriptsubscript𝑊𝑢𝑝𝑑𝑎𝑡𝑒𝑙\displaystyle\textbf{h}_{l}^{G}=\mathcal{M}_{update}^{(l)}(\textbf{h}_{i}^{(l-%1)},\bigoplus_{v_{j}\in\mathcal{N}_{i}}\textbf{m}_{i\leftarrow j}^{(l)};W_{%update}^{(l)})h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT = caligraphic_M start_POSTSUBSCRIPT italic_u italic_p italic_d italic_a italic_t italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , ⨁ start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT m start_POSTSUBSCRIPT italic_i ← italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ; italic_W start_POSTSUBSCRIPT italic_u italic_p italic_d italic_a italic_t italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT )

where xj𝒳𝒩i(k)subscript𝑥𝑗subscript𝒳superscriptsubscript𝒩𝑖𝑘x_{j}\in\mathcal{X}_{\mathcal{N}_{i}^{(k)}}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_X start_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT represents the attribute of node vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, EmbeddingG()𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛superscript𝑔𝐺Embedding^{G}(\cdot)italic_E italic_m italic_b italic_e italic_d italic_d italic_i italic_n italic_g start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ( ⋅ ) is the text embedding model of Ssubscript𝑆\mathcal{M}_{S}caligraphic_M start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT, such as bag-of-words or TF-IDF, converting node textual attributes into a low-dimensional vector space, establishing the initial node feature hj(0)superscriptsubscripth𝑗0\textbf{h}_{j}^{(0)}h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT. The functions msg(l)()subscriptsuperscript𝑙𝑚𝑠𝑔\mathcal{M}^{(l)}_{msg}(\cdot)caligraphic_M start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m italic_s italic_g end_POSTSUBSCRIPT ( ⋅ ) and update(l)()subscriptsuperscript𝑙𝑢𝑝𝑑𝑎𝑡𝑒\mathcal{M}^{(l)}_{update}(\cdot)caligraphic_M start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u italic_p italic_d italic_a italic_t italic_e end_POSTSUBSCRIPT ( ⋅ ) perform message construction and node feature updates, respectively, during l𝑙litalic_l-order message passing. direct-sum\bigoplus denotes a differentiable, permutation-invariant function (e.g., sum, mean, or max) to perform message aggregation. The parameters Wmsg(l)superscriptsubscript𝑊𝑚𝑠𝑔𝑙W_{msg}^{(l)}italic_W start_POSTSUBSCRIPT italic_m italic_s italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT and Wupdate(l)superscriptsubscript𝑊𝑢𝑝𝑑𝑎𝑡𝑒𝑙W_{update}^{(l)}italic_W start_POSTSUBSCRIPT italic_u italic_p italic_d italic_a italic_t italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT are the associated trainable weights. The output of the l𝑙litalic_l-th message passing layer hlGdGsuperscriptsubscripth𝑙𝐺superscriptsubscript𝑑𝐺\textbf{h}_{l}^{G}\in\mathbb{R}^{d_{G}}h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_POSTSUPERSCRIPT represents the l𝑙litalic_l-th order feature of the central node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, reflecting its l𝑙litalic_l-order neighbor structure and attributes.

Subsequently, we synchronize the node features into the unified distillation vector space via a normalization layer:

(18)hlS=Norm(hlG),0lkformulae-sequencesuperscriptsubscripth𝑙𝑆𝑁𝑜𝑟𝑚superscriptsubscripth𝑙𝐺0𝑙𝑘\textbf{h}_{l}^{S}=Norm(\textbf{h}_{l}^{G}),\quad 0\leq l\leq kh start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT = italic_N italic_o italic_r italic_m ( h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ) , 0 ≤ italic_l ≤ italic_k

where hlSdksuperscriptsubscripth𝑙𝑆superscriptsubscript𝑑𝑘\textbf{h}_{l}^{S}\in\mathbb{R}^{d_{k}}h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denotes the student knowledge, and Norm()𝑁𝑜𝑟𝑚Norm(\cdot)italic_N italic_o italic_r italic_m ( ⋅ ) is a normalization function, typically batch or layer normalization. Applying this process across all subgraph orders up to the k𝑘kitalic_k-th, we obtain a set of hierarchical student node features S={hlS}l=0ksubscript𝑆superscriptsubscriptsuperscriptsubscripth𝑙𝑆𝑙0𝑘\mathcal{F}_{S}=\{\textbf{h}_{l}^{S}\}_{l=0}^{k}caligraphic_F start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = { h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_l = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT.

In the next phase, we focus on aligning the student GNN’s learned node features Ssubscript𝑆\mathcal{F}_{S}caligraphic_F start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT with those from the teacher LLM, Tsubscript𝑇\mathcal{F}_{T}caligraphic_F start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. This alignment ensures that the LLM’s rich semantic knowledge is effectively integrated into the GNN, enhancing its capability to process graph-structured data.

3.2.3. Layer-Adaptive Contrastive Distillation

In graph inference, the relevance of node features varies significantly with the order of their neighbors: structural-free features highlight a node’s core attributes, lower-order features elucidate direct interactions, and higher-order features provide insight into distant connections, which are essential for understanding a node’s role within the graph. In different downstream tasks, the importance of features encoding information from various neighbor orders can differ significantly. For instance, tasks such as community detection may rely more on higher-order features (Huang etal., 2019), while node classification might benefit more from lower-order features (Kipf and Welling, 2017).

Given these variations, a one-size-fits-all distillation approach may fail to capture the nuanced importance of features at different hops. Therefore, to fully leverage the teacher LLM’s comprehensive understanding of different neighbor orders, we propose a Layer-Adaptive Contrastive Distillation mechanism within the LinguGKD framework. This approach tailors the distillation process to the importance of node features at each hop, ensuring the student GNN effectively captures the teacher LLM’s nuanced knowledge of both local and global graph structures. This layer-adaptive strategy is essential for optimizing the student GNN’s performance across diverse tasks by facilitating precise and task-relevant knowledge transfer.

To achieve effective knowledge transfer, it is crucial to align the feature spaces of the teacher LLM and the student GNN. Contrastive learning with infoNCE loss (Oord etal., 2018) is particularly well-suited for this task because it encourages the alignment of similar (positive) feature pairs while ensuring that dissimilar (negative) pairs are distinguishable. By leveraging contrastive learning, we can effectively measure and minimize the divergence between the layer-wise features extracted by both models, thereby ensuring that the student GNN can accurately mimic the teacher LLM’s deep semantic understanding.

To design an effective contrastive distillation loss, we consider the following steps: For each node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and its corresponding l𝑙litalic_l-th order feature, the positive sample is the same node’s feature as learned by both the teacher LLM and the student GNN. This ensures that the features of the same node are aligned across different models.The negative samples are features of nodes from different categories compared to the center node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. This selection helps in maintaining a clear distinction between different classes, thereby improving the classification accuracy of the student GNN.Formally, given a node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, its l𝑙litalic_l-th order feature from the teacher LLM, hlTsuperscriptsubscripth𝑙𝑇\textbf{h}_{l}^{T}h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, and from the student GNN, hlSsuperscriptsubscripth𝑙𝑆\textbf{h}_{l}^{S}h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT, the positive pair is defined as:

(19)(hlS,hlT)superscriptsubscripth𝑙𝑆superscriptsubscripth𝑙𝑇\displaystyle(\textbf{h}_{l}^{S},\textbf{h}_{l}^{T})( h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT , h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT )

For each positive pair, we sample N𝑁Nitalic_N negative pairs {(hlS,hlT)}\{(\textbf{h}_{l}^{S},\textbf{h}_{l}^{T}*)\}{ ( h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT , h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∗ ) }, where hlTsuperscriptsubscripth𝑙𝑇\textbf{h}_{l}^{T*}h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T ∗ end_POSTSUPERSCRIPT is the l𝑙litalic_l-th order feature of a node from a different category.The infoNCE loss for each l𝑙litalic_l-th order feature, controlled by a temperature parameter t𝑡titalic_t, is expressed as follows:

(20)Dl=𝔼[logexp(sim(hlS,hlT)/t)m=1Nexp(sim(hlS,hl,(m)T)/t)]superscriptsubscript𝐷𝑙𝔼delimited-[]simsuperscriptsubscripth𝑙𝑆superscriptsubscripth𝑙𝑇𝑡superscriptsubscript𝑚1𝑁simsuperscriptsubscripth𝑙𝑆superscriptsubscripth𝑙𝑚𝑇𝑡\displaystyle\mathcal{L}_{D}^{l}=-\mathbb{E}\left[\log\frac{\exp(\text{sim}(%\textbf{h}_{l}^{S},\textbf{h}_{l}^{T})/t)}{\sum_{m=1}^{N}\exp(\text{sim}(%\textbf{h}_{l}^{S},\textbf{h}_{l,(m)}^{T*})/t)}\right]caligraphic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = - blackboard_E [ roman_log divide start_ARG roman_exp ( sim ( h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT , h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) / italic_t ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_exp ( sim ( h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT , h start_POSTSUBSCRIPT italic_l , ( italic_m ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T ∗ end_POSTSUPERSCRIPT ) / italic_t ) end_ARG ]

where sim(,)𝑠𝑖𝑚sim(\cdot,\cdot)italic_s italic_i italic_m ( ⋅ , ⋅ ) denotes a similarity function, such as cosine similarity. The temperature parameter t𝑡titalic_t controls the smoothness of the probability distribution, ensuring that the model focuses on hard negative samples that are more challenging to distinguish from the positives.

Recognizing the unique importance of different-order neighbor structures in various downstream tasks, we then introduce a trainable distillation factor γlsubscript𝛾𝑙\gamma_{l}italic_γ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT for each layer’s distillation loss, which allows the model to adaptively focus more on the layers that are more critical for the specific task at hand.The overall layer-adaptive contrastive distillation loss is then computed as the weighted sum of the layer-specific contrastive losses:

(21)D=l=0kγlDlsubscriptDsuperscriptsubscript𝑙0𝑘subscript𝛾𝑙superscriptsubscriptD𝑙\mathcal{L}_{\text{D}}=\sum_{l=0}^{k}\gamma_{l}\mathcal{L}_{\text{D}}^{l}caligraphic_L start_POSTSUBSCRIPT D end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_l = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT

Here, each order’s distillation factor γlsubscript𝛾𝑙\gamma_{l}italic_γ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ensures a balanced knowledge distillation, enabling the effective transfer of complex semantic and structural insights from the LLM to the GNN.

3.3. Model Training

In training the student GNN, acknowledging the different inference frameworks of the teacher and student models, it is essential to not only distill knowledge from the teacher LLM but also to train a task-specific prediction layer for the GNN. Therefore, we approach the student GNN’s training as a multi-task joint optimization challenge. For instance, in the node classification scenario, we use a fully connected layer as the classifier:

(22)y^=softmax(WGhkS+bG)^𝑦𝑠𝑜𝑓𝑡𝑚𝑎𝑥subscript𝑊Gsuperscriptsubscripth𝑘𝑆subscript𝑏G\hat{y}=softmax(W_{\text{G}}\textbf{h}_{k}^{S}+b_{\text{G}})over^ start_ARG italic_y end_ARG = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( italic_W start_POSTSUBSCRIPT G end_POSTSUBSCRIPT h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT + italic_b start_POSTSUBSCRIPT G end_POSTSUBSCRIPT )

where hkSsuperscriptsubscripth𝑘𝑆\textbf{h}_{k}^{S}h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT denotes the output of the k𝑘kitalic_k-th layer of the GNN, y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG represents the predicted node label, and WGsubscript𝑊GW_{\text{G}}italic_W start_POSTSUBSCRIPT G end_POSTSUBSCRIPT and bGsubscript𝑏Gb_{\text{G}}italic_b start_POSTSUBSCRIPT G end_POSTSUBSCRIPT are the weights and biases of the fully connected layer.

Subsequently, the node classification loss function, formulated as cross-entropy, is computed as:

(23)G=i=1|𝒟tr|yilog(y^i)subscriptGsuperscriptsubscript𝑖1subscript𝒟trsubscript𝑦𝑖subscript^𝑦𝑖\mathcal{L}_{\text{G}}=-\sum_{i=1}^{|\mathcal{D}_{\text{tr}}|}y_{i}\log(\hat{y%}_{i})caligraphic_L start_POSTSUBSCRIPT G end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_D start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

where yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the actual node label, and y^isubscript^𝑦𝑖\hat{y}_{i}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the predicted probability for each category. 𝒟trsubscript𝒟tr\mathcal{D}_{\text{tr}}caligraphic_D start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT is the training set.

The overall training objective integrates the KD loss DsubscriptD\mathcal{L}_{\text{D}}caligraphic_L start_POSTSUBSCRIPT D end_POSTSUBSCRIPT with the classification loss GsubscriptG\mathcal{L}_{\text{G}}caligraphic_L start_POSTSUBSCRIPT G end_POSTSUBSCRIPT, obtaining a joint loss function:

(24)=αG+βD𝛼subscriptG𝛽subscriptD\mathcal{L}=\alpha\mathcal{L}_{\text{G}}+\beta\mathcal{L}_{\text{D}}caligraphic_L = italic_α caligraphic_L start_POSTSUBSCRIPT G end_POSTSUBSCRIPT + italic_β caligraphic_L start_POSTSUBSCRIPT D end_POSTSUBSCRIPT

where α𝛼\alphaitalic_α and β𝛽\betaitalic_β are tunable factors for adaptively balancing the influence of knowledge distillation loss and downstream task loss on the training process.

Finally, the student GNN undergoes end-to-end training using a mini-batch AdamW optimizer, which is dedicated to optimizing model parameters efficiently for robust performance.

4. Experiments

CoraPubMedArxiv
# Node2,70819,717169,343
# Edge5,42944,3381,166,243
# Class7340
# Features1433500128
Embedding Tech.BoWTF-IDFSkip-gram
|𝒟tr|:|𝒟val|:|𝒟test|:subscript𝒟𝑡𝑟subscript𝒟𝑣𝑎𝑙:subscript𝒟𝑡𝑒𝑠𝑡|\mathcal{D}_{tr}|:|\mathcal{D}_{val}|:|\mathcal{D}_{test}|| caligraphic_D start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT | : | caligraphic_D start_POSTSUBSCRIPT italic_v italic_a italic_l end_POSTSUBSCRIPT | : | caligraphic_D start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT |6:2:26:2:25.4:1.8:2.8

4.1. Datasets and Backbone Models

Datasets

We validated the effectiveness of our proposed LinguGKD framework through node classification experiments on three widely-adopted benchmark datasets: Cora, PubMed (Yang etal., 2016), and Arxiv (Hu etal., 2020). These datasets represent academic papers as nodes and citations as edges. The node attributes consist of titles and abstracts, encapsulating the core content of each paper. For knowledge distillation and GNN training, we utilized the default node embeddings generated by various techniques (e.g., bag of words (BoW), TF-IDF) inherent to these datasets without any alterations.

Due to the lack of initial text attributes for each node in the original datasets, we reconstructed titles, abstracts, and other text attributes for each node following the method described in (He etal., 2024) for graph instruction tuning of the teacher LLM.

For dataset partitioning, we followed the split strategy adopted in (Ye etal., 2023; He etal., 2024). Specifically, we applied a 6:2:2 split for the Cora and PubMed datasets, while for the Arxiv dataset, we used the 5.4:1.8:2.8 split as in the OGB open benchmark (Hu etal., 2020). Comprehensive dataset statistics are summarized in Table 1.

Backbone Models

To verify the effectiveness and generality of our proposed LinguGKD framework, we selected multiple LLMs with different architectures as teachers and GNNs as students. Specifically, we chose Mistral-7B (Jiang etal., 2023), Llama2-7B (Touvron etal., 2023), and Llama3-8B (AI@Meta, 2024) as the teacher models. For student GNNs, we selected GCN (Hamilton etal., 2017), GAT (Veličković etal., 2018), GraphSAGE (Hamilton etal., 2017), and GIN (Xu etal., 2018), known for their effectiveness and efficiency in graph-based tasks.

4.2. Experimental Settings

ParameterValue
Instruction Tuning of PLM
Maximum neighboring subgraph order (k𝑘kitalic_k)3
Maximum number of instruction prompts per hop (θ𝜃\thetaitalic_θ)2
Maximum sequence length (smaxsubscript𝑠𝑚𝑎𝑥s_{max}italic_s start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT)512
Hidden feature dimension (dLsubscript𝑑𝐿d_{L}italic_d start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT)4096
Batch size (mbL𝑚subscript𝑏𝐿mb_{L}italic_m italic_b start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT)2
Gradient accumulation (gaccsubscript𝑔𝑎𝑐𝑐g_{acc}italic_g start_POSTSUBSCRIPT italic_a italic_c italic_c end_POSTSUBSCRIPT)2
Learning rate (lrL𝑙subscript𝑟𝐿lr_{L}italic_l italic_r start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT)2×1042superscript1042\times 10^{-4}2 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
Training duration (epL𝑒subscript𝑝𝐿ep_{L}italic_e italic_p start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT)1 epoch
LoRA parameterslorar𝑙𝑜𝑟subscript𝑎𝑟lora_{r}italic_l italic_o italic_r italic_a start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT=64, loraα𝑙𝑜𝑟subscript𝑎𝛼lora_{\alpha}italic_l italic_o italic_r italic_a start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT=32
Graph Knowledge Distillation
Learning rate (lrG𝑙subscript𝑟𝐺lr_{G}italic_l italic_r start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT)1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
Batch size (mbG𝑚subscript𝑏𝐺mb_{G}italic_m italic_b start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT)32
Training epochs (epG𝑒subscript𝑝𝐺ep_{G}italic_e italic_p start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT)500
Feature dimension (dGsubscript𝑑𝐺d_{G}italic_d start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, dksubscript𝑑𝑘d_{k}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT)[64, 128, 256, 512, 1024]
Instruction Tuning of PLM

During the LLM’s fine-tuning phase, for each center node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we defined a maximum neighboring subgraph order k𝑘kitalic_k of 3, constructing instruction prompts 𝒫={𝒫l}l=03𝒫superscriptsubscriptsubscript𝒫𝑙𝑙03\mathcal{P}=\{\mathcal{P}_{l}\}_{l=0}^{3}caligraphic_P = { caligraphic_P start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_l = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT to cover structural prompts from structure-free to 3rd-hop subgraphs. We standardized the maximum number of instruction prompts per hop (θ𝜃\thetaitalic_θ) at 2, resulting in totals of 62,216, 121,580, and 2,000,123 instruction prompts for Cora, PubMed, and Arxiv, respectively.

We utilized pre-trained models from huggingface.co. The LLM input’s maximum sequence length (smaxsubscript𝑠𝑚𝑎𝑥s_{max}italic_s start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT) was capped at 512, with a hidden feature dimension (dLsubscript𝑑𝐿d_{L}italic_d start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT) of 4096. Training settings included a minimal batch size (mbL𝑚subscript𝑏𝐿mb_{L}italic_m italic_b start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT) of 2, gradient accumulation (gaccsubscript𝑔𝑎𝑐𝑐g_{acc}italic_g start_POSTSUBSCRIPT italic_a italic_c italic_c end_POSTSUBSCRIPT) also set at 2, and a learning rate (lrL𝑙subscript𝑟𝐿lr_{L}italic_l italic_r start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT) of 2×1042superscript1042\times 10^{-4}2 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, all within a concise training duration of 1 epoch (epL𝑒subscript𝑝𝐿ep_{L}italic_e italic_p start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT). We integrated LoRA and 4-bit quantization techniques, with LoRA’s parameters set to lorar𝑙𝑜𝑟subscript𝑎𝑟lora_{r}italic_l italic_o italic_r italic_a start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT=64, dropout at 0.1, and alpha (loraα𝑙𝑜𝑟subscript𝑎𝛼lora_{\alpha}italic_l italic_o italic_r italic_a start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT) to 32. This adjustment resulted in a total of approximately 1,800,000 trainable parameters, including elements like q_proj, k_proj, v_proj, o_proj, and lm_head.

In the node label generation steps of LinguGraph LLM, each node underwent classification for every instructional prompt via a greedy search algorithm. A majority voting scheme was then applied, where the most frequently appearing classification across prompts was selected as the final prediction, ensuring a balanced and democratically derived outcome.

Graph Knowledge Distillation

For distilling graph knowledge from LinguGraph LLM to GNNs, across all benchmark datasets, we standardized the learning rate (lrG𝑙subscript𝑟𝐺lr_{G}italic_l italic_r start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT) at 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, batch size (mbG𝑚subscript𝑏𝐺mb_{G}italic_m italic_b start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT) at 32, and extended the training to 500 epochs (epG𝑒subscript𝑝𝐺ep_{G}italic_e italic_p start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT). The performance of our LinguGKD was assessed across different message-passing layers [0, 1, 2, 3].

For simplicity, we aligned the feature dimension of the GNN (dGsubscript𝑑𝐺d_{G}italic_d start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT) with that of the distilled knowledge (dksubscript𝑑𝑘d_{k}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT). We then explored the effects of varying the hidden feature dimensions of GNNs and distilled knowledge at [64, 128, 256, 512, 1024].

For each central node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT within a specified neighbor order l𝑙litalic_l, our experimental setup involved generating θ𝜃\thetaitalic_θ unique instruction prompts. This led to the extraction of θ𝜃\thetaitalic_θ l𝑙litalic_l-hop features {hl,(1)L,,hl,(θ)L}superscriptsubscripth𝑙1𝐿superscriptsubscripth𝑙𝜃𝐿\{\textbf{h}_{l,(1)}^{L},\dots,\textbf{h}_{l,(\theta)}^{L}\}{ h start_POSTSUBSCRIPT italic_l , ( 1 ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT , … , h start_POSTSUBSCRIPT italic_l , ( italic_θ ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT } from the LinguGraph LLM for visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. During the layer-adaptive knowledge distillation phase, we utilized an average pooling operation to consolidate these features into a single representation hlLsuperscriptsubscripth𝑙𝐿\textbf{h}_{l}^{L}h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT that accurately reflects the l𝑙litalic_l-th order neighborhood’s characteristics.The overall experimental settings are summarized in Tabel 2.

MethodsCoraPubMedMethodsArxiv
Acc.\uparrowF1\uparrowAcc.\uparrowF1\uparrowAcc.\uparrowF1\uparrow
GCN (Kipf and Welling, 2017)86.53±plus-or-minus\pm±0.9285.66±plus-or-minus\pm±0.7886.12±plus-or-minus\pm±0.9385.64±plus-or-minus\pm±0.82GCN (Kipf and Welling, 2017)71.74±plus-or-minus\pm±0.2171.04±plus-or-minus\pm±0.37
GAT (Veličković etal., 2018)86.12±plus-or-minus\pm±0.9585.05±plus-or-minus\pm±0.8885.49±plus-or-minus\pm±0.7684.89±plus-or-minus\pm±0.71GAT (Veličković etal., 2018)73.66±plus-or-minus\pm±0.3372.44±plus-or-minus\pm±0.19
GraphSAGE (Hamilton etal., 2017)87.08±plus-or-minus\pm±0.8585.96±plus-or-minus\pm±0.7387.69±plus-or-minus\pm±0.9287.38±plus-or-minus\pm±0.68GraphSAGE (Hamilton etal., 2017)71.19±plus-or-minus\pm±0.2670.87±plus-or-minus\pm±0.45
GIN (Xu etal., 2018)86.60±plus-or-minus\pm±0.9185.37±plus-or-minus\pm±0.7485.84±plus-or-minus\pm±0.9285.31±plus-or-minus\pm±0.63GIN (Xu etal., 2018)71.62±plus-or-minus\pm±0.4771.13±plus-or-minus\pm±0.33
SGC-v2 (Wu etal., 2019)85.48±plus-or-minus\pm±1.4885.04±plus-or-minus\pm±0.6985.36±plus-or-minus\pm±0.5284.96±plus-or-minus\pm±0.77DeeperGCN (Li etal., 2020)71.92±plus-or-minus\pm±0.1671.24±plus-or-minus\pm±0.39
BernNet (He etal., 2021)88.52±plus-or-minus\pm±0.9587.96±plus-or-minus\pm±0.8588.48±plus-or-minus\pm±0.4187.52±plus-or-minus\pm±0.79GTAN (Wu and Wang, 2022)72.97±plus-or-minus\pm±0.1771.77±plus-or-minus\pm±0.22
fa*gCN (Bo etal., 2021)88.85±plus-or-minus\pm±1.3687.92±plus-or-minus\pm±0.6589.98±plus-or-minus\pm±0.5488.72±plus-or-minus\pm±0.53UniMP (Shi etal., 2020)73.11±plus-or-minus\pm±0.2072.14±plus-or-minus\pm±0.38
GCNII (Chen etal., 2020)88.93±plus-or-minus\pm±1.3787.58±plus-or-minus\pm±0.7189.80±plus-or-minus\pm±0.3088.96±plus-or-minus\pm±0.62GCNII (Chen etal., 2020)72.74±plus-or-minus\pm±0.0072.22±plus-or-minus\pm±0.44
RevGAT (Li etal., 2021)89.11±plus-or-minus\pm±0.0087.65±plus-or-minus\pm±0.5888.50±plus-or-minus\pm±0.0587.12±plus-or-minus\pm±0.73RevGAT (Li etal., 2021)74.02±plus-or-minus\pm±0.1873.56±plus-or-minus\pm±0.29
ACM-GCN+ (Luan etal., 2022)89.75±plus-or-minus\pm±1.1688.94±plus-or-minus\pm±0.54\cellcolor[gray]0.890.96±plus-or-minus\pm±0.6289.77±plus-or-minus\pm±0.51E2EG (Dinh etal., 2023)73.62±plus-or-minus\pm±0.1472.96±plus-or-minus\pm±0.26
GraphTransformer (Dwivedi and Bresson, 2020)86.42±plus-or-minus\pm±0.8285.96±plus-or-minus\pm±0.6788.75±plus-or-minus\pm±0.1687.91±plus-or-minus\pm±0.59SGFormer (Wu etal., 2024)72.63±plus-or-minus\pm±0.1371.58±plus-or-minus\pm±0.42
Graphormer (Ying etal., 2021)80.41±plus-or-minus\pm±0.3079.98±plus-or-minus\pm±0.5688.24±plus-or-minus\pm±1.5087.52±plus-or-minus\pm±0.71Graphormer (Ying etal., 2021)72.81±plus-or-minus\pm±0.2372.14±plus-or-minus\pm±0.39
LinguGraph-Mistral (7B)87.82±plus-or-minus\pm±0.8887.47±plus-or-minus\pm±0.7293.71±plus-or-minus\pm±0.6993.37±plus-or-minus\pm±0.52LinguGraph-Mistral (7B)76.07±plus-or-minus\pm±0.5376.02±plus-or-minus\pm±0.44
LinguGraph-Llama2 (7B)88.19±plus-or-minus\pm±0.8388.12±plus-or-minus\pm±0.7394.09±plus-or-minus\pm±0.7893.55±plus-or-minus\pm±0.61LinguGraph-Llama2 (7B)75.67±plus-or-minus\pm±0.5275.60±plus-or-minus\pm±0.41
LinguGraph-Llama3 (8B)\cellcolor[gray]0.891.51±plus-or-minus\pm±0.4691.53±plus-or-minus\pm±0.1895.59±plus-or-minus\pm±0.2995.55±plus-or-minus\pm±0.10LinguGraph-Llama3 (8B)79.73±plus-or-minus\pm±0.1879.29±plus-or-minus\pm±0.56
GCN(Mistral)90.04±plus-or-minus\pm±0.6489.62±plus-or-minus\pm±0.5888.92±plus-or-minus\pm±0.7188.47±plus-or-minus\pm±0.69GCN(Mistral)73.55±plus-or-minus\pm±0.4973.27±plus-or-minus\pm±0.35
GCN(Llama2)90.59±plus-or-minus\pm±0.7189.62±plus-or-minus\pm±0.6688.97±plus-or-minus\pm±0.8288.56±plus-or-minus\pm±0.71GCN(Llama2)73.87±plus-or-minus\pm±0.2273.87±plus-or-minus\pm±0.61
GCN(Llama3)90.77±plus-or-minus\pm±0.2890.35±plus-or-minus\pm±0.3789.76±plus-or-minus\pm±0.4489.46±plus-or-minus\pm±0.37GCN(Llama3)74.68±plus-or-minus\pm±0.4574.29±plus-or-minus\pm±0.32
GAT(Mistral)89.85±plus-or-minus\pm±0.6289.19±plus-or-minus\pm±0.5288.08±plus-or-minus\pm±0.6287.53±plus-or-minus\pm±0.47GAT(Mistral)74.72±plus-or-minus\pm±0.4774.55±plus-or-minus\pm±0.42
GAT(Llama2)90.33±plus-or-minus\pm±0.6789.72±plus-or-minus\pm±0.5987.93±plus-or-minus\pm±0.2887.42±plus-or-minus\pm±0.36GAT(Llama2)74.92±plus-or-minus\pm±0.1474.48±plus-or-minus\pm±0.28
GAT(Llama3)\cellcolor[gray]0.891.51±plus-or-minus\pm±0.35\cellcolor[gray]0.891.45±plus-or-minus\pm±0.5888.31±plus-or-minus\pm±0.7687.93±plus-or-minus\pm±0.65GAT(Llama3)\cellcolor[gray]0.875.71±plus-or-minus\pm±0.41\cellcolor[gray]0.875.06±plus-or-minus\pm±0.36
GraphSAGE(Mistral)90.59±plus-or-minus\pm±0.8289.85±plus-or-minus\pm±0.7590.11±plus-or-minus\pm±0.6989.77±plus-or-minus\pm±0.54GraphSAGE(Mistral)72.85±plus-or-minus\pm±0.4272.87±plus-or-minus\pm±0.41
GraphSAGE(Llama2)90.22±plus-or-minus\pm±0.7789.89±plus-or-minus\pm±0.1989.96±plus-or-minus\pm±0.5089.67±plus-or-minus\pm±0.34GraphSAGE(Llama2)72.53±plus-or-minus\pm±0.6172.42±plus-or-minus\pm±0.49
GraphSAGE(Llama3)91.70±plus-or-minus\pm±0.5191.08±plus-or-minus\pm±0.6290.14±plus-or-minus\pm±0.56\cellcolor[gray]0.889.96±plus-or-minus\pm±0.48GraphSAGE(Llama3)75.38±plus-or-minus\pm±0.3875.22±plus-or-minus\pm±0.32
GIN(Mistral)89.67±plus-or-minus\pm±0.7188.64±plus-or-minus\pm±0.5487.83±plus-or-minus\pm±0.6287.27±plus-or-minus\pm±0.56GIN(Mistral)73.40±plus-or-minus\pm±0.4273.63±plus-or-minus\pm±0.34
GIN(Llama2)90.26±plus-or-minus\pm±0.6789.20±plus-or-minus\pm±0.4887.73±plus-or-minus\pm±0.2987.20±plus-or-minus\pm±0.30GIN(Llama2)73.71±plus-or-minus\pm±0.2573.42±plus-or-minus\pm±0.10
GIN(Llama3)91.33±plus-or-minus\pm±0.2891.05±plus-or-minus\pm±0.5389.22±plus-or-minus\pm±0.7988.87±plus-or-minus\pm±0.61GIN(Llama3)75.64±plus-or-minus\pm±0.4675.28±plus-or-minus\pm±0.39
Avg. Dist. Gains4.61%5.22%2.79%2.84%Avg. Dist. Gains3.85%4.22%

4.3. Experimental Results and Analyses

4.3.1. Comparison Performance Analyses

To validate the effectiveness of our proposed LinguGKD, we report the accuracy and F1 score of node classification over different datasets to evaluate the performance of various graph learning models. We selected a series of representative single-model graph learning approaches as baselines from three corresponding leaderboards222https://paperswithcode.com/sota/node-classification-on-cora-60-20-20-random 333https://paperswithcode.com/sota/node-classification-on-pubmed-60-20-20-random 444https://ogb.stanford.edu/docs/leader_nodeprop/, ranging from simple to advanced architecture designs. These include message-passing-based GNNs (such as BernNet (He etal., 2021), RevGAT (Li etal., 2021), etc.) and more complex graph Transformer models (such as Graphormer (Ying etal., 2021), E2EG (Dinh etal., 2023), etc.).Table 3 shows the node classification results.

Effectiveness Analysis of LinguGraph LLMs Compared to Baselines

The experimental results demonstrate that, after graph instruction tuning, the LinguGraph LLMs exhibit strong capabilities in semantic and entity relationship understanding and generalizes well across different datasets. Compared to baseline GNNs, the tuned LinguGraph LLMs achieve state-of-the-art results on all datasets (e.g., LinguGraph-Llama3 (8B) achieved 91.51%, 95.59%, and 79.73% accuracy on the Cora, PubMed, and Arxiv datasets, respectively). Additionally, the performance of the LinguGraph LLM improves with the increase in pre-trained LLM corpus and model parameters. For instance, Llama3-8B, with more model parameters and a larger pre-training corpus than Llama2-7B and Mistral-7B, consistently performs well across multiple datasets. These results strongly support the assertion in (Ye etal., 2023) that LLMs have the potential to become the next generation foundational model for future graph learning.

In graph knowledge distillation frameworks, student models often experience performance degradation compared to teacher models (Joshi etal., 2024; Chen etal., 2022; Samy etal., 2023), especially without additional training data. Therefore, selecting an exceptional teacher model is crucial for enhancing the performance of student GNNs. The experimental results above motivate our choice of LLM as the teacher model for graph knowledge distillation, ensuring more effective student GNNs.

Performance Gains Analysis of Distilled GNNs Over Baselines

From the results, we can observe that the proposed LinguGKD framework can be seamlessly applied to different combinations of teacher LLMs and student GNNs, significantly improving the performance of various student GNNs by extracting knowledge from LLMs without requiring additional training data or modifications to the GNN architecture.The average distillation gains of various student GNNs range from 2.79% in PubMed to 4.61% in Cora. For instance, on the Cora dataset, the accuracy of the GCN model distilled with LinguGraph-Llama3 (8B) increased to 90.77%, compared to 86.53% for the vanilla GCN model.

Moreover, compared to other advanced GNNs and graph Transformer models, the LinguGKD framework enables basic GNN models to achieve competitive performance with these more complex models through knowledge distillation, and even outperform them in certain scenarios. For instance, the GAT model distilled by LinguGraph-Llama3 (8B) on the Cora dataset achieved 91.51% accuracy, surpassing the performance of more complex models such as RevGAT (89.11%) and GraphTransformer (86.42%). This demonstrates that the LinguGKD framework can significantly enhance the performance of basic GNNs without increasing model complexity, making them more efficient and practical for real-world applications.

This notable performance improvement indicates that the LinguGKD framework can effectively transfer the deep semantic knowledge and complex graph structural understanding from the teacher LLM to the student GNN, thus improving its performance in node classification tasks.Furthermore, student GNNs distilled from higher-performing teacher LLMs also achieve higher accuracy and F1 scores in their respective tasks, which indicates significant potential for leveraging the rapid advancements in pre-trained LLMs to continuously enhance the performance of student GNNs, thereby improving the efficiency of applications in production environments.

Analysis of Knowledge Distillation Effectiveness Across Different Datasets

The effectiveness of knowledge distillation varies across different datasets. For instance, on the Cora dataset, the improvement in student GNNs was substantial, with some models even surpassing the performance of the teacher LLM, such as GraphSAGELlama3 achieving an accuracy of 91.70%, surpassing its teacher LinguGraph-Llama3 (8B) which had 91.51%. In contrast, the improvement on the PubMed dataset was relatively modest.The potential reason is that the LLM’s predictions for node classification are primarily derived from the node’s textual attributes, meaning it performs better in a structural-free context, with graph structural information contributing less significantly.The training of student GNNs used pre-extracted node semantic embeddings provided by the datasets, which lost significant information compared to the raw text inputs used by the LLM, resulting in GNNs relying more on graph structural information for classification.

Specifically, in the Cora dataset, the overlapping relevance between class labels in the paper titles and abstracts results in nodes that can reasonably belong to multiple categories, making it easier for the teacher LLMs to confuse these categories.The distilled GNNs, on the other hand, not only retained the advantage of understanding graph structure but also gained the semantic understanding capability of the teacher LLM through the layer-adaptive feature alignment, leading to noticeable performance improvements.In contrast, the PubMed dataset has fewer categories, which are well-reflected in the node’s textual attributes, allowing the LLM to achieve excellent performance. However, the student GNNs, constrained by the small initial feature dimensions provided by the dataset and significant loss of semantic information, faced limitations in aligning their hierarchical features with those of the LLMs due to extensive cross-category citations, resulting in less pronounced improvements post-knowledge distillation.

In summary, our proposed LinguGKD framework excels in knowledge distillation, significantly enhancing the performance of student GNNs. The teacher LLM, after graph instruction tuning, demonstrates strong semantic and entity relationship understanding capabilities and generalizes well to different graph datasets, providing robust support for student GNNs. The experimental results also indicate that basic GNN models can achieve performance comparable to, or even surpass, complex graph learning models after distillation through LinguGKD. With the continuous development and performance improvement of pre-trained LLMs, the LinguGKD framework is poised to further enhance the performance of GNNs, thereby advancing the efficiency of applications in production environments.

Large Language Model Meets Graph Neural Network in Knowledge Distillation (2)
Large Language Model Meets Graph Neural Network in Knowledge Distillation (3)

4.3.2. Convergence Efficiency of Vanilla GNNs vs. Distilled GNNs

Figure 2 illustrates that GNNs optimized through knowledge distillation not only achieve higher classification accuracy but also exhibit faster convergence rates. Here, we select LinguGraph-Llama2 as the teacher. The distilled GNNs, distinguished by the subscript (L), quickly reach high accuracy early in the training process, significantly outperforming their undistilled counterparts.

This enhanced convergence is primarily attributed to our joint optimization of knowledge distillation and downstream tasks in the training procedure, in which GNNs are trained to fit the node feature distributions learned by the teacher LLMs, facilitating rapid convergence. Teacher LLMs deliver high-quality node feature representations due to their robust semantic understanding and contextual modeling capabilities. These representations encapsulate complex semantic relationships and extensive contextual information among nodes. When student GNNs learn these feature distributions during the distillation process via the proposed LinguGKD framework, they can significantly reduce the training iterations required to achieve stable high accuracy, resulting in accelerated convergence.

In summary, knowledge distillation not only enhances the classification performance of GNNs but also significantly accelerates model convergence. This advantage renders distilled GNNs more efficient for practical applications, enabling them to complete training in a shorter time while maintaining excellent performance.

Large Language Model Meets Graph Neural Network in Knowledge Distillation (4)
Large Language Model Meets Graph Neural Network in Knowledge Distillation (5)
Large Language Model Meets Graph Neural Network in Knowledge Distillation (6)
Large Language Model Meets Graph Neural Network in Knowledge Distillation (7)
Large Language Model Meets Graph Neural Network in Knowledge Distillation (8)
Large Language Model Meets Graph Neural Network in Knowledge Distillation (9)

4.3.3. Application Trade-off of LinguGraph LLM vs. Distilled GNN

Figure 3 shows the differences between teacher LLMs and student GNNs in terms of model parameters, storage requirements, and inference latency. The right y-axis delineates the parameter quantity and storage requirements of various models through a line graph, while the left y-axis showcases the inference latency for these models across different datasets via a bar graph.

From the figure, we can observe that teacher LLMs have substantially higher parameter counts and storage needs compared to student GNNs. Specifically, Llama2 and Mistral have parameter counts of 6.74B and 7.24B, respectively, while student GNNs have only a few million parameters. In terms of storage, teacher LLMs require over 25GB, whereas student GNNs require just 0.03GB to 0.04GB. Additionally, the inference time for teacher LLMs exceeds 0.5 second on the Cora, PubMed, and Arxiv datasets, whereas the inference time for student GNNs is much smaller.

In practical applications, the choice between using an LLM and a distilled GNN involves several trade-offs. If the application demands the highest possible performance and accuracy, and there are ample computational and storage resources available, deploying a teacher LLM might be preferable. Furthermore, if the application involves tasks that combine natural language processing and graph-based tasks, using a teacher LLM may be essential due to its advanced capabilities in both natural language and graph understanding. However, if the application requires real-time performance or operates under limited resources, a distilled GNN is a more suitable choice. Distilled GNNs not only offer advantages in inference speed and resource consumption but also inherit some capabilities of the teacher LLM through knowledge distillation, striking a good balance between performance and efficiency.

4.4. Ablation Study

MethodsCoraPubMedArxiv
Mistral (7B)8.7624.9110.28
LinguGraph-Mistral (7B)87.8293.7176.07
Llama2 (7B)9.2325.0410.72
LinguGraph-Llama2 (7B)88.1994.0975.67
Llama3 (8B)14.9230.1715.31
LinguGraph-Llama3 (8B)91.5195.5979.73
Large Language Model Meets Graph Neural Network in Knowledge Distillation (10)
Large Language Model Meets Graph Neural Network in Knowledge Distillation (11)
Large Language Model Meets Graph Neural Network in Knowledge Distillation (12)
Large Language Model Meets Graph Neural Network in Knowledge Distillation (13)
Necessity Analysis for Graph Instruction Tuning of PLM

Table 4 illustrates the accuracy of node classification on different datasets for various LinguGraph LLMs and their pre-trained versions. The results indicate a substantial improvement in performance across all datasets following the tuning process.

Pre-tuned models exhibited low accuracy scores: around 8-15% on the Cora dataset, 25-30% on the PubMed dataset, and 10-15% on the Arxiv dataset. Since the pre-training of LLMs typically does not include corpora related to graph understanding, they only have a basic understanding of graphs and are prone to issues such as repetition (Fu etal., 2021). However, after graph instruction tuning, their accuracies increased dramatically, with LinguGraph-Mistral (7B) averaging around 86%, LinguGraph-Llama2 (7B) around 86%, and LinguGraph-Llama3 (8B) reaching up to 89% across these datasets.

These results underscore the effectiveness of graph instruction tuning with specifically designed graph instruction prompts in improving the LLMs’ ability to understand and classify nodes within graph structures, validating the efficacy of this approach in constructing powerful teacher LLMs for knowledge distillation.

Effectiveness Analysis of Layer-Adaptive Contrastive Distillation

To validate the effectiveness of the proposed layer-adaptive contrastive distillation, we conducted a set of experiments comparing the impact of using versus not using the layer-adaptive distillation strategy on our proposed LinguGKD framework. We constructed a variant of LinguGKD, named LinguGKD-, which only aligns the last-order node features extracted by the LinguGraph LLM with the features output by the last message-passing layer of the GNN in the distillation space. We used LinguGraph-Llama2 (7B) as the teacher model and conducted comparative experiments on the Cora and PubMed datasets. The node classification performance of GNNs distilled using the LinguGKD framework and the LinguGKD- variant across different hops are shown in Figure 4, in which GNNs with the - superscript indicate they were distilled from the LinguGKD-.

From the results, we observe that GNNs distilled with the full LinguGKD framework consistently achieve higher performance on both the Cora and PubMed datasets compared to those distilled with the LinguGKD- variant. This performance improvement indicates that the layer-adaptive contrastive knowledge distillation strategy effectively enhances the model’s ability to leverage multi-hop information and capture nuanced features necessary for accurate node classification. Theoretically, this approach allows the student GNN to progressively align with the hierarchical representations captured by the teacher LLM at different layers, facilitating the transfer of complex, multi-layered knowledge. This alignment helps the student GNN better mimic the teacher LLM’s deep semantic understanding and contextual modeling capabilities, resulting in more robust and accurate node representations. Consequently, student GNNs distilled with the layer-adaptive strategy demonstrate superior performance, validating the robustness and generalizability of our approach.

Impact of Layer-Adaptive Distillation and Loss Weights on Model Convergence

To further elucidate the contributions of layer-adaptive contrastive knowledge distillation and the interplay between distillation loss and downstream task loss to model convergence during the joint optimization process, we present the layer-adaptive factors γlsubscript𝛾𝑙\gamma_{l}italic_γ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and classification-distillation loss weights α𝛼\alphaitalic_α and β𝛽\betaitalic_β on the Cora and PubMed datasets in Figure 5. The heatmaps show that the distillation loss weights (β𝛽\betaitalic_β) are consistently high across both datasets, indicating the significant role of distillation in accelerating GNN convergence. This trend underscores the importance of contrastive distillation in the training process, as GNNs distilled with higher β𝛽\betaitalic_β values converge faster and perform better.

The distribution of layer-adaptive factors γlsubscript𝛾𝑙\gamma_{l}italic_γ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT varies across different datasets. In Cora, there is a distinct emphasis on first-order neighbor knowledge during distillation, reflected by higher γ1subscript𝛾1\gamma_{1}italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT values. This suggests that for Cora, immediate neighborhood information is crucial, likely because the textual attributes and node labels have lower semantic relevance, making structural information vital for effective classification. Conversely, the PubMed dataset emphasizes structure-free features. This suggests that semantic features are more significant in PubMed, where the higher semantic relevance between node textual attributes and labels makes the LLM’s semantic understanding crucial. This analysis highlights the critical role of the proposed layer-adaptive contrastive distillation strategy in leveraging the unique structural and semantic characteristics of each dataset.

4.5. Impact of Hyperparameters on Performance

Based on the Cora dataset, we investigated how varying hyperparameters, specifically neighbor orders (k𝑘kitalic_k) and hidden feature dimensions (dGsubscript𝑑𝐺d_{G}italic_d start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT) of GNNs, affects model performance in node classification within TAG. The hidden feature dimensions of the teacher LLMs were kept constant at 4096, as per the original model design.

Figures 6(a) and 6(b) reveal that teacher LLMs consistently outperformed vanilla GNNs, regardless of neighbor hops, highlighting their superior semantic processing ability. Notably, even with no neighbor information (0-hop), LLMs showed a significant edge over GNNs. Under the LinguGKD framework, distilled GNNs surpassed original GNNs in all neighbor hops, demonstrating the effective transfer of multi-hop knowledge from LLMs to GNNs. However, while LLMs benefitted from increasing neighbor orders, GNNs experienced performance declines past the 2-hop mark due to over-smoothing. Distilled GNNs alleviated this issue but faced longer fine-tuning times for teacher LLMs with increased order, leading us to choose a 2-hop setting for a balance of efficiency and effectiveness.

Figures 6(c) and 6(d) provide insights into the impact of hidden feature dimensions on model performance. Vanilla GNNs improved up to a 128-dimension limit, beyond which their performance plateaued. In contrast, distilled GNNs continued to show enhanced performance with higher dimensions, owing to their improved capacity for semantic and structural understanding from LLMs. Consequently, we set the hidden feature dimension at 1024 in our experiments to maximize the benefits from the teacher models.

5. Related Work

5.1. LLMs based Graph Learning

Recent advancements in graph learning have been significantly enriched by the integration of LLMs, marking a notable evolution in the field. Research in this domain can be primarily categorized into two distinct approaches: LLM as Enhancer (LaE) (He etal., 2024; Chen etal., 2024; Wei etal., 2024) and LLM as Predictor (LaP) (Wang etal., 2024; Fatemi etal., 2023; Ye etal., 2023), distinguished by their degree of integration with graph-structured data.

The LaE approach enhances node embedding quality in GNNs by leveraging the semantic processing capabilities of LLMs, addressing traditional GNNs’ limitations in extracting semantic features from TAGs. For instance, TAPE (He etal., 2024) generates interpretive explanations and pseudo-labels, enriching the graph’s textual attributes and fine-tuning a smaller-scale language model to transform textual semantics into robust node embeddings. Similarly, Chen et al. (Chen etal., 2024) propose the Knowledge Entity Augmentation (KEA) strategy, employing LLMs to generate knowledge entities with textual descriptions, enhancing graph nodes with nuanced semantic information. Other notable methods, such as Qian et al. (Qian etal., 2023), produce semantically-rich interpretations of strings for fine-tuning a compact language model, showing potential in fields like pharmaceutical discovery. Wei et al. (Wei etal., 2024) enhance user-item interaction edges in recommendation systems, creating a richer edge dataset and improving system precision.Several studies have also explored the direct application of LLMs in producing text-based node embeddings for GNNs. The GIANT method (Chien etal., 2022) refines language models through a self-supervised learning framework, utilizing XR-Transformers (Zhang etal., 2021) to address multi-label classification challenges in link prediction. Similarly, Duan et al. (Duan etal., 2023) and Zhu et al. (Zhu etal., 2023) enhance PLMs using link prediction analogues to improve structural awareness. Huang et al. (Huang etal., 2023) integrate a graph-tailored adapter at the terminus of PLMs to extract graph-aware node features, generating interpretable node representations. Tan et al. (Tan etal., 2023) introduce an unsupervised technique for universal graph representation learning, converting random walks on graphs into text sequences for fine-tuning LLMs.

The LaP methodologies utilize LLMs directly for prediction tasks in graph contexts, including classification and inference.Recent research leverages LLMs pre-trained on large-scale corpora for encoding graph structures in natural language, enabling direct inference. Studies by Wang et al. (Wang etal., 2024) and Fatemi et al. (Fatemi etal., 2023) explore LLMs’ ability to process textual descriptions of graphs, highlighting their potential and limitations. Ye et al. (Ye etal., 2023) propose scalable prompting techniques, creating direct relational links between nodes through natural language, outperforming traditional GNNs in node classification tasks across benchmarks.

Collectively, these advancements underscore the potential of LLMs in graph learning, skillfully deciphering both semantic content and complex graph structures, and introducing innovative methodologies to the field.

5.2. Graph Knowledge Distillation

In the realm of graph knowledge distillation, the strategic transfer of intricate knowledge from complex teacher models to simpler student models is crucial for enhancing GNNs’ effectiveness and efficiency. This technique maintains the student model’s lightweight nature while striving to emulate the teacher model’s advanced behavior. Predominantly, research in this field is categorized into three areas based on the type of knowledge distilled: output logits (He and Ma, 2022; Ahluwalia etal., 2023; Wu etal., 2022), latent features (Chen etal., 2022; Samy etal., 2023; Joshi etal., 2024), and graph structure (Wu etal., 2023; Deng and Zhang, 2021; Yang etal., 2024).

Research focusing on output logits in graph knowledge distillation aims at transferring the final output representations from the teacher model to the student model. He et al. (He and Ma, 2022) proposed the Scalable and Effective Knowledge Distillation Framework (SGKD) for graph representation learning, which includes feature propagation to provide MLPs with graph structure-aware features. Ahluwalia et al. (Ahluwalia etal., 2023) introduced Attention-Based Knowledge Distillation (ABKD) to compress large GNNs into smaller ones while maintaining accuracy. Wu et al. (Wu etal., 2022) developed an approach focused on model extraction attacks on GNNs, demonstrating effective duplication of models with high input-output correlation.

Latent feature distillation involves transferring intermediate representations from the teacher to the student model. Chen et al. (Chen etal., 2022) proposed a structure-aware MLP student and a structure-mixing distillation strategy to distill knowledge from GNNs into MLPs. Samy et al. (Samy etal., 2023) introduced Graph2Feat that enables inductive link prediction in graph learning through knowledge distillation, showing superior performance in terms of AUC and average precision. Joshi et al. (Joshi etal., 2024) introduced graph contrastive representation distillation (G-CRD), aligning student node embeddings with those of the teacher in a shared representation space to preserve global topology.

Graph structure knowledge distillation focuses on transferring structural information from the teacher to the student model. Wu et al.’s Prototype-Guided Knowledge Distillation (PGKD) (Wu etal., 2023) method distills graph structural information from GNNs to MLPs without requiring graph edges. The graph-free knowledge distillation (GFKD) approach by Deng et al. (Deng and Zhang, 2021) models graph topology structures for knowledge transfer without using graph data. Yang et al. (Yang etal., 2024) proposed VQGraph, a framework that transfers structural knowledge from GNNs to MLPs, achieving state-of-the-art performance in GNN-MLP distillation.

All these exceptional works offer valuable references and a robust theoretical foundation for the proposed LinguGKD framework in this paper.

6. Conclusion

In this study, we propose a novel LLM-to-GNN knowledge distillation framework termed LinguGKD, which integrates the semantic understanding capabilities of LLMs with the efficiency and structural insights of GNNs. LinguGKD employs TAG-oriented instruction tuning to train pre-trained LLMs as teacher models and introduces a layer-adaptive contrastive distillation strategy to align and transfer node features between teacher LLMs and student GNNs within a latent space.Extensive experiments across various LLM and GNN architectures on multiple datasets demonstrates that LinguGKD significantly enhances the predictive accuracy and convergence rate of GNNs without requiring additional training data or model parameters, making them highly practical for deployment in resource-constrained environments.Moreover, LinguGKD shows great potential for leveraging advancements in LLM research to continuously augment GNN performance.

References

  • (1)
  • Ahluwalia etal. (2023)Anshul Ahluwalia, Rohit Das, Payman Behnam, Alind Khare, Pan Li, and Alexey Tumanov. 2023.ABKD: Graph Neural Network Compression With Attention-Based Knowledge Distillation.arXiv Preprint arXiv:2310.15938 (2023).
  • AI@Meta (2024)AI@Meta. 2024.Llama 3 Model Card.(2024).https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md
  • Bo etal. (2021)Deyu Bo, Xiao Wang, Chuan Shi, and Huawei Shen. 2021.Beyond Low-Frequency Information in Graph Convolutional Networks. In AAAI Conference on Artificial Intelligence (AAAI), Vol.35. 3950–3957.
  • Chen etal. (2022)Jie Chen, Shouzhen Chen, Mingyuan Bai, Junbin Gao, Junping Zhang, and Jian Pu. 2022.SA-MLP: Distilling Graph Knowledge From GNNs Into Structure-Aware MLP.arXiv Preprint arXiv:2210.09609 (2022).
  • Chen etal. (2020)Ming Chen, Zhewei Wei, Zengfeng Huang, Bolin Ding, and Yaliang Li. 2020.Simple and Deep Graph Convolutional Networks. In International Conference on Machine Learning (ICML). 1725–1735.
  • Chen etal. (2024)Zhikai Chen, Haitao Mao, Hang Li, Wei Jin, Hongzhi Wen, Xiaochi Wei, Shuaiqiang Wang, Dawei Yin, Wenqi Fan, and Hui etal. Liu. 2024.Exploring The Potential Of Large Language Models (LLMs) In Learning On Graphs.ACM SIGKDD Explorations Newsletter 25, 2 (2024), 42–61.
  • Chien etal. (2022)Eli Chien, WeiCheng Chang, ChoJui Hsieh, HsiangFu Yu, Jiong Zhang, Olgica Milenkovic, and InderjitS Dhillon. 2022.Node Feature Extraction By Self-Supervised Multi-Scale Neighborhood Prediction. In International Conference on Learning Representations (ICLR).
  • Deng and Zhang (2021)Xiang Deng and Zhongfei Zhang. 2021.Graph-Free Knowledge Distillation for Graph Neural Networks. In International Joint Conference on Artificial Intelligence (IJCAI). 2318–2324.
  • Dinh etal. (2023)TuAnh Dinh, Jeroen den Boef, Joran Cornelisse, and Paul Groth. 2023.E2EG: End-to-End Node Classification Using Graph Topology and Text-Based Node Attributes. In IEEE International Conference on Data Mining Workshops (ICDMW). 1084–1091.
  • Duan etal. (2023)Keyu Duan, Qian Liu, Tat-Seng Chua, Shuicheng Yan, WeiTsang Ooi, Qizhe Xie, and Junxian He. 2023.Simteg: A Frustratingly Simple Approach Improves Textual Graph Learning.arXiv Preprint arXiv (2023).
  • Dwivedi and Bresson (2020)VijayPrakash Dwivedi and Xavier Bresson. 2020.A Generalization of Transformer Networks to Graphs.arXiv preprint arXiv:2012.09699 (2020).
  • Fatemi etal. (2023)Bahare Fatemi, Jonathan Halcrow, and Bryan Perozzi. 2023.Talk Like A Graph: Encoding Graphs For Large Language Models. In International Conference on Learning Representations (ICLR).
  • Fu etal. (2021)Zihao Fu, Wai Lam, Anthony Man-Cho So, and Bei Shi. 2021.A Theoretical Analysis of the Repetition Problem in Text Generation. In AAAI Conference on Artificial Intelligence (AAAI), Vol.35. 12848–12856.
  • Hamilton etal. (2017)Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017.Inductive Representation Learning on Large Graphs. In Advances in Neural Information Processing Systems (NeurIPS), Vol.30.
  • He etal. (2021)Mingguo He, Zhewei Wei, Hongteng Xu, and Others. 2021.Bernnet: Learning Arbitrary Graph Spectral Filters via Bernstein Approximation. In Advances in Neural Information Processing Systems (NeurIPS), Vol.34. 14239–14251.
  • He etal. (2024)Xiaoxin He, Xavier Bresson, Thomas Laurent, Perold, and Bryan Hooi. 2024.Harnessing Explanations: LLM-to-LM Interpreter for Enhanced Text-Attributed Graph Representation Learning. In International Conference on Learning Representations (ICLR).
  • He and Ma (2022)Yufei He and Yao Ma. 2022.SGKD: A Scalable and Effective Knowledge Distillation Framework for Graph Representation Learning. In IEEE International Conference on Data Mining Workshops (ICDMW). 666–673.
  • Honovich etal. (2023)Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. 2023.Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor. In Annual Meeting of the Association for Computational Linguistics (ACL). 14409–14428.
  • Hu etal. (2020)Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. 2020.Open Graph Benchmark: Datasets for Machine Learning on Graphs. In Advances in Neural Information Processing Systems (NeurIPS), Vol.33. 22118–22133.
  • Huang etal. (2019)Ling Huang, Chang-Dong Wang, and Hong-Yang Chao. 2019.Higher-Order Multi-Layer Community Detection. In The AAAI Conference on Artificial Intelligence (AAAI), Vol.33. 9945–9946.
  • Huang etal. (2023)Xuanwen Huang, Kaiqiao Han, Dezheng Bao, Quanjin Tao, Zhisheng Zhang, Yang Yang, and Qi Zhu. 2023.Prompt-Based Node Feature Extractor for Few-Shot Learning on Text-Attributed Graphs.arXiv Preprint arXiv:2309.02848 (2023).
  • Jiang etal. (2023)AlbertQ Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, DevendraSingh Chaplot, Diego delas Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, and Lucile etal. Saulnier. 2023.Mistral 7B.arXiv Preprint arXiv:2310.06825 (2023).
  • Joshi etal. (2024)ChaitanyaK. Joshi, Fayao Liu, Xu Xun, Jie Lin, and ChuanSheng Foo. 2024.On Representation Knowledge Distillation for Graph Neural Networks.IEEE Transactions on Neural Networks and Learning Systems 35, 4 (2024), 4656–4667.
  • Kipf and Welling (2017)ThomasN. Kipf and Max Welling. 2017.Semi-Supervised Classification with Graph Convolutional Networks. In International Conference on Learning Representations (ICLR).
  • Li etal. (2021)Guohao Li, Matthias Müller, Bernard Ghanem, and Vladlen Koltun. 2021.Training Graph Neural Networks with 1000 Layers. In International Conference on Machine Learning (ICML). 6437–6449.
  • Li etal. (2020)Guohao Li, Chenxin Xiong, Ali Thabet, and Bernard Ghanem. 2020.DeeperGCN: All You Need to Train Deeper GCNs.arXiv preprint arXiv:2006.07739 (2020).
  • Li etal. (2023b)Jianxin Li, Hao Peng, Yuwei Cao, Yingtong Dou, Hekai Zhang, PhilipS. Yu, and Lifang He. 2023b.Higher-Order Attribute-Enhancing Heterogeneous Graph Neural Networks.IEEE Transactions on Knowledge and Data Engineering 35, 1 (2023), 560–574.
  • Li etal. (2023a)Yuhan Li, Zhixun Li, Peisong Wang, Jia Li, Xiangguo Sun, Hong Cheng, and JeffreyXu Yu. 2023a.A Survey of Graph Meets Large Language Model: Progress and Future Directions.arXiv Preprint arXiv:2311.12399 (2023).
  • Li etal. (2022)Yanying Li, Xiuling Wang, Yue Ning, and Hui Wang. 2022.Fairlp: Towards Fair Link Prediction on Social Network Graphs. In International AAAI Conference on Web and Social Media (ICWSM), Vol.16. 628–639.
  • Li etal. (2024)Yihao Li, Ru Zhang, Jianyi Liu, and Gongshen Liu. 2024.An Enhanced Prompt-Based LLM Reasoning Scheme via Knowledge Graph-Integrated Collaboration.arXiv Preprint arXiv:2402.04978 (2024).
  • Luan etal. (2022)Sitao Luan, Chenqing Hua, Qincheng Lu, Jiaqi Zhu, Mingde Zhao, Shuyuan Zhang, Xiao-Wen Chang, and Doina Precup. 2022.Revisiting Heterophily for Graph Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS), Vol.35. 1362–1375.
  • Oord etal. (2018)Aaron vanden Oord, Yazhe Li, and Oriol Vinyals. 2018.Representation Learning with Contrastive Predictive Coding.arXiv preprint arXiv:1807.03748 (2018).
  • OpenAI (2023)OpenAI. 2023.Prompt Engineering.https://platform.openai.com/docs/guides/prompt-engineering.Accessed: 2023-12-20.
  • Ouyang etal. (2022)Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, PaulF. Christiano, Leike Jan, and Lowe Ryan. 2022.Training Language Models to Follow Instructions with Human Feedback. In Advances in Neural Information Processing Systems (NeurIPS), Vol.35. 27730–27744.
  • Pan etal. (2024)Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, and Xindong Wu. 2024.Unifying Large Language Models and Knowledge Graphs: A Roadmap.IEEE Transactions on Knowledge and Data Engineering (2024).
  • Qian etal. (2023)Chen Qian, Huayi Tang, Zhirui Yang, Hong Liang, and Yong Liu. 2023.Can Large Language Models Empower Molecular Property Prediction?arXiv Preprint arXiv:2307.07443 (2023).
  • Samy etal. (2023)AhmedE. Samy, Zekarias T.Kefato, and Sarunas Girdzijauskas. 2023.Graph2Feat: Inductive Link Prediction via Knowledge Distillation. In ACM Web Conference (WWW). 805–812.
  • Shi etal. (2020)Yunsheng Shi, Zhengjie Huang, Shikun Feng, Hui Zhong, Wenjin Wang, and Yu Sun. 2020.Masked Label Prediction: Unified Message Passing Model for Semi-Supervised Classification.arXiv preprint arXiv:2009.03509 (2020).
  • Si etal. (2023)Qingyi Si, Tong Wang, Zheng Lin, Xu Zhang, Yanan Cao, and Weiping Wang. 2023.An Empirical Study of Instruction-Tuning Large Language Models in Chinese. In Conference on Empirical Methods in Natural Language Processing (EMNLP).
  • Tan etal. (2023)Yanchao Tan, Zihao Zhou, Hang Lv, Weiming Liu, and Carl Yang. 2023.WalkLM: A Uniform Language Model Fine-Tuning Framework for Attributed Graph Embedding. In Conference on Neural Information Processing Systems (NeurIPS).
  • Touvron etal. (2023)Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, and Faisal Azhar. 2023.Llama: Open And Efficient Foundation Language Models.arXiv Preprint arXiv:2302.13971 (2023).
  • Vaswani etal. (2017)Ashish Vaswani, Noam Shazeer, Niki Parmar, Uszkoreit, AidanN Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017.Attention Is All You Need. In Advances in Neural Information Processing Systems (NeurIPS), Vol.30.
  • Veličković etal. (2018)Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018.Graph Attention Networks. In International Conference on Learning Representations (ICLR).
  • Wang etal. (2024)Heng Wang, Shangbin Feng, Tianxing He, Zhaoxuan Tan, Xiaochuang Han, and Yulia Tsvetkov. 2024.Can Language Models Solve Graph Problems in Natural Language?. In Advances in Neural Information Processing Systems (NeurIPS), Vol.36.
  • Wei etal. (2024)Wei Wei, Xubin Ren, Jiabin Tang, Qinyong Wang, Lixin Su, Suqi Cheng, Junfeng Wang, Dawei Yin, and Chao Huang. 2024.Llmrec: Large Language Models with Graph Augmentation for Recommendation. In ACM International Conference on Web Search and Data Mining (WSDM). 806–815.
  • Wu etal. (2022)Bang Wu, Xiangwen Yang, Shirui Pan, and Xingliang Yuan. 2022.Model Extraction Attacks on Graph Neural Networks: Taxonomy and Realisation. In Asia Conference on Computer and Communications Security (ASIA-CCS). 337–350.
  • Wu etal. (2019)Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. 2019.Simplifying Graph Convolutional Networks. In International Conference on Machine Learning (ICML). 6861–6871.
  • Wu and Wang (2022)Nan Wu and Chaofan Wang. 2022.Gtnet: A Tree-based Deep Graph Learning Architecture.arXiv preprint arXiv:2204.12802 (2022).
  • Wu etal. (2024)Qitian Wu, Wentao Zhao, Chenxiao Yang, Hengrui Zhang, Fan Nie, Haitian Jiang, Yatao Bian, and Junchi Yan. 2024.Simplifying and Empowering Transformers for Large-Graph Representations. In Advances in Neural Information Processing Systems (NeurIPS), Vol.36.
  • Wu etal. (2023)Taiqiang Wu, Zhe Zhao, Jiahao Wang, Xingyu Bai, Lei Wang, Ngai Wong, and Yujiu Yang. 2023.Edge-Free But Structure-Aware: Prototype-Guided Knowledge Distillation From GNNs To MLPs.arXiv Preprint arXiv:2303.13763 (2023).
  • Xiao etal. (2023)sh*tao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighof. 2023.C-Pack: Packaged Resources to Advance General Chinese Embedding.arXiv Preprint arXiv:2309.07597 (2023).
  • Xu etal. (2018)Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018.How Powerful Are Graph Neural Networks?arXiv Preprint arXiv:1810.00826 (2018).
  • Yang etal. (2024)Ling Yang, Ye Tian, Minkai Xu, Zhongyi Liu, Shenda Hong, Wei Qu, Wentao Zhang, Bin Cui, Muhan Zhang, and Jure Leskovec. 2024.VQGraph: Rethinking Graph Representation Space for Bridging GNNs and MLPs. In International Conference on Learning Representations (ICLR).
  • Yang and Shi (2024)Renchi Yang and Jieming Shi. 2024.Efficient High-Quality Clustering for Large Bipartite Graphs.Proceedings of the ACM on Management of Data 2, 1 (2024), 1–27.
  • Yang etal. (2016)Zhilin Yang, William Cohen, and Ruslan Salakhudinov. 2016.Revisiting Semi-Supervised Learning with Graph Embeddings. In International Conference on Machine Learning (ICML), Vol.48. 40–48.
  • Ye etal. (2023)Ruosong Ye, Caiqi Zhang, Runhui Wang, Shuyuan Xu, and Yongfeng Zhang. 2023.Natural Language Is All A Graph Needs.arXiv Preprint arXiv:2308.07134 (2023).
  • Ying etal. (2021)Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and Tie-Yan Liu. 2021.Do Transformers Really Perform Badly for Graph Representation?. In Advances in Neural Information Processing Systems (NeurIPS), Vol.34. 28877–28888.
  • Zhang etal. (2021)Jiong Zhang, Wei-Cheng Chang, Hsiang-Fu Yu, and Inderjit Dhillon. 2021.Fast Multi-Resolution Transformer Fine-Tuning For Extreme Multi-Label Text Classification. In Advances In Neural Information Processing Systems (NeurIPS), Vol.34. 7267–7280.
  • Zhang etal. (2023)Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu, etal. 2023.Instruction Tuning for Large Language Models: A Survey.arXiv preprint arXiv:2308.10792 (2023).
  • Zhang (2020)Xuanyu Zhang. 2020.Cfgnn: Cross Flow Graph Neural Networks for Question Answering on Complex Tables. In AAAI Conference on Artificial Intelligence (AAAI), Vol.34. 9596–9603.
  • Zhu etal. (2023)Jing Zhu, Xiang Song, VassilisN Ioannidis, Danai Koutra, and Christos Faloutsos. 2023.TouchUp-G: Improving Feature Representation Through Graph-Centric Finetuning.arXiv Preprint arXiv:2309.13885 (2023).
Large Language Model Meets Graph Neural Network in Knowledge Distillation (2024)

References

Top Articles
Latest Posts
Article information

Author: Arielle Torp

Last Updated:

Views: 5798

Rating: 4 / 5 (61 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Arielle Torp

Birthday: 1997-09-20

Address: 87313 Erdman Vista, North Dustinborough, WA 37563

Phone: +97216742823598

Job: Central Technology Officer

Hobby: Taekwondo, Macrame, Foreign language learning, Kite flying, Cooking, Skiing, Computer programming

Introduction: My name is Arielle Torp, I am a comfortable, kind, zealous, lovely, jolly, colorful, adventurous person who loves writing and wants to share my knowledge and understanding with you.