DLT Interoperability and More ⛓️#23 —Blockchain Large Language Models⛓️

Rafael Belchior
6 min readMay 2


In this series, we analyze papers on blockchain and interoperability.

This edition covers a paper on a paper that utilizes large language models (LLMs) for vulnerability detection in blockchains.

➡️ Title: Blockchain Large Language Models
➡️ Authors: Yu Gai, Liyi Zhou, Kaihua Qin, Dawn Song, and Arthur Gervais

➡️ Paper source: https://arxiv.org/pdf/2304.12749.pdf

➡️ Background:

LLMs are statistical models that generate text by placing token after token. The last token is chosen based on a probability distribution that accounts for all the previous text in that session. Tokens here are words that are arranged based in the large training corpus for several languages. This is, ChatGPT received as training input books, essays, articles, etc, and calculated a probability distribution of n-word arrangements, which is used to compose the responses to queries. However, the number of combinations for “n-gram” words grows exponentially, so there is not a practical way to calculate the distribution for, let’s say 20-word. That would be a very big number. Therefore, the trick here is not to calculate the probability distribution of n-word arrangements because that would take infinite time to do for sufficiently large n, but to “estimate the probabilities with which sequences should occur — even though we’ve never explicitly seen those sequences in the corpus of text we’ve looked at”.

Stephen Wolfram wrote an excellent article on how ChatGPT, a very popular LLM, works behind the scenes.

Some solutions for monitoring blockchains (in particular a set of smart contracts) emerged in the last months, with a big heterogeneity of methods, empirical evaluation, and deployment in the industry. The authors do a great job at the related work illustrating this: reward-based approaches, (ii) pattern-based techniques, and (iii) proof-based methods. Our work, Hephaestus, uses a security-by-specification approach (number ii), that defines heuristics to deduce blockchain transaction intents and user address behavior — useful to capture attacks (see Section 7).

Source: https://www.techrxiv.org/articles/preprint/Hephaestus_Modelling_Analysis_and_Performance_Evaluation_of_Cross-Chain_Transactions/20718058

Onto the paper!

➡️ Motivation:

The motivation is simple and tackles a very important problem. Blockchain security best practices are still maturing, and, unfortunately, hackers still exploit DeFi protocols on a consistent basis.

From Hephaestus: “Looking at the facts, many of the largest decentralized finance hacks in blockchain history were performed in bridges[15], [16], in a grand total of more than $2B in damages[17], [18]. The facts show that the community still has a long way to implement secure bridges. The trend for attackers to exploit bridges will likely not disappear soon, as the more value bridges they hold, the more incentive criminals will have to attack those systems [19].”.

Moreover, “In a cross-chain setting, automating the discovery of models and enabling its monitoring becomes very challenging, as there need to exist more tools to secure and monitor cross-chain applications.This is where our work fills the gap in current knowledge.In summary, we present the following contributions:•We propose Hephaestus, a system that creates models for fine-grain monitoring and auditing multiple blockchain use cases. Our system uses and extends a state-of-the-art BI solution, Hyperledger Cacti [23].”

This paper's motivation is similar to Hephaestus: to identify vulnerabilities in real-time and avoid or mitigate attacks.

➡️ Contributions:

  • This paper presents a dynamic, real-time approach to detecting anomalous blockchain transactions, called BlockGPT.
Source: paper, https://arxiv.org/pdf/2304.12749.pdf

💪 Strong points:

  • The authors propose a tree-based encoding for representing transaction traces, which enables training the model with a dataset of these transactions.
  • The main contribution is a generalized way (no search space restrictions) to achieve an IDS. “This approach is orthogonal to other methods, and every method has its own limitations and strengths. The more orthogonal methods exist, the better and more secure the systems become.”

🤞 Suggestions for improvement:

  • The LLM-based system explored in this work could be integrated with a system like Hephaestus to increase the true positive rate while diminishing the false positives. How could we achieve this? At a very high level, the original system could make the abnormal prediction and then cross the execution trace with Hephastus cross-chain model evaluation on that specific trace, taking into account the current state. After that, we can calculate the final output by applying different formulas to the output of both.
  • It would be useful to 1) have access to the transaction tracer, 2) dataset, and 3) implementation. Without these, it is a bit difficult to visualize what is happening behind the scenes.
  • “Additionally, it is worth noting that each attack may involve more than one malicious transaction, some of which may appear benign but in fact prepare for the attack. Therefore, our approach may not be able to identify any anomalies based on the transaction traces of these transactions, which is another limitation of our study”. Often, before attacks are realized, an attacker smart contract that calls the exploited smart contract is deployed, paving the way for the attack to occur. This first transaction should already be considered malicious.

🔥 Points of interest:

  • 24% success rate is low (but, on the other hand, the false positive rate is very low) — so there is a trade-off that can be expressed in terms of the F-score metric. Some transactions could seem innocuous but hide an attack, especially if we consider contract upgrades (for example, you could send an invalid Merkle proof which uses an often called endpoint, but after a contract update now has its logic compromised — the case of the Nomad hack). I understand we could improve this rate at the expense of false positives improving; I think that crossing this type of LLM with a symbolic reasoning system (like the parallel of ChatGPT + Wolfram Alpha) to increase the actual positive rate while lowering false positives.
  • “This rapid detection of malicious blockchain transactions enables the triggering of a smart contract pause mechanism to prevent an attack as an Intrusion Prevention System. Approximately 50% of the attacked contracts we investigate already have such a pause mechanism deployed.” — it would be interesting to see if they're the different protection mechanisms (is it only a pause?)
  • “This rapid detection of malicious blockchain transactions enables the triggering of a smart contract pause mechanism to prevent an attack as an Intrusion Prevention System. Approximately 50% of the attacked contracts we investigate already have such a pause mechanism deployed.” — furthermore, the authors said that validating a transaction takes, at most, around half a second. So it is reasonable to assume that verifying a transaction and signing a transaction that acts upon the alert issues could be done in less than a second. Considering slots of 12 seconds in Ethereum, this seems like a reasonable latency for acting quickly. In a practical scenario, an adversary could monitor counter-attack mechanisms and explore MEV to cancel or surpass these mechanisms (the mentioned hidden adversary). I believe there is research to be done in these areas.
  • BlockGPT learns the probability distribution on an entire transaction’s blockchain trace. Would it be an improvement if BlockGPT was instantiated to specific dApps and learned from the transaction space constricted to that specific app? This could be possible with some dApps having tens of thousands of transactions.
  • It also seems possible that this work is applied to the cross-chain scenario, although managing several different domains (and transaction pools) could be difficult to coordinate due to the heterenogity of transaction representation and functionality. We made an initial effort on this front by mapping transactions to a common format which then are used to initialize a cross-chain model.
  • “We also choose to preprocess numerics to capture only the first two significant figures and the scale of numbers rather than the precise amounts (e.g., 1254 −→ 1300). This is necessary to avoid vocabulary explosion because smart contracts frequently operate with big integers beyond 18 decimals.” — I wonder what are the practical trade-offs between precision and scalability. Could we have a more fine-grain representation of transactions? I suppose it would depend on the context of the application being inspected: small deviations might matter for some use cases.

🚀 What are the implications for our work?

  • We are researching and developing several methods to detect vulnerabilities in real-time. This pioneering work sheds many future work directions that inspire our work.



Rafael Belchior

R&D Engineer at Blockdaemon. Opinions and articles are my own and do not necessarily reflect the view of my employer. https://rafaelapb.github.io