SNF Blockchain

From EU COST Fin-AI
Jump to navigation Jump to search

Grant link

Blockchain networks are increasingly being implemented into healthcare, supply chain, and retail systems, through smart contracts, smart devices, smart identity management. Although the use of this technology brings with it benefits, it can also still cause problems. A particular problem is derived from the immutability property, which means that fraudulent transactions or transfers of information cannot be reversed. Rationale: Blockchains can be attacked via a deluge of requests or transactions within a short time span, resulting in the loss of connectivity to the blockchain for users and businesses, or even financial institutions. Therefore, the rapid detection of anomalies from such activities is critical in order to prevent damage from occurring, or correct any damage as soon as possible to reduce the severity of its impact.Overall objectives: This project will study the problem of anomaly and fraud detection from the perspective of blockchain-based networks. Anomaly and fraud detection in blockchain-based networks is more complex due to their unique properties such as decentralisation, global reach, anonymity, etc., which make them different from traditional networks.Specific aims: To further the understanding of the sources and behaviours of anomalies and fraud in blockchain-based networks, and develop new improved methods for both static and dynamic anomaly detection that can be used alongside blockchain-based systems for real-time fraud detection.Methods: Developing and implementing static anomaly detection methods via a hybrid approach and developing dynamic anomaly detection methods using extreme value theory.Expected results: This research work will be able to contribute to improving the security relating to blockchain-based networks by providing more accurate and efficient methods for detecting anomalies and fraud and reducing the impact of losses resulting from these anomalies.Impact for the field: The project will be particularly beneficial alongside real world blockchain-based networks to allow for the fast detection of anomalous or fraudulent data, preventing damage or allowing for damage to be corrected as soon as possible. For cryptocurrency networks, this will reduce the impact of market manipulation, fraud, and more widely on global financial markets, currencies, and trade. In addition, the project will be of interest to a broad range of cryptocurrency and blockchain stakeholders including (but not limited to) academics, financial institutions, policymakers, regulators, and cybercrime agencies.


Aims and Relevance

This project aims to study the problem of anomaly and fraud detection from the perspective of blockchain-based networks. The major developments of blockchain technology and cryptocurrencies have brought benefits such as increased efficiency and transparency to all, but the immutability property means that fraudulent transactions or transfers of information cannot be reversed. Rapid detection of anomalies from such activities is critical in order to prevent damage from occurring, or correct any damage as soon as possible to reduce the severity of its impact. Anomaly and fraud detection in blockchain-based networks are more complex due to their unique properties such as decentralization, global reach, anonymity, etc., which make them different from traditional networks.

The proposed research work comprises three main parts:

  1. Studying the evolution of blockchain-based networks over time.
  2. Investigating static anomaly detection methods for blockchain-based networks.
  3. Developing dynamic anomaly detection methods for blockchain-based networks.

This research aims to contribute to a better understanding of the sources and behaviors of anomalies and fraud in blockchain-based networks, as well as the development of new improved methods for anomaly detection, especially in reducing the false positive rate. Additionally, it will help to develop new methods that can be used alongside blockchain-based systems to detect anomalies and fraud in real time as new data is generated.

Methods

The proposed research work focuses on the problem of anomaly and fraud detection in blockchain-based and cryptocurrency networks. Due to the rising popularity of these systems in the financial sector and the potential benefits, it has become increasingly important to detect anomalies and outliers, which may be derived from true errors or more likely monetary or information fraud. Therefore, our goal is to extend and improve upon the accuracy of existing methods of static anomaly detection in the literature relating to blockchain-based network graphs through combining methods from statistics and data mining. Furthermore, our goal is also to develop a new method for dynamic anomaly detection based on data streams and statistical extreme value theory. This methodology will be particularly beneficial alongside real-world blockchain-based networks to allow for the fast detection of anomalous or fraudulent data, preventing damage or allowing for damage to be corrected as soon as possible. For cryptocurrency networks, this will reduce the impact of market manipulation, fraud, and more widely on global financial markets, currencies, and trade. For blockchain-based networks in general, this will assist in reducing the impact of information loss. The proposed research design can be split into three main targets as outlined below and illustrated in Figure 1.

Summary of the methodology

Analysis of the Evolution of Blockchain-Based Network Graphs and Their Properties

The initial goal involves studying and analyzing the key properties of blockchain-based network graphs and how they have evolved over time. The key difference between blockchain-based networks and other systems that can be represented in terms of a network graph is that blockchain technology is relatively young, existing for just over 10 years, and still developing. Therefore, it is likely that the structures of blockchain-based networks have changed since they were first implemented and have continued to evolve. This is a key part of our analysis which needs to be completed before we start our investigation into extending existing and developing new methods for anomaly detection. The main reason is that many assumptions regarding anomalies in other types of networks may not be directly applicable. For example, in credit card transaction networks, anomalies may be classed as transactions where the value of the transaction is significantly higher, the number of transactions is significantly higher, or transactions occur in locations that are far away from the majority. However, in blockchain-based networks, the concepts of normal and anomalous data are not so clear-cut and known.

To address this problem, we propose to perform a comprehensive analysis of the network graphs of large blockchain-based networks. These will include the network graphs of large cryptocurrencies such as Bitcoin and Ethereum, in addition to other blockchains for which network data can be obtained. A starting point is to investigate the fundamental result that the network graphs of many real-world systems follow the power-law model. This states that in a network graph, the probability that a node has a degree (number of edges) of k is given by the relationship 𝑷𝑷(π’Œπ’Œ) ∝ π’Œπ’Œβˆ’πœΆπœΆ or equivalently π₯π₯π₯π₯π₯π₯ 𝑷𝑷 ∝ βˆ’πœΈπœΈ π₯π₯π₯π₯π₯π₯ 𝜢𝜢, which forms a straight line on a logarithmic scale (Boginski et al., 2005). This indicates that a large number of nodes have a very small degree, while a small number of nodes have a very large degree. For example, in a blockchain transaction graph, this would suggest that a large number of accounts make very few transactions, while a small number of accounts make a large number of transactions. This would provide a general idea of whether the structure and behavior of the blockchain show any similarities to traditional networks. In addition, other common network graph statistics such as the clustering coefficient, cliques, and independent sets will also be computed. Due to the lack of labeled data relating to anomalies and fraud, there do not appear to be any benchmarks for distinguishing between normal and anomalous data in blockchain-based networks. Therefore, we propose to split our network data into subsamples of months and years and construct a large number of different network graphs from our datasets covering transaction graphs, user graphs, and graphs based on other network variables. By analyzing the distribution of these network graph statistics for the graphs, we will be able to see how the distributions and their parameters have changed over time. This can then provide a benchmark time series for parameters and statistics that can be used as possible baselines and inputs in the anomaly detection methods.

Analysis of static anomaly detection methods

After obtaining a comprehensive overview of the structures of blockchain-based networks and cryptocurrency networks and how they have evolved over time, the second phase will focus on trying to improve existing methods for anomaly detection in blockchain-based networks. We classify anomalies and outliers, into three different groups as follows (Chandola et al., 2009): a) point anomalies – these are the simplest types of anomalies. Single data points are classed as anomalies if they are located far enough away from the centroid of the data set; b) collective anomalies – these are sets of point anomalies that are linked to each other; c) contextual anomalies – these anomalies are conditional and usually occur in time series data.

Point anomalies and collective anomalies can be considered as part of the static anomaly and fraud detection problem. In theory, these types of anomalies will generally be more pronounced and easier to detect. Existing data on anomalies that have previously occurred in blockchain-based networks is limited. We propose to build our real data sample from a combination of publicly available data from two main sources: a) cryptocurrency networks – using data from previously reported anomalous events such as hacks, including date and time, type of anomaly, total loss, etc; b) other blockchain-based networks – using data from previously reported anomalous events such as attacks on user wallets, smart contracts, double spending, distributed denial of service (DDOS) attacks, etc. Although this data will likely be very general, it can still provide an indication of approximate time periods that can be focused on for detecting anomalies.

The main part of our method will be based on a hybrid approach, where individual anomaly detection methods are combined and used in parallel, or consecutively. The motivation for this approach is provided in the current literature, which has found that anomalies detected in blockchain-based networks using existing methods do not show a significant overlap (Mansourifar et al., 2020). To overcome this, network graphs of the cryptocurrency and blockchain-based networks will be constructed from our real data using the undirected graph model defined in Section 1. These will correspond to the network graphs of the time periods when anomalous events actually occurred.

An existing method such as k-means clustering will be used to search for clusters in the networks graphs to split the data into groups based on their similarity, in order to determine which data points may be anomalies. In the case of k-means clustering, the goal is to partition the sample data corresponding to a particular network graph into k distinct and non-overlapping clusters. In the simplest case, the number of clusters k can be set manually to be equal to the number of types of anomalies we expect to see, or set to a value of two based on the premise of data being normal or anomalous. More formally, let 𝐢𝐢1, 𝐢𝐢2, β‹― , πΆπΆπ‘˜π‘˜ denote sets containing the data points in each cluster, we aim to minimize the within cluster variation for each of the π‘˜π‘˜ clusters as follows:


Math formula.JPG

where |πΆπΆπ‘˜π‘˜ | denotes the number of observations within the π‘˜π‘˜th cluster, and the within cluster variation is denoted by the term in brackets (pairwise squared Euclidean distances between observations within the π‘˜π‘˜th cluster). Depending on the graph type, these data points may correspond with transactions, accounts, etc., and may be clustered according to the frequency of transactions, values of transactions, or other attributes.

Results will be obtained for a large number of graphs for each blockchain-based network. Point and collective anomalies can then be identified and compared against the true anomalies in the real data, and to the benchmark values computed from the analysis in Part 1. In addition, anomalous trends and patterns in network graph statistics may also be revealed that can also be indicative of an anomalous event and possible fraud. However, to improve the accuracy of the detection of anomalies, we propose to combine these methods with the use of extreme value theory (EVT). This is because anomalies resulting from fraud usually correspond with extreme data – for example, in cryptocurrency networks, a small number of transactions with large values, or a large number of transactions with small values.

One possibility is for the most extreme values in network statistics to be modelled using the generalised extreme value (GEV) distribution from extreme value theory. Suppose anomalies have been detected in a network graph representing the number of transactions between user accounts. Considering all data points exceeding some threshold as extreme values, we can model the distribution of these values, x, by the GEV distribution. This can provide a probabilistic interpretation of how likely the data, in this case very large or small numbers of transactions, are to occur. The anomalies detected by single methods such as k-means clustering can then be analysed using the fitted model to provide a further confirmation of how likely the true anomalies were to occur. Simulations of these extreme data can also be generated using this model, which can be used to analyse possible anomalies during periods when there were no confirmed anomalies.

Development of dynamic anomaly detection methods

The third phase of the proposed work will focus on developing dynamic methods for the detection of contextual anomalies in the network graphs of blockchain-based networks. This problem is more complex as these anomalies are dependent on the context or the conditions at the time when they occur. In addition, many static anomaly detection methods are not suitable as they require scanning of the network data multiple times. To solve this problem, we propose to develop our method by treating the data from blockchain-based networks as a data stream. As new data is continuously generated, the structures of the corresponding network graphs will change over time and so will the network graph statistics.

We suppose that data from blockchain-based networks can be represented as a streaming time series 𝑋𝑋𝑑𝑑, 𝑑𝑑 > 0 of independent and identically distributed (i.i.d.) observations. To determine a level of β€œnormality” with respect to network graph statistics, we use the results from the analysis in Part 1 as benchmark values. The focus will be on determining a threshold level 𝛼𝛼 that if network graph statistics exceed, then they will be classed as being a possible anomaly and can then be analysed further. Instead of analysing the network graphs of the whole sample period of our data, we account for the time component by analysing network graphs and statistics corresponding to rolling windows of pre-defined lengths like one hour, one day, one week, etc.

Inspired by extreme value theory, we can use the peaks over threshold method to determine which data points in a rolling window are defined as extreme. The simplest way is to define an initial threshold value so that within a rolling window a small percentage (e.g. 2.5%) of data points are above or below 𝛼𝛼 . These extreme data points, x, can then be modelled by the generalised Pareto distribution (GPD), which is very similar to the GEV distribution used in Part 2.This model can again provide a probabilistic interpretation of how likely these anomaly data points are to occur. As a way to check whether the data points are true anomalies that should be investigated, we require additional conditions. One way is to define anomalies as data points which have a very small probability of occurring according to the fitted model, in addition to exceeding a percentage relating to the benchmark values obtained in Part 1 of the analysis. Another possibility is to also require data points to exceed a percentage relating to average values of the network statistics in the current rolling window, or a number of the most recent rolling windows. Selecting the most appropriate conditions will require testing on real anomaly data and also simulated anomaly data to find the optimal detection conditions.

In summary, due to the rising popularity and use of blockchain-based and cryptocurrency networks, the risks from these networks are also growing due to anomalous data and events. Much of the current literature focuses only on static anomaly detection in blockchain-based networks. The uniqueness and innovation of

the proposed work is that our methodology attempts to extend static anomaly detection methods via a hybrid approach, and also develop dynamic anomaly detection methods using extreme value theory. This research work will be able to contribute to improving the security relating to blockchain-based networks by providing more accurate and efficient methods for detecting anomalies and fraud and reducing the impact of losses resulting from these anomalies.