f:id:sansanxingli:20200630140454p:plain

Hi, I am XING LI, a researcher from Sansan DSOC.

We have discussed two of DGL(Deep Graph Library)'s fundamental mechanisms: Message Passing and NodeFlow data structure. Today, we are going to dive into a small but vital area: negative edge sampling methods in graph learning.

Actually, this is a relatively independent topic of "Hands-on guidance to DGL library", it is more like my personal summary on graph edge negative sampling problem. No need to talk about the universality and importance of graph edge negative sampling, you could find some useful insights here[1]. I would like to show you the comparison among several negative sampling methods I know and analyse their applicable occasions. There would be no much theoretical study here, but mostly about the hands-on guidance. Let's get started!

Graph Edge Negative Sampling

Although there are many storage methods to save a graph in memory and many negative sampling request forms. To simply the discussion, we only consider a normal situation where a graph is defined as $G = \lbrace N, E, R \rbrace$ where $N$ is the set of all nodes, $E$ is the set of all edges and $R$ is the set of all relationship used in the knowledge graph domain which we may ignore in this blog. One edge could be represented as $( n_{i}, n_{j} )$ and with a $r_{k}$ in the knowledge graph. A normal request is to sample a certain number of negative edges proportional to a given set of positive edges.

Idea One

Considering that in most cases the number of positive edges is much less than the negative edges, if not, we can simply reverse their positive and negative attributes to make this situation always correct. We only need storage all positive edges by recording two sides nodes' IDs of each edge. And then we randomly sample two nodes IDs to see if there is a positive edge between them. If not, then we get a negative edge or repeat this process until not.

The main space cost here is O(n(E)) where n(E) is the number of edges (and so n(N), n(R)).

The time cost may vary based on your program language and algorithm optimisation. If we use python and no optimisation, a normal way is to use list data structure and rely in operation in python. Then the time cost is O(n(E))[2].

It is a quite slow method, that we may need to check all positive edges to confirm a randomly sampled negative edge. Normally, like graph representation learning will need to sample tonnes of negative edges. You would have a terrible experience in this method if you are going to tackle a "large" graph (a huge number of nodes and edges).

Idea Two

The second idea is a smart trick. Because the idea one could work in a "small" graph, but only fail in "large" graph. And a phenomenon that holds in most cases in "large" graphs is they are very sparse. It means the $\dfrac{n(E)}{n(N)^2}$ would be smaller with $n(N)$ grows up.

And here is the clever method: for each given positive edge, such as $( n_{i}, n_{j} )$ , we corrupt one side of it. Corruption means we randomly sample a node and replace the original one with the new without any further checking if there is a positive edge between the non-corrupted side and corrupted side. Because the graph is a large and sparse, the number of all adjacent nodes starting from the non-corrupted node should be much smaller than the total number of nodes. This would guarantee we successfully get a true negative edge with high probability. So the space complexity is still O(n(E)) to save the graph and time complexity is only O(1).

But of course, this method suffers from sampling false negative edges with a small but non-negligible probability. So this restricts its scope of application. For most graph representation learning algorithm, this flaw is tolerable. But if you want the true negative sampling, then idea two is not satisfying enough.

Idea Three

Idea three contains several methods which have a trade-off in time and space. But they share the same core, so we put them together under one idea category.

An important optimisation direction of idea one is to minimise the existence checking time. Idea one spends most checking time over in operation in list in python. A natural question is why it needs O(n(E)) time to search. The reason is because python list saves data in a discontinuous way among memory, it could not provide off-set addressing operation as in C++. Hence it needs to check the whole saving space.

A python user will immediately say: but you know, we have set. Exactly, set is our first optimisation to idea one. By relying upon hash algorithm, the time complexity of in operation in set could even be fast to O(1)[2] on average. However, if you read the source code or just run a little demo, you will find set data structure will cost constant multiples of space due to hash memory expansion mechanism. A naive demo here:

import sys

N = 10000
list_data = list(range(N))
set_data = set(i for i in range(N))
print(sys.getsizeof(list_data))
print(sys.getsizeof(set_data))
print(sys.getsizeof(set_data)/sys.getsizeof(list_data),'\n=====\n')

N = 100000
list_data = list(range(N))
set_data = set(i for i in range(N))
print(sys.getsizeof(list_data))
print(sys.getsizeof(set_data))
print(sys.getsizeof(set_data)/sys.getsizeof(list_data),'\n=====\n')

N = 1000000
list_data = list(range(N))
set_data = set(i for i in range(N))
print(sys.getsizeof(list_data))
print(sys.getsizeof(set_data))
print(sys.getsizeof(set_data)/sys.getsizeof(list_data),'\n=====\n')

#Result:
#90120
#524520
#5.8202396804260985 
#=====

#900120
#4194536
#4.659974225658801 
#=====

#9000120
#33554664
#3.728246290049466 
#=====

In most cases, time is more precious than space. But graph data always contains a million or even billion magnitude number of edges. In this situation, space would become another significant bottleneck for the problem.

Then we could restrict the space complexity to narrowly O(n(E)) and think about how can we reduce the time cost now.

The answers are many. numpy.array saves the data in a continuous way just like C++. Also, you can index and orderly save all positive edges by $n_{i}*n(E) + n_{j}$ just in list and then use binary search to quickly identify a sampled edge is positive or not with the time complexity O( $log_{2}n(E)$ ). Furthermore, you can write the sampling part codes in C++ and glue it into python codes, don't forget python is called glue language!

Summary

In this blog, we talked about one common and important small portion of graph learning: negative sampling. Even it has no direct relation with DGL, but since DGL now is built upon Python and you will finally face this problem when implementing some graph learning algorithms with time or space restriction.

Each particular case may have specific restrictions or conditions, so don't stick to these methods, but try to exploit these ideas to create the most customised and suitable method your own! See you in the next blog!

References

[1] Yang, Z., Ding, M., Zhou, C., Yang, H., Zhou, J. and Tang, J., 2020. Understanding Negative Sampling in Graph Representation Learning. arXiv preprint arXiv:2005.09863.

[2] TimeComplexity - Python Wiki

buildersbox.corp-sansan.com