# Decentralized Search

## Abstract

Milgrim's small world experiments in den 1960s resulted in the notion of six degress of separation, stating that any two human beings could be connected by just six friendship-relations betweens people. We analyse this concept of social search and link it to the Watts-Strogatz model for the creation of random graphs, fit for decentralized search. We will then construct a model to generate grids with optimal conditions for decentralized search and prove that our approach is optimal.

Furthermore we will test the aforementioned theories and models against real world examples and a set of simulations.

## 1 Concept of Decentralized Search

### 1.1 Milgram's small-world experiment and six degrees of separation

In the 1960s Stanley Milgram performed a series of experiments in order to analyse,how human beings are connected. In his most famous experiment he chose 296 participants as random seeds. These seeds each received a letter with the instruction to forward it to a target person; but only if they already knew that person. If they were not familiar with the target, they were supposed to forward the letter to one of their friends whom they suspected to have a better chance of knowing the target. To make such a decision possible, the letter included some personal information about the target and mentioned, among other things, their address and line of work.

The definition of being friends with somebody (necessary to forward them the letter) was defined as knowing them on a first name basis.

In the end, 64 of the original 296 letters reached their target, with an average path length of 6. The path length in this case is the number of people involved in forwarding the letter [1, 2].

To discuss these results we will look at this experiment as a graph. Every person is a node and friendship between people is represented by an edge between the corresponding nodes. A letter/message can only travel from one node to another when an edge exists between them.

#### 1.1.1. Problems with the Milgrim experiment

Even though 78% of letters never reached the target, this experiment is often cited as the justification for the term six degrees of separation, implying that every pair of humans on the planet could be connected by just six friendship links. While one can accept that a lack of participation accounts for all the letters that never reached the target (as opposed to the absence of a friendship-path from the original seed to the target) there are two other factors we want to take a look at: Geographic location and social status.

Figure 1.1: A route from Omaha, NE and Wichita, KS to Boston, MA. [3]

All the random seeds were chosen in Omaha, NE and Wichita, KS, while the target was from Boston. While these cities are geographically somewhat distant1, as can be seen in figure 1.1, the fact remains that they are in the same country and that their inhabitants speak the same language. And even the fact that the cities are socially distant2 does not make their selection truly random, when you take a global point of view.

1. more than 2,200 km
2. Omaha and Wichita are rural centres in the mainland, surrounded by farming communities, while Boston is a harbour city and financial center, known for its universities.

The second problem we mentioned is social status. The target person in Boston was someone of high social status, working in the financial industry. We feel that in a setting as was used by Milgrim, it is much easier to reach a person of high social status because such a person will be recognized by the peers around them. Especially on a global scale, high-status people are easier to reach. They generally speak more languages, travel more and have easier access to better means of communication. These three factors alone give them a much greater possibility to know people that live far away from them. A condition that is necessary to cover great geographic distances with just a few friendship-edges.

The search that was used in Milgrims experiment was not a true shortest path search, because no participant had knowledge of the whole friendship graph and there was no communication going back and forth between participants. Since the letter was forwarded only to the most promising friend, the notion of tunnelling a message through the graph comes to mind. While this approach seems to work pretty well, as has been shown in section 1.1, there is always the possibility of unnecessary failure, meaning that a letter might never reach the target even though there is a path between the seed and the target. We will call this kind of search social search.
A theoretical improvement would be to simply have each participant forward the letter1 to all of their friends. While it is guaranteed that this breadth-first search will find the shortest path, such a ooding of the network is not applicable in real world scenarios.

1. A copy of the letter to be more precise.

Just based on the observations made so far, we feel that it is safe to assume, that the structure of the network is an important factor in determining how well a social search will perform.

### 1.3. Watts-Strogatz model

In 1998 Duncan J. Watts and Steven Strogatz introduced a model to generate random graphs `G = (V,E)` with small-world attributes. These attributes are high clustering and a small average shortest path length. They use a homogeneous grid to model homophily and add random edges to represent weak ties1.

The structure of the grid depends on just four variables. `N` defines the total number of nodes. The dimension `n` defines the dimension in which the grid lies. A onedimensional grid is a circle2 and a two-dimensional grid corresponds to a chessboard, as can be seen in figure 1.2.

1. The terms weak tie, random link and long distant link essentially all refer to the same thing.
2. Actually it is a line whose start- and endpoint are connected.

Figure 1.2.: 2D Watts-Strogatz grid

The third variable `r` defines an area around every node `v`, that includes all the nodes whose distance from `v` on the grid is less or equal to `r`. `v` is directly connected to a all nodes in this area. Figure 1.3 shows a two-dimensional Watts-Strogatz model with `r = 2` for just one node. For reasons of clearness, we will always use `r = 1` for our visualisations.

Figure 1.3.: 2D Watts-Strogatz grid with r = 2

The fourth variable is `k`. Every node v forms edges to `k` additional nodes, chosen uniformly random among all the nodes, that did not previously share a connection with `v`. Figure 1.4 shows a Watts-Strogatz grid with `4` random edges.

Figure 1.4.: 2D Watts-Strogatz grid with 4 random edges

There exist numerous modifications of this model. For example sometimes only `1` in every `k` nodes forms a random edge. Another well-known modification is to redirect existing edges to form random links, instead of introducing new ones. And in every dimension greater one, there can be additional cross-links on the grid. Figure 1.5 shows a two-dimensional grid with cross-links on the left and no cross-links on the right.

Figure 1.5.: 2D Watts-Strogatz grid with and without cross-links.

For decentralized search, two nodes `s` (start) and `t` (target) are chosen randomly. The task is, to deliver a message from `s` to `t`, i.e. to find a path from `s` to `t`. In every step of the algorithm, there is only one active node. This currently active node `c` knows the structure of the underlying grid, including its own position and the position of `t` on the grid, and all the random edges that are connected to `c`. Any other random edges are unknown to `c` and there are no mechanisms to poll other nodes for information.

Neither did `c` receive any additional information from its predecessor, apart from the target `t`, nor is `c` even aware who its predecessor is.

With this information every nodes chooses whom to forward the message to by itself. The sum of these local choices is a collective procedure, but does not represent any kind of hive mind of swarm intelligence, since every decision is made truly local.

#### 1.4.1 Example of Decentralized Search on a Watts-Strogatz grid

An example of decentralized search on a two-dimensional grid follows:

Figure A.1.: Example: Step 0
The aim is to find a path from node `A` to node `B`. In this step node `A` has two choices, it can pass the message down or to the right.

Figure A.2.: Example:
The shortest path between nodes `A` and `B` has a length of `3` steps.

Figure A.3.: Example: Step 1
Node A randomly choose to pass the message down. The now active node is connected to a random edge, that is a better choice than its grid connections.

Figure A.4.: Example: Step 2
The active node is connected to a random edge, but it is a worse choice than passing the message either down or to the right.

Figure A.5.: Example: Step 3
The active node in the last step randomly choose to pass the message to the right. The now active node is connected to a random edge but the best option is pass the message down.

Figure A.6.: Example: Step 4
Again the now active node is connected to a random edge and again the best option is pass the message down.

Figure A.7.: Example: Step 5
Node `B` has been reached in `5` steps. A length of `3` would have been ideal.

### 1.5. Watts-Strogatz model fails

While decentralized search on graphs genrated by the Watts-Strogatz model yields decent results, the paths are still too long. This can be explained by the fact that there are too many long distance weak ties, when more short distance weak ties would help much more in reducing the average path length. A detailed explanation of this assumption follows in section 2.2. To change this, we need to stop placing weak ties randomly.

## 2. Modelling Decentralized Search

We will now modify the Watts-Strogatz model to make the resulting graphs more suitable for decentralized search.

### 2.1. Modified Watts-Strogatz grid

We modify the Watts-Strogatz grid as follows: Again each node is connected to all neighbours within `r` grid steps. But now the probability for the existence of a random edge gets lower by increasing distance. The probability for a random edge between the nodes v and w is proportional to `d(v,w)-q`, for a `q≥0`. In this formula `q` is the clustering component and the function d gives us the distance between two nodes on the pure grid, without taking the preexisting random edges into account.

We will show the process step by step. In figure 2.1 we start with a node `v` (blue) that is supposed to be the origin of a new random edge.

Figure 2.1.: A random edge is supposed to start from the blue node `v`.

For reasons of clarity we fade out the underlying grid and annotate every node with its distance to `v` in figure 2.2. The grey nodes with a distance of `1` from `v` are already connected to `v` via the grid and are thus not available as candidates for a random edge starting in `v`.

Figure 2.2.: Every node is annotated with its distance to `v`.

In figure 2.3 we annotate every node `w` with the value `1` `d(v,w)2`, which we introduced in section 2.1 as the value, to which the final probabilities will be proportional to. An explanation, why we use a value of `2` for `q` will fallow at the end of section 2.1.1.

Figure 2.3.: Every node is annotated with `1 / d(v,w)2`

.

Since we have a finite set of 20 nodes and have the proportionality value for each node, we can easily calculate the actual probabilities. They are shown in figure 2.4.

Figure 2.4.: Every node is annotated with its probability to be `v`'s random edge target.

#### 2.1.1. Clustering component q

We can observe a couple of things regarding the clustering component `q`:

• When the distance between two nodes a and b is greater than 1, the probability for the existence of a random edge between them decreases with growing q:
`limq→∞ 1 ⁄ d(a,b)q =`
`1`; if `d(a; b) = 1` (`a` und `b` are neighbours)
`0`; if `d(a; b) > 1`
• For every q greater 0, a growing distance between two nodes a and b leads to a lower probability for the existence of a random edge between these nodes:
`limd(a,b)→∞ 1 ⁄ d(a,b)q = 0`, for a `q > 0`

A clustering component equal to 0 will lead to true randomness, meaning that the length of the random edges will be distributed uniformly random. Such a distribution corresponds to the original Watts-strogatz model. For values of q that are too small the distribution will still be too random and form too many long distance random edges as can be seen in figure 2.5.

Figure 2.5.: 2D Watts-Strogatz grid with small clustering component

A clustering component that is to large, could on the other hand lead to a lack of necessary long distance random edges, as illustrated in figure 2.6.

Figure 2.6.: 2D Watts-Strogatz grid with large clustering component

We will at this point assume that a value of `q = 2` is ideal for huge two-dimensional networks. A proof will follow in chapter 4.

#### 2.1.2. Simulation

In a simulation with 400 million nodes on a two-dimensional grid, John Kleinberg has shown, that clustering components between 1.5 and 2.0 generate random graphs that yield good results for decentralized search, with a value of approximately 1.9 being ideal. He deducts that the formula `lim→∞ qideal = 2` holds true [4].

### 2.2. Scales of resolution

If given the task, of travelling from one country to a specific address in a different country, a very basic and naive approach, as displayed in figure 2.7, is to first get on the right continent, then into the right country, state, town, borough, block and finally street. This is based on the assumption, that we have a general idea, where a specific address in a foreign country is, but do not know what shortcuts (like airplanes, public transportation, etc.) can aid us in our goal to reach the target, unless we happen to be at an airport or trainstation, which would give us knowledge about ights, trains, etc. leaving at this station (and only this station).

#### 2.2.1 Idealized path

Figure 2.7.: Closing in on a target

Effective decentralized search is essentially doing the same thing. We will call the real world examples of continent, country, etc. scales of resolution. In the two-dimensional case, a scale around a given node v is a circle with a given radius. Every node within the circle is part of the scale that is described by said radius.

In figure 2.8 we can see two scales with radii of `d` and `2d`. If we lay this sketch over a two-dimensional Watts-Strogatz grid, the number of nodes in the first scale would be proportional to `d2` and the number of nodes in the second scale proportional to `(2d)2`.

Figure 2.8.: Scales with radii of d and 2d

If we now look at a random node w in the inner scale, the probability for a random edge going from the blue node v in the center to w is proportional to `1 ⁄ d2`. This follows directly from our definitions in section 2.1.

So for every given scale-radius `d`, we know that the number of nodes in the scale is proportional to `d2`. And the probability to hit a specific one of these nodes with a random edge, that starts in the center of the scale, is proportional to `1 ⁄ d2`. Thus the probability to hit any node in the scale is proportional to `1`. Particularly, this probability is independent of the radius `d`. Thus, the probability is the same for every scale.
This translates to a uniform distribution of random edges over all scales.

## 3 Empirical Data and Generalized Distance

In this chapter we will take a look at some real world examples.

### 3.1 LiveJournal

LiveJournal (short LJ) is a platform for personal blogs. Liben-Nowell et. al. took a sample of 500,000 users, who had provided a valid continental US ZIP code and build a grid based on the users' geographical location, with each node representing one user [5]. Between these users there were 4 million friendship relations on LJ. This puts every users at an average of 8 friends. While friendships on LJ are directed, they were symmetrical in 80% of observed cases. The LJ-friendships were represented by long-range links.

A noteworthy observation it, that 380,000 users, roughly 78% of all users, form a giant component.

#### 3.1.1 GEOGREEDY

The algorithm used by Liben-Nowell et al., to deliver a message from a start-node s to a target-node t, is called GEOGREEDY. In every step the current node c chooses the friend, that is geographically closest to t. If none of c's friends is truly nearer to t than c itself, the algorithm aborts. The algorithm finishes successfully when a node has been reached that has the same ZIP code as t. This is owed to the fact, that the data set did not provide street-level data. Thus, the geographic location of individuals living in the same ZIP area could not be differentiated. This means, that the GEOGREEDY algorithm only solves global routing problems.

The problem of local routing, e.g. finding the right person once the message has reached the correct city, can be solved with non-geographic factors like age, cultural background, profession or interests [6, 7]. The GEOGREEDY algorithm has a 13% success rate with an average of 4 steps on the LJ data.

##### 3.1.1.1 Modified GEOGREEDY

In a modified version, the current node passes the message to another node in its city, instead of aborting. This modification leads to an 80% success rate with an average of 12 steps.

#### 3.1.2 Results

The research from Liben-Nowell et al. yields an interesting finding: Friendship depends heavily on distance, even in the context of online-friendship (This puts the hypothesis to question, that the role of an individual's location diminishes in the face of the modern communication devices of a globally connected world.). The more distant two people are, the less likely they are to be friends on LJ. At least up to 1,000km. From that distance on, the probability stays constant. To be precise, about 5.5 out of 8 friendships, roughly two thirds, are based on low geographic distance. Another finding is somewhat disturbing: When modelling friendship based on distance in the LJ dataset, probabilities proportional to `1 / d(a,b)` are ideal to create networks, suitable for decentralized search. This is a contradiction to the earlier observation by Kleinberg in section 2.1.2, that a value proportional to `1 / d(a,b)2` should yield the best results.

To explain this phenomenon, we need to look at the population density in the USA as shown in figure 3.1. The center of the graphic is Ithaca, NY and every circle contains an additional 50,000 people. The giant gap between between 350,000 and 400,000 is caused by the large variation in population density in the USA. This puts the hypothesis to question, that the role of an individual's location diminishes in the face of the modern communication devices of a globally connected world.

Figure 3.1.: A visualisation of the population density in the USA with each circle containing an additional 50,000 people.

An intuitive example to put this observation in context: In rural Iowa, a person living 500m away from your home might be your direct neighbour and you will probably know them. In Manhattan, there are a couple of thousand people living in a 500m radius around you and you will never get to know all of them. Thus, a uniform two-dimensional grid seems to be an ill approximation for modelling friendship based on distance.

#### 3.1.3 Solution: Rank-Based Friendship

As a solution for the problem described above, Liben-Nowell et al. introduce for every node `a`:
`ranka(b) = |{x ∈ V\{a} | d(a,x) < d(a,b)}|`

In other words: The rank of a node `b` (in relation to a certain center-node `a`) is equal to the number of nodes that have a smaller distance to `a` than `b` does. This combination of distance and density can be applied uniformly over the whole network, even when the density is non-uniform, as is the case in the LJ example. An example of rank-based friendship can be found in figure 3.2.

Figure 3.2.: Rank-based friendship values for a group of nodes in correspondence to node a.

In this model, the probability for the existence of a random-edge from node `a` to node `b` is inversely proportional to `ranka(b)`, or to be more precise:
`P( (a,b) ∈ E) ∝ 1 / ranka(b)`

It is important to note, that the ranked-based friendship and the probabilities it creates, are independent of the graphs dimension.

### 3.2. Generalized similarity

So far, we have always used geographic proximity as an indicator for the homophily between nodes, that their arrangement on the grid is supposed to express. However there are many other factors, that can be used to calculate similarity, like place of birth, area of profession, age, income, race, gender, hobbies, political views or combinations of the aforementioned factors.

## 4 Proof

To show, that `q = 2` yields the best results on a two-dimensional grid, we will first show that `q = 1` is very good in the one-dimensional case. In a second step we will show, that all other exponents perform worse. Once we have established, that `q = 1` is optimal in one dimension, we will generalize this statement to `q = n` for the n-dimensional case [8].

### 4.1 Proof in 1D

A one-dimensional grid is often not the best representation of reality but it is much easier and cleaner to handle than the higher dimensions. We use a Watts-Strogatz grid as shown in figure 4.1. Every node knows its two neighbours and has one directed random edge.

Figure 4.1.: Grid: 1D, 2 neighbours, 1 random edge

#### 4.1.1 Plan

We will generate a random graph as described in section 4.1, select two random nodes `s` and `t` and perform a decentralized search to get from `s` to `t`. We will use the terms `c` and `d(a,b)` as introduced in section 2.1. The number of steps necessary to reach `t` from `s` using decentralized search will be denoted by the random variable `X`. Our goal is to show, that the expected value of `X`, `E[X]`, is relatively small.

#### 4.1.2 Phases

We say that the decentralized search is in phase `j`, when:
`2j ≤ d(c,t) < 2j+1`

Figure 4.2 illustrates the fact, that the maximum number of phases is `log2(n)`. We will now show, that it does not take long to reduce the distance to the target by factors of 2. Such a reduction would of course result in entering the next phase.

Figure 4.2.: Phases

We can break down our random variable `X` into a sum of random variables `Xj`, that each count the number of steps in a certain phase:
`X = X1 + X2 + ... + Xlog2(n)`

The same can be done for the expected value:
`E[X] = E[X1] + E[X2] + ... + E[Xlog2(n)]`

We will now show, that the number of steps in each phase is proportional to `log2(n)`, or:
`E[Xi] ∝ log2(n), ∀; 1 ≤ i ≤ log2(n)`

And from the fact that `E[X]` is a sum of `log2(n)` terms, each proportional to `log2(n)`, we will be able to conclude, that `E[X]` is proportional to `(log2(n))2`.

#### 4.1.3 Normalization Constant

At this point it is necessary to determine the exact constant of proportionality. So far all we know is, that the probability for the existence of a random edge between two nodes `a` and `b` is inversely proportional to their distance, or:
`P((a, b) ∈ E) ∝ 1 / d(a,b)`

We will now determine for a node `a ∈ V`:
`Z = ∑b ∈ V\{a} 1 / d(a,b)`

The term `Z` will then be used to calculate the exact probability, via:
`P((a,b) ∈ E) = 1/Z × 1/d(a,b), ∀; a,b ∈ V with d(a,b)>1`

##### 4.1.3.1 Calculation of Z

When `N` is the number of nodes on the grid, then `⌊N/2⌋` is the maximum distance at which any node can be from the current node `c`. And at every distance between `1` and `⌊N/2⌋` there is is maximum of two nodes, confer figure 4.3.

`Z ≤ 2 × ∑1n/2 1/i`

Figure 4.3.: 1D-grids with even and uneven number of nodes.

And since we only need to gauge the maximum:
`Z ≤ 2 × log2(n)`

The resulting probability is:
`P((a,b) ∈ E) ≥ (1 / (2 × log2(n))) × (1/d(a,b))`

#### 4.1.4 Analysis of the time spent in one phase

The search is in phase `j` as long as the following formula holds true:
`2j ≤ d < 2j+1 for d = d(c,t)`

We will now show, that the following requirement for leaving the current phase is met pretty soon:
`d(cnew, t) < 2j`

##### 4.1.4.1 How to leave a phase immediately

If the edge `(c,v)` exists and the distance between `v` and `t` is half or less then half the distance between `c` and `t`, `c` is the last node in the current phase. We will now show, that this halving of `d(c,t)` happens relatively often.

To do so, we introduce `I` as the set of nodes, that are beneficial for us, with:
`I = { x ∈ V | d(x,t) ≤ d/2 }`

We will now look at the probability for the existence of an edge `(c,y)` for a node `y ∈ I`.

As figure 4.4 illustrates, the number of nodes in `I` is `d+1`

Figure 4.4.: Number of nodes at distance `d/2` from `t`

As exemplified in figure 4.5, the maximum distance from `c` to a node in `I` is `3 × d/2`

Figure 4.5.: Maximum distance from `c` to a node in `I`

###### 4.1.4.1.1 From distance to probability

We know:
`P((c,w) ∈ E) = (1/(2 × log2(n)) × (1/d(c,w)), ∀; w ∈ I`

We can gouge `d(c,w)` with its maximum `3 × d/2`, which gives us:
`P((c,w) ∈ E) ≥ 1 / (3 × d × log2(n))`

Since we know that `|I| ≥ d`, we can conclude:
`∑w ∈ I P((c,w) ∈ E ) ≥ 1 / (3 × log2(n))`

At this point we need to keep in mind that, if the edge `(c,w)` exists for a `w ∈ I`, then the next phase will be reached in one step. In every step of phase `j`, there is a chance of at least `1 / (3 × log2(n))` for this to happen. These chances are independent. So to remain in phase `j` for `i` steps, the decentralized search would have to fail `(i-1)` times in a row.
This results in the following probability:
`P(Xj ≥ i) ≤ (1 -(1/(3 × log2(n))))i-1`

With `Xj ≥ i` meaning, that phase `j` requires at least `i` steps.

#### 4.1.5. Expected value of time spent in each phase

Normally we express the expected value for the random variable `X` like this:
`E[Xj] = 1 × P(Xj = 1) + 2 × P(Xj = 2) + 3 × P(Xj = 3) + ...`

This can also be expressed the following way:
`E[Xj] = P(Xj ≥ 1) + P(Xj ≥ 2) + P(Xj ≥ 3) + ...`

If combined with
`P(Xj ≥ i) ≤ (1-(1/(3 × log2(n)))i-1`

we get the result
`E[Xj] ≤ 3 × log2(n)`

We have now proven that the time spent in each phase of decentralized search is proportional to `log2(n)`, or:
`E[Xj] ∝ log2(n)`

#### 4.1.6 Final step

We have shown that `E[X]` is the sum of `log2(n)` terms `E[Xi]` and that each of these terms `E[Xi]` is proportional to `log2(n)`. The combination of both facts results in the conclusion, that `E[X]` is proportional to `(log2(n))2`, or formal:
`E[X] ∝ (log2(n))2`

We showed that `q=1` works pretty good in one dimension. We still need to show, that all other exponents are less effective and that `q=n` is very effective in n dimensions.

### 4.2 Why other exponents are less effective

We will again focus on the one-dimensional case. For `q = 1` we argued, that it is easy to enter ever shrinking sets of nodes around the target node `t`. For `q<1` we will argue, that there is a set around `t` that is very hard to reach with decentralized search.

#### 4.2.1. Why q<1 is bad

We focus on `q=0`, but similar arguments can be found for all `q` less than `1`.
Let
`K = {v ∈ V | d(v,t) ≤ √(n)}`

Then the probability for `s` to be in `K` is the number of nodes in `K`, divided by the total number of nodes in the grid:
`P(s ∈ K) ≤ |K|/|V|`

Since the distribution of random links is uniformly random, the probability for any node to have a long-range link to a node in `K` is the number of nodes in `K`, divided by the total number of nodes in the grid:
`P((c,v) ∈ E|v ∈ K) ≤ |K|/|V|`

Since we know, that `|K|/|V| = 2/√(n)` and `limn → ∞ 2/√(n) = 0`, we now know, that it is very unlikely for `s` to be in `K` and in every stop of the decentralized search it is very unlikely for the current node to have a long range connection to a node in `K`. Thus we expect to require `√(n)/2` steps to reach K. This is necessary in order to reduce the distance to `t` significantly. We can conclude
`E[X] ∝ √(n)`

The argument for all `q` less than `1` is very similar [9].

#### 4.2.2. Why q>1 is bad

When we have a large clustering coefficient the average long-range links are relatively short. Therefore it takes a long time to find links that span sufficient distances. This makes it hard to quickly reduce the distance.

#### 4.2.3. Generalization

We can generalize, that
`∀q≠1 ∃c>0 : E[X] ∝ nc`

### 4.3. Decentralized search in higher dimensions

We will now take a look at the n-dimensional case.

#### 4.3.1. 1D

In section 4.1 we only used the fact, that we were in the one-dimensional case, in two steps: When calculating `Z` and when calculating the number of nodes within distance `d/2` from `t`.
`d` cancelled `1/d` when we calculated to probability to halve the distance. Thus, the probability is proportional to `1/log2(n)`.

#### 4.3.2. Adapt 1D to 2D

The number of nodes within distance `d/2` from `t` will be proportional to `d2`. This `d2` will be cancelled by the `1/d2` from the probability:
`P((c,y) ∈ E) ≥ (d2 / (d2 × log2(n))) = 1 / log2(n), for any y ∈ I`

This is essentially the same argumentation we used to distribute our random links uniformly over all scales. The argumentation for dimensions `≥3` is similar.

## 5 Our Simulations

We performed some simulations in order to better understand how import the choice of the right clustering component is and how its impact depends on the total number of nodes. Additionally we looked at the impact increasing or reducing the number of random edges per node has on the average length of decentralized search (ALDS) and tried to combine two different clustering components in one graph.

All simulations take place in the one-dimensional case, with every node only knowing its two direct neighbours. Every value presented in this chapter is the average of `10` graphs, with `100` searches performed on each.

If not explicitly mentioned otherwise, the x-axis is the value for q and the y-axis is the ALDS.

When evaluating these results, one must keep in mind, that the ALDS on a circle with `10,000` nodes and no random edges is `2,500`.

### 5.1 Simulation 1

In the first simulation, graphs with sizes between `10,000` and `200,000` nodes were used. Figure 5.1 shows the ALDS depending on `q` for a graph with `10,000` and a graph with `25,000` notes. Function names represent graphs with thousandfold nodes.

Based on these results, we didn't calculate the ALDS for `q` greater `1.5` and focused on values between `0.0` and `1.5`. The result can be seen for six graphs of different sizes in figure 5.2.

Values between `0.7` and `1.1` deliver the best results, with `0.9` and `1.0` being optimal. This is in correspondence with the simulation mentioned in section 2.1.2. Apparently, the positive effect of random edges on ALDS is bigger for graphs with more nodes.

Figure 5.1.: ALDS for graphs with `10,000` and `25,000` nodes for `q`-values between `0.0` and `3.0`.

Figure 5.2.: ALDS for graphs with `10,000` to `200,000` nodes for `q`-values between `0.0` and `1.5`.

Since the curves for the graphs have roughly the same shape we will now focus on graphs with `10,000` nodes.

### 5.2 Simulation 2

In our second simulation only every `k`-th node was allowed to form one outgoing random link. As can be seen in figure 5.3 a decrease in random-links increases the ALDS significantly. One striking detail can be observed:

With so few random edges, there is almost no difference between values from `0.0` up to `0.7` for `q`. Additionally, these values yield better results than the formerly optimal value of `1.0`. This can be explained by the fact, that a `q` between `0.0` and `0.7` leads to more randomness and thus more long distance edges, than a value of `1.0`. Since there are very few random edges to begin with, the formerly negative factor of finding to many unnecessary long range edges, does not play a role anymore.

Figure 5.3.: ALDS for graphs with only one random edge per `10`, `50` and `100` nodes.

### 5.3 Simulation 3

In our third experiment, we gave every node multiple random links. For the results see figure 5.4. Function names represent graphs with `x`-many random links per node.

Figure 5.4.: ALDS for graphs with up to `100` random links per node.

As could be expected, adding more random edges reduced the ALDS tremendously. But only up to `75` random links per node. At this point every node is connected to 0.77% of all nodes in the graph, which could be some kind of saturation. A model, that can predict how much the ALDS benefits from adding more random edges per node, depending in the used clustering coefficient and the total number of nodes, could be of interest for further research.

### 5.4 Simulation 4

For the forth simulation we looked at how the combination of two different values for `q` in uences the ALDS. For every node in the graph, there was a 50% chance to get his random edge target calculated by either `q1` or `q2`. This was done for all combinations of `q1,q2 ∈ {0.0, 0.1, ..., 1.9, 2.0}`. The normalized result can be seen in figure 5.5. Normalized in this case means that each ALDS has been reduced by `39`, the previously established optimal value, resulting in values between `0` and `380`.

Clearly, values between `0.6` and `1.0` generate extremely good results in every combination with values between `0.6` and `2.0`.

Figure 5.5.: ALDS for combinations of two clustering components.

Values of `0.0` and `0.1` never yield good results, regardless of the value they are combined with. However they never perform extremely bad either. No combination of `qs` yielded better results than the previous optimum.

## 6 Conclusion

We motivated the use of decentralized search as an approximation of search in social networks and showed that a clustering component of `n` is the best choice to generate `n`-dimensional Watts-Strogatz grids, that yield good results for decentralized search. Additionally we explained how to model networks that base homgenity on geographic proximity and can handle variations in population density. In our own research, we analysed how the total code count and the number of random edges per node in uence the average length of decentralized search and in some cases can even alters, which values for `q` are optimal.

For future research, we propose to look more closely at the following aspect: In artificial real-world networks, that perform decentralized search, the cost of nodes and edges is a very important factor. Examples for this are ad-hoc networks that enable mobile phones to work in an area recently hit by an earthquake. Or a network of buoyages that measure data for a tsunami warning system. In these real-world examples, every node only has a limited range and needs to send a message through other nodes in order to reach a special node that has a satellite uplink or a similar connection to entities outside the network. In both examples, letting every node perform a decentralized search might be a more efficient option, than using an extra software layer to implement routing.

Since resources are limited and even the forming of wireless connections consumes energy, knowing how thinly the nodes can be distributed and how many connections every node needs, would be helpful information.

Findings that could factor into this are our results from the simulation in section 5.2 that clearly show, that `q=n` is not the optimal choice for networks with very few random edges. And the results from our simulation in section 5.3 suggest, that there is a rather low threshold for the benefit gained from additional random edges.

## 7 References

[1] S. Milgram: The small-world problem. Psychology Today, 2:60-67, 1967.

[2] J. Travers and S. Milgram: An experimental study of the small world problem. Sociometry, 32(4):425-443, 1969.

[4] J. Kleinberg: Navigation in a small world. Nature, 406:845, 2000.

[5] D. Liben-Nowell, J. Novak, R. Kumar, P. Raghavan and A. Tomkins (from MIT, IBM, Yahoo and others):G eographic routing in social networks. Proceedings of the National Academy of Sciences of the United States of America. 2005
http://www.pnas.org/content/102/33/11623.long, visited 06/02/2012

[6] S. Milgram: Psychol. Today 1, 61:67. 1967

[7] P. S. Dodds, R. Muhamad and D. J. Watts: Science 301, 827:829. 2003

[8] D. Easly and J. Kleinberg (both from Cornell University): Networks, Crowds and Markets
http://www.cs.cornell.edu/home/kleinber/networks-book/, visited 02/13/2012

[9] J. Kleinberg (Cornell University): The small-world phenomenon: an algorithmic perspective. 32nd ACM Symposium on Theory of Computing. 2000