## Section: New Results

### Other Topics Related to Security or Distributed Computing

#### Detection of distributed denial of service attacks

A Denial of Service (DoS) attack tries to progressively take down an Internet
resource by flooding this resource with more requests than it is capable to
handle. A Distributed Denial of Service (DDoS) attack is a DoS attack triggered
by thousands of machines that have been infected by a malicious software,
with as immediate consequence the total shut down of targeted web resources
(*e.g.*, e-commerce websites).
A solution to detect and to mitigate DDoS attacks it to monitor network
traffic at routers and to look for highly frequent signatures that might
suggest ongoing attacks.
A recent strategy followed by the attackers is to hide their massive flow of
requests over a multitude of routes, so that locally, these flows do not
appear as frequent, while globally they represent a significant portion of
the network traffic. The term “iceberg” has been recently introduced to describe such an attack as only a very small part of the iceberg can be observed from each single router. The approach adopted to defend against such new attacks is to rely on multiple routers that locally monitor their network traffic, and upon detection of potential icebergs, inform a monitoring server that aggregates all the monitored information to accurately detect icebergs [29] . Now to prevent the server from being overloaded by all the monitored information, routers continuously keep track of the $c$ (among $n$) most recent high flows (modeled as items) prior to sending them to the server, and throw away all the items that appear with a small probability ${p}_{i}$, and such that the sum of these small probabilities is modeled by probability ${p}_{0}$. Parameter $c$ is dimensioned so that the frequency at which all the routers send their $c$ last frequent items is low enough to enable the server to aggregate all of them and to trigger a DDoS alarm when needed. This amounts to compute the time needed to collect $c$ distinct items among $n$ frequent ones. A thorough analysis of the time needed to collect $c$ distinct items appears in [16] , [15] .

#### Metrics Estimation on Very Large Data Streams

Huge data flows have become very common in the last decade. This has motivated the design of online algorithms that allow the accurate estimation of statistics on very large data flows. A rich body of algorithms and techniques have been proposed for the past several years to efficiently compute statistics on massive data streams. In particular, estimating the number of times data items recur in data streams in real time enables, for example, the detection of worms and denial of service attacks in intrusion detection services, or the traffic monitoring in cloud computing applications.
Two main approaches exist to monitor in real time massive data streams. The first one consists in regularly sampling the input streams so that only a limited amount of data items is locally kept. This allows to exactly compute functions on these samples. However, accuracy of this computation with respect to the stream in its entirety fully depends on the volume of data items that has been sampled and their order in the stream.
In contrast, the streaming approach consists in scanning each piece of data of the input stream on the fly, and in locally keeping only compact synopses or *sketches* that contain the most important information about these data. This approach enables us to derive some data streams statistics with guaranteed error bounds without making any assumptions on the order in which data items are received at nodes. Sketches highly rely on the properties of hashing functions to extract statistics from them. Sketches vary according to the number of hash functions they use, and the type of operations they use to extract statistics. The *Count-Min sketch* algorithm proposed by Cormode and Muthukrishnan in 2005 so far predominates all the other ones in terms of space and time needed to guarantee an additive $\epsilon $-accuracy on the estimation of item frequencies. Briefly, this technique performs $t$ random projections of the set of items of the input stream into a much smaller co-domain of size $k$, with $k=\lceil e/\u03f5\rceil $ and $t=\lceil log(1/\delta )\rceil $ in which
$0<\u03f5,\delta <1$.
The user defined parameters $\u03f5$ and $\delta $ represent respectively the accuracy of the approximation, and the probability with which the accuracy holds.
However, because $k$ is typically much smaller than the total number of distinct items in the input stream, hash collisions do occur. This affects the estimation of item frequency when the size of the stream is large.
In this work, we have proposed an alternative approach to reduce the impact of collisions on the estimation of item frequency. The intuition of our idea is that by keeping track of the most frequent items of the stream, and by removing their weight from the one of the items with which these frequent items collide, the over-estimation of non frequent items is drastically decreased [21] .

We have also proposed a metric, called codeviation, that allows to evaluate the correlation between distributed streams [27] . This metric is inspired from classical metric in statistics and probability theory, and as such allows us to understand how observed quantities change together, and in which proportion. We then propose to estimate the codeviation in the data stream model. In this model, functions are estimated on a huge sequence of data items, in an online fashion, and with a very small amount of memory with respect to both the size of the input stream and the values domain from which data items are drawn. We give upper and lower bounds on the quality of the codeviation, and provide both local and distributed algorithms that additively approximates the codeviation among $n$ data streams by using a sublinear number of bits of space in the size of the domain value from which data items are drawn, and the maximal stream length. To the best of our knowledge, such a metric has never been proposed so far.

#### Stream Processing Systems

Stream processing systems are today gaining momentum as a tool to perform analytics on continuous data streams. Their ability to produce analysis results with sub-second latencies, coupled with their scalability, makes them the preferred choice for many big data companies.

A stream processing application is commonly modeled as a direct acyclic graph where data operators, represented by nodes, are interconnected by streams of tuples containing data to be analyzed, the directed edges. Scalability is usually attained at the deployment phase where each data operator can be parallelized using multiple instances, each of which will handle a subset of the tuples conveyed by the operator's ingoing stream. Balancing the load among the instances of a parallel operator is important as it yields to better resource utilization and thus larger throughputs and reduced tuple processing latencies. We have proposed a new key grouping technique targeted toward applications working on input streams characterized by a skewed value distribution [44] . Our solution is based on the observation that when the values used to perform the grouping have skewed frequencies, e.g. they can be approximated with a Zipfian distribution, the few most frequent values (the *heavy hitters*) drive the load distribution, while the remaining largest fraction of the values (the *sparse items*) appear so rarely in the stream that the relative impact of each of them on the global load balance is negligible. We have shown, through a theoretical analysis, that our solution provides on average near-optimal mappings using sub-linear space in the number of tuples read from the input stream in the learning phase and the support (value domain) of the tuples. In particular this analysis presents new results regarding the expected error made on the estimation of the frequency of heavy hitters.

#### Randomized Message-Passing Test-and-Set

In [30] , we have presented a solution to the well-known Test&Set operation in an asynchronous system prone to process crashes. Test&Set is a synchronization operation that, when invoked by a set of processes, returns yes to a unique process and returns no to all the others. Recently many advances in implementing Test&Set objects have been achieved, however all of them target the shared memory model. In this paper we propose an implementation of a Test&Set object in the message passing model. This implementation can be invoked by any number p $\le $ n of processes where n is the total number of processes in the system. It has an expected individual step complexity in $O(logp)$ against an oblivious adversary, and an expected individual message complexity in $O\left(n\right)$. The proposed Test&Set object is built atop a new basic building block, called selector, that allows to select a winning group among two groups of processes. We propose a message-passing implementation of the selector whose step complexity is constant. We are not aware of any other implementation of the Test&Set operation in the message passing model.

#### Population Protocol Model

The population protocol model, introduced by Angluin et his colleagues in 2006, provides theoretical foundations for analyzing global properties emerging from pairwise interactions among a large number of anonymous agents.
In the population protocol model, agents are modeled as identical and
deterministic finite state machines, *i.e.* each agent can be in a
finite number of states while waiting to execute a transition. When
two agents interact, they communicate their local state, and can move
from one state to another according to a joint transition function.
The patterns of interaction are unpredictable, however they must be fair, in the sense that any interaction that should possibly appear cannot be avoided forever.
The ultimate goal of population protocols is for all the agents to
converge to a correct value independently of the interaction pattern. Examples of systems whose behavior can be modeled by population protocols range from molecule interactions of a chemical process to sensor networks in which agents, which are small devices embedded on animals, interact each time two animals are in the same radio range.

In this work, we focus on an quite important related question. Namely, is there a population protocol that exactly counts the difference $\kappa $ between the number of agents that initially set their state to $A$ and the one that initially set it to $B$, and can it be solved in an efficient way, that is with the guarantee that each agent should converge to the exact value of $\kappa $ after having triggered a sub-linear number of interactions in the size of the system [43] .

We answer this question by the affirmative by presenting a $O\left({n}^{3/2}\right)$-state population protocol that allows each agent to converge to the exact solution by interacting no more than $O(logn)$ times. The proposed protocol is very simple (as is true for most known population protocols), but is general enough to be used to solve different types of tasks.