Stream Mining — Window Models

This article explain various window models for keeping relevant data from continuous stream of data.

Let’s assume we have a stream of tweets coming for a particular hashtag. Initially there were 2 clusters based on emotions, positive and negative we could map the tweets to. However, as the tweets are coming continuously there is a high chance that the sentiments of the tweets are changing, there might be tweets which can not be put into these two clusters and a new cluster of “neutral” sentiments needs to be created. Question arises how do we do know that? Do we keep all the tweets coming? How do we select which tweets to consider? Do I keep only recent tweets? if, yes, which recent ones? What about the older tweets?

Well, to address the same questions there are window models consider part of the design designs. These models provides how you can select tweets from the streams. Below are the popular ones-

Sliding Window Model

Just like you can slide a normal window at home, there you can slide it over tweets. Let’s say I want to keep taking only 3 tweets then over the coming tweets I would always take latest 3 tweets. Now, 3 is my window length. This is a hyperparameter. Something like below-

Source- Lecture Slide LMU[3]

Landmark Window Model

Well, if you want to keep all the tweets from a point till now, you may decide a landmark and keep all the tweets from then. So, below its gonna keep all the tweets from e1 till now. The landmark can be a day/ week or number of tweets in previous landmark. Everything after the landmark is reached is forgotten, in English- if the landmark is set for daily, then tomorrow all the tweets from previous day are removed.

Source- Lecture Slide LMU[3]

Damped Window Model

Just like sliding window model Damped window model also considers latest data but by assigning weights to the each tweets coming. So, the latest coming tweets shall have higher weight then the tweet came last week. The weights are assigned as per below equation-

higher the lambda, less weight to tweet, t is the timepoint tweet arrived.

Source — Survey by Silva et al. [2]

In weighting terms, the weights of all points in a sliding windows and landmark are 1.

References-

[2][Silva et al., 2014] Silva, J., Faria, E., Barros, R., Hruschka, E., De Carvalho, A., and Gama, J. (2014). Data Stream Clustering: A Survey. ACM Computing Surveys, 4.

[3] https://www.dbs.ifi.lmu.de/Lehre/KDD_II/WS1516/skript/KDD2-4-DataStreamsClustering