EN FR
EN FR


Section: Research Program

Big data stream mining and processing

The challenge of deriving insights from the Internet of Things (IoT) has been recognized as one of the most exciting and key opportunities for both academia and industry. Advanced analysis of big data streams from sensors embedded in the environment and wearable or mobile user devices is bound to become a key area of data mining research as the number of applications requiring such processing increases. However, the high data speed (velocity) in conjunction with the low data quality (veracity) of IoT data streams challenges traditional Machine-Learning (ML) approaches assuming that a good quality training set is available a priori to learn models that may be effectively applied to new data collected under very similar conditions. As previous work has observed, data quality issues are detrimental to data analysis (US National Research Council. 2013. Frontiers in Massive Data Analysis. The National Academies Press. http://www.nap.edu/openbook.php?record id=18374). Good quality training data are typically the result of a thorough data pre-processing comprising data aggregation/integration, data cleaning/normalization, data dimensionality reduction, etc. The offline nature of these data engineering tasks represent nowadays one of the biggest technical barriers for supporting a high-value data analytics in real-time for various IoT settings (e.g., residential, industrial, urban, etc.). Furthermore, existing techniques for data quality management are usually agnostic of the analytical process that is to be applied on the data. For this reason, analysts either clean everything, which is impossible for Big Data, or clean random subsets and hope for the best. We are interested in studying the following research questions: (a) what specific characteristics of the data quality (e.g., incomplete, extreme or erroneous values) led to the improvement, or lack thereof, in the performance of a ML algorithm (e.g., regression, classification)? (b) how we can identify influential data that are both unusual in the predictor variables and do not follow the general trend of the data relative to a prediction? (c) to what extend we can automate data pre-processing tasks (in particular cleaning) for specific streaming data analytics scenarios?