sequence data mining

sequence data mining | springerlink

sequence data mining | springerlink

Many interesting real-life mining applications rely on modeling data as sequences of discrete multi-attribute records. Existing literature on sequence mining is partitioned on application-specific boundaries. In this article we distill the basic operations and techniques that are common to these applications. These include conventional mining operations, such as classification and clustering, and sequence specific operations, such as tagging and segmentation. We review state-of-the-art techniques for sequential labeling and show how these apply in two real-life applications arising in address cleaning and information extraction from websites.

sequence mining - an overview | sciencedirect topics

sequence mining - an overview | sciencedirect topics

Sequence mining has already proven to be quite beneficial in many domains such as marketing analysis or Web click-stream analysis [19]. A sequence s is defined as a set of ordered items denoted by s1,s2,,sn. In activity recognition problems, the sequence is typically ordered using timestamps. The goal of sequence mining is to discover interesting patterns in data with respect to some subjective or objective measure of how interesting it is. Typically, this task involves discovering frequent sequential patterns with respect to a frequency support measure.

The task of discovering all the frequent sequences is not a trivial one. In fact, it can be quite challenging due to the combinatorial and exponential search space [19]. Over the past decade, a number of sequence mining methods have been proposed that handle the exponential search by using various heuristics. The first sequence mining algorithm was called GSP [3], which was based on the a priori approach for mining frequent itemsets [2]. GSP makes several passes over the database to count the support of each sequence and to generate candidates. Then, it prunes the sequences with a support count below the minimum support.

Many other algorithms have been proposed to extend the GSP algorithm. One example is the PSP algorithm, which uses a prefix-based tree to represent candidate patterns [38]. FREESPAN [26] and PREFIXSPAN [43] are among the first algorithms to consider a projection method for mining sequential patterns, by recursively projecting sequence databases into smaller projected databases. SPADE is another algorithm that needs only three passes over the database to discover sequential patterns [71]. SPAM was the first algorithm to use a vertical bitmap representation of a database [5]. Some other algorithms focus on discovering specific types of frequent patterns. For example, BIDE is an efficient algorithm for mining frequent closed sequences without candidate maintenance [66]; there are also methods for constraint-based sequential pattern mining [44].

This chapter presents a high-level overview of mining complex data types, which includes mining sequence data such as time series, symbolic sequences, and biological sequences; mining graphs and networks; and mining other kinds of data, including spatiotemporal and cyber-physical system data, multimedia, text and Web data, and data streams. Trends and research frontiers in data mining are focused on. An overview of methodologies for mining complex data types is presented. Such mining includes mining time-series, sequential patterns, and biological sequences; graphs and networks; spatiotemporal data, including geospatial data, moving-object data, and cyber-physical system data; multimedia data; text data; web data; and data streams. Other approaches to data mining, including statistical methods, theoretical foundations, and visual and audio data mining are briefly introduced. Several well-established statistical methods have been proposed for data analysis such as regression, generalized linear models, analysis of variance, mixed-effect models, factor analysis, discriminant analysis, survival analysis, and quality control. Data mining applications in business and in science, including the financial retail, and telecommunication industries, science and engineering, and recommender systems are introduced. The social impacts of data mining are discussed, including ubiquitous and invisible data mining, and privacy-preserving data mining. Finally, current and expected data mining trends that arise in response to new challenges in the field e speculated.

An important precursor to the task of activity recognition is the discovery phaseidentifying and modeling important and frequently repeated event patterns [43]. Two chapters in the book focus on this emerging research area: Rashidis chapter on Stream Sequence Mining and Human Activity Discovery and Learning Latent Activities from Social Signals with Hierarchical Dirichlet Processes by Phung et al. Rashidis chapter discusses the problem of analyzing activity sequences in smart homes. Smart homes are dwellings equipped with an array of sensors and actuators that monitor and adjust home control system settings to improve the safety and comfort of the inhabitants. Key advances in this area have been driven by several research groups who have made activities of daily living (ADL) datasets publicly available [48,71,70]. Rashidis work was conducted using data from the CASAS testbed at Washington State [56]; examples of other smart environment projects include Georgia Techs Aware Home [1] and MITs House_n [68].

Smart environments pose a challenging data-analysis problem because they output nonstationary streams of data; new elements are continuously generated and patterns can change over time. Many activity discovery approaches (e.g., Minnen et al. [43] and Vahdatpour et al. [72]) use time-series motif detection, the unsupervised identification of frequently repeated subsequences, as an element in the discovery process. The term motif originated from the bioinformatics community in which it is used to describe recurring patterns in DNA and proteins. Even though these techniques are unsupervised, they make the implicit assumption that it is possible to characterize the users activity with one dataset sampled from a fixed period of time. Problems arise when the action distribution describing the users past activity differs from the distribution used to generate future activity due to changes in the users habits. Thus, it can be beneficial to continue updating the library of activity models, both to add emerging patterns and to discard obsolete ones.

Rashidi proposes that activity discovery can be modeled as a datastream processing problem in which patterns are constantly added, modified, and deleted as new data arrives. Patterns are difficult to discover when they are discontinuous because of interruptions by other events, and also when they appear in varied order. Rashidis approach, STREAMCom, combines a tilted-time window data representation with pruning strategies to discover discontinuous patterns that occur across multiple time scales and sensors. In a fixed-time window, older data are forgotten once they fall outside the window of interest; however, with the tilted-time representation, the older data are retained at a coarser level. During the pruning phase, infrequent or highly discontinuous patterns are periodically discarded based on a compression objective that accounts for the patterns ability to compress the dataset. The chapter presents an evaluation of STREAMComs performance on discovering patterns from several months of data generated by sensors within two smart apartments.

The second chapter in this part by Phung et al., describes a method for analyzing data generated from personal devices (e.g., mobile phones [37], sociometric badges [49], and wearable RFID readers [18]). Wearable RFID readers, such as Intels iBracelet and iGlove, are well suited for reliably detecting the users interactions with objects in the environment, which can be highly predictive of many ADL [51]. Sociometric badges are wearable electronic devices designed to measure body movement and physical proximity to nearby badge wearers. The badges can be used to collect data on interpersonal interactions and study community dynamics in the workplace. Two datasets of particular importance, Reality Mining [16] and Badge [49], were released by the MIT Human Dynamics lab to facilitate the study of social signal processing [50].

Phung et al. describe how a Bayesian nonparametric method, the hierarchical Dirichlet process [69], can be used to infer latent activities (e.g., driving, playing games, and working on the computer). The strength of this type of approach is twofold: (1) the set of activity patterns (including its cardinality) can be inferred directly from the data and (2) statistical signals from personal data generated by different individuals can be combined for more robust estimation using a principled hierarchical Bayesian framework. The authors also show how their method can be used to extract social patterns such as community membership from the Bluetooth data that captures colocation of users in the Reality Mining dataset. The activity discovery techniques described in these two chapters will be of interest to readers working with large quantities of data who are seeking to model unconstrained human activities using both personal and environmental sensors.

Based on the personal data collected on each mobile device, we have devised Proactive Suggestion (PS), an application that makes context-aware recommendations. In Fig.3.2, the individual components of the PS are laid out.

Analytics engines of PS produce hierarchical personal data that are interdependent to each other. Raw data such as GPS coordinates, call logs, application usage, and search queries are fed to a Cooccurrence Analysis engine, which is responsible for identifying activities that occurred at the same time [18]. For example, the cooccurrence analysis engine may recognize that a user listens to live streaming music while walking in the park. Given such cooccurrence data, the Sequence Mining engine can infer causal relationships between personal activities that occurred over time [19]. The recognized sequential patterns can be fed into the Predictive Analysis engine to assess the probability of a particular activity taking place in a certain context [19].

Fig.3.3 illustrates how PS implements the device/cloud collaboration framework. The master device can discover neighboring devices that the end user is authorized to use (Device Discovery). The master device can send over the data to one of the neighboring devices that has sufficient compute capacity (Device Binding). The neighboring device can retrieve an appropriate analytics engine for processing the data sent by the master device (Application Component Sharing). In this example, the highlighted pieces of data on the master device are shared between cloud and neighboring devices.

Note that the PS application initially opted for the Hierarchical Data Sandboxing for an explicit and declarative privacy-protection method. We could not afford to run an alternative privacy-protection method based on the data obfuscation, due to the limited resources on the device that was already bogged down by the analytics work. However, recall that our framework is flexible enough to allow user-defined cost functions. For example, if the cost of running an analytics operation (eg, the cost of consuming battery life) is excessive, then the Device/Cloud Selection module in the framework may decide to transfer the analytics task to the cloud or simply wait for the battery level to rise above the configured thresholds. It turned out that transferring the data over the network consumed as much energy as running the analytics operation within the device. Thus, the Device/Cloud Selection module opted for waiting until the battery got charged above the configured level.

A modern mine is an extremely capital-intensive industrial operation, and a major fraction of the initial investment is the acquisition of the mining equipment. In addition, during the life of the mine, equipment replacements or additions usually will become necessary. Equipment selection therefore is an integral part of mining engineering.

Mine equipment selection is a complex process, continuing for nearly the entire life of a mine. It must be integrated with overall mine planning and design, and it must take into account numerous site-specific factors, such as climate, power supply, and labor supply, as well as rock and ore conditions, mine size, etc. The requirement that equipment selection must be integrated with mine planning is illustrated by Fig. 8: The feasibility of the planned mining sequence depends entirely on the availability of a dragline that can strip the coal seam to the proposed depth and spoil the overburden to the required distance and height. This figure also illustrates the next complication, the fact that several entirely different types of equipment often can be used to accomplish the same goal. In the case under consideration, it might be possible to strip with a large stripping shovel, with a bucketwheel excavator (Fig. 16), or with a fleet of scrapers (Fig. 5). Although in many cases particular site conditions (e.g., rock hardness, depth, rainfall) will dictate or strongly favor a particular type of equipment, in many other cases only detailed analysis will provide the necessary information to allow an optimum selection.

Once mine design has been narrowed down to the fairly specific description of unit operations, several choices usually remain. At this point it becomes particularly important to match the equipment to the specific task it will be performing. One aspect of the procedure followed is illustrated for stripping and for loading equipment in Fig. 28 and in Tables VII and VIII. The figures and tables present general specifications of the equipment, describing the range of parameters over which it can perform, as well as the space requirements for the equipment. Such figures and tables are available from all manufacturers who supply the type of equipment under consideration. In general, a range of individual equipment models will be able to perform the required task. Final selection usually is based on extensive computer simulations, matching various types of equipment (e.g., shovels, trucks) with each other as well as with the range of mining conditions expected.

FIGURE 28. Working ranges for a large hydraulic excavator (DEMAG H241). Boom and bucket can be selected for the excavator to dig primarily at grade (left), or below grade (middle and right), over a longer range (right), but then at a reduced bucket load capacity compared to the shorter range (middle). Numerical values of the range parameters and bucket capacities are given in Tables VI and VII. [Figure courtesy of Mannesman Demag Baumaschinen, Dsseldorf, West Germany.]

Our activity discovery method (ADM) performs frequent sequent mining using DVSM and then maps the similar discovered patterns onto clusters. We use DVSM to find sequence patterns from discontinuous instances that might also be misplaced (exhibit varied order). As an example, DMSM can extract the pattern from instances {b, x, c, a}, {a, b, q}, and {a, u, b}. Our approach is different from frequent itemset mining because we consider the order of items as they occur in the data. Unlike many other sequence mining algorithms, we report a general pattern that comprises all variations of a single pattern that occur in the input dataset D; we also report the core pattern that is present in all these variations. For general pattern a, we denote the ith variation of the pattern as ai, and the core pattern as ac. We also refer to each single component of a pattern as an event (such as a in the pattern ).

To find these discontinuous order-varying sequences from the input data D, DVSM first creates a reduced dataset Dr containing the top most frequent events. Next, DVSM slides a window of size 2 across D to find patterns of length 2. After this first iteration, the whole dataset does not need to be scanned again. Instead, DVSM extends the patterns discovered in the previous iteration by their prefix and suffix events, and will match the extended pattern against the already discovered patterns (in the same iteration) to see if it is a variation of a previous pattern or if it is a new pattern [7]. To facilitate comparisons, we save general patterns along with their discovered variations in a hash table.

To see if two patterns should be considered as permutations of the same pattern, we use the Levenshtein distance [41] to define a similarity measure sim(A,B) between the two patterns. The edit distance, e(A,B), represents the number of edits (e.g., event insertions, deletions, and substitutions) required to transform an event sequence A into another event sequence B. We define the similarity measure based on the edit distance as in Eq. (19-1).

At the end of each iteration, we prune infrequent variations of a general pattern, as well as infrequent general patterns. We identify general patterns as of interest if they satisfy inequality (19-2), and variation i of the pattern as interesting if it satisfies inequality (19-4). In this inequality DL computes the description length of the data D, the pattern a, and the dataset compressed by replacing occurrences with a pointer to the pattern definition. C and Cv are minimum compression value thresholds.

Our approach to identifying interesting patterns aligns with the minimum description length principle [42] which advocates that the pattern which best describes a dataset is the one which maximally compresses the dataset by replacing instances of the pattern by pointers to the pattern definition. However, since we allow discontinuities to occur, each instance of the pattern needs to be encoded not only with a pointer to the pattern definition but also with a discontinuity factor, . The discontinuity of a pattern instance, (ai), is calculated as the number of bits required to express how the pattern varies from the general definition.

To understand what the discontinuity function measures, consider a general pattern as shown in Figure 19-2. An instance of the pattern is found in the sequence {a, b, g, e, q, y, d, c} where symbols g, e, q, y, d separate the pattern subsequences {a, b} and {c}. Though this sequence may be considered as an instance of the general pattern a, b, c, we still need to take into account the number of symbols that appear between subsequences {a, b} and {c}. In terms of calculating a pattern's compression, discontinuities increase the description length of the data because the way in which the pattern is broken up needs to be encoded.

The continuity between component events, e, is defined for each two consecutive events in an instance. For each frequent event e, we record how far apart (or separated, denoted by se) it is from a preceding frequent event in terms of the number of events that separate them in D (in above example, sc = 5). Then e(e), the event continuity for e, is defined as in Eq. (19-4).

The more the separation that exists between two frequent events, the less will be the event continuity. Based on event continuity, the instance continuity i reflects how continuous its component events are. As a result, the value of i(aij), for an instance j of a variation ai will be defined as in Eq. (19-5).

The continuity, g, of a general pattern, g, is defined as the weighted average continuity of its variations. g is defined according to Eq. (19-7), where the continuity for each ai is weighted by its frequency fai and na shows the total number of variations for general pattern a.

Patterns that satisfy inequality (19-8) are flagged as interesting, as are variations that satisfy inequality (19-9). The rest of the patterns and variations are pruned. In every iteration, we also prune redundant nonmaximal patterns, that is, those patterns that are totally contained in another larger pattern. This considerably reduces the number of discovered patterns. We continue extending the patterns by prefix and suffix until no more interesting patterns are found. A postprocessing step records attributes of the patterns, such as event durations.

an introduction to sequential pattern mining | the data mining blog

an introduction to sequential pattern mining | the data mining blog

In this blog post, I will give anintroductiontosequentialpatternmining,an important dataminingtask with a wide range of applications from text analysis to market basket analysis. This blog post is aimed to be a short introductino. If you want to read a more detailedintroductiontosequentialpatternmining, you can readasurvey paperthat I recently wrote on thistopic.

Dataminingconsists of extracting information from data stored in databases to understand the data and/or take decisions.Some of the most fundamental dataminingtasks are clustering, classification, outlier analysis, andpatternmining.Patternminingconsists of discovering interesting, useful, and unexpected patterns in databases Various types of patterns can be discovered in databases such asfrequent itemsets, associations,subgraphs,sequentialrules, andperiodic patterns.

The task ofsequentialpatternminingis a dataminingtask specialized for analyzingsequentialdata,to discoversequentialpatterns. More precisely, it consists of discovering interesting subsequences ina set of sequences, where the interestingness of a subsequence can be measured in terms of various criteria such as its occurrence frequency, length, and profit.Sequentialpatternmininghas numerous real-life applications due to the fact that data is naturally encoded assequences of symbolsin many fields such as bioinformatics, e-learning, market basket analysis, texts, and webpage click-stream analysis.

This database contains four sequences. Eachsequencerepresents the items purchased by a customer at different times. A sequence is an ordered list of itemsets (sets of items bought together). For example, in this database, the first sequence (SID 1) indicatesthat a customer bought some itemsaandbtogether, then purchased an itemc, then purchased itemsfandgtogether, then purchased an itemg, and then finally purchased an iteme.

Traditionally,sequentialpatternminingis beingused to find subsequences that appear often in a sequence database, i.e. that are common to several sequences. Those subsequences are called thefrequentsequentialpatterns. For example, in the context of our example,sequentialpatternminingcan be used to find the sequences of itemsfrequently bought by customers. This can be useful to understand the behavior of customers to take marketing decisions.

Todosequentialpatternmining, a user must provide a sequence database and specify a parameter called theminimum support threshold. This parameter indicates a minimum number of sequences in which apatternmust appear to be considered frequent, and be shown to the user. For example, if a user sets the minimum support threshold to 2 sequences, the task ofsequentialpatternminingconsists of finding all subsequences appearing in at least 2 sequences of the input database. In the example database, many subsequences met this requirement. Some of thesesequentialpatterns are shown in the table below, where the number of sequences containing eachpattern(called thesupport) is indicated in the right column of the table.

For example, the patterns<{a}> and <{a}, {g}> are frequent and have a support of 3 and 2 sequences, respectively. In other words, these patterns appears in 3 and 2 sequences of the input database, respectively. Thepattern<{a}> appears in the sequences 1, 2 and 3, while thepattern<{a}, {g}> appears in sequences 1 and 3. These patterns are interesting as they represent some behavior common to several customers. Of course, this is a toy example.Sequentialpatternminingcan actually be applied on database containing hundreds of thousands of sequences.

Another example of application ofsequentialpatternminingis text analysis. In this context, a set of sentences from a text can be viewed as sequence database, and the goal ofsequentialpatternminingis then to find subsequences of words frequently used in the text. If such sequences are contiguous, they are called ngrams in this context. If you want to know more about this application, you can read thisblog post, wheresequentialpatterns are discovered in a Sherlock Holmes novel.

Besides sequences,sequentialpatternminingcan also be applied totime series(e.g. stock data), when discretization is performed as a pre-processing step. For example, the figure below shows atime series (an ordered list of numbers) on the left. On the right, asequence(a sequence of symbols) is shown representing the same data, after applying a transformation. Various transformations can be done to transform atime series to a sequence such as the popular SAX transformation. After performing the transformation, anysequentialpatternminingalgorithm can be applied.

To trysequentialpatternminingwith your datasets, you maytry the open-sourceSPMF dataminingsoftware, which provides implementations of numeroussequentialpatternminingalgorithms:http://www.philippe-fournier-viger.com/spmf/

It provides implementations of several algorithms forsequentialpatternmining, as well as several variations of the problem such as discoveringmaximalsequentialpatterns,closedsequentialpatternsandsequentialrules.Sequentialrulesare especially useful for the purpose of performing predictions, as they also include the concept of confidence.

There exists severalsequentialpatternminingalgorithms. Some of the classic algorithms for this problemarePrefixSpan, Spade, SPAM,andGSP. However, in the recent decade, several novel and more efficient algorithms have been proposed such asCM-SPADE andCM-SPAM(2014),FCloSMandFGenSM(2017), to name a few. Besides, numerous algorithms have been proposed for extensions of the problem ofsequentialpatternminingsuch as finding thesequentialpatterns that generate the most profit (high utilitysequentialpatternmining).

In thisblog post, I have given a brief overview ofsequentialpatternmining, a very useful set of techniques for analyzingsequentialdata. If you want to know more about this topic, you may read the followingrecent survey paper that I wrote, which gives an easy-to-read overview of this topic, including the algorithms forfsequentialpatternmining, extensions, research challenges and opportunities.

what is sequence mining? (with pictures)

what is sequence mining? (with pictures)

Sequence mining is a type of structured data mining in which the database and administrator look for sequences or trends in the data. This data mining is split into two fields. Itemset sequence mining typically is used in marketing, and string sequence mining is used in biology research. Sequence mining is different from regular trend mining, because the data are more specific, which makes building an effective database difficult for database designers, and it can sometimes go awry if the sequence is any different from the common sequence.

At one point or another, all databases are used to mine for data. This mining helps businesses and research parties find something they need. Usually, they are looking for some sort of trend, but what that trend is and how specific the information is will depend on the database design. In sequence mining, the database is built to find very specific sequences, with little to no variation. This is a unique form of structured data mining in which the database looks through the structured data for similarities.

Sequence mining can be broken into two categories. Itemset mining is used in marketing and business to find specific trends in sales numbers, product types, product placement in a store and the use of a product. These figures are taken and applied to marketing algorithms to help strategize a marketing project and to bolster sales. Information about a product and how it does typically is taken from the database, but the defining aspect of itemset sequence mining is that the sequence is taken from multi-symbol database cells.

String mining is the opposite of itemset mining because it looks at each symbol individually rather than as a cluster. In string mining, the database might be set to find a sequence from a protein source or gene samples. This helps in comparing many gene samples to see whether they are the same or to break down large sequences and find which sequences they contain. Mostly biological and medical research teams use this.

Creating a database for sequence mining can be difficult because, unlike trend mining and other structured data mining, the sequences must specifically match each other. This also leads to the problem of mining for sequences. If the sequence is any different, it won't be recognized, which might make itemset mining more difficult. String mining typically benefits from this, because the slightest difference in a tissue sample could make the organism or whatever the research team is researching completely distinct from other samples.

Related Equipments