Nguyễn Tuấn Anh - Blog: Data Mining

Showing posts with label Data Mining. Show all posts

04/01/2024

Knime là gì

Knime là viết tắt của từ Konstanz Information Miner, là một nền tảng phân tích, báo cáo và tích hợp dữ liệu nguồn mở và hoàn toàn miễn phí. Knime tích hợp nhiều thành phần khác nhau để học máy và khai thác dữ liệu thông qua khái niệm "Lego of Analytics" theo mô-đun dữ liệu mô-đun. Một giao diện mà người dùng đồ họa sử dụng JDBC cho phép lắp ráp các nút pha trộn các nguồn dữ liệu lại với nhau.

31/12/2015

A New Method for Generating All Positive and Negative Association Rules(2011)

Association Rule play very important role in recent scenario of data mining. But we have only generated positive rule, negative rule also useful in today data mining task. In this paper we are proposing “A new method for generating all positive and negative Association Rules” (NRGA).NRGA generates all association rules which are hidden when we have applied Apriori Algorithm. For representation of Negative Rules we are giving new name of this rules as like: CNR, ANR, and ACNR. In this paper we are also modify Correlation coefficient (CRC) equation, so all generate results are very promising. First we apply Apriori Algorithm for frequent itemset generation and that is also generate positive rules, after on frequent itemset we apply NRGA algorithm for all negative rules generation and optimize generated rules using Genetic Algorithm

Tham khảo

24/07/2015

Efficient mining fuzzy association rules from ubiquitous data streams

Abstract:
Due to the development in technology, a number of applications such as smart mobile phone, sensor networks and GPS devices produce huge amount of ubiquitous data in the form of streams. Different from data in traditional static databases, ubiquitous data streams typically arrive continuously in high speed with huge amount, and changing data distribution. Dealing with and extracting useful information from that data is a real challenge. This raises new issues, that need to be considered when developing association rule mining techniques for these data. It should be noted, that data, in the real world, are not represented in binary and numeric forms only, but it may be represented in quantitative values. Thus, using fuzzy sets will be very suitable to handle these values.

IMPROVE EFFICIENCY FUZZY ASSOCIATION RULE USING HEDGE ALGEBRA APPROACH

TRAN THAI SON, NGUYEN TUAN ANH

Institute of Information Technology, Vietnam Academy of Science and Technology;trn_thaison@yahoo.comUniversity of Information and Communication Technology, Thai Nguyen University;anhnt@ictu.edu.vn

Abstract: A major problem when conducting mining fuzzy association rules from the database (DB) is the large computation time and memory needed. In addition, the selection of fuzzy sets for each attribute of the database is very important because it will affect the quality of the mining rule. This paper proposes a method for mining fuzzy association rules using compression database. We also use the approach of Hedge Algebra (HA) to build the membership function for attributes instead of using the normal way of fuzzy set theory. This approach allows us to explore fuzzy association rules through a relatively simple algorithm which is faster in terms of time, but it still brings association rules which are as good as the classical algorithms for mining association rules.

Keywords: Data mining, Association rules, Compressed transactions, Knowledge discovery, hedge algebras

04/07/2014

Dataset for Datamining

http://fimi.ua.ac.be/data
http://archive.ics.uci.edu/ml

29/08/2013

Big data trong công nghệ đám mây

Tốc độ, khối lượng, tính đa dạng và xác thực của dữ liệu

Sam B. Siewert, Trợ lý giám đốc, Đại học Alaska Anchorage

Tóm tắt: Dữ liệu lớn (Big Data) là một tính năng vốn có của công nghệ đám mây và cung cấp cơ hội chưa từng có khi sử dụng cả hai loại cơ sở dữ liệu truyền thống và mạng xã hội, dữ liệu của mạng báo hiệu và xa hơn nữa là dữ liệu đa phương tiện. Các ứng dụng dữ liệu lớn yêu cầu kiến trúc trung tâm dữ liệu và nhiều giải pháp bao gồm các API của nền tảng đám mây để tích hợp với tìm kiếm nâng cao, các giải thuật máy học và các phân tích nâng cao như thị giác máy tính, phân tích phim ảnh và các công cụ phân tích trực quan. Bài viết này nghiên cứu cách sử dụng ngôn ngữ R và các công cụ phổ biến để phân tích dữ liệu lớn và các phương pháp để mở rộng các dịch vụ dữ liệu lớn trong các đám mây. Nó cung cấp một góc nhìn sâu sắc về một dịch vụ dữ liệu lớn cơ bản là quản lý hình ảnh số, trong đó sử dụng các yếu tố cơ bản như tìm kiếm, phân tích và máy học cho dữ liệu không có cấu trúc.

Vấn đề tiền xử lý dữ liệu trong Data Mining (Data Preprocessing in Data Mining)

Vấn đề tiền xử lý dữ liệu trong Data Mining

(Data Preprocessing in Data Mining)

Nguyễn Văn Chức – chuc1803@gmail.com

1. Giới thiệu về tiền xử lý dữ liệu (Data Preprocessing)

Trong qui trình khai phá dữ liệu, công việc xử lý dữ liệu trước khi đưa vào các mô hình là rất cần thiết, bước này làm cho dữ liệu có được ban đầu qua thu thập dữ liệu (gọi là dữ liệu gốc original data) có thể áp dụng được (thích hợp) với các mô hình khai phá dữ liệu (data mining model) cụ thể. Các công việc cụ thể của tiền xử lý dữ liệu bao gồm những công việc như:

Một số kết quả nghiên cứu trên thế giới về luật kết hợp năm 2011 - 2012

1. A Study on Efficient Data Mining Approach on Compressed Transaction (2011)

Data mining can be viewed as a result of the natural evolution of information technology. The spread of computing has led to an explosion in the volume of data to be stored on hard disks and sent over the Internet. This growth has led to a need for data compression, that is, the ability to reduce the amount of storage or Internet bandwidth required to handle the data. This paper analysis the various data mining approaches which is used to compress the original database into a smaller one and perform the data mining process for compressed transaction such as M2TQT,PINCER-SEARCH algorithm, APRIORI & ID3 algorithm, TM algorithm, AIS & SETM, CT-Apriori algorithm, CBMine, CT-ITL algorithm, FIUT-Tree. Among the various techniques M2TQT uses the relationship of transactions to merge related transactions and builds a quantification table to prune the candidate item sets which are impossible to become frequent in order to improve the performance of mining association rules. Thus M2TQT is observed to perform better than existing approaches.

Ý tưởng thực hiện: Nén dữ liệu trước khi thực hiện khai phá luật kết hợp.

Ứng dụng Đại số gia tử trong điều khiển.

Tác giả: Nguyễn Tiến Duy - Bộ môn Kỹ thuật máy tính - Khoa Điện tử

Để xây dựng phương pháp luận tính toán nhằm giải quyết vấn đề mô phỏng các quá trình tư duy, suy luận của con người chúng ta phải thiết lập ánh xạ: gán mỗi khái niệm mờ một tập mờ trong không gian tất cả các hàm F(U, [0, 1]). Nghĩa là ta mượn cấu trúc tính toán rất phong phú của tập để mô phỏng phương pháp lập luận của con người thường vẫn được thực hiện trên nền ngôn ngữ tự nhiên.

Bài toán kết nhập mờ theo cách tiếp cận đại số gia tử

BÀI TOÁN KẾT NHẬP MỜ THEO CÁCH TIẾP CẬN ĐẠI SỐ GIA TỬ (*)

Trần Thái Sơn, Viện CNTT, Viện KH&CN Việt Nam

Nguyễn Tuấn Anh, ĐH Kỹ Thuật Công nghiệp – Đại Học Thái Nguyên

Tóm tắt:

Bài báo trình bày một phương pháp giải quyết bài toán kết nhập mờ theo cách tiếp cận sử dụng lý thuyết về Đại số gia tử. Phương pháp này bổ sung cho những khiếm khuyết của phương pháp bộ 2 của Herrera, sử dụng chỉ số thứ tự của giá trị đánh giá để tiến hành tính toán. Cách tiếp cận dựa trên Đại số gia tử dựa trên những tính toán khá đơn giản và cho kết quả của phép kết nhập chính xác hơn và do đó có thể ứng dụng tốt vào những lĩnh vực cần đến việc ra quyết định dựa trên ý kiến đánh giá của các chuyên gia về một hay nhiều đối tượng nào đó.

Từ khóa: Đại số gia tử, kết nhập, lý thuyết mờ, chỉ số sắp xếp

Xem chi tiết

09/04/2013

2013: Big Data – thị trường tiềm năng (Kỳ 1)

Big Data đang là chìa khóa cho mọi doanh nghiệp. Nhà phân tích George Gilbert và Jo Maitland đã thảo luận về những gì có thể chờ đón trong 12 tháng tới của thị trường này, và đưa ra những dự đoán thú vị.

'Big Data' là gì?

Nhân loại hiện đang sản sinh ra 2,5 nhân 10 mũ 30 tỉ bytes dữ liệu mỗi ngày , vượt rất xa so với việc sử dụng các máy bàn thông thường mỗi ngày. Việc khai thác lượng dữ liệu khổng lồ này để lọc ra được những dữ liệu hữu dụng quả là một thử thách lớn nhất mà người ta gặp phải trong xã hội hiện đại. Nhưng thế hệ các công cụ phân tích dữ liệu mới đang giúp chúng ta kiểm soát hiện tượng này tốt hơn và thuật ngữ Big Data ra đời để chỉ về hiện tượng đó.
Big Data là một thuật ngữ bao quát để chỉ về việc thao tác trên những bộ dữ liệu cực kì lớn
Việc thu thập một lượng lớn dữ liệu không quá khó. Từ những năm 1980, khả năng lưu trữ dữ liệu trên mỗi đầu người của toàn thế giới đã tăng lên gấp đôi sau mỗi 40 tháng. Giờ dữ liệu đã có thể đến từ nhiều nguồn khác nhau mà bạn có thể tưởng tượng được như các thông tin xã hội đa phương tiện và các trang web cho đến thông tin về thời tiết, các file đã phương tiện số hóa, các hóa đơn mua bán online và vô cùng nhiều các nguồn khác. Nhưng thử thách thật sự là chúng ta sẽ xử lý nó như thế nào vì chúng không thể được xử lý và phân tích hiệu quả nhờ các phần mềm thương mại thông thường.

Twitter tạo ra khoảng 12 Terabytes dữ liệu mỗi ngày , trong khi The Large Hadron Collider (LHC) sinh ra 13 Petabytes tất cả trong riêng năm 2010. Ngay cả Wal-mart cũng xử lí hơn 1 triệu giao dịch của khách hàng mỗi giờ. Phân tích dòng dữ liệu không ngừng này và nhanh chóng đưa ra các xu hướng phát triển, theo dõi ô hình nguyên tử Higg-Boson và xác đinh chính xác các lỗi có thể xảy ra trong quá trình truyền tải cần nhiều sức mạnh tính toán hơn những gì mà MS Access có thể làm được.

An Efficient Tree-based Fuzzy Data Mining Approach

Chun-Wei Lin, Tzung-Pei Hong, and Wen-Hsiang Lu

Abstract

In the past, many algorithms were proposed for mining association rules, most of which were based on items with binary values. In this paper, a novel tree structure called the compressed fuzzy frequent pattern tree (CFFP tree) is designed to store the related information in the fuzzy mining process. A mining algorithm called the CFFP-growth mining algorithm is then proposed based on the tree structure to mine the fuzzy frequent itemsets. Each node in the tree has to keep the membership value of the contained item as well as the membership values of its super-itemsets in the path. The database scans can thus be greatly reduced with the help of the additional information. Experimental results also compare the performance of the proposed approach both in the execution time and the number of tree nodes at two different numbers of regions, respectively.

Khai phá luật kết hợp với Weka

Khai phá luật kết hợp với Weka

(Association Rule Mining with WEKA )

Nguyễn Văn Chức – chucnv@ud.edu.vn

Trong lĩnh vực Data Mining, mục đích của luật kết hợp (Association Rule - AR) là tìm ra các mối kết hợp (Association) hay tương quan (Correlation) giữa các đối tượng trong khối lượng lớn dữ liệu. Ứng dụng của luật kết hợp rất phổ biến trong nhiều lĩnh vực, nhất là trong kinh doanh như Market Basket Analysis (Cross selling, Product placement, Affinity promotion, Customer behavior Analysis). Xem bài ứng dụng luật kết hợp trong Market Basket Analysis tại http://bis.net.vn/forums/t/382.aspx

Code association Rule

1. Một số địa chỉ tham khảo về luật kết hợp
http://en.wikipedia.org/wiki/Association_rule_learning
http://cgi.csc.liv.ac.uk/~frans/KDD/Software/Apriori-T_GUI/aprioriT_GUI.html

http://www.kdnuggets.com/software/associations.html

http://www.cse.msu.edu/~cse980/software.html
http://michael.hahsler.net/research/association_rules

2. Code tham khảo về luật kết hợp
http://www.codeproject.com/Articles/70371/Apriori-Algorithm
http://www.codeding.com/?article=13
http://cgi.csc.liv.ac.uk/~frans/KDD/Software/FPgrowth/fpGrowth.html
http://cgi.csc.liv.ac.uk/~frans/KDD/aprioriTdemo.html
http://s.pudn.com/search_hot_en.asp?k=fp+growth#
http://www.borgelt.net//software.html

Thuật toán Apriori mờ (fuzzy Apriori): https://cgi.csc.liv.ac.uk/~frans/KDD/Software/FuzzyAprioriT/fuzzyAprioriT.html

Code khai phá luật kết hợp http://www.codeding.com/?article=13 Mình đọc thấy dễ hiểu, trong vòng có một hai hôm mình đã đọc hiểu chỉnh sửa và cài đặt trên C# hoàn chỉnh.

Code thuật toán FP - Growth C# http://www.myfirm.cn/fptree
Code FP Growth C+:http://adrem.ua.ac.be/~goethals/software

Tham khảo thuật toán nhóm Herrera
http://www.uco.es/grupos/kdis/ARMBibliography/index.html

Một số kết quả nghiên cứu trên thế giới về luật kết hợp năm 2011 - 2012

1. A Study on Efficient Data Mining Approach on Compressed Transaction (2011)

Ý tưởng thực hiện: Nén dữ liệu trước khi thực hiện khai phá luật kết hợp.

2. An Efficient Algorithm for Mining Multilevel Association Rule Based on Pincer Search (2012)

Discovering frequent itemset is a key difficulty in significant data mining applications, such as the discovery of association rules, strong rules, episodes, and minimal keys. The problem of developing models and algorithms for multilevel association mining poses for new challenges for mathematics and computer science. In this paper, we present a model of mining multilevel association rules which satisfies the different minimum support at each level, we have employed princer search concepts, multilevel taxonomy and different minimum supports to find multilevel association rules in a given transaction data set. This search is used only for maintaining and updating a new data structure. It is used to prune early candidates that would normally encounter in the top-down search. A main characteristic of the algorithms is that it does not require explicit examination of every frequent itemsets, an example is also given to demonstrate and support that the proposed mining algorithm can derive the multiple-level association rules under different supports in a simple and effective manner

3. Fast Mining of Fuzzy Association Rules (2012)

Fuzzy association rules described by the natural language are well suited for the thinking of human subject and will help to increase the flexibility for supporting user in making decisions or designing the fuzzy systems. However, the efficiency of algorithms needs to be improved to handle real-world large datasets. In this paper, we present an efficient algorithm named fuzzy cluster-based (FCB) along with its parallel version named parallel fuzzy cluster-based (PFCB). The FCB method is to create cluster tables by scanning the database once, and then clustering the transaction records to the i-th cluster table, where the length of a record is i. moreover, the fuzzy large itemsets are generated by contrasts with the partial cluster tables. Similarly, the PFCB method is to create cluster tables by scanning the database once, and then clustering the transaction records to the i-th cluster table, which is on the i-th processor, where the length of a record is i. moreover, the large itemsets are generated by contrasts with the partial cluster tables. Then, to calculate the fuzzy support of the candidate itemsets at each level, each processor calculates the support of the candidate itemsets in its own cluster and forwards the result to the coordinator. The final fuzzy support of the candidate itemsets is then calculated from these results in the coordinator. We have performed extensive experiments and compared the performance of our algorithms with two of the best existing algorithms.

4. Detection of Fuzzy Association Rules by Fuzzy Transforms (2012)

Ferdinando DiMartino and Salvatore Sessa

Dipartimento di Costruzioni e Metodi Matematici in Architettura, Universit`a degli Studi di Napoli Federico II, Via Monteoliveto 3, 80134 Napoli, Italy Correspondence should be addressed to Salvatore Sessa, sessa@unina.it Received 3 March 2012; Revised 25 June 2012; Accepted 25 June 2012 Academic Editor: Irina G. Perfilieva Copyright © 2012 F. Di Martino and S. Sessa. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

We present a new method based on the use of fuzzy transforms for detecting coarse-grained association rules in the datasets. The fuzzy association rules are represented in the form of linguistic expressions and we introduce a pre-processing phase to determine the optimal fuzzy partition of the domains of the quantitative attributes. In the extraction of the fuzzy association rules we use the AprioriGen algorithm and a confidence index calculated via the inverse fuzzy transform. Our method is applied to datasets of the 2001 census database of the district of Naples (Italy); the results show that the extracted fuzzy association rules provide a correct coarse-grained view of the data association rule set.

5. Fuzzy Associative Rule-based Approach for Pattern Mining and Identification and Pattern-based Classification (2011)

Associative Classification leverages Association Rule Mining (ARM) to train Rule-based classifiers. The classifiers are built on high quality Association Rules mined from the given dataset. Associative Classifiers are very accurate because Association Rules encapsulate all the dominant and statistically significant relationships between items in the dataset. They are also very robust as noise in the form of insignificant and low-frequency itemsets are eliminated during the mining and training stages. Moreover, the rules are easy-to-comprehend, thus making the classifier transparent. Conventional Associative Classification and Association Rule Mining (ARM) algorithms are inherently designed to work only with binary attributes, and expect any quantitative attributes to be converted to binary ones using ranges, like “Age = [25, 60]”. In order to mitigate this constraint, Fuzzy logic is used to convert quantitative attributes to fuzzy binary attributes, like “Age = Middle-aged”, so as to eliminate any loss of information arising due to sharp partitioning, especially at partition boundaries, and then generate Fuzzy Association Rules using an appropriate Fuzzy ARM algorithm. These Fuzzy Association Rules can then be used to train a Fuzzy Associative Classifier. In this paper, we also show how Fuzzy Associative Classifiers so built can be used in a wide variety of domains and datasets, like transactional datasets and image datasets.

6. Efficient Parallel Pruning of Associative Rules with Optimized Search (2012)

The main focus of this research work is to propose an improved association rule mining algorithm to minimize the number of candidate sets while generating association rules with efficient pruning time and search space optimization. The relative association with reduced candidate item set reduces the overall execution time. The scalability of this work is measured with number of item sets used in the transaction and size of the data set. Further Fuzzy based rule mining principle is adapted in this work to obtain more informative associative rules and frequent items with increased sensitive. The requirement for sensitive items is to have a semantic connection between the components of the item-value pairs. The effectiveness of item-value pairs minimizes the search space to its optimality. Optimality of the search space indicates the trade off between pruning time and size of the data set.

7. Using Support Vector Machine in Fuzzy Association Rule Mining (2012)

Fuzzy rule based classification systems is one of the most popular in pattern classification problems. The rules in the fuzzy models can be weighted to show the importance of generated rules where all attributes in the antecedent part of the rules have been usually weighted equally. Whereas the contributed attributes in a fuzzy model may have different influences on the decision making, a new method based on support vector machine-recursive feature elimination (SVM-RFE) has been proposed in this study to show the effects of attributes by weighting factors. Apriori algorithm and fuzzy association rule mining (FARM) have been used to generate the suitable rules which are weighted by fuzzy support value. The combination of the proposed method for attribute weighting and fuzzy support value for weighting the generated rules have been used to discriminate the samples of two different well known datasets iris and wine. The results show that this simple method can increase the rate of accuracy and reduce the dependency of model to fuzzy support value in Apriori algorithm and the number of rules.

Nguyễn Tuấn Anh - Blog

Trang

04/01/2024

Knime là gì

31/12/2015

A New Method for Generating All Positive and Negative Association Rules(2011)

24/07/2015

Efficient mining fuzzy association rules from ubiquitous data streams

13/06/2015

IMPROVE EFFICIENCY FUZZY ASSOCIATION RULE USING HEDGE ALGEBRA APPROACH

TRAN THAI SON, NGUYEN TUAN ANH

04/07/2014

Dataset for Datamining

29/08/2013

Big data trong công nghệ đám mây

20/08/2013

Vấn đề tiền xử lý dữ liệu trong Data Mining (Data Preprocessing in Data Mining)

17/08/2013

Một số kết quả nghiên cứu trên thế giới về luật kết hợp năm 2011 - 2012

13/08/2013

Ứng dụng Đại số gia tử trong điều khiển.

Tác giả: Nguyễn Tiến Duy - Bộ môn Kỹ thuật máy tính - Khoa Điện tử

05/08/2013

Bài toán kết nhập mờ theo cách tiếp cận đại số gia tử

Trần Thái Sơn, Viện CNTT, Viện KH&CN Việt Nam

Nguyễn Tuấn Anh, ĐH Kỹ Thuật Công nghiệp – Đại Học Thái Nguyên

Tóm tắt:

Từ khóa: Đại số gia tử, kết nhập, lý thuyết mờ, chỉ số sắp xếp

09/04/2013

2013: Big Data – thị trường tiềm năng (Kỳ 1)

Big Data đang là chìa khóa cho mọi doanh nghiệp. Nhà phân tích George Gilbert và Jo Maitland đã thảo luận về những gì có thể chờ đón trong 12 tháng tới của thị trường này, và đưa ra những dự đoán thú vị.

'Big Data' là gì?

09/11/2012

An Efficient Tree-based Fuzzy Data Mining Approach

24/10/2012

Khai phá luật kết hợp với Weka

17/10/2012

Code association Rule

Một số kết quả nghiên cứu trên thế giới về luật kết hợp năm 2011 - 2012