Global Journal of Enterprise Information System January-June 2012 Volume-4 Issue-I. Theme Based Paper - PDF

Please download to get full document.

View again

of 14
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

Ancient Egypt

Published:

Views: 17 | Pages: 14

Extension: PDF | Download: 0

Share
Related documents
Description
Global Journal of Enterprise Information System January-June 212 ABSTRACT Performance Evaluation of Data Minin in alorithm in WEKA Mahenra Tiwari Research Scholar, Department Of Comp. Science, UPRTOU Allahaba
Transcript
Global Journal of Enterprise Information System January-June 212 ABSTRACT Performance Evaluation of Data Minin in alorithm in WEKA Mahenra Tiwari Research Scholar, Department Of Comp. Science, UPRTOU Allahaba Yashpal Sinh Hea, Deptt Of CSE, BIET Jhansi Data minin is a computerize technoloy that uses complicate alorithms to fin relationships an trens in lare ata bases, real or perceive, previously unknown to the retailer, to promote ecision support.., ata minin is toute to be one of the wiesprea reconition of the potential for analysis of past transaction ata to improve the quality of future business ecisions. The purpose of this paper is to critique ata minin technoloy in comparison with more familiar ata minin alorithm in well known tool Weka for strateic ecision makin by small to meium size retailers. The context for this stuy inclues current an future inustry applications an practices for research performe in ata minin applications within the retail sector. KEYWORDS WEKA Alorithm Cluster Data Minin Performance Evaluation of Data Minin in alorithm in WEKA Pae 35 Global Journal of Enterprise Information System January-June 212 INTRODUCTION As the ata sizes accumulate from various fiels are exponentially increasin, ata minin techniques that extract information from hue amount of ata have become popular in commercial an scientific omains, incluin marketin, customer relationship manaement. Durin the evaluation, the input atasets an the number of er use are varie to measure the performance of Data Minin alorithm. I present the results base on characteristics such as scalability, accuracy to ientify their characteristics in a worl famous Data Minin tool-weka. RELATED WORK I stuie various journals an articles rearin performance evaluation of Data Minin alorithms on various ifferent tools, some of them are escribe here, Yin Liu et all worke on Classification alorithms while Osama abu abbas worke on in alorithm, an Abullah compare various classifiers with ifferent types of ata set on WEKA, I presente their result as well as about tool an ata set which are use in performin evaluation. Yin Liu,wei-ken Liao et all in his article performance evaluation an characterization of scalable ata minin alorithms by Yin Liu, Jayaprakash, Wei-ken, Alok chauhary investiate ata minin applications to ientify their characteristics in a sequential as well as parallel execution environment.they first establish Mine bench, a benchmarkin suite containin ata minin applications. The selection principle is to inclue cateories & applications that are commonly use in inustry an are likely to be use in the future, thereby achievin a realistic representation of the existin applications. Minebench can be use by both prorammers & processor esiners for efficient system esin. They conuct their evaluation on an Intel IA-32 multiprocessor platform, which consist of an Intel Xeon 8-way share memory parallel(smp) machine runnin Linux OS, a 4 GB share memory & 124 KB L2 cache for each processor. Each processor has 16 KB non-blockin interate L1 instructions an ata caches. The number of processors is varie to stuy the scalability. In all the experiments, they use VTune performance analyzer for profilin the functions within their applications, & for measurin their breakown execution times. VTune counter monitor provies a wie assortment of metrics. They look at ifferent characteristics of the applications: execution time, fraction of time spent in the OS space, communication/synchronization complexity, & I/O complexity. The Data comprisin 25, recors. This notion enotes the ataset contains 2,, transactions,the averae transaction size is 2, an the averae size of the maximal potentially lare itemset is 6. The number of items is 1 an the number of maximal potentially lare itemset is 2. The alorithms for comparison are ScalParc, Bayesian, K-, Fuzzy K-, BIRCH,HOP,Apriori, & ECLAT Fi 1: OS overheas of Mine Bench applications as a percentae of the total execution time HOP Fuzzy K- Fi 2: Percentae of I/O time with respect to the overall execution times. Osama Abu Abbas in his article comparison between ata in alorithms by Osama Abu Abbas compare four ifferent in alorithms (K-, hierarchical, SOM, EM) accorin to the size of the ataset, number of the s,type of S/W. The eneral reasons for selectin these 4 alorithms are: Performance Evaluation of Data Minin in alorithm in WEKA Pae 36 Global Journal of Enterprise Information System January-June 212 o Popularity o Flexibility o Applicability o Hanlin Hih imensionality Osama teste all the alorithms in LNKnet S/W- it is public omain S/W mae available from MIT Lincoln lab For analyzin ata from ifferent ata set, locate at The ataset that is use to test the in alorithms an compare amon them is obtaine from the site ataset is store in an ASCII file 6 rows,6 columns with a sinle chart per line No. (K) normal 11-2 cyclic 21-3 increasin tren 31-4 ecreasin tren 41-5 upwar shift 51-6 ownwar shift of SOM K Performance EM HCA T. velmurun in his research paper performance evaluation of K- & Fuzzy C- in alorithm for statistical istribution of input ata points stuie the performance of K- & Fuzzy C- alorithms. These two alorithm are implemente an the performance is analyze base on their in result quality. The behavior of both the alorithms epene on the number of ata points as well as on the number of s. The input ata points are enerate by two ways, one by usin normal istribution an another by applyin uniform istribution (by Box-muller formula). The performance of the alorithm was investiate urin ifferent execution of the proram on the input ata points. The execution time for each alorithm was also analyze an the results were compare with one another, both unsupervise in methos were examine to analyze base on the istance between the various input ata points. The s were forme accorin to the istance between ata points an s centers were forme for each. The implementation plan woul be in two parts, one in normal istribution an other in uniform istribution of input ata points. The ata points in each were isplaye by ifferent colors an the execution time was calculate in millis. Velmuruan an Santhanam chose 1 (k=1) s an 5 ata points for experiment. The alorithm was repeate 5 times (for one ata point one iteration) to et efficient output. The centers (centroi) were calculate for each s by its mean value an s were forme epenin upon the istance between ata points Fi 3 : Relationship between number of s an the performance of alorithm K=32 Data SOM K- EM HCA type Ranom Ieal Fi 4 : The affect of ata type on alorithm Performance Evaluation of Data Minin in alorithm in WEKA Pae 37 Global Journal of Enterprise Information System January-June 212 memory machine an analyze some important performance characteristics. Minebench encompasses many alorithms commonly forme in ata minin. They analyze the architectural properties of these applications to investiate the performance bottleneck associate with them. For performance characterization, they chose an Intel IA-32 multiprocessor platform, Intel Xeon 8-way share memory parallel (SMP) machine runnin Re Hat avance server 2.1. The system ha 4 GB of share memory. Each processor ha a 16 KB nonblockin interate L1 cache an a 124 KB L2 cache. For evaluation they use VTune performance analyzer. Each application was compile with version 7.1 of the Intel C++ compiler for Linux. Fi 5 : Clusters on 5 ata points Jayaprakash et all in their paper performance characterization of Data Minin applications usin Minebench presente a set of representative ata minin applications call Minebench. They evaluate the Minebench application on an 8 way share The ata use in experiment were either real-worl ata obtaine from various fiels or wiely accepte synthetic ata enerate usin existin tools that are use in scientific an statistical simulations. Durin evaluation, multiple ata sizes were use to investiate the characteristics of the Minebench applications, For non-bioinformatics applications, the input atasets were classifie in to 3 ifferent sizes: small, meium, & lare. IBM Quest ata enerator, ENZO, & real imae atabase by corel corporation. Performance Evaluation of Data Minin in alorithm in WEKA Pae 38 Global Journal of Enterprise Information System January-June 212 Reference Goal Database/Data escription Data size use Preprocessin Data Minin alorithm Software Abullah H. wabheh et all. (IJACSA) Comparative stuy between a number o free available ata minin tools UCI repository 1 to 2, Data interation NB,OneR,C4.5,SVM,KNN,ZeroR Weka,KNI ME,Orane,TANAGRA Yin Liu et all To investiate ata minin applications to ientify their characteristic in a sequential as well as parallel execution environment IBM Quest ata enerator,enzo 25, recors,2,, transaction s HOP,K,BIRC H,ScalParc, Bayesian,Ap riori,eclat V Tune Performanc e analyzer P.T. Kavitha et all (IJCSE) To evelop efficient ARM on DDM framework Transaction ata by Point-of-Sale(PoS) system Apriori,Aprior itid,apriorih ypri,fp rowth Java T.velmuruan T.Santhanam (EJOSR) & To analyze K- & Fuzzy C- in result quality by Box-muller formula Normal & uniform istribution of ata points 5 to 1 ata points K-, Fuzzy C- Applet Viewer Jayaprakash et all To evaluate MineBench applications on an 8-way share memory machine IBM Quest ata enerator,enzo, Synthetic ata set Dense atabase, 1k to 8k transcation s,73mb real ata set Data cleanin Scalparc,K,HOP, Apriori,Utility, SNP,Genene t,semphy,r esearch,sv M,PLSA V tune performanc e analyzer Pramo S. & O.P.vyas To assess the chanin behavior of customers throuh ARM Frequent Itemset Minin(FIM) ata set repository Sorte & unsorte transaction set Data cleanin CARMA,DS CA,estDec java Osama abu Abbas To compare 4 in alorithm m ASCII file 6 rows 6 columns K-,hierar chical,som, EM LNKnet Performance Evaluation of Data Minin in alorithm in WEKA Pae 39 Global Journal of Enterprise Information System January-June 212 Table 1 : Summary of selecte references with oals As the number of available tools continues to row, the choice of one special tool becomes increasinly ifficult for each potential user. This ecision makin process can be supporte by performance evaluation of various ers use in open source ata minin tool Weka. ANALYSIS OF DATA MINING ALGORITHM enerate number of s, time taken to buil moels etc. Weka toolkit is a wiely use toolkit for machine learnin an ata minin that was oriinally evelope at the university of Waikato in New Zealan. It contains lare collection of state-ofthe-art machine learnin an ata minin alorithms written in Java. Weka contains tools for reression, classification, in, association rules, visualization, an ata processin. Clusterin Proram Clusterin is the process of iscoverin the roups of similar objects from a atabase to characterize the unerlyin ata istribution. K- is a partition base metho an aruably the most commonly use in technique. K- er assins each object to its nearest center base on some similarity function. Once the assinment are complete, new centers are foun by the mean of all the objects in each. BIRCH is a hierarchical in metho that employs a hierarchical tree to represent the closeness of ata objects. BIRCH first scans the atabase to buil a in-feature tree to summarize the representation. Density base methos row s accorin to some other ensity function. DBscan, oriinally propose in astrophysics is a typical ensity base in metho. After assinin an estimation of its ensity for each particle with its ensest neihbors, the assinment process continues until the ensest neihbor of a particle is itself. All particles reachin this state are e as a roup. EVALUATION STRATEGY/METHODOLOGY H/W tools I conuct my evaluation on Pentium 4 Processor platform which consist of 512 MB memory, Linux enterprise server operatin system, a 4GB memory, & 124kbL1 cache. S/W tool In all the experiments, I use Weka 3-6-6, I looke at ifferent characteristics of the applications-usin classifiers to measure the accuracy in ifferent ata sets, usin er to Input ata sets Input ata is an interal part of ata minin applications. The ata use in my experiment is either real-worl ata obtaine from UCI ata repository an wiely accepte ataset available in Weka toolkit, urin evaluation multiple ata sizes were use, each ataset is escribe by the ata type bein use, the types of attributes, the number of store within the ataset, also the table emonstrates that all the selecte ata sets are use for the classification an in task. These atasets were chosen because they have ifferent characteristics an have aresse ifferent areas. Zoo ataset an Letter imae reconition ataset are in csv format whereas labor,an Supermarket ataset are in arff format. Zoo, Letter, & Labor ataset have 17 number of attributes while Supermarket ataset has 2 attributes. Zoo ataset encompasses 11, Letter imae contains 2 but I taken just 174. Labor comprises 57, & Supermarket has All atasets are cateorical an inteer with multivariate characteristics. Experimental result an Discussion To evaluate the selecte tool usin the iven atasets, several experiments are conucte. For evaluation purpose, two test moes are use, the Full trainin set & percentae (holout metho) moe. The trainin set refers to a wiely use experimental testin proceure where the atabase is ranomly ivie in to k isjoint blocks of objects, then the ata minin alorithm is traine usin k-1 blocks an the remainin block is use to test the performance of the alorithm, this process is repeate k times. At the en, the recore measures are averae. It is common to choose Performance Evaluation of Data Minin in alorithm in WEKA Pae 4 Global Journal of Enterprise Information System January-June 212 k=1 or any other size epenin mainly on the size of the oriinal ataset. In percentae (holout metho),the atabase is ranomly in to two isjoint atasets. The first set, which the ata minin system tries to extract knowlee from calle trainin set. The extracte knowlee may be teste aainst the set which is calle test set, it is common to ranomly a ata set uner the minin task in to 2 parts. It is common to have 66% of the objects of the oriinal atabase as a trainin set an the rest of objects as a test set. Once the tests is carrie out usin the selecte atasets, then usin the available classification an test moes,results are collecte an an overall comparison is conucte. Performance Measures For each characteristic, I analyze how the results vary whenever test moe is chane. My measure of interest inclues the analysis of ers on ifferent atasets, the results are escribe in value number of enerate, e, time taken to buil the moel, an une. after applyin the cross-valiation or holout metho. For performance issues, There are 3 other atasets which I use for measurement they are Letter imae reconition, labor, & Supermarket ataset. The etails of applie classifiers on those atasets are as followin: Dataset: Letter imae reconition Classifier: Lazy-IBK,KStar, Tree-Decision stump, REP, Function- Linear reression, Rule-ZeroR Dataset: Labor Classifier: Lazy-IBK,KStar, Tree-Decision stump, REP, Function- Linear reression, Rule-ZeroR, Bayesian-Naïve Bayes Dataset: Supermarket Classifier: Lazy-IBK,KStar, Tree-Decision stump, CART, Function- SMO, Rule-ZeroR, OneR, Bayesion-Naïve Bayes. The etails of er with ifferent ataset are as followin Dataset: Zoo Clusterer: DBscan, EM, Hierarchical, K- Dataset: Letter imae reconition Clusterer: DBscan, EM, Hierarchical, K- Dataset: Labor: Clusterer: DBscan, EM, Hierarchical, K- Dataset: Supermarket: Clusterer: DBscan, EM,, K- Clusterin in Weka:- Fi 6 : Clusterin winow Selectin a Cluster: By now you will be familiar with the process of selectin an confiurin objects. Clickin on the in scheme liste in the Clusterer box at the top of the winow brins up a Generic Object Eitor ialo with which to choose a new in scheme Cluster Moes: The Cluster moe box is use to choose what to an how to evaluate the results. The first three options are the same as for classification: Use train- in set, Supplie test set an Percentae except that now the ata is assine to s instea of tryin to preict a specific class. The fourth moe, Classes to s evaluation, compares how well the chosen s match up with a preassine class in the ata. The rop-own box below this option selects the class, just as in the Classify pane Inorin Attributes: Often, some attributes in the ata shoul be inore when in. The Inore attributes button brins up a small winow that allows you to select which attributes are inore. Clickin on an attribute in the winow hihlihts it, holin own the SHIFT Performance Evaluation of Data Minin in alorithm in WEKA Pae 41 Global Journal of Enterprise Information System January-June 212 key selects a rane of consecutive attributes, an holin own CTRL toles iniviual attributes on an off. To cancel the selection, back out with the Cancel button. To activate it, click the Select button. Workin with Filters The Filtere meta-er offers the user the possibility to apply filters irectly before the er is learne. This approach eliminates the manual application of a filter in the Preprocess panel, since the ata ets processe on the fly. Useful if one nees to try out ifferent filter setups. Learnin Clusters The Cluster section, like the Classify section, has Start/Stop buttons, a result text area an a result list. These all behave just like their classification counterparts. Riht-clickin an entry in the result list brins up a similar menu, except that it shows only two visualization options: Visualize assinments an Visualize tree. DETAILS OF DATA SET I use 4 ata set for evaluation with in in WEKA,Two of them from UCI Data repository that are Zoo ata set an Letter imae reconition, rest two labor ata set an supermarket ata set is inbuilt in WEKA Zoo ata set an letter imae reconition are in csv file format,an labor an supermarket ata set are in arff file format. Detail of ata set use in evaluation:-- ZOO DATA SET Table 2 : Detail of ata set Fi 7 : Zoo ata set (UCI repository. ). Title: Zoo atabase. Source Information -- Creator: Richar Forsyth -- Donor: Richar S. Forsyth 8 Grosvenor Avenue Mapperley Park Nottinham NG3 5DX Date: 5/15/199 Relevant Information: -- A simple atabase containin 17 -value attributes. The type attribute appears to be the class attribute. Here is a breakown of which animals are in which type: (I fin it unusual that there are 2 of fro an one of irl !) Class# Set of animals 1 (41) aarvark, antelope, bear, boar, buffalo, calf, cavy, cheetah, eer, olphin, elephant, fruitbat, iraffe, irl, oat, orilla, hamster, hare, leopar, lion, lynx, mink, mole, monoose, opossum, oryx, platypus, polecat, pony, porpoise, puma, pussycat, raccoon, reineer, seal, sealion, squirrel, vampire, vole, wallaby, wolf Name of Data set Zoo Type file of CSV(com ma separate value) Numb er of attrib utes Numb er of instan ces Attribute characteristi cs Cateorical,Int eer Dataset characte ristics Multivaria te Miss in valu e No Letter Imae Reconit ion CSV(com ma separate value) /2 Cateorical,Int eer Multivaria te No Labor ARFF(Attr ibute Relation File Format) Cateorical,Int eer Multivaria te No Superma rket ARFF(Attr ibute Relation File Format) Cateorical,Int eer Multivaria te No Performance Evaluation of Data Minin in alorithm in WEKA Pae 42 Global Journal of Enterprise Information System January-June (2) chicken, crow, ove, uck, flamino, ull, hawk, kiwi, lark, ostrich, parakeet, penuin, pheasant, rhea, skimmer, skua, sparrow, swan, vulture, wren 3 (5) pitviper, seasnake, slowworm, tortoise, tuatara 8. Missin Attribute Values: None 9. Class Distribution: Given above Letter imae reconition ata set :- 4 (13) bass, carp, catfish, chub, ofish, haock, herrin, pike, piranha, seahorse, sole, stinray, tuna 5 (4) fro, fro, newt, toa 6 (8) flea, nat, honeybee, housefly, laybir, moth, termite, wasp 7 (1) clam, crab, crayfish, lobster, octopus, scorpion, seawasp, slu, starfish, worm Number of Instances: 11 Number of Attributes: 18 (animal name, 15 at
Recommended
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks