it's a journey about life, love and passion: PREDICTION OF STUDENTS GRADUATION USING ALGORITHM C4.5

International Seminar on Scientific Issues and Trends (ISSIT) 2014

Proceeding ISSIT 2014, Page: A-31

PREDICTION OF STUDENTS GRADUATION USING ALGORITHM C4.5

Hilda Amalia

Akademi Manajemen Informatika and Komputer Bina Sarana Informatika

email : Hilda.ham@bsi.ac.id

Abstract-College is a place for students to gain knowledge before face the competitive world of work. The numbers of students who graduate on time become an indicator of college success both public and private. Important for college to make a prediction for their student so the collage can do some precaution. Research in terms of predicting student graduation and many students have done. In this research we will use algorithm c4.5 method. Algorithm C4.5 is one of the methods that exist in the decision tree method that can be used to make a prediction and classification. Decision tree is a kind of data mining method. Data mining is a way to explore data becoming useful information. Information that results from data mining process can be used for seeing further probability with more accurate than before. The result of this research is decision tree rule with 74,33% accurate.

Key Words: Data Mining, Algorithm C4.5

I. INTRODUCE

The college currently required a competitive advantage by utilizing all available resources that collaged had. Students and lecturers as a college major asset that want to continue to improve key indicators using assets effectively and efficiently (Qudri & Kalayar, 2010). Students who not graduated on time will be affected the assessment of the credibility of the public schools or educational institutions. So important for a college to watch a number of college graduation accuracy. The decline in student graduation rates will also affect the accreditation of the college. So that student graduation rate which is decline with significant and grow up continues are a problem at college. Several methods have been used to predict the accuracy of graduation among which, neural network (Karamouiz, 2009), comparison of naïve Bayes and C4.5 algorithm (suhartinah, 2010), several studies have also been carried out comparative method comparative method that ANN, Decision tree and Liner Regresion (Ibrahim & Rush, 2007), k-means (Oyelade, 2010). C4.5 algorithm is a method in decision tree. Decision Tree is a classification and prediction methods are very powerful and famous. Decision tree change the fact that a very large into a decision tree that represents the rule (Suhartinah, 2010). Decision tree has been widely used to make predictions in various fields. In this study the prediction accuracy will be calculated using the C4.5 algorithm

II. THEORY

2.1 Data Mining

Data Mining is a process of finding meaningful relationships, patterns and trends to examine trends by examining patterns in large collections of data stored in the storage with the introduction of techniques using statistical and mathematical techniques pola seperti (Larose, 2005).

Proses dalam data mining:

1. Data Selection: identifies the source of data to be mined.

2. Data Pre-processing: This stage is a stage for preparation methods, such as abolishing duplicate data, which is not true power, gather data from many databases.

3. Transformation: This is the process of converting data into a data format suitable for the algorithm.

4. Data Mining: This is the main process is the use of data mining methods to find the model.

5. Interpretation and evolution: this stage to intrepret and evaluation results (Yingkuarchat, 2007).

2.2 Algoritma C4.5

Desicion Tree resembles a flowchart structure, each of its internal nodes is expressed as an attribute test, each branch represents the output of the test, and each leaf node (terminal node) determines the class label. Top node of a tree is the root node (Han & Kamber, 2007).

There are several stages in making a decision tree algorithm C4.5 (Kusrini & Lutfi, 2009), namely:

1. Preparing the training data. Training data are usually taken from historical data that never happened before and has been grouped into certain classes.

2. Determine the roots of the tree. The roots will be taken from the selected attributes, by calculating the value of the gain of each attribute, the highest value of gain will becoming a root of the tree. Before

calculating the gain of an attribute, first calculate the entropy values are:

3. Then calculate the gain value with the method of the information gain:

4. Repeat step 2 until all tupelo-partitioned.

5. Decision tree partitioning process will stop when:

a) All tuples in node N gets the same grade.

b) There is no attribute in the tuple is partitioned again.

c) There is no tuple in the branch empty.

III. REASERCH METHODS

In this study, the data used is the graduation of students at one university in Jakarta. In this study will be several steps or phases of the study as described below:

Figure 1 Stages study used:

a. Data Collection

The data obtained are due to the secondary data obtained from a database owned by a university student who was in Jakarta, namely through the department computer centers owned by the campus. The data obtained in this study is the qualitative and quantitative data. The data is the data collected by the University student undergraduate courses (S1) for the period of September 2011.-year graduation record is 1633 records obtained by NIM attribute, name, age, school, semesters IP 1, IP 2 semesters, 3 semesters IP to with IP Semester 8, with late and proper labeling.

b. Preparation data

(1) To obtain high-quality data, several techniques were carried out as follows (vecellis, 2009):

(2) Data validation, to identify and remove the odd data (outliers / noise), the data are inconsistent, and incomplete data (missing value).

(3) Data Integration and Transformation, to improve the accuracy and efficiency of the algorithm. The data used in this paper is worth categorical.

(4) Data size reduction and dicrtization, to obtain data sets with the number of attributes and records but less informative.

From the initial processing of the data on the data obtained can be processed into 1583 data, which consists of 671 data with the class label or "RIGHT" and 911 data labels or class "LATE".

c. The Proposed Method

In this study will be calculated graduation rate accuracy using data mining methods, namely C4.5 algorithm. Here's an illustration of the proposed use of the method in the study:

Figure 2 Illustration of use of the proposed method:

III. THEORY

3.1 Algoritma C4.5

C4.5 algorithm is one of the algorithms in a decision tree method that converts the data into a decision tree using entropy calculation formula. Here are the stages of the calculation of the entropy of the C4.5 algorithm for data graduation:

a. Prepare the training data, which is used for training data in Table 2 that there are training data tables.

b. Calculate the total value of the overall entropy case "RIGHT" pass and "LATE" pass. Of the training data that is known to pass the number of cases "RIGHT" on time

as much as 671 record, and the number of cases that pass the "past due" is as much a record total of 911 cases were 1582 cases overall. So that the overall entropy obtained:

= -671/1582 *log 671/1582 + (-911/1582 * log 911/1582

= 0,983

c. Calculate the entropy value and the gain value of each attribute. The highest gain value is an attribute that becomes the root of the decision tree to be created. Suppose calculate the entropy for attribute falkultas.

Entropi IlmuPendidikan [129,229] = (-129/358 log2 129/358)+(-229/358 log2 229/358)

= 0,943

Entropi BahasaandSeni[166,96]=(-166/262 log(2) 166/262) + (-96/262 log(2) 96/262)

=0,948

Entropi MatematikaandIPA[44,15]=(-44/198 log(2) 44/198) + (-15/198 log(2) 15/198)

= 0,764

Entropi IlmuSosial[99,123]=(-99/222 log(2) 99/222) + (-123/222 log(2) 123/222)

=0,992

Entropi teknik[43,123]=(-43/170 log(2) 43/170) + ( -123/170 log(2) 123/170)

= 0,816

Entropi ekonomi[187,96]=(-187/283 log(2) 187/283) + (-187/283 log(2) 187/283)

= 0,924

Entropi ilmukeolahragaan[3,86]=(-3/170 log(2) 3/170)+(-86/170 log(2) 86/170)

=0,213

kemudian hitung gain dari falkultas:

Gain(S,A)= 0,983-((358/1582 * 0,943)+(262/1582 * 0,948)+(198/1582 * 0,7642)+(222/1582 * 0,992)+(170/1582 * 0,816)+(89/1582 * 0,213)+(96/1582 * 0,924)= 0,290

Gain fakultas=0,290

below are value of root calculation:

Atribut

Nilai gain

Fakultas

0,290

Jenis kelamin

0,021

Umur <=26 and > 26

0,054

Umur <=25 and >25

0,0521

Umur <=38 and >38

-0,058

Umur <=42 and >42

0,0016

IPS1 <=3,190 and > 3,190

0,119

IPS1 <=3,455 and >3,455

0,082

IPS1 <=3,310 and >3,310

0,125

IPS1 <=2,320 and > 2,350

0,040

IPS1 <=3,565 and >3,565

0,062

IPS1 <=3,705 and >3,705

0,032

IPS1 <=3,685 and >3,685

0,062

IPS1 <=3,545 and >3,545

0,079

IPS1 <=3,295 and >3,295

0,116

IPS2 <=3,790 and 3,790

0,060

IPS2 <=2,690 and >2,690

0,050

IPS3 <=3,150 and >3,150

0,058

IPS4 <=2,365 and >2,365

0,036

IPS4 <=2,900 and >2,900

0,065

After calculation of entropy and gain all the attributes obtained the highest gain faculty. Then attributes the faculty at the root or root. Recalculate nilaiand gain for the faculty. So the decision tree is obtained.

Following the decision tree image for graduation the data using the C4.5 algorithm:

below are rule that resulted from algorithm c4.5:

R1: IF fakultas=bahasa and seni and IPS4>2,455 and IPS1>2,350 then result tepat.

R2: IF fakultas=bahasa and seni and IPS4>2,455 and IPS1<=2,350 then result terlambat.

R3: IF fakultas=bahasa and seni and IPS4 <=2,350 then result terlambat.

R4: IF fakultas=ekonomi and IPS1>3,190 then result tepat.

R5: IF fakultas=ekonomi and IPS1<=3,190 then result tepat.

R6: IF fakultas=ilmu keolahragaan and IPS1 >3,455 then result tepat.

R7: IF fakultas=ilmu keolahragaan IPS1<=3,455 and IPS1>3,310 then result tepat.

R8: IF fakultas=ilmu keolahragaan IPS1<=3,455 and IPS1<=3,310 then result terlambat.

R9: IF fakultas=ilmu pendidikan and IPS1 > 3,545 and umur>26 and IPS1>3,566 then result terlambat.

R10: IF fakultas=ilmu pendidikan and IPS1 > 3,545 and umur>26 and IPS1<=3,566 then result tepat.

R11: IF fakultas=ilmu pendidikan and IPS1 > 3,545 and umur <=26 and IPS2 >3,790 then result terlambat.

R12: IF fakultas=ilmu pendidikan and IPS1 > 3,545 and umur <=26 and IPS2 <=3,790 and IPS1 > 3,705 then result tepat.

R13: IF fakultas=ilmu pendidikan and IPS1 > 3,545 and umur <=26 and IPS2 <=3,790 and IPS1 <= 3,705 and IPS1 > 3,685 then result terlambat.

R15: IF fakultas=ilmu pendidikan and IPS1 > 3,545 and umur <=26 and IPS2 <=3,790 and IPS1 <= 3,705 and IPS1 <= 3,685 then result tepat.

R16: IF fakultas=ilmu pendidikan and IPS1 <= 3,545 and umur > 25 and umur > 38 and umur >42 and IPS3 > 3,150 then result tepat.

R17: IF fakultas=ilmu pendidikan and IPS1 <= 3,545 and umur > 25 and umur > 38 and umur >42 and IPS3 <= 3,150 then result terlambat.

R18: IF fakultas=ilmu pendidikan and IPS1 <= 3,545 and umur > 25 and umur <= 38 and IPS1 > 3,295 and IPS4 >2,900 then result terlambat.

R19: IF fakultas=ilmu pendidikan and IPS1 <= 3,545 and umur > 25 and umur <= 38 and IPS1 > 3,295 and IPS4 <=2,900 then result tepat.

R20: IF fakultas=ilmu pendidikan and IPS1 <= 3,545 and umur > 25 and umur <= 38 and IPS1 <= 3,295 then result terlambat.

R21: IF fakultas=ilmu pendidikan and IPS1 <= 3,545 and umur > 25 and umur <= 38 and IPS1<=3,295 then result terlambat.

R22: IF fakultas=ilmu pendidikan and IPS1 <= 3,545 and umur <= 25 then result terlambat.

R23: IF fakultas=ilmu sosial and IPS2 >2,690 then result tepat

R24: IF fakultas=ilmu sosial and IPS2 >2,690 then result terlambat.

R25: IF fakultas=matematika and IPA then result terlambat.

R26: IF fakultas=teknik then result terlambat.

Testing with k-fold validation using RapidMiner application for C4.5 algorithm method:

The following table confusion matrix for C4.5 algorithm method. Known level of accuracy is 74.33%, and from 1582 as many as 473 that the data fit the predicted data and the data predicted exactly 208 but it LATE, and as many as 198 predicted late but apparently including proper classification, and as many as 703 predicted fit is too late. Table confusion matrix presented in Table 4.9 and Figure 4.5 is a graph of AUC Encryption method C4.5 horizontal line is the false positive and false negative vertical lines.

Table 2 Confusin Matrix for data graduatin

IV. CONCLUSION

From the research that has been conducted on the data of students who have done graduation data mining process, it can be concluded C4.5 algorithm method produces 74.33% accuracy values and the AUC value of 0.787 and can say this method is quite accurate in making predictions for the data existing graduation.

REFERENCES

[1] Azwar, S. (2004). Penyusunan Skala Psikologi. Yogyakarta: Pustaka Pelajar.

[2] Nawawi, H., & M, M. (1994). Kebijaksanaan Pendidikan di Indonesia di tinjau dari Sudut Hukum. Yogyakarta: Gajah Mada University Press.

[3] Qudri, M. N., & Kalyankar, N. V. (2010). Drop Out Feature of Student Data for Academic Performance Using Decision Tree techniques. Global Journal of Computer Science and Technology , 2-4.

[4] Karamouzis, T. S., & Vrettos, A. (2008). An Artificial Neural Network for Predicting Student Graduation Outcomes. Preceeding of World Congress on Engineering and Computer Science , 978-988-98671-02.

[6] Han, J., & Kamber, M. (2007). Data Mining Concepts and Techniques. San Fransisco: Mofgan Kaufan Publisher.

[7] Kusrini, & Luthfi, E. T. (2009). Algoritma Data Mining. Yogyakarta: Andi Publishing.

[8] Bramer, M. (2007). Principles of Data Mining. London: Springer.

it's a journey about life, love and passion

Kamis, 05 November 2015

PREDICTION OF STUDENTS GRADUATION USING ALGORITHM C4.5

Tidak ada komentar:

Posting Komentar

Arsip Blog