Decision tree is a traditional algorithm using tree structure to do the
classification and regression. The algorithm break down a dataset into smaller
and smaller subsets. This algorithm has been widely used for many years, however, in some private
situation one cannot only use the traditional information to do the classification. The privacy of
important information should be also take into consideration. Thus, we here implement a revised decision tree
algorithm to do the task in privacy situation.
The following section will fist introduce the
traditional ID3 algorithm, and then illustrate the two-party joint decision
tree construction algorithms.
ID3 Algorithm
The algorithm used for building decision tree is ID3 by J.R.Quinlan
whose method employs a top-down, greedy search through the space of possible branches with no backtracking.
ID3 uses Entropy and Information Gain to construct a decision tree. Next we use an example to illustrate how the algorithm implement
Dataset
The dataset used for explain how the ID3 algorithm work is the
well-known play golf dataset shown as follows. The
right side displays the whole structure of the constructed decision
tree.
Play Golf Dataset
Outlook
Temp
Humidity
Windy
Play Golf
Rainy
Hot
High
False
No
Rainy
Hot
High
True
No
Overcast
Hot
High
False
Yes
Sunny
Mild
High
False
Yes
Sunny
Cool
Normal
False
Yes
Sunny
Cool
Normal
True
No
Overcast
Cool
Normal
True
Yes
Rainy
Mild
High
False
No
Rainy
Cool
Normal
False
Yes
Sunny
Mild
Normal
False
Yes
Rainy
Mild
Normal
True
Yes
Overcast
Mild
High
True
Yes
Overcast
Hot
Normal
False
Yes
Sunny
Mild
High
True
No
Privacy Two-party ID3 Algorithm
The two-party ID3 algorithm is based on the traditional ID3 algorithm
introduced above. The algorithm is proposed in order to deal with some
practical situations. Assume that there are two organizations, and they
both have some data, and they want to corperate with each other to
mine some knowledge based on their data. However, these data involve
confidential information of their organizations and they cannot share all
the detailed information with each other, but they both still want to
discover some interesting rules based on their data. The following parts
we will introduce one intuitive approach to solve the problem.
Assuming that Angelina is the database manager of organization $A$, and Bob is the
database manager of organization $B$. They both possessed part of the dataset, and
in this example we assume that the joint of their dataset is the whole dataset.
Angelina
Angelina Dataset
Outlook
Temp
Humidity
Windy
Play Golf
Rainy
Hot
High
False
No
Rainy
Hot
High
True
No
Overcast
Hot
High
False
Yes
Sunny
Mild
High
False
Yes
Sunny
Cool
Normal
False
Yes
Sunny
Cool
Normal
True
No
Bob
Bob Dataset
Outlook
Temp
Humidity
Windy
Play Golf
Overcast
Cool
Normal
True
Yes
Rainy
Mild
High
False
No
Rainy
Cool
Normal
False
Yes
Sunny
Mild
Normal
False
Yes
Rainy
Mild
Normal
True
Yes
Overcast
Mild
High
True
Yes
Overcast
Hot
Normal
False
Yes
Sunny
Mild
High
True
No
If they build the decision tree based on their own
dataset, they both will get a "tree" like:
Angelina
Bob
Angelina
Play Golf
Yes
No
Outlook
Sunny
2
1
3
Overcast
1
0
1
Rainy
0
2
2
3
3
6
Bob
transInfo --> Outlook
Yes
No
Sunny
1
1
2
Overcast
3
0
3
Rainy
2
1
3
6
2
8
      
Thus, the information gain calculation is change to:
Thus the rest part of construction is similar with the traditional ID3 algorithm, and the only difference is
every time when calculate the information gain, one should take both the Angelina data and transInfo into consideration.