Privacy Decision Tree (Two-party jointly build decision tree)

author: Nan Meng

project repository: PrivacyDecisionTree

Motivation

Decision tree is a traditional algorithm using tree structure to do the classification and regression. The algorithm break down a dataset into smaller and smaller subsets. This algorithm has been widely used for many years, however, in some private situation one cannot only use the traditional information to do the classification. The privacy of important information should be also take into consideration. Thus, we here implement a revised decision tree algorithm to do the task in privacy situation. The following section will fist introduce the traditional ID3 algorithm, and then illustrate the two-party joint decision tree construction algorithms.

ID3 Algorithm

The algorithm used for building decision tree is ID3 by J.R.Quinlan whose method employs a top-down, greedy search through the space of possible branches with no backtracking. ID3 uses Entropy and Information Gain to construct a decision tree. Next we use an example to illustrate how the algorithm implement

Dataset

The dataset used for explain how the ID3 algorithm work is the well-known play golf dataset shown as follows. The right side displays the whole structure of the constructed decision tree.

Play Golf Dataset

Outlook Temp Humidity Windy Play Golf
Rainy Hot High False No
Rainy Hot High True No
Overcast Hot High False Yes
Sunny Mild High False Yes
Sunny Cool Normal False Yes
Sunny Cool Normal True No
Overcast Cool Normal True Yes
Rainy Mild High False No
Rainy Cool Normal False Yes
Sunny Mild Normal False Yes
Rainy Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Sunny Mild High True No


















Privacy Two-party ID3 Algorithm

The two-party ID3 algorithm is based on the traditional ID3 algorithm introduced above. The algorithm is proposed in order to deal with some practical situations. Assume that there are two organizations, and they both have some data, and they want to corperate with each other to mine some knowledge based on their data. However, these data involve confidential information of their organizations and they cannot share all the detailed information with each other, but they both still want to discover some interesting rules based on their data. The following parts we will introduce one intuitive approach to solve the problem.

Assuming that Angelina is the database manager of organization $A$, and Bob is the database manager of organization $B$. They both possessed part of the dataset, and in this example we assume that the joint of their dataset is the whole dataset.

Angelina

Angelina Dataset

Outlook Temp Humidity Windy Play Golf
Rainy Hot High False No
Rainy Hot High True No
Overcast Hot High False Yes
Sunny Mild High False Yes
Sunny Cool Normal False Yes
Sunny Cool Normal True No

Bob

Bob Dataset

Outlook Temp Humidity Windy Play Golf
Overcast Cool Normal True Yes
Rainy Mild High False No
Rainy Cool Normal False Yes
Sunny Mild Normal False Yes
Rainy Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Sunny Mild High True No

If they build the decision tree based on their own dataset, they both will get a "tree" like:

Angelina

Bob

Apparently, the result is worse than the decision tree constructed based on the original dataset. Now they want to build the same decision tree shown above at step 3. From the algorithm illustrated above, the decision tree has been constructed hierarchically. At every spliting node, we need to calculate the information gain and it can be calculated if the number of objects of each class are known. Thus, when Angelina wants to construct a tree based on universal dataset, she only need to know the number of objects of each class in Bob dataset.

Let's define a new variable transInfo which contains the information about number of objects of each class in Bob dataset. First, Angelina use the traditional algorithm to determine which attribute should be the first one to divide the dataset, i.e. which attribute should be the root.

When Angelina wants to calculate the information gain of Outlook, she first count the number of objects belong to each class in her own dataset, and then based on the transInfo, she will get the final result. The calculation is shown as follows:

Angelina

Play Golf
Yes No
Outlook Sunny 2 1 3
Overcast 1 0 1
Rainy 0 2 2
3 3 6

Bob

transInfo --> Outlook
Yes No
Sunny 1 1 2
Overcast 3 0 3
Rainy 2 1 3
6 2 8

       Thus, the information gain calculation is change to:

       $Gain(Outlook) = Entropy(PlayGolf, transInfo) - Entropy(PlayGolf, transInfo, Outlook)$

                                   $ = I(\frac{3+6}{3+3+6+2},\frac{3+2}{3+3+6+2}) - Entropy(PlayGolf, transInfo, Outlook)$

                                   $ = I(\frac{3+6}{3+3+6+2},\frac{3+2}{3+3+6+2}) - (P(Sunny)*E(2+1,1+1)+P(Overcast)*E(1+3,0)+P(Rainy)*E(0+2,2+1))$

                                   $ = I(0.36,0.64) - (\frac{5}{14} \times I(0.6,0.4)+\frac{4}{14} \times I(1,0)+\frac{5}{14} \times I(0.4,0.6))$

                                   $ = 0.94 - 0.693 = 0.247$

       The information gain of rest three attributes can be calculated in the same way:

$Gain(Temp) = Entropy(PlayGolf,transInfo) - Entropy(PlayGolf, transInfo, Temp) = 0.029$

$Gain(Humidity) = Entropy(PlayGolf,transInfo) - Entropy(PlayGolf, transInfo, Humidity) = 0.152$

$Gain(Windy) = Entropy(PlayGolf,transInfo) - Entropy(PlayGolf, transInfo, Windy) = 0.048$

Thus the rest part of construction is similar with the traditional ID3 algorithm, and the only difference is every time when calculate the information gain, one should take both the Angelina data and transInfo into consideration.