Privacy Decision Tree (Two-party jointly build decision tree)

Motivation

Decision tree is a traditional algorithm using tree structure to do the classification and regression. The algorithm break down a dataset into smaller and smaller subsets. This algorithm has been widely used for many years, however, in some private situation one cannot only use the traditional information to do the classification. The privacy of important information should be also take into consideration. Thus, we here implement a revised decision tree algorithm to do the task in privacy situation. The following section will fist introduce the traditional ID3 algorithm, and then illustrate the two-party joint decision tree construction algorithms.

ID3 Algorithm

The algorithm used for building decision tree is ID3 by J.R.Quinlan whose method employs a top-down, greedy search through the space of possible branches with no backtracking. ID3 uses Entropy and Information Gain to construct a decision tree. Next we use an example to illustrate how the algorithm implement

Dataset

The dataset used for explain how the ID3 algorithm work is the well-known play golf dataset shown as follows. The right side displays the whole structure of the constructed decision tree.

Play Golf Dataset

Outlook	Temp	Humidity	Windy	Play Golf
Rainy	Hot	High	False	No
Rainy	Hot	High	True	No
Overcast	Hot	High	False	Yes
Sunny	Mild	High	False	Yes
Sunny	Cool	Normal	False	Yes
Sunny	Cool	Normal	True	No
Overcast	Cool	Normal	True	Yes
Rainy	Mild	High	False	No
Rainy	Cool	Normal	False	Yes
Sunny	Mild	Normal	False	Yes
Rainy	Mild	Normal	True	Yes
Overcast	Mild	High	True	Yes
Overcast	Hot	Normal	False	Yes
Sunny	Mild	High	True	No

Entropy

ID3 adopted an information based method and measure the homogeneity of dataset based on entropy. If all the subset data belongs to a single class, i.e. the subset data is totally homogenous, the entropy is zero and if all the subset data evenly divided into each class, the entropy is one. More generally, if dataset $D$ contain $p$ objects of class $P$ and $n$ of class $N$. An arbitrary object will be classified to class $P$ with probability $p/(p+n)$ and to class $N$ with probability $n/(p+n)$.

$Entropy = I(p,n) = -\frac{p}{p+n} \times log_{2}(\frac{p}{p+n}) - \frac{n}{p+n} \times log_{2}(\frac{n}{p+n})$

a) Entropy of target class of the dataset(Whole entropy):

$Entropy(PlayGolf)$
$= Entropy(5, 9) = I(\frac{5}{5+9},\frac{9}{5+9})$
$= I(0.36, 0.64)$

$= -(0.36log_{2}0.36)-(0.64log_{2}0.64)$

$= 0.94$

Play Golf
Yes	No
9	5

If attribute $A$ has multiple values ${A_1,A_2,A_3,...,A_k}$, the decision tree will classify the dataset $D$ into ${D_1,D_2,D_3,...,D_k}$, where $D_i$ with the attribute value $A_i$ of attribute $A$. Assume that the subset $D_i$ contains $p_i$ objects of class $P$ and $n_i$ objects of class $N$. Then the expected information of subset $C_i$ is $I(p_i,n_i)$. The expected information for attribute $A$ is the summation of all the subset, which can be expressed as:

$Entropy(A) = \sum_{i=1}^{k}\frac{p_i+n_i}{p+n}I(p_i,n_i)$

b) Entropy of every attribute(take attribute Outlook for example).

$Entropy(PlayGolf, Outlook)$

$= P(Sunny) * E(3,2) + P(Overcast) * E(4,0) + P(Rainy) * E(2,3)$
$= \frac{5}{14} \times I(0.6,0.4)+\frac{4}{14} \times I(1,0)+\frac{5}{14} \times I(0.4,0.6)$
$= 0.693$

		Play Golf
		Yes	No
Outlook	Sunny	3	2	5
	Overcast	4	0	4
	Rainy	2	3	5
				14

Information Gain

The information gain by dividing dataset on attribute $A$ is the decrease in entropy after a dataset is split on attribute $A$. Thus, the calculation of attribute Outlook is:

$Gain(Outlook) = Entropy(PlayGolf) - Entropy(PlayGolf, Outlook) = 0.940-0.693 = 0.247$

The information gain of rest three attributes can be calculated in the same way:

$Gain(Temp) = Entropy(PlayGolf) - Entropy(PlayGolf, Temp) = 0.029$

$Gain(Humidity) = Entropy(PlayGolf) - Entropy(PlayGolf, Humidity) = 0.152$

$Gain(Windy) = Entropy(PlayGolf) - Entropy(PlayGolf, Windy) = 0.048$

Then choose the attribute with the largest information gain as the decision node. So the tree-forming method used in ID3 would choose Outlook as the attribute for the root of the decision tree. Based on the values of Outlook, the branch with entropy of zero is a leaf node and the branch with entropy larger than 0 needs further dividing.

The following steps of tree construction displayed as follows:

Privacy Two-party ID3 Algorithm

The two-party ID3 algorithm is based on the traditional ID3 algorithm introduced above. The algorithm is proposed in order to deal with some practical situations. Assume that there are two organizations, and they both have some data, and they want to corperate with each other to mine some knowledge based on their data. However, these data involve confidential information of their organizations and they cannot share all the detailed information with each other, but they both still want to discover some interesting rules based on their data. The following parts we will introduce one intuitive approach to solve the problem.

Assuming that Angelina is the database manager of organization $A$, and Bob is the database manager of organization $B$. They both possessed part of the dataset, and in this example we assume that the joint of their dataset is the whole dataset.

Angelina

Angelina Dataset

Outlook	Temp	Humidity	Windy	Play Golf
Rainy	Hot	High	False	No
Rainy	Hot	High	True	No
Overcast	Hot	High	False	Yes
Sunny	Mild	High	False	Yes
Sunny	Cool	Normal	False	Yes
Sunny	Cool	Normal	True	No

Bob

Bob Dataset

Outlook	Temp	Humidity	Windy	Play Golf
Overcast	Cool	Normal	True	Yes
Rainy	Mild	High	False	No
Rainy	Cool	Normal	False	Yes
Sunny	Mild	Normal	False	Yes
Rainy	Mild	Normal	True	Yes
Overcast	Mild	High	True	Yes
Overcast	Hot	Normal	False	Yes
Sunny	Mild	High	True	No

If they build the decision tree based on their own dataset, they both will get a "tree" like:

Angelina

Bob

Angelina

		Play Golf
		Yes	No
Outlook	Sunny	2	1	3
	Overcast	1	0	1
	Rainy	0	2	2
		3	3	6

Bob

	transInfo --> Outlook
	Yes	No
Sunny	1	1	2
Overcast	3	0	3
Rainy	2	1	3
	6	2	8

Thus, the information gain calculation is change to:

       $Gain(Outlook) = Entropy(PlayGolf, transInfo) - Entropy(PlayGolf, transInfo, Outlook)$

                                   $ = I(\frac{3+6}{3+3+6+2},\frac{3+2}{3+3+6+2}) - Entropy(PlayGolf, transInfo, Outlook)$

                                   $ = I(\frac{3+6}{3+3+6+2},\frac{3+2}{3+3+6+2}) - (P(Sunny)*E(2+1,1+1)+P(Overcast)*E(1+3,0)+P(Rainy)*E(0+2,2+1))$

                                   $ = I(0.36,0.64) - (\frac{5}{14} \times I(0.6,0.4)+\frac{4}{14} \times I(1,0)+\frac{5}{14} \times I(0.4,0.6))$

                                   $ = 0.94 - 0.693 = 0.247$

The information gain of rest three attributes can be calculated in the same way:

$Gain(Temp) = Entropy(PlayGolf,transInfo) - Entropy(PlayGolf, transInfo, Temp) = 0.029$

$Gain(Humidity) = Entropy(PlayGolf,transInfo) - Entropy(PlayGolf, transInfo, Humidity) = 0.152$

$Gain(Windy) = Entropy(PlayGolf,transInfo) - Entropy(PlayGolf, transInfo, Windy) = 0.048$

Thus the rest part of construction is similar with the traditional ID3 algorithm, and the only difference is every time when calculate the information gain, one should take both the Angelina data and transInfo into consideration.