Permission to make digital or hard copies of all or part of this work for personal or Essay

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for profit or commercial advantage and that copies bear this notice and the full

citation on the first page. To copy otherwise, or republish, to post on servers or to

redistribute to lists, requires prior specific permission and/or a fee.

SIGPLAN’05 June 12–15, 2005, Location, State, Country.

Copyright © 2004 ACM 1-59593-XXX-X/0X/000X…$5.00.

Privacy Preserving Distributed Decision Forest

based Inductive Learning

Sanya Ralhan, Stacey Truex

[email protected], [email protected]

Abstract

One of the biggest technology trends that almost all companies

are investing in is machine learning and the knowledge generated

out of it. Organizations are spending a lot of money and time to

get clean noise free data to train their data models in order to get

machine learning infrastructure with high accuracy and lower

false positive rate. Data-driven machine learning can vastly im-

prove the quality of our daily lives and is already doing so in

many ways. Healthcare providers use systems based on machine

learning to diagnose patients; wearable devices are connected to

fitness tracking apps that use machine learning to make personal

health recommendations; search engines and social media sites

rely on machine learning to decide what to show to each individu-

al user, including which advertisements, e-commerce companies

leverage machine learning to determine which products or movies

to recommend to customers based on their prior purchase behav-

iour, content distributors like Netflix and HBO use customers

viewing patterns to suggest watching lists, online dating services

use machine learning in an attempt to connect people and find

love. There are many other examples where companies earn reve-

nues, make their services customized to users etc. To benefit from

any of these personalized services, the personal data of users –

such as personal preferences, browsing behaviour or medical lab

results needs to be scored against a trained machine learning

model. If there was a way for companies/ organizations to com-

bine their common learning in a way that privacy of training data",

model details are never leaked, the involved parties could get a

result which is more accurate than any individual classification.

Categories and Subject Descriptors: Security, Machine Learn-

ing, End User Computing

Keywords: private classification; decision trees; secure multi-

party computation; secret sharing; privacy preserving computa-

tion; ensemble learning

1. Introduction

In this paper, we have extended the work of Martine De Cock et

all [1] to make the protocol work for multiple parties and combine

their results while maintaining the privacy preserving attribute of

the protocol. More specifically we deal with scenarios where

Alice holds feature vector for which she needs a classification

result and can in parallel score against multiple parties (Bobs)

such that at the end of the protocol Alice gets an aggregate of

results from these Bobs yet learning nothing about the individual

results from Bobs and nothing besides the depths of Bobs’ mod-

els. Bobs learn nothing about the actual input form Alice.

Our results are proven in the so-called commodity-based model

[2], [3], in which correlated data is distributed to Alice and Bob

during a setup phase. Later, during an online phase, Alice and Bob

use these commodities to run the desired computation on their

respective inputs. This randomness data can be pre-distributed by

a trusted authority or it can be pre-computed by the players during

a setup phase using well-known protocols available in the litera-

ture (see, for instance, [4], [5], [6]). These commodities do not

depend on the actual inputs of Alice or Bob. Thus, in case a trust-

ed authority is used to distribute the commodities, the trusted

authority never engages in the actual computation during the

online phase and never learns any information about the model

held by Bob or the data possessed by Alice. The protocols in our

online phase are information theoretically secure; that is, if the

commodities are provided in an information theoretically secure

fashion, the overall protocol will be information theoretically

secure. Finally, differently from previously proposed solutions

[7], [8] our protocols solely use modular additions and multiplica-

tions. No modular exponentiations are required.

The main idea behind our solutions is to decompose the

problem of obtaining privacy-preserving classifiers into the prob-

lem of obtaining secure versions of a few building blocks: distrib-

uted multiplication, distributed comparison, bit decomposition of

shares and oblivious input selection.

2. Applications

• Government / public agencies

-The Centers for Disease Control want to identify disease out-

breaks for releasing annual reports

-Insurance companies and hospitals have data on disease inci-

dents, seriousness, patient background, etc.

-This release of sensitive PII data is not allowed as part of

HIPAA.

• Industry Collaborations / Trade Groups

-An industry or trade group may want to identify best practices to

help involved members

- Some practices are trade secrets

-How to provide commodity results to al (Manufacturing using

chemical supplies from supplier X have high failure rates), while

still preserving secrets (manufacturing process Y gives low failure

rates)?

• Enterprise/ Network Security

-Intrusion detection, malware detection and firewalls leverage a

variety of machine learning tools including decision trees, random

forests

-Domains from DGA classification which are used by bots on

computers to contact their command and control servers. Tree

based DGA classification is common

mailto:[email protected]

mailto:[email protected]

-Each organization’s trained model can be used as base model

input for a more accurate and efficient ensemble

• Supply chain planning

-Very large companies such as Home Depot, Walmart have large

distributed databases holding information such as distributed

facilities, company offices, industrial customers, distributors",

suppliers, etc.

-While the umbrella company is same, some or all of this infor-

mation can be sensitive to a specific location in which case indi-

vidual results can be trained privately and results are shared with-

out privacy loss.

-Similarly, big multinational firms such as PWC, EY etc. have

office across the globe and privacy laws across continents may be

different making knowledge sharing difficult even within the ‘big’

company.

• Social Networks

-Social Networks such as Facebook, Instagram etc. heavily deploy

decisions forests to improve features such as friend and pages

suggestion, newsfeed etc.

-Privacy is a very big concern

-Even though they are competitors in some way, all of them can

use collaborative decision models to make their systems more

accurate and efficient without releasing customer information.

Besides these there may be many more applications in the critical

organizations such as FBI, US Armed Forces etc. where the in-

formation sharing is not an option, but a privacy leakage proof

way of sharing learning will be extremely valuable.

Two interesting problems that were not answered in [1] but out

protocol overcomes are the following

Scenario 1: Two parties, Alice and Bob, jointly train a

decision tree model with each providing its own private input

data. Alice can then independently execute the decision tree train-

ing algorithm with the same parameters used in the joint training

to see how Bob’s dataset impacted the decision tree structure. The

black-box access to Bob’s data in the joint training can easily

lead to the unwanted privacy leakage of Bob’s private input data

as Alice is able to infer detailed statistics with respect to Bob’s

private training data through careful reverse engineering.

Scenario 2: An adversarial client may query a decision

tree repeatedly with cleverly engineered classification instances

and, using only the classification results and a priori information

on the decision tree algorithm, and may reverse- engineer the

decision tree itself.

Beyond the loss of privacy for the decision tree struc-

ture, this leakage also reveals some amount of information about

the underlying data on which the tree was trained. In some do-

mains, this data leakage can even be more alarming than the

model leakage. Solutions to privacy-preserving decision tree

learning using secure multiparty computation technique often

suffer from this pitfall of black-box access to computation results.

Out protocol effectively overcomes these problems.

3. Privacy Preserving Evaluation Protocols

3.1 Decision Forests: A decision forest consists of a set of deci-

sion trees. A random forest [10], for example, is a popular deci-

sion forest model with different, specific attributes in its deploy-

ment. In our scenario, each tree in the forest will be trained sepa-

rately on a separate dataset. This way, each organization can train

its own tree privately on its own data. Then the total set of trees

form all participating organizations can make up the ensemble

model. To evaluate an unseen instance, each tree will output its

own prediction and the overall prediction will be the average of

the individual trees’ outputs. Whoever is available at the time of

evaluation is considered a part of the ensemble for that specific

evaluation run. Dietterich notes that ‘[a] necessary and sufficient

condition for an ensemble of classifiers to be more accurate is that

the classifiers are accurate and diverse [11].’ We have seen that

there are existing works ([12], [13], [14]) which provide tech-

niques for accurate decision tree training that each organization

can leverage locally. The diversity requirement, we argue, can

follow directly from the disjoint training processes; each decision

tree will be trained using different datasets and strategies used by

the different organizations. Therefore, if we are to use a distribut-

ed decision forest setup; organizations can benefit from the

knowledge of other institutions in addition to the benefit of im-

proved accuracy through ensemble learning. For our semi-honest

decision forest evaluation, we will use the same decision tree

notation as is seen in [1]. That is, each tree D will represented as

D = (d, G, H, w) where d is the depth of the tree, G:{1, ..., 2d} ->

{1, ...",k} maps the index of each leaf node to its corresponding

class, H : {1, ..., 2d − 1} ->{1, ..., t} maps the index of each inter-

nal node to the appropriate feature index in the evaluation instance

vector, and w = {w1, ...",w2d−1} with each wi R being the thresh-

old for each internal node vi. For a more comprehensive introduc-

tion to decision tree structures we direct the interested reader to

[15]. For each internal node vi with 1 < i < 2d − 1, let us denote zi

as the result of comparing xH(i) with wi (i.e. zi will be one if xH(i)

>= wi and zero otherwise). The classification process for a deci-

sion forest with N trees will then goes as follows:

• For each tree Dt in the decision forest

– Starting from the root node, for the current internal node vi",

evaluate zi. If zi = 1, take the left branch; otherwise, the right

branch.

– The tree terminates when a leaf is reached. If the jth leaf is

reached, then the output of this tree is ct = cG(j).

• The output of the forest is

3.2 Security Model: We will be using the commodity-based mod-

el in our work under the semi-honest adversarial model. The

commodity-based model relies on the idea of a trusted initializer

(TI). In general, a TI is a trusted third party outside of the system

who distributes cryptographic data to the participating parties

before the protocol execution [16]. Though the model was formal-

ized in [2], there are multiple interpretations of the TI model

throughout the literature.

For example, a key/certification distribution center can be consid-

ered as a trusted initializer that pre-distributes confidential data to

protocol participants to ensure their confidential communication.

Another use of the TI is seen in [9] where the TI is used to im-

plement the bit commitment and oblivious transfer protocols. In

our context, we modeled the trusted initializer by an ideal func-

tionality FDTI formalized in Figure 1. There are multiple ways, in

practice, to implement trusted initializer functionality:

• a trusted server can be set up to distribute the correlated

commodities to participating parties in the set-up stage

through secure channels; or

• a server with unknown trust level can be set up to dis-

tribute the correlated commodities by utilizing the tech-

niques of zero-knowledge proof systems [17]

3.3 Assumptions: In this paper, we assume, in addition to the

trusted initializer functionality, that there exist secure channels

between each pair of participating parties. A secure channel is a

secure bidirectional communication tunnel between two principals

that is resistant to eavesdropping and tampering. In the decision

tree aggregation stage, we also assume the existence of a secure

broadcast channel between all servers or model holders. A secure

broadcast channel ensures that “a common message can be relia-

bly and securely transmitted at a rate independent of the number

of receivers” [18].

4. Building Blocks

We will next outline two important protocols which we will use as

building blocks in our privacy-preserving decision forest evalua-

tion protocol:

1) Distributed Comparison: Originally proposed by Garay et al.

[18] the secure distributed comparison protocol Π DC, operates

through secret sharing in the field Z2 with [log + 1e] rounds and

3l−[logl]−2 multiplication. Intuitively, the protocol Π DC allows

parties, each with shares in Z2 of each bit of two values x and y, to

receive shares in Z2 of 1 if x

How to cite this essay: