联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2019-04-21 08:22

Content-Based Access Control

Submitted to the Department of Electrical Engineering and Computer Science and the

Graduate Faculty of the University of Kansas

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

Date defended: April 3, 2015

The Dissertation Committee for Wenrong Zeng certifies

that this is the approved version of the following dissertation :

Date approved:

ii

Abstract

In conventional database, the most popular access control model specifies policies explicitly

for each role of every user against each data object manually. Nowadays, in

large-scale content-centric data sharing, conventional approaches could be impractical

due to exponential explosion of the data growth and the sensitivity of data objects.

What’s more, conventional database access control policy will not be functional when

the semantic content of data is expected to play a role in access decisions. Users are often

over-privileged, and ex post facto auditing is enforced to detect misuse of the privileges.

Unfortunately, it is usually difficult to reverse the damage, as (large amount of)

data has been disclosed already. In this dissertation, we first introduce Content-Based

Access Control (CBAC), an innovative access control model for content-centric information

sharing. As a complement to conventional access control models, the CBAC

model makes access control decisions based on the content similarity between user

credentials and data content automatically. In CBAC, each user is allowed by a metarule

to access “a subset” of the designated data objects of a content-centric database,

while the boundary of the subset is dynamically determined by the textual content

of data objects. We then present an enforcement mechanism for CBAC that exploits

Oracles Virtual Private Database (VPD) to implement a row-wise access control and

to prevent data objects from being abused by unnecessary access admission. To further

improve the performance of the proposed approach, we introduce a content-based

blocking mechanism to improve the efficiency of CBAC enforcement to further reveal

a more relevant part of the data objects comparing with only using the user credentials

and data content. We also utilized several tagging mechanisms for more accurate texiii

tual content matching for short text snippets (e.g. short VarChar attributes) to extract

topics other than pure word occurrences to represent the content of data. In the tagging

mechanism, the similarity of content is calculated not purely dependent on the word

occurrences but the semantic topics underneath the text content. Experimental results

show that CBAC makes accurate access control decisions with a small overhead.

iv

Acknowledgements

In this section, I would like to express my gratitude to my advisors, my colleagues, my

committee members and my family for their encouragement, support and assistance

down along the road of my PhD study.

First of all, I would like to thank my advisor Dr. Bo Luo for his valuable guidance

during my thesis. He was a great advisor to work with. He originally led me to the

field of access control and patiently explained the fundamental background of what

it is, why it is important to database security and its potential impacts on big data

platform. He has been very supportive, and enthusiastic in all our discussions. He

usually inspires me with his solid background knowledge on database security and

kindly provides insights to draw conclusion on experimental results, which help me

to push the experiment forwards meanwhile consolidate my work. I would also like

to thank my previous advisor Dr. Xue-wen Chen. When I began my PhD, he led me

to machine learning field. He directed me to multi-label learning, which is another

major part of my PhD work. The guidance from him led me to explore multi-label applications

in image analysis with graphical modeling, and theoretical optimization of

multi-label improvements. I am also very gratefully thankful to my committee members:

Dr. Arvin Agah, Dr. Jerzy Grzymala-Busse, Dr. Prasad Kulkarni, and Dr. Alfred

Tat-kei Ho. They offer me professional suggestion on my proposal and dissertation.

They kindly have provided their insights on further directions and experiments with

my work. Without their help, I cannot finish the entire course of my PhD.

Secondly, I would like to thank all my colleagues in University of Kansas. They began

as colleagues and ended to be my best friends. They have provided a lot of happiness,

v

supports, and assistance in my life and study. Working together with everyone is a

memorable moment in the entire course of my study. Dr. Hongliang Fei, Dr. Yi Jia,

Dr. Jintao Zhang, and Junyan Li, I have been grateful meeting them in the painful yet

rewardful PhD study. When I met problems, they always provide valuable suggestion.

Besides, I should owe special thanks to Dr. Jong Cheol Jeong. He is a valuable

colleague to work with, thorough in mind and detail oriented in execution. He is

willing to help me whenever I feel confused and lost in research. His determination in

research has set him up as my role model always.

Last but not least, I want to thank my family. My parents gave me endless courage

and love during my PhD stage. They have visited me three times from China, and

supported me with all their assistance in my household. They are the best parents one

could ask. My husband, the one I am super lucky to have, is the most supportive,

patient, generous and humorous man in my life, who has brought the happiness and

healed the pain. My baby daughter, Brenda, I would like to thank her for being the

best project I have ever done. Although she does cry a lot, she laughs more. Her smile

is the best gift after work.

vi

Contents

1 Introduction 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Related Works 6

2.1 Access Control Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.1 Discretionary Access Control . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.2 Role-Based Access Control . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.3 Attribute-Based Access Control . . . . . . . . . . . . . . . . . . . . . . . 17

2.1.4 Policy-Based Access Control . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.1.5 Risk-Adaptabe Access Control . . . . . . . . . . . . . . . . . . . . . . . . 20

2.1.6 Access Control Based on Content . . . . . . . . . . . . . . . . . . . . . . 21

2.2 Oracle Virtual Private Database (VPD) . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Text Feature Extraction 27

3.1 TF-IDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.1.1 Stop Word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1.2 Stemming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2 n-Gram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3 Topic Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3.1 Latent Dirichlet Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3.2 Non-negative Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . 33

vii

3.4 TAGME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4 Content-Based Access Control Model 37

4.1 Background and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.3 Model Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.4 Content Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.5 Top-K Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5 CBAC Enforcement 47

5.1 CBAC On-the-Fly Enforcement . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.1.1 The Basic CBAC Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.1.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.2 Offline CBAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.2.1 Unsupervised Nearest Neighbor Offline Training . . . . . . . . . . . . . . 56

5.2.1.1 Brute Force Algorithm . . . . . . . . . . . . . . . . . . . . . . 56

5.2.1.2 K-D tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.2.1.3 Ball Tree Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 58

5.2.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6 CBAC Optimizing Strategies 67

6.1 Content-Based Blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.1.1 Naive k-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6.1.2 The Advantage of Careful Seeding: k-means++ . . . . . . . . . . . . . . . 70

6.1.3 Scaled k-means++ with mini-batch Strategy . . . . . . . . . . . . . . . . . 70

6.1.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

6.2 Content-based labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

6.2.1 Document labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

6.2.2 Soundness of CBAC Enforcement . . . . . . . . . . . . . . . . . . . . . . 76

viii

6.2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

7 Labeling Improvement with Multi-Label Learning (MLL) 86

7.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

7.2 Problem Definition and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . 87

7.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

7.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

7.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

7.5.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

7.5.2 Objective Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

7.5.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

7.6 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

7.6.1 Data Set Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

7.6.2 Comparison Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

7.6.3 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

7.6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

8 Discussions 107

8.1 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

8.2 Negative Rules and Conflict Resolution . . . . . . . . . . . . . . . . . . . . . . . 108

8.3 CBAC for XML Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

9 Conclusion 110

A The Top 10 Words of Non-Negative Matrix Factorization 126

ix

List of Figures

2.1 Selected Access Control Models (NIST (2009)) . . . . . . . . . . . . . . . . . . . 7

2.2 Access Control List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Capability List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 Role-Based Access Control Model (Sandhu et al. (1996)) . . . . . . . . . . . . . . 16

2.5 Risk-Adaptable Access Control Notional Process (McGraw (2009)) . . . . . . . . 21

3.1 Plate Notation of Latent Dirichlet Allocation . . . . . . . . . . . . . . . . . . . . . 34

3.2 Plate Notation of Smoothed Latent Dirichlet Allocation . . . . . . . . . . . . . . . 34

3.3 TAGME Annotation Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.1 ABAC Efficiency with QUERY1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.2 ABAC Efficiency with QUERY2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.3 Threshold CBAC Efficiency with QUERY1 . . . . . . . . . . . . . . . . . . . . . 58

5.4 Threshold CBAC Efficiency with QUERY2 . . . . . . . . . . . . . . . . . . . . . 61

5.5 Threshold CBAC + ABAC Efficiency with QUERY1 . . . . . . . . . . . . . . . . 62

5.6 Threshold CBAC + ABAC Efficiency with QUERY2 . . . . . . . . . . . . . . . . 63

5.7 Top-10 CBAC Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.8 2-D K-D Tree Subspace Splits . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.9 K-D Tree Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.10 Offline Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6.1 Threshold CBAC + Blocking Efficiency with QUERY1 . . . . . . . . . . . . . . . 72

x

6.2 Threshold CBAC + Blocking Efficiency with QUERY2 . . . . . . . . . . . . . . . 73

6.3 Threshold CBAC + ABAC + Blocking Efficiency with QUERY1 . . . . . . . . . . 74

6.4 Threshold CBAC + ABAC + Blocking Efficiency with QUERY2 . . . . . . . . . . 75

6.5 Top-10 CBAC + Blocking Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . 76

6.6 Soundness of CBAC Enforcement . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.7 Threshold CBAC + Labeling Efficiency with QUERY1 . . . . . . . . . . . . . . . 80

6.8 Threshold CBAC + Labeling Efficiency with QUERY2 . . . . . . . . . . . . . . . 81

6.9 Threshold CBAC + ABAC + Labeling Efficiency with QUERY1 . . . . . . . . . . 81

6.10 Threshold CBAC + ABAC + Labeling Efficiency with QUERY2 . . . . . . . . . . 82

6.11 Top-10 CBAC + Labeling Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . 82

6.12 Top-10 CBAC + Blocking + Labeling Efficiency . . . . . . . . . . . . . . . . . . . 83

6.13 Density Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.14 Cumulative Probability Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

6.15 NMF 100 Density Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

6.16 NMF 100 Cumulative Probability Fit . . . . . . . . . . . . . . . . . . . . . . . . . 85

7.1 Scene of Sunset at Sea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

7.2 Molecular Function Annotation of P75957 . . . . . . . . . . . . . . . . . . . . . 94

xi

List of Tables

2.1 Access Control Matrix Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 VPD Function Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.3 VPD Policy Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.1 Schemas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2 Column Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.3 CBAC Top-10 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.4 CBAC Threshold Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

7.1 Multi-Label Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

7.2 Binary Relevance Matrix Example . . . . . . . . . . . . . . . . . . . . . . . . . . 91

7.3 Label Power-Set Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

7.4 Label Power-Set Matrix Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

7.5 Statistics of Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

7.6 Imbalance Rate (%) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

7.7 Sample Sizes of Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

7.8 Macro-Averaging F1 Measure (%) ↑ . . . . . . . . . . . . . . . . . . . . . . . . . 102

7.9 Micro-Averaging F1 Measurel (%) ↑ . . . . . . . . . . . . . . . . . . . . . . . . . 103

7.10 Subset Accuracy (%) ↑ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

A.1 The Top 10 Words of Non-Negative Matrix Factorization with 10 Topics . . . . . . 127

A.2 The Top 10 Words of Non-Negative Matrix Factorization with 20 Topics . . . . . . 128

xii

A.3 The Top 10 Words of Non-Negative Matrix Factorization with 50 Topics . . . . . . 129

A.4 The Top 10 Words of Non-Negative Matrix Factorization with 100 Topics . . . . . 133

xiii

Chapter 1

Introduction

1.1 Introduction

Simply put, database access control models and enforcement mechanisms define and enforce "who

can access what". Here, "who" represents a set of users/roles, and "what" represents a set of

data objects, e.g. tuples or XML nodes, attributes of SQL databases. In conventional database

access control models, database administrators (DBAs) or data owners/users explicitly specify

access rights of each data object for each role by GRANT or REVOKE certain rights from each

role. However, due to the exponential explosion of data, especially for content-centric data, such

approaches may not be suitable or even practical. The reason for this is three-folded. Firstly, it is

determined by the characteristics of content-centric data. Content-centric data usually contains a

lot of free text. For example, electronic health record (EHR) is a content-centric kind of data. In the

EHR, doctors other than list the basic information of patients (eg. name, gender, age, etc.) describe

the symptoms of every patient. Instead of choosing exact words to describe the patients’ symptoms,

a doctor usually use more descriptive ways to record the symptoms given by the patients. That’s

why EHR data is rather free text kind of data than formatted text kind of data. Free text, as

the example shown, can express the same semantic meaning with different term distributions.

Secondly, in content-centric database, the data content is expected to play a role in making the

1

access control decisions. Let’s continue on the EHR example. Before giving a concluded decision

on what the patient’s problem is, doctors might have needs to review some other patient’s record

with similar symptoms, especially for unusual diseases. For this kind of situation, it could be very

difficult to explicitly describe access rights for very large amounts of data objects, especially when

the decisions are based on content – it is too labor-intensive to require a system administrator to

manually examine every record in the database and assign access rights to each user/role. Thirdly,

in distributed and dynamic environments, it could be difficult to explicitly define access rights

for every user from remote peers, e.g., an organization could easily develop new roles without

notifying its collaborators, which happens a lot in information sharing. In this case, access control

decisions could be based on remote requestor’s knowledge that is dynamically submitted with

every query. Meanwhile, in distributed information sharing, data owners may only want to share

with people who contribute similar data which might reveal that they have similar interests due to

the sensitivity of the data content, but they cannot specify access control rules unless they explore

the content of others’ data. To further motivate this research, let us see the following examples:

Example 1: A law enforcement agency (e.g. FBI) holds a database of highly sensitive case

records. A director Bob assigns a case to agent Alice for investigation. Naturally, the director also

needs to grant Alice access to all related or similar cases. In this scenario, the concept of “related

cases” is determined by the semantical content similarity of the records, which could be geological,

temporal, motis operandi, or just the similarity in the textual description of the case records.

Moreover, when new cases are added to the database, cases that are similar to Alice’s should be

automatically made accessible to Alice, without requiring the director to further intervene. For

example, that new added related cases could be a crucial key to the case being investigated. Unfortunately,

in the existing database access control paradigm, this type of access control description

is not supported. Meanwhile, it is too labor-intensive for the director to manually examine every

record to grant/revoke access. In practice, the Multi-Level Security (MLS) model is often adopted

and every agent is granted access to a large number of records – everything lower than or equal to

his/her security level. Similarly, many content processing companies (e.g. survey processing and

2

telemarketing firms) allow every employee to access all the (potentially sensitive) customer records

in their databases, due to lack of capability to enforce access control based on the textual content

of the records. In all these similar scenarios, information sharing could be either too conservative

or being abused because unnecessary information leakage.

Example 2: In traditional subscription systems, users pay for access towards entire periodicals.

For instance, a researcher interested in “information security” may subscribe to IEEE Transactions

on Knowledge and Data Engineering, though he is only interested in a small portion of the papers

in the journal. An alternative approach would be that each user subscribes to a set of tags, and

each paper (as a record in the database) is tagged by keywords. Thus, access control decisions

could be made by matching user’s tags with paper’s tags. However, such approach suffers two

serious drawbacks: (1) tag quality is essential to the approach, but the quality control is a nontrivial

problem; (2) the number of accessible papers could be too small or too large, for instance,

a paper carrying a tag may be only slightly related to the tag topic. In a desired solution, the

subscriber is expected to submit his interests as a textual description or identify some seed articles

(e.g. his own papers), and then be granted access to articles with similar content. In the ideal

solution, the granted access control policies based on content similarity would further improve the

work efficiency of users based on qualified selective articles.

Example 3: In distributed information sharing scenarios, some data owners will only share their

records with peers who contribute relevant data, so that the sharing is mutually beneficial. For

instance, in a collaborative project with Department of Public Administration studying citizen

engagement, surveyees are found to be willing to share their opinions with others who have similar

opinions. In this case, opinions are represented by a short paragraph of text. In other scientific

research domains, we also see investigators sharing research data (in a shared and access-controlled

repository) with colleagues who contributes similar data. Let us revisit Example 1: when FBI

3

collaborates with other law enforcement agency (say, CIA), they only share “related cases”, while

the case relationships are accessed by semantical content similarity. Privacy-preserving similar

document matching (Murugesan et al. (2010); Scannapieco et al. (2007)) has been used to identify

and share similar documents. However, in the scenario that FBI is willing to disclose cases that are

similar to a known CIA case, an alternative solution is to employ database access control to allow

CIA to access the “similar cases”.

Example 4: Healthcare information sharing is strictly governed by HIPAA. Medical records are

well protected by healthcare providers, and are only shared under very rigorous rules. However,

within the facility, users (doctors, nurses, researchers) are often given broader access privileges,

while ex post facto auditing is enforced to detect and punish misuse of the privileges (Appari &

Johnson (2010); Malin et al. (2007); Boxwala et al. (2011); Rostad & Edsberg (2006)). Another

thrust of solutions employs the "break the glass (BTG)" mechanism – to allow users to break access

control rules in a controlled manner in special circumstances (Ferreira et al. (2006)). Additional

auditing will be performed once a user invokes the BTG policy.

From the examples, we can see that conventional deterministic database access control models

fall short in content-centric data sharing scenarios. In such cases, a new access control model

is expected to emerge to meet the needs of generating access decision based on the semantical

content similarity of the data. Another desirable capability of such content-based access control

model is the similarity of semantical content should be measured as native functions provided

by RDBMS, and it only requires minimal intervene from database administrators (DBAs). In

this dissertation, we present a first attempt towards this endeavor: we present the content-based

access control model and enforcement mechanisms, where access rights are granted based on the

lexicon similarity between requestor’s credentials and the requested records. The new model, as

a complement to existing access control approaches, provides an effective and efficient means of

access control that exploits content features in content-rich data sharing, and leads to a first effort to

4

solve the difficulties in content-centric database access control in big data era. In the dissertation,

we explore the new needs of security and privacy in distributed information systems and decide

to tackle such issues with innovative designs. Therefore, we formally propose a new data-driven

access control model called content-based access control (CBAC) model which exploits the data

content to achieve more flexible and powerful access control semantics towards conetent-centric

databases in information sharing in the dissertation. CBAC is the first attempt to create access

control model that introduces the notion of approximate security, and it is capable of dealing with

situations where explicit access control policies are not at all available. In CBAC, we decide

to use machine learning methods (i.e. text mining techniques) for access control modeling and

enforcement. By introducing these methods, access control principle is translated into algorithm

implementation, and in the sense, we aim to enhance the dynamic properties, automation and

“intelligence” into access control models via all these techniques.

5

Chapter 2

Related Works

Computer technology has transformed the way of daily life including education, career life, and

entertainment of people. It makes convenience for people to seek information for knowledge, find

jobs, enjoy fancy music and films. Meanwhile, computer technology also transformed the way

of running companies including hunting for suppliers to compete their offers, collecting, storing

and broadcasting their information of products, and maintaining their close work with clients. Not

only computer technology has improved the efficiency of everyone’s daily life, it has also changed

how information is created, processed, transferred, stored, and concealed. Nowadays, one of the

most important security problem is to prevent unauthorized access to information, which prevents

unauthorized people have access to credential information he/she is NOT allowed to. The common

risks from unauthorized access include but not limited to:

Unauthorized disclosure of information

Disruption of computer services

Loss of productivity which delaying normal computer activities in time critical applications

Financial loss such as corruption of information or disruption of services

Legal implications due to lawsuits from investors, customers, or the public

6

Figure 2.1: Selected Access Control Models (NIST (2009))

Blackmail intruders extort money from the company by threatening the security system

To avoid these risks, researchers developed different access control models to paradigm of

"who" has the authorities to access "what". In this chapter, we select some common access control

models for introduction. Figure 2.1 is modified from Figure 1 (NIST (2009)) to show the relationship

among these models. We follow the list of access control models (NIST (2009)) and add more

details about models which have concrete mathematical definition.

Database access control research could be roughly categorized as access control models and

access control enforcement. Relational access control models can be classified into: mandatory

access control (Jajodia & Sandhu (1991); Sandhu (1993); Sandhu & Chen (1998); Winslett et al.

(1994); McCune et al. (2006); Lindqvist (2006); Thuraisingham (2009); Upadhyaya (2011)), discretionary

access control (DAC) (Moffett et al. (1990); Thomas et al. (1993); Ahn (2009); Li

7

(2011); Downs et al. (1985)) and role-based access control (RBAC) (Ferraiolo et al. (2001); Osborn

et al. (2000); Sandhu et al. (1996)).

Mandatory access control (MAC) emphasizes only the database administrators have the authorities

to manage the access control policy and usage. These policies and usage cannot be modified

by any other users other than the administrators. Therefore, MAC is most often used in systems

or databases when the highest priority is placed on confidentiality. The assignment and enforcement

of access control policy under MAC models places strict restrictions on users. The dynamic

alteration of any access control policy requires detailed investigation of the policy itself purely by

database administrators manually. One obvious shortcoming is any update might introduce dilemmas

in the entire access control policies. Also frequent database updates will be labor-intensive

for administrators. Another shortcoming of MAC is it can be too protective to unnecessarily overclassify

data through “the high-water mark principle" and limit the ability of transfer information

between users and databases. On the other hand, most real world RDBMS implement a table/column

level DAC or RBAC similar to the one in System R (Griffiths & Wade (1976)).

Discretionary Access Control (DAC) is the type of access control where users has complete

authority over all the data they owns. Also they have the authorities to assign GRANT/REVOKE

to other users to access or not to their own data. DAC requires the permission assignment between

users who hold the data and who want to access the data. Thus, it is commonly known as the

“need-to-know" model. Compared to MAC, DAC shows an obvious advantage enabling finegrained

control over system or database objects. Data objects can have access control restrictions

with the minimum rights needed. However, security policies are extremely difficult for DAC as

the access control right is owned by users. Compromised users could pass potential threats to the

database and further them to other users. Thus, DAC has high potential to insecure problems.

Role-based access control (RBAC) is the type of access control model where users are firstly

assigned to different roles due to different job functions in an enterprise, and then the permission

are not directly assigned to users but to roles. The permission in contrast to the above two methods

of access control which GRANT/REVOKE user access on a rigid, object-by-object basis. In

8

RBAC, users are easily to be granted or revoked accesses due to the change of their work status.

In large organizations, to cluster of many users into a single role allows much more convenient

management. RBAC also integrates support for least-privilege principle, duty separation, and role

membership central administration. Although RBAC shows a great advantage over the above two

conventional access control models. Meanwhile it also has its own limitations. In large systems,

role membership, hierarchical structure among roles, and the need to maintain least-privilege principle

make administration overwhelming. Besides, although RBAC supports data abstraction, it

is unable to be used to ensure permissions on sequences of operations need to be controlled as

discussed in Section 2.1.2.

To enforce access control policy, view-based approaches is the traditional method to enable

row-level access control (Bertino & Haas (1988); Bertino et al. (1983)). Over the years, many models

and enforcement mechanisms have been proposed (a survey is available (Samarati & de Vimercati

(2001))), such as the Flexible Authorization Manager (FAM) (Jajodia et al. (1997)), temporal

DAC and RBAC models (Bertino et al. (1996, 2001a)), credential-based access control (Bertino

et al. (2001b); Winslett et al. (1997)), group-centric models (Krishnan et al. (2009)); and more

recently: purpose based access control (Byun & Li (2008)), policy-based access control for the

semantic web (Bhatti et al. (2007); Kagal et al. (2003)), Oracle VPD (Oracle (2012)), XML access

control (Bertino & Ferrari (2002); Damiani et al. (2002, 2000); Yu et al. (2002)), and access control

for the Web (Hicks et al. (2010); Park et al. (2001)). In Bertino et al. (2003a), a formal framework

for logic-based reasoning of access control model is proposed to analyze the relationships between

access control rules. Much efforts have been devoted to facilitate effective management of users,

roles, rules in different applications, e.g. role mining and administration (Dekker et al. (2008);

Molloy et al. (2010); Takabi & Joshi (2009)), policy integration and user provisioning (Li et al.

(2009); Molloy et al. (2009); Ni et al. (2009); Rao et al. (2009)), mediation in distributed systems

(Leighton & Barbosa (2010); Rao & Jaeger (2009)).

More relevant to the proposed research, the notion of content-based access control has been

proposed in relational access control specification (Bertino et al. (1997)) and in the context of

9

digital libraries (Adam et al. (2002)). In Adam et al. (2002); Bertino et al. (1997), the notion

of content refers to attribute values or definitive concepts extracted from digital library objects.

Access privileges are (statically) specified based on relationships between user credentials and

data attributes (concepts). Meanwhile, policy-based access control models (Bhatti et al. (2007);

Kagal et al. (2003)) bind access rights with user credentials (attributes); however, the decision is

still based on definitive values of the attributes (e.g. users with title=“physician” could access

patient records in his/her department). On the other hand, concept-level access control has also

been proposed for the semantic web (Qin & Atluri (2003)). However, our notion of content-based

access control is significantly different from existing approaches, we refer to the semantic content

and semantic similarity of data, as well as the notion of approximation.

Selected access control models are introduced explicitly in Section 2.1.

2.1 Access Control Models

2.1.1 Discretionary Access Control

The popularity of discretionary access controls (DACs) (Moffett et al. (1990); Thomas et al.

(1993)) started at about the late 1970’s due to the mergence of multi-user systems. Even nowadays,

DACs are most widely used forms of access control in multi-user systems such as Windows and

UNIX. The emergence of multi-user systems demands a huge need to limit the access to objects

owned by different users on a same file system for sharing information. Under UNIX system, it

is well known that three primitive permissions are attached to each object (eg. files and folders,

etc) including read, write, and execute to determine which user can do what on one object. The

scenario in DACs is that each object (eg. data object) has an owner who has primary controls over

the object. This means the owner is able to assign the access control policy on his/her own data to

other users under the same system.

The concept behind DACs is very simple: each object (eg. files or folders on file system;

data objects or attributes in databases) on systems has its own associated mappings between the

10

set of users requesting access to it and the set of actions that can be performed on the object.

In fact, different DACs show different data structures to handle the associated mappings. For

example, access control matrix uses matrices to denote the DACs’ deployment of access control

by constructing two sets including subject set (users) S and object set O, explicit annotation of

access rights (namely, read, write, and execute) and administration rights (eg. control). All these

components are then constructed as a matrix with the dimension of |S| ×(|S|+|O|)

Table 2.1: Access Control Matrix Example

Alice Bob Carl File1 Folder1 Process1

Alice control read, write, own

Bob control write

Carl execute

From Table 2.1, we can see for three users Alice, Bob and Carl and three objects File1, Folder1

and Process1, access control matrix is of 3-by-6 dimensions. With the increases of subjects and

objects, the access control matrix tends to be much larger and sparser. This results an inefficiency

in storing access control policy in matrices which is a waste of the storage spaces. Also as before

any actions can be performed from a subject to an object, the access permission should be checked

first. To search against a large matrix to verify every action like this, is inefficient. Therefore,

DACs have two lists to replace the matrix data structure called access control lists (ACLs) and capability

lists (CLs). These two lists are different from the standing points. Access control lists are

commonly observed in UNIX file system as an appending property of a file, a folder or an application

with a string constructed by r for read, w for write, x for execute and - for no access. Thus, the

standing point of access control list is from the objects. Contrary to access control lists, capability

lists basically show a certain subject the overall access rights under some objects. Examples of

ACLs and CLs is illustrated in Figure 2.2 and 2.3. In this sense, ACLs provide access review on a

per-obejct basis; while CLs provide access review on a per-subject basis. Additionally, in access

assessment process, CLs requires unforgeability and control of propagation of capabilities, while

11

ACLs just require authentication of subjects. The relative simplicity enhances the popularity of

ACLs, which makes ACLs commonly used in all modern operating systems even other contexts.

For instance, they have been applied together with network contexts where the target resource

access is requested by remote systems. Despite of the widely usage of ACLs, they do have their

limitations. To assess permissions, ACLs need to authenticate of subjects every time, which deteriorate

the time wastes. In situation where an enterprise requires many people having different levels

of access rights to many different resources, the updating of ACLs on each individual objects can

be really time consuming and also may invite errors hard to be detected.

2.1.2 Role-Based Access Control

A comprehensive model of role-based access control shown in Figure 2.4 cited from Sandhu et al.

(1996). The feature of RBAC models (Ferraiolo et al. (2001); Osborn et al. (2000); Sandhu et al.

(1996)) is that the model first construct User Assignment to assign users to roles due to different

work sessions, and then uses Permission Assignment to further assign roles to access rights on

different objects. The scenario is a basic RBAC. However, in realistic enterprise situations, users

have different job titles and tasks, which forms a hierarchical structure. The hierarchical structure

is not based on users themselves, but the roles. Therefore, in Figure 2.4, role hierarchy is also

specified. Role hierarchy can also be written: ≥ (The notation: x ≥ y means that x inherits the

permissions of y) To construct RBAC model, the roles of administrators need to interfere with

the User Assignment and Permission Assignment. They form a special group of roles. In the

comprehensive model, it is also called administrative role-based access control (ARBAC).

To concretely define a RBAC model, the following concepts need to be addressed:

12

Figure 2.2: Access Control List

13

Figure 2.3: Capability List

14

U = a set o f users

R = roles determined by job f unctions

P = permissions

SE = sessions = a mapping involving U, R and/or P

UA = user assignment

PA = permission assignment

RH = role hierarchy

RBAC emerges later than DACs’ paradigm in about mid 1990’s. Unlike DACs’ rigid objectto-object

basis of access control, RBAC consider more on the relationship between subjects and

their organizations according to sessions or job tasks in a deployment view. In other words, instead

of the subject, the subject function or role determines his/her access rights will be granted or

revoked. In RBAC, it successfully addresses some of the limitations in DACs. DACs based on

rigid subject-to-object access right assessment suffer a cumbersome process when being applied to

large organizations with large amount of resources. RBAC in treating subjects by grouping them

into clusters of roles improves the scalability of access control at an enterprise level. Since RBAC

determines access rights based on roles, and several people may share the same role (eg. the role of

data scientists), RBAC allows a group of people who have the same work function share the one set

of access control permissions on a set of resources (eg. databases, source code, etc). This performs

better scalability than DACs. Meanwhile, RBAC also allows users to be assigned to different roles

at the same time. For example, a data scientist can also be assigned to a role of software engineer

to share the access to source codes during projects on-going. Another explanation for multi-role

of a user is from the role hierarchical side. One supervisor in a role hierarchy can access resources

15

Figure 2.4: Role-Based Access Control Model (Sandhu et al. (1996))

16

which can be accessed by employees who are in a lower level of role hierarchy than him/her.

The assumptions of RBAC:

A subject can have multiple roles.

A role can have multiple subjects.

A role can have many permissions.

A permission can be assigned to many roles.

An operation can be assigned many permissions.

A permission can be assigned to many operations.

RBAC is increasingly being commonly used at the system and application level in enterprises

due to its scalability and versatility. The applications include Microsoft SQL Server, Oracle

DBMS, FusionForge, Solaris, SELinux and many other implementions of RBAC. Although

RBAC has advantages compared to DACs model, it does have its own limitation. First, RBAC

needs infrastructure work to deploy the entire model compared to DACs. Second, one of the most

limitation of RBAC is by grouping individuals into a group, the model is difficult to define granular

access controls for each one of them. This might require a more specific role to prevent an

individual who is in a certain group but does not need full access rights according to the group

permission assignments. In this case, the ability to differentiate individual members of a group and

selectively grant or revoke access is needed. Therefore, the attribute-based access control (ABAC)

is created to deal with the problem.

2.1.3 Attribute-Based Access Control

Attribute-based access control (ABAC) (Hu et al. (2015)) is a type of access control model where

the determination of access is made through the evaluation of features or attributes associated with

the subjects, the environment and the objects.

17

To concretely define an ABAC model, the following concepts need to be addressed:

S = a set o f sub jects or users

O = a set o f ob jects or resources

E = environment

SA = {SAk

|1 ≤ k ≤ K}

OA = {OAm|1 ≤ m ≤ M}

EA = {EAn|1 ≤ n ≤ N}

attr(s) SA1 ×SA2 ×...×SAK

attr(o) OA1 ×OA2 ×...×OAM

attr(e) EA1 ×EA2 ×...×EAN

attr(s), attr(o) and attr(e) denote the attribute set of subjects, objects, and environments.

In general form, whether a subject s can access an object o under a particular environment e is

determined by a Boolean function of the attributes of s, o and e.

can_access(s,o, e) ← f(attr(s),attr(o),attr(e)) (2.1)

The attributes in ABAC are distinct fields. Subjects’ attributes are considered to be the subjects’

characterisitics which can be used as tags to differ one subject from another in access control

scenario sense, such as identifier, name, job title, role and etc. Objects’ attributes usually include

the description of their functionality, content and any other information on the value of objects’

usgae in access control, such as modified time, title, tags, file format and etc. Environments’

attributes are used to record the status of environments, like current date time, CPU temperature,

current threat level, and etc. Therefore, the access rule basically means to access the permission of

access right of a subject s to an object o under the environment e.

18

A key advantage of ABAC is that the subjects are well abstracted by their attributes, which

means the system or administrator does not need to know the subjects in advance. Therefore,

ABAC is extremely useful to systems which constantly have unanticipated users. It makes arbitrary

access of resources more efficient. Unlike RBAC and DACs application popularity. ABAC is

usually implemented at an intermediary level to mediate access between a user or an application

and the resource to which access is requested like XACML.

The shortcoming of ABAC is how to generate discriminant attributes of subjects, objects, and

environment, especially in large scaled system. In large system, to use a complete list o attributes

in determining whether a subject can access an object under a specific environment is huge waste.

However, in large system, it is often desirable to harmonize the entire system or uniform access

control policies enforced. Then there emerges policy-based access control (PBAC) to solve the

problem.

2.1.4 Policy-Based Access Control

PBAC is said to be an evolutionary version of ABAC. Similar to ABAC, PBAC generates access

control policies using attributes from subjects, objects, and environment. However, as there is a

demand from enterprise level that access control policies should be harmonized across different

departments of enterprises. ABAC is suitable to perform local access control as there is defined

attributes to make decisions on access permissions. In enterprises, however, one department may

only require password and job title to determine whether an employee can access a working spreadsheet,

while another department may require all of the user’s credentials to determine whether

he/she can access the spreadsheet. Nevertheless, enterprises usually have some descriptive access

control principles, which is hard to apply due to its abstraction. To enforce the harmonization of

enterprise level, PBAC is then introduced to derive policies according to enterprises demand on

abstracted access control principles by listing them concretely with rules in ABAC.

PBAC as an evolutionary version of ABAC is much more complicated than ABAC. The first

problem is to maintain the attribute set over the enterprises. An Authoritative Attribute Source

19

(AAS) is the one source of sttribute data that can be used to standardize attributes in ABAC and

PBAC. The second question is how to transform the abstracted access control principles into rules

that can be implemented in application level. One most widely used tool of this transformation

is eXtensible Access Control Markup Language (XACML) based upon XML. It is developed to

specify the policies into a machine readable format.

2.1.5 Risk-Adaptabe Access Control

Enterprises are constantly evolving with the progressing economic aim and financial realities, and

market challenges. The dynamic nature of enterprises and any other organizations requires high

adaptive ability of access control policies which can constantly assess the risk of information propagation

in enterprise level. Thus, risk-adaptable access control (RAdAC) model emerges, since the

previous access control models including ABAC and PBAC cannot adequately meet the need for

dynamism in the risk assessment. RAdAC was intended to be created as a real-time, adaptable,

risk-assess access control facility to enterprises.

Figure 2.5 (McGraw (2009)) shows the flow chart of RAdAC notional process. Before access

requests, RAdAC evaluates the security risks to decide whether the existing policies of access

control need to be overridden or not. RAdAC leads a profound shift from static access control

model to dynamic access control models. For example, if the computer system is normal, users

can access the resources on the computer via username and password. However, if a security

level is breached, RAdAC will enforce a much more stricter access control policy to protect the

resources being disrupted.

Despite the attractiveness of RAdAC in its dynamic property, the implementation is daunting.

First, RAdAC faces the difficulty in sharing the access control model of different organizations.

This requires a standard way of interpreting risks among organizations. Second, RAdAC in assessing

risks needs to gather as much trustworthy information about subjects, objects, environments

and risk factors as possible. However, there is no standard way to extract these useful sets of information.

Third, data exchange on recording the information needed to assess risks needs to have

20

Figure 2.5: Risk-Adaptable Access Control Notional Process (McGraw (2009))

standard formats to ensure that RAdAC’s efficiency of adaption to changeable environments. Last

but not least, the heuristics of assessing risks relies on advanced techniques, including machine

learning, and genetic algorithms which are still open research topics far from being solved and

defined.

2.1.6 Access Control Based on Content

Bertino et al. pointed out that “mechanisms for enforcing access control policies based on data

contents” are needed for comprehensive data protection (Bertino et al. (2011)). More relevant to the

21

proposed research, the notion of content-based access control has been used in relational access

control specification (Bertino et al. (1997); Giuri & Iglio (1997)), multimedia database (Tzelepi

et al. (2001); Tran & Dang (2007)), web 2.0 (Hart et al. (2007); Monte (2010); Hart (2006)), and

digital libraries Adam et al. (2002), etc. However, their definition is quite different from ours.

In particular, in Bertino et al. (1997); Giuri & Iglio (1997); Adam et al. (2002); Amjad (2007),

the notion of content refers to attribute values or definitive concepts extracted from digital library

objects. Access privileges are statically specified based on relationships between user credentials

and attributes/concepts. Similarly, policy-based access control models (Kagal et al. (2003); Bhatti

et al. (2007); Reddivari et al. (2005)) bind access rights with user credentials, however, the decision

is still based on definitive values of the attributes (e.g. users with title="physician” could

access patient records in his/her department). In Tzelepi et al. (2001), RBAC is extended to

specify access control policies on image content (captured as attributes). Bertino et al. (2000)

and Tran & Dang (2007) enforce access control of video databases based on text annotations on

videos, while Bertino et al. (2003b) manages videos in clusters (based on visual content), and

supports more flexible access control. In all cases, explicit and static rules are required – user

credentials, video content and access control policies are all explicitly defined a priori.

More recently, Hart et al. (2007); Monte (2010); Hart (2006) enforces access control in Web

2.0 based on tags of messages, where tags are learned from the message content. Access control is

explicitly specified on tags, for instance, there are explicit rules such as: “[family members] are

allowed to access messages tagged with [home]”. To handle the dynamics in modern enterprise

applications, a few recent proposals attempt to infer access control provisioning from known decisions

using supervised learning, when a decision cannot be directly made from available policies

Molloy et al. (2012); Ni et al. (2009). This approach is effective when a good number of training

samples (known access decisions) are available, and training and testing samples statistically follow

the same distribution. On the other hand, concept-level access control has also been proposed

for the semantic web (Qin & Atluri (2003)). Last, the terms context and semantic has been used

in various access control approaches. Context mostly refers to the operational context of the user

22

Toninelli et al. (2006), while semantic is often used to indicate the semantic of data schema and

access control policies, especially in data integration and federation applications Pan et al. (2006);

Fabian et al. (2012).

Our notion of content-based access control is significantly different from existing approaches,

we refer to the semantic content and semantic similarity of data in RDBMS or XML DB, as well

as the notion of approximation and implicit access control specification in access control. In our

approach, content refers to the meanings of data objects, which is significantly different from those

concepts. Last, Oracle’s CONTEXT index is essentially an inverted index for text retrieval, which

is very different from the context used in access control literature.

2.2 Oracle Virtual Private Database (VPD)

Oracle offers special supports in database security for database administrators (DBAs) called Virtual

Private Databases (VPD). Oracle Virtual Private Database (VPD) enables developers to generate

security policies of database access at the row and column level. It essentially adds on a

dynamic WHERE clause to a SQL query issued by the users against tables, or views to restrain

the result set from being revealed with the entire content of the tables or views. In the sense,

VPD enforces security to a finer level of granularity, directly on database tables or views, as it

requires developers or database administrators (DBAs) attach the security policy together with its

implementation to the database objects via PL/SQL functions, or packages. When users log in

the database and issue a query against the database objects, security policy enforced by VPD will

automatically applied. The policy is usually determined by the credentials of the users, such as

identity, job title, and department. In CBAC, the policy is enforced according to the data objects

owned or can be accessed by the user initially. With VPD, there is no way to bypass the enforced

the security policy.

VPD utilized appending a dynamic WHERE clause or modifying WHERE condition dynamically

to realize the enforced security policy. The modification is returned by a function implement-

23

ing security policy. Users can apply VPD policies to SELECT, INSERT, UPDATE, INDEX, and

DELETE statements together with the implemented function. For example, if a user issued a query

as follows to the table OE.ORDERS for which security is enforced by VPD as the user can only see

the data objects from his/her own.

SELECT * FROM OE.ORDERS;

Then VPD would likely appends a WHERE clause to rewrite the query as

SELECT * FROM OE.ORDERS

WHERE SALES_REP_ID = SYS_CONTEXT(‘USERENV’,‘SESSION_USER’);

There are several benefits of using VPD. First, VPD can enforce security policy to a finegrained

access control level. It also makes the on-the-fly access control possible. Second, the

implemented function associated VPD enforced security policy is added once to make the access

control enforcement much more simpler. Third, via VPD, developers and databases administrators

(DBAs) have the flexibility to add several security policies to a table or a view. The policies can be

addressed differently on SELECT, INSERT, UPDATE, and DELETE.

To generate a dynamic WHERE clause, a function should be implemented first to define the

concrete steps of policy to be enforced. There are certain restrains on the function. First, it must

take a schema name and an object (table, or view) as inputs. Second, the return value should be a

VARCHAR2 type to hold the appending WHERE clause string. Table 2.2 shows an example of the

function enforcing a security policy example listed above.

In the example, the input arguments take in the schema, and object which the security policy

is applied on. The declared return value con provides a string which will be appended to WHERE

clause condition in SQL statement. con basically controls the result set belonging to the current

session user. Further examples can be found in Chapter 5.

After creating the VPD function, a policy attaching the function to objects should be speci-

fied, as in the unction, the schema and object are used as input arguments. To create the policy,

DBMS_RLS package in Oracle has to be utilized. Please note the developer of the security policy

should be granted with EXECUTE privileges in using this package. Table 2.3 shows a concrete

24

Table 2.2: VPD Function Example

CREATE OR REPLACE FUNCTION hide_other (

v_schema IN VARCHAR2,

v_objname IN VARCHAR2)

RETURN VARCHAR2 AS

con VARCHAR2 (200);

BEGIN

con := ‘SALES_REP_ID = SYS_CONTEXT(‘ ‘ ‘USERENV’ ’ ’,

‘ ‘ ‘SESSSION_USER’ ’ ’)’;

RETURN (con);

END hide_other;

/

example of the policy creation. Further examples can be found (Oracle (2012)).

In CBAC, we utilize Oracle VPD to generate on-the-fly and offline access control policy. In

the policy, we specify that when a user issue a query against content-centric databases, he/she will

get a result set similar to the data objects he/she owns or granted access right initially.

25

Table 2.3: VPD Policy Example

BEGIN

DBMS_RLS.ADD_POLICY(

object_schema => ‘OE’,

object_name => ‘ORDERS’,

policy_name => ‘secure_update’,

policy_function => ‘hide_other’,

policy_type => DBMS_RLS.SHARED_CONTEXT_SENSITIVE);

END;

/

26

Chapter 3

Text Feature Extraction

In content-centric database, the first question before measuring any similarities between free texts

is how we treat the free text, or concretely, how we can convert the free texts into some type of

data, which is computable. That is the very question before any type of problems in text mining,

usually called text feature extraction. Traditionally in text mining, when we have our text represented

as a sequence of characters, one of the usual next step is to convert it into a sequence

of words. The other is to treat the text into sequences of characters. The most commonly used

word-based feature is term frequency inverse document frequency (TF-IDF) which is viewed as

a term-distributed feature totally relying on the word occurrences in documents. This feature entirely

neglects the sequences of words. Meanwhile, character-based feature considers sequences

of characters by calculating statistics (frequency) of sequences of n characters or n words. Aside

of features based purely on term distributions, topic modelings tried to extract semantic features

with term distributions and statistical information on documents in an unsupervised way. In this

chapter, we first introduce two kinds of term-distributed features of free text. After that, we will

introduce two commonly used topic modeling algorithms. Last, we will come to an annotation tool

of tagging text with topics of Wikipedia, which goes a step further to bring in authorized database

annotation as topic seeds for document tagging.

27

3.1 TF-IDF

TF-IDF (Joachims (1996)) is abbreviation for term frequency-inverse document frequency which

is a statistic to reflect the importance of a word to a document in a corpus of documents. It is

commonly used in text mining, and information retrieval. Before TF-IDF became the most popular

term-distributed feature in text mining, different term weighting schemes had been explored

(Salton & Buckley (1988)). TF-IDF basically considers two factors to determine how important

a word can differentiate a document from the others in the corpus, namely, term frequency and

document frequency. Term frequency (TF) generally calculates the ratio of a word t appears in

one document d over the number of total words in the document. In practice, there are variants

of term frequency. If we denote the raw term frequency as f(t,d), boolean term frequency is:

f(t,d) > 0 ? t f(t,d) = 1. f(t,d) can also be directly applied in computing term frequency. In

realistic application, to prevent the bias towards longer documents, augmented term frequency

shown in Equation 3.1 is most commonly used.

t f(t,d) = 0.5+

0.5× f(t,d)

max{ f(t,d) : w ∈ d}

(3.1)

Document frequency is equal to the ratio of the word t appears in how many documents in the

corpus over the entire number of documents in the set. The inverse document frequency (IDF)

basically measures how common the term is in the set of documents. The more common the term

is, the less important the term is to differentiate a document from others, according to Equation 3.2

calculation.

id f(t,D) = log |D|

|{d ∈ D : t ∈ d}| (3.2)

The formula demands |{d ∈ D : t ∈ d}| to be greater than 0, which means every term in calculation

must at least occur once in the set of documents. Based on the principle, we need first to

collect all the terms in all documents to form a specified dictionary for TF-ID calculation. Having

got both of the two factors ready, the TF-IDF is calculated as a multiplication of TF and IDF.

28

t fid f(t,d,D) = t f(t,d)×id f(t,D) (3.3)

After extracting TF-IDF features from every term of every document, each document is treated

as one single sample with a representation of a vector. Therefore, documents are transformed into

a mathematical vector space. In Equation 3.4, wj,i denotes the i-th term’s TF-IDF value in the j-th

document. There are n terms in the specified dictionary.

dj = {wj,1,wj,2,...,wj,n} (3.4)

The similarity of two documents is therefore equal to the cosine product of two vectors of

documents represented as Equation 4.1.

3.1.1 Stop Word

In text feature extraction, there are some words, which usually occur in documents; however won’t

contribute to any semantic content in the documents, such as, “a", “an", “the", etc. These words,

although having grammatical meanings, are considered as meaningless words in text. Another

type of words has very concrete meanings; however is too general to contribute to a specified

topic, sometimes is also considered as “meaningless" in text, such as “with", “somehow", etc.

They together form a group called “stop words". In different languages, there are different lists of

stop words. Even in English itself, there are several different versions of stop word lists. In text

mining, nowadays, people tend to use comprehensive stop word list from google search engine. In

the experiments, we filter out the stop words according to most search engines, when calculating

TF-IDF.

29

3.1.2 Stemming

Another important step for preprocessing before feature extraction of texts is called stemming

(Porter (2001)). Stemming solves the problem that defined calculation of features will innately

ignore the same meanings of “apple" and “apples", and treat them as two different terms. In

linguistic morphology and information retrieval, stemming is the process for reducing inflected

(or sometimes derived) words to their stem, which is the base or root form generally a written

word is from. The stem needs not be identical to the morphological root of the word; it is usually

sufficient that related words map to the same stem, even if this stem is not in itself a valid root.

Algorithms for stemming have been studied in computer science since 1968. Many search engines

treat words with the same stem as synonyms as a kind of query broadening, a process called

conflation. Stemming programs are commonly referred to as stemming algorithms or stemmers.

One of the most famous stemmer is Porter’s Stemmer. The Porter stemming algorithm (or Porter

stemmer) is a process for removing the commoner morphological and inflexional endings from

words in English. Its main use is as part of a term normalization process that is usually done

when setting up Information Retrieval systems. This, however, is a problem, as not all parts of

speech have such a well formulated set of rules. Thus, the Porter’s Stemmer, although is simple

to use, sometimes goes too far to shrink the duplicity of terms. Also there is no clear evidence

showing that text feature extraction with stemming boosts the performance of text understanding.

Therefore, in the experiments, we keep all variations of words without stemming, when calculating

TF-IDF.

3.2 n-Gram

Clearly, TF-IDF totally ignores the sequences of words. However, in text mining, sequences of

words especially for multi-word phrases play important roles in understanding text content. One

of the common ways to address the problem is to use n-gram feature instead of TF-IDF. The

difference between TF-IDF and n-gram is that n-gram treats n adjacent words (n > 1) as one

30

single “term" to introduce local sequences of words. Thus, for every n-gram feature, the feature is

represented as {ti

,ti+1,...,ti+n1}. The features are collected for every document in the document

set and unified into a specific “dictionary", which contains all n-gram features occurred in all the

documents. After n-gram features are extracted, every feature can be treated as a new special

“term". Therefore, all the term frequency, inverse document frequency, similarity measurement

can be easily applied on n-gram features.

In n-gram, stop words removing is vital. Even though n-gram incorporates local sequences

of words to prevent the loss of sequence information in TF-IDF. However, n-gram usually suffers

the problem of maintaining a lot of fake phrases and noisy “terms". Keeping stop words will

make the problem even worse and it will make the “dictionary" even longer than it could be.

The general problem of term-distributed text feature is the sparseness of the vectors in the feature

spaces. A longer “dictionary" list will increase the vector dimensionality which will make the text

representative feature even sparser.

Another problem of term-distributed features is although they maintain the appearances of

words, and terms or even phrases, it does not mean that using these features makes the understanding

of text much easier. There still stands the gap of semantic meaning behind the words in

the documents due to the flexibility in explaining the same meaning with different words and addressing

different meaning with similar words. Topic modeling was raised to tackle the problem.

Researchers tried to extract “topics" based on statistical analysis of term distributions. In the next

section, two commonly used topic modeling algorithms are introduced. However, intuitively, the

model can be just as good as the data inputs. These “topics" are generated by pure term distributions,

by which means it will only reflect what term distributions will tell us. Therefore, the “true"

semantic representation of documents is a difficult topic interested lots of researcher in academy

and industry. Nowadays, researchers start to use annotation tagging schemes on text to extract not

direct words, terms or phrases, but related topics. One of the annotation tool is TAGME (Ferragina

& Scaiella (2010)). It uses the Wikipedia as a backup data source to annotate topics on text content.

The motivation of TAGME is to address the text understanding in short text where it is hard to

31

find two documents share the same word in document set. That is the exact situation where termdistributed

feature would fail. We will further our discussion on TAGME, an annotation tagging

tool on text understanding in the last part of this chapter.

3.3 Topic Modeling

Topic modeling is a statistical model in natural language processing for extracting“topics" in a

collection of documents in the statistical sense. The algorithm behind topic modeling either treats

documents as mixtures of topic (distributions) or treats documents as linear combination of topics,

which are the basis of the documents’ space. The representation of “topics" in topic modeling

is collections of words/terms. Here we introduce two models: Latent Dirichlet Allocation (Blei

et al. (2003)), and Non-negative Matrix Factorization (Lee & Seung (2001)), which represent two

different strategies of topic modelings.

3.3.1 Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is a generative model, which is capable to discover “topics"

from document repository in an unsupervised way. It treats documents as mixtures of “topics" and

“topics" are represented as Dirichlet distributions. Innately, LDA is a non-parametric Bayesian

inference model.

Figure 3.1 is the plate notation for LDA. The outer plate denotes the document repository; the

inner plate denotes the topics and words within each document. M is the number of documents in

the repository. Therefore, we have the follows:

α is the parameter of the Dirichlet prior on topic distribution across documents,

β is the parameter of the Dirichlet prior on word distribution on topics, θi

is the topic distribution for the i-th document, φk

is the word distribution for the k-th topic,

32

zi j is the topic for the j-th word in i-th document, and

wi j is the j-th word in the i-th document.

The problem of the basic LDA is that a new coming document is very likely to contain words

that did not exist in the training document repository. In the basic LDA, when using maximum

likelihood of multinomial parameters, zero probability will be assigned to such words, and thus

zero probability to new documents. Therefore, a smoothed model of LDA was raised to extend the

Bayesian inference model, which treats β as random variables. The plate notation of smoothed

LDA is shown in Figure 3.2, in which K denotes the number of topics, and φ is a K ×V matrix,

where each row denotes the word distribution of a topic (V is the dimension of vocabulary).

The generative process of LDA treats documents as random mixtures over latent “topics", and

“topics" are viewed as multinomial distributions over words. Assumptions behind LDA are as

follows.

Given a corpus D of M documents each with length Ni

: Choose θi ~ Dir(α), where i ∈ {1,...,M} and Dir(α) is the Dirichlet distribution.

Choose φk ~ Dir(β), where k ∈ {1,...,K}.

For each word wi, j, where i ∈ {1,...,Ni}, and j ∈ {1,...,M}.

– Choose a topic zi, j ~ Multinomial(θi).

– Choose a word wi, j ~ Multinomial(φzi, j).

3.3.2 Non-negative Matrix Factorization

Non-negative matrix factorization factorizes a matrix V into two matrices W and H, where the

elements of these three matrices are greater than or equal to zero.

Formally, we have that given a non-negative matrix V, find two non-negative matrices W and

H such that:

33

Figure 3.1: Plate Notation of Latent Dirichlet Allocation

Figure 3.2: Plate Notation of Smoothed Latent Dirichlet Allocation

V ≈ W ×H (3.6)

Consider if V is a m × n non-negative matrix which containing TF-IDF values for each word

in the vocabulary of each document. Every row represents the feature vector of a document. W (a

m×k non-negative matrix) is the topic weights for each document; while H (a k ×n non-negative

matrix) is the topic distribution across the vocabulary. That being said, innately NMF tries to

reconstruct V by linear combination (W) of “topics" (H).

34

3.4 TAGME

Although topic modelings make their efforts to extract semantic representations of documents

based on statistical assumption and analysis, it still embeds the disadvantages of TF-IDF. The

“topics" are either weighted sum or mixture of terms in the existing term distributions, even given

the smoothing version of LDA. Therefore, researchers tried to bring in some external annotation

database. Here comes TAGME with the topic annotation from Wikipedia. Before TAGME (Ferragina

& Scaiella (2010)), text annotation has been studied (Carpineto et al. (2009); Kulkarni et al.

(2009); Mihalcea & Csomai (2007); Milne & Witten (2008)). TAGME first implemented a webbased

API tool motivated to address annotating short texts. It uses Wikipedia anchor texts as spots

and the pages linked to them in Wikipedia as their possible senses. By finding collective agreement

among these pages, TAGME is able to calculate out a new score to verify the importance of the

topic (Wikipedia page title) in the input text. TAGME considering the sparseness of the anchors in

short texts, combines the relatedness function among concepts (Witten & Milne (2008)) and probabilistic

statistics drawn from Wikipedia. TAGME is available on http://tagme.di.unipi.it.

Figure 3.3 shows an example posted on the web site. The above text box is used to get the users’

inputs of text. The right bar is used to tune the threshold of annotation. The higher the bar sets,

the more related topics are annotated to the input text. The box below shows the annotated results.

The topics are linked to Wikipedia web pages via blue phrases. Innately, TAGME uses an XML ile

to represent the annotated topics with the title of the linked Wikipedia pages, and for each topic,

there is a parameter call ρ in the range of 0 to 1 to denote the importance or goodness for the topic

annotation.

35

Figure 3.3: TAGME Annotation Example

36

Chapter 4

Content-Based Access Control Model

4.1 Background and Assumptions

In this chapter, let’s first introduce the basic assumptions of problems specified in the dissertation.

To satisfy the new needs of enforcing access control without explicitly identifying every subject

and object in the access control policies, we propose content-based access control (CBAC). CBAC,

as an addition to the existing database access control models, works for content-centric data sharing

scenarios that satisfy the following assumptions:

1. Subjects and objects: We assume there are large amounts of users, who are authenticated with

some level of trust. Each user is expected to get access to a subset of the data objects in databases.

In practice, we can treat such set of users as a special role in the database. There are also large

amounts of data objects (e.g. records), and the data is content-rich in nature. Each data object

features with a block of unstructured textual content (e.g. a CLOB type attribute), which is a key

part in every data object.

2. Data-driven access control decisions: In CBAC, it is assumed that the access control decision

for each user against each data object is expected to be data (or content)-driven. In particular, the

decision is supposed to be determined by the content similarity of the textual data with the data

objects held by each user him/her self, as we have illustrated in previous examples.

37

3. Lack of explicit authorization: Another important assumption is that there might be situations

that explicit authorizations (for each user against each data object) are not available, since

it requires excessive labor to examine the textual content for each record and make a verdict for

each user. Moreover, the data set is dynamic that new records are added constantly, hence, access

control verdicts need to be made on-the-fly.

4. Approximation: In contrast to the conventional data confidentiality notions, approximation is

allowed in this scenario. That says, it is acceptable if a user (of the special role) accesses a few

more (or a few less) records than it would have been assigned by an administrator. This assumption

states there should be approximation tolerance in the demands of access control policies.

Example 5: If we revisit Example 1: assume that only 15 cases in the database are relevant to

Alice’s case, that says, a careful and accurate director would only allow Alice to access those 15

records. However, if an automated mechanism blocks a small portion of these 15 records, or allows

Alice to access a few other records, it is considered to be acceptable, especially comparing with

the current practice which gives Alice access to all the records.

In a nutshell, in CBAC, each user is allowed by a meta-rule to access a subset of the designated

data objects, but the boundary of the subset will be dynamically determined during query

processing, and the decision is data-driven.

Our goal is to design an access control model that considers the textual content of data objects in

making access control decisions, and to develop a mechanism that enforces this model efficiently.

Meanwhile, the access control enforcement mechanism is expected to have the following features:

Autonomous: CBAC enforcement mechanism is expected to require minimal intervention

from the system administrators and data owners.

Transparent: Users of the CBAC role are expected to issue queries as usual – without being

affected by the existence of the access control mechanism.

Efficient: Although content similarity assessment could be computationally expensive, we

38

still expect the CBAC enforcement mechanism to return answers promptly.

Off-the-shelf: The CBAC enforcement mechanism is expected to employ native access

control capabilities from off-the-shelf database management systems, so that the proposed

model and mechanisms could be easily adopted.

4.2 Contribution

Let’s revisit Example 1 and 4. Both examples demonstrate applications where explicit access

control specifications at record level are too labor-intensive; therefore, users become significantly

over-privileged due to the nonexistence of record-level content-based access control. The excessive

privilege is somehow mitigated with two controls: (1) RBAC or MLS is enforced so that users have

basic clearance to access the database; and (2) ex post facto auditing is enforced to punish misuse

of the privileges. However, with the size of data, the basic clearance still allows a user to access an

unacceptable amount of records. Meanwhile, ex post facto auditing does not reverse the damage,

since the suspicious user has already committed the misfeasance, and it is impractical to revoke

disclosed data. Ideally, we expect a more restrictive and automated access control model, instead of

allowing users to be significantly over-privileged or requiring excessive human intervention. That

is, the new model is expected to intelligently identify a smaller subset of records that are relevant

to the user’s task, and only grant access to this subset.

Attribute-based access control (ABAC) could be employed to partially mitigate the problem.

For instance, in Example 2, we can specify access control based on a combination of doctors’

and patients’ attributes: a doctor may access records of patients that have ever been treated in

his/her department. However, attribute-based access control may not work with unstructured text

(free text) content. Moreover, when the database structure and the attributes are very complicated,

it may be difficult to obtain closed-form expressions for ABAC policies. That is, in managing

content-rich data, it may be difficult to describe administrators’ access control intensions with a

small set of explicit, closed-form policies. In such applications, “hard security” requires a high

39

price of excessive human labor and degradation of usability (e.g. waiting for manual authorization

in BTG). From the technology perspective, there does not exist a computational model to precisely

describe semantic content, or to model this human cognitive process – the rationale behind the

decision is too vague and complicated.

In such use cases, it is expected to have an access control model that extends ABAC to make

access decisions based on the semantic content of the data. It is also desired that such content-based

access control capability to be provided by RDBMS as native functions, and only requires minimal

intervention from administrators. In this paper, we present our first attempt towards this endeavor:

we introduce the content-based access control model and enforcement mechanisms. In particular,

we propose a two-phase hybrid solution: (1) the data owner or administrator manually identifies a

small base set of records – the core of the set of records that are accessible to the user; and (2) at

runtime, CBAC extends the base set and makes access verdicts according to specified CBAC rules,

which are based on the lexicon similarity between the base set and the requested records. The new

model, as an extension to ABAC and a complement to legacy access control approaches, provides

an effective and efficient means of access control that exploits content features in content-rich data

sharing.

We would like to emphasize that content-based access control does not imply weakened or

relaxed security. Rather, it enforces an additional layer of access control on top of existing “precise”

access control methods. CBAC allows approximation – it does not provide a static boundary

for the accessible set of records. However, allowing the user to access a small set (size of n) of

roughly (and automatically) selected records is more secure than allowing the user to access all

the records in the pool (size of N), especially considering that n is usually orders of magnitude

smaller than N.

Our contributions are three-fold: first, we formally propose a data-driven access control model

that exploits the data content to achieve flexible and powerful access control semantics. Second,

we develop an effective enforcement mechanism of CBAC utilizing native functions from off-theshelf

database systems. Last but not least, we further develop a blocking mechanism and labeling

40

mechanisms to improve the efficiency for CBAC enforcement, and to improve the accuracy of

textual content matching.

4.3 Model Definition

A simple access control policy could be specified as:

ACR = {sub ject, ob ject, action, sign}

where the subject denotes a user, and the data object could be a table, an attribute, or a tuple in

the relational data model, or an XML node in the XML model. The action identifies an operation

on the data objects, such as read, delete, update, etc, and the sign denotes whether the operation is

allowed or denied.

In conventional database access control models, the subject could be identified by a user ID,

a role (in RBAC (Zhang et al. (2002); Sandhu et al. (1996); Yang et al. (2004))), or an attribute

(in attribute/credential based access control). In content-based access control, we assume that all

CBAC users belong to a special role that is allowed to access "some data", as we have introduced.

A CBAC user is further represented by a set of records owned by the subject. Particularly, a CBAC

user is able to access his/her own data initially. Alternatively, the subject could also be a short

description, which is modeled as unstructured textual content.

The data object could be a table, an attribute, or a tuple in the relational data model, or an

XML node in the XML model. We assume fine-grained access control in CBAC: access control

in enforced at record or node level. Hence, we consider each tuple in the relational model as

a data object. Hereafter, we assume relational data, and we terms "record" and "data object"

interchangeably. We will discuss the handling of XML data in Chapter 8.

The similarity between two data objects are defined as the weighted sum of the similarities

across all N attributes interested:

where simax

is the normalized similarity unction defined on the domain of attribute ax, and wax

is

the weight on the attribute. Similarity functions on simple types (e.g. integer) could be relatively

trivial, and the access control functions could be implemented using views. For instance, it is nonfancy

to enforce "user Alice could access all the cases in ‘San Francisco’ that took place after 2010"

using a view. On the other hand, we are more interested in content-rich unstructured text types,

such as VarChar, CLOB or TEXT types, as we have introduced. In the big data era, unstructured

text type is becoming more and more popular, meanwhile they carry much more information. For

example, there are news, blogs, twitters, and facebook comments exploded every day in daily

life which carry the rich information not just about what happened in the realistic world, but also

the trend of technology, the opinions of citizens towards a piece of news, the possible interests

on investigation from big agencies and so on. However, despite of the advantages and research

interests researchers have in unstructured text type of data (free text data), it is also known as a

difficult source of data to extract meaningful and useful information or being understood/assessed

due to its unformatted characteristics and its exponential explosion in quantity.

In content-based access control, the static, binary notion of sign ∈ {true, f alse} is extended

to be a content-based access control function, which is evaluated on-the-fly during access control

enforcement. The CBAC policy could be represented as:

ACR = {sub ject, ob ject, action, f(u,di)} (4.2)

where u represents the user, and d represents the data object. For each query, the function evaluates

to a value f(u,·) ∈ {true, f alse} for each di

in the candidate result set. Access is granted when

f(u,di) == true, and hence di

is included in the result to the query. In most cases, the decision

function f() consists a similarity function and a threshold to compare with:

42

f(u,di) = Simd(u,di) ≥ T (4.3)

Moreoever, if the user is represented by a set of data objects (records) owned by him/her (as in

Examples 1 and 3), the decision function will get the maximum similarity between the data object

and the owner’s records, and compare it with a reset threshold:

f(u,di) = max(Simd(du, j

,di)) ≥ T (4.4)

where du, j denotes the set of records owned by the user u, the similarity function Simd(·,·) represents

the content-based record similarity function we described above, and max j(Simd(·,·)) will

return the largest similarity for all j.

Example 6: Let us revisit Example 1. The content-based access control rule for Alice and other

agents will be defined as:

ACR = {u,di

,read, f(Du,di)}

Du = {dj

|Access(u,dj) = 1}

f(Du,di) = Simd(dj

,di) > T,dj ∈ Du

That says, agent u is granted "read" access to record di

, when the content similarity between di and

any one of u’s records is greater than a preset threshold T. In particular, the director (administrator)

first manually grants agent u access to a set of records Du (mimicking the scenario that the director

assigns cases to Alice), and agent u is thus represented by Du in CBAC policies. The CBAC policy

further allows agent u to read records that are similar to the ones in Du. This policy needs to be

enforced by DBMS without requiring further attention from the administrators.

43

4.4 Content Similarity

In the CBAC model, the similarity function Simax

(di,x,dj,x) for attribute ax is respectively defined

for different data types. As we have shown in the examples in Chapter 1, the CBAC model is

mostly designed for content-rich unstructured text types, as as VarChar, CLOB or TEXT types.

Ideally, we are expected to model such types by their linguistic semantics, and measure record-wise

similarity based on the semantical content. However, natural language understanding remains an

open problem, and it is very difficult to provide a reliable similarity assessment purely based on the

linguistic content (Kao & Poteet (2007); Sharma (2010); Wu & Chen (2011)). For domain-specific

data, customized semantic similarity measurements could be adopted, e.g. Pedersen et al. (2012)

for medical data. While the specific attribute similarity measurement is not the focus of CBAC, in

this paper, we present a generic measure that works for unstructured text attributes.

We model unstructured text by the statistical distribution of terms. The terms (words) from the

selected textual attribute for all tuples are collected to construct a feature space (term space). Each

cell is represented as a vector (di) in the term space:

is the TF-IDF weight of document i on term t. The original TF-IDF weight is defined

as:

wt,i = t ft,i ×id ft = t ft,i ×log N

d ft

where t ft,i

is the frequency of term t in document i, and d ft

is the number of documents that

contain term t. Many variations of TF-IDF weight have been used in the research community

(Manning et al. (2008)). Furthermore, the similarity between two documents could be calculated

as the cosine similarity of two document vectors:

44

Simd(di

Please note that the choice of content modeling and similarity measurement is not determined

by the CBAC model, rather, it is the choice of the system administrators or data owners, who

specify the policies. In Chapter place holder, we present a CBAC enforcement mechanism that

exploits the content model and text similarity measurement embedded in the Oracle DBMS.

4.5 Top-K Similarity

In the basic CBAC model, content similarity is compared with a preset threshold, and the user is

granted access to all of the "similar records". A potential problem is that the number of accessible

records heavily depends on the preset threshold. In particular, it is difficult to estimate the range

of pair-wise similarities (e.g. TF-IDF) among the records, especially for different seed records.

Therefore, even for experts who are familiar with the similarity measurement, it could be difficult

to give a reasonable threshold for any arbitrary user. With a bad threshold, the subject may be

granted access to too many or too few records. Meanwhile, we also observed that using same

threshold for different seed record will result in very different number of records in the positive

set.

To tackle the problem, top-K similarity could be used. Instead of setting a threshold for record

similarity values, the administrator could preset the number of data objects to grant access. For

instances, in Example 1, we may define that Agent Alice is allowed to access 30 cases that are most

similar to each of her own cases. The top-K similarity measurement provides flexible and intuitive

control to database administrators and data owners, especially to users who are not familiar with

the content model or the similarity measurement.

Enforcing top-K similarity is very similar to enforcing basic CBAC. An additional step is required

to sort the similarity values for all records, and select the K largest. This step could be

optimized by sorting only the top K elements, instead of sorting all records. Selecting the top

45

K elements could be done at linear time (e.g. QuickSelect). When sorting of the K records is

not needed, the overall complexity is O(N), where N denotes the total number of records in the

database. On the other hand, when the top K records are to be sorted, the overall computational

complexity of the overhead is O(N +KlogK) (Hoare (1961)).

46

Chapter 5

CBAC Enforcement

In this chapter, we set up the basic content-based access control model with two different strategies:

top-K similarity and threshold methods. First, we establish the framework of on-the-fly similarity

assessment with Oracle’s VPD modula. Second, we discuss offline content-based access control

training, especially for the databases which are not updated that frequently.

5.1 CBAC On-the-Fly Enforcement

5.1.1 The Basic CBAC Model

In content-based access control models, we exploit Oracle’s VPD modula on content-centric databases

to perform row-wise access control. In the scenario, there are two basic assumptions. The first one

is that the content of the database is contributed by multiple users (eg. user1, user2, and etc.)

where everyone owns a certain amount of data objects and maintain by a special user (e.g. John

Doe) or a database administrator. Second, the users are granted to access their own data initially

and based on the data they owned, they are able to explore similar data objects.

In content-based access control model, the overall aim is to rewrite the user’s query by appending

a dynamic WHERE clause in which restrains the query in an accessible range other than the

entire database. The accessible range is determined by the initial data objects the user can access.

47

Algorithm 1 CBAC threshold strategy

Require: a threshold score T > 0.

Require: A user u in a database of D.

Ensure: An array Cu which contains the d’s in D that u can access.

1: u logs in D with his/her password and issues a query q against D,

2: for all di

in D which can be accessed by u do

3: Access(di,u) = 1

4: end for

5: Let Du = {di|Access(di,u) = 1}.

6: Cu ← Du

7: for all di in Du do

8: Calculate sim(dj,di),dj ∈ {dj|Access(dj,u) = 0}

9: Select the ID’s of dj whose sim(dj,di) ≥ T, and add them into Cu.

10: end for

11: Append a dynamic where clause to q, such that the range of q is restrained within the ID’s in

Cu.

48

Algorithm 2 CBAC top-K strategy

Require: An integer N > 0.

Require: A user u in a database of D.

Ensure: An array Cu which contains the d’s in D that u can access.

1: u logs in D with his/her password and issues a query q against D,

2: for all di in D which can be accessed by u do

3: Access(di,u) = 1

4: end for

5: Let Du = {di|Access(di,u) = 1}.

6: Cu ← Du

7: for all di in Du do

8: Calculate sim(dj,di),dj ∈ {dj|Access(dj,u) = 0}

9: Sort sim(·,di) descendingly, select the ID’s of top N, and add them into Cu.

10: end for

11: Append a dynamic where clause to q, such that the range of q is restrained within the ID’s in

Cu.

49

In the scenario of content-based access control, we consider the situation when a user logs in the

database and issues a query against the entire database, CBAC model will first check the accessible

data objects of the user according to the data objects owned and feedback the relevant instances

according to the content similarity between the user’s data objects and the rest part of the database.

It seems as before handling any query from the user’s input, the system first submit "pre-queries"

to collect similar data objects which the user is able to access currently. The two strategies of

content-based access control model are concretely described in Algorithm 2 and Algorithm 1. The

basic model shows the aim to limit the every user’s accessible range and prevent the entire database

from being revealed to unauthorized users.

5.1.2 Experiments

In CBAC enforcement, we exploit Oracle’s VPD modula to implement record-level access control

on content-centric databases. To enforce CBAC model in VPD, we rewrite the user’s query

by appending a dynamic predicate, which represents the content-based access control semantics.

The range of accessible records is determined by the user’s base set, as well as the similaritybased

access control function. As we have introduced, there are two types of CBAC policies: (1)

threshold-based CBAC, and (2) Top-K CBAC. In this section, we assume that similarity assessments

are performed on-the-fly.

Settings. In the experiments, we utilize the NSF research awards data set from UCI KDD repository

(Bache & Lichman (2013)). The original data set contains 129,000 instances. Records containing

empty abstracts are removed, and thus leaves 102,508 awards. Awards are extracted, parsed

and loaded into three tables: Award_Basic(A_ID, Title, A_Instr, Div, abs, S_date, E_date,

Ex_tol_amt); Aw_Intr(A_ID, I_ID); Investigator(I_ID, I_Name, I_Email). In particular,

attribute AWARD_BASIC.abs contains full-text abstracts of NSF awards, representing the

content-rich information. To demonstrate the scalability of CBAC enforcement, we increase the

number of records in the database by adding synthetic dummy records. We employ an automatic

50

Table 5.1: Schemas

TABLE COLUMNS

AW_INTR (A_ID, I_ID)

INVESTIGATOR (I_ID, I_NAME, I_EMAIL)

AWARD_BASIC (A_ID, TITLE, A_INSTR, ABS,

S_DATE, E_DATE, EX_TOL_AMT)

Table 5.2: Column Description

COLUMN DATA TYPE DESCRIPTION

A_ID NUMBER Award ID

I_ID VARCHAR2(20) Investigator’s ID

I_NAME VARCHAR2(100) Investigator’s Name

I_EMAIL VARCHAR2(100) Investigator’s Email

TITLE VARCHAR2(100) Award Title

A_INSTR VARCHAR2(100) Award Institution

ABS CLOB Award Abstract

S_DATE DATE Start Date

E_DATE DATE Expiration Date

EX_TOL_AMT NUMBER Expected Funding

CS paper generator SCIgen1

to generate very large amount of content-rich but meaningless records.

Eventually, we have constructed a database with 2,714,025 records for the experiments. Note that

the content in this database is not sensitive thus it does not require access control, however, it mimics

content-centric databases, for which CBAC is designed. Table 5.1 describes the columns of

each relevant table, and Table 5.2 gives description of the columns in the tables.

We use Oracle 11g for the experiments, and apply CONTEXT indexing on the AWARD_BASIC.ABS

attribute. In order to optimize the similarity calculation, we remove the common English stop

1Available at: http://pdos.csail.mit.edu/scigen/

51

Table 5.3: CBAC Top-10 Example

SELECT A_ID FROM

(SELECT ROWNUM m, score(1), A_ID

FROM JOHNDOE.AWARD_BASIC

WHERE ID NOT IN

(SELECT A_ID FROM JOHNDOE.AW_INTR

WHERE I_ID=SYS_CONTEXT(’USERENV’,’SESSION_USER’))

AND CONTAINS (ABS,

’<query><textquery grammar="CONTEXT" lang="english">

Markov chain Monte Carlo (MCMC)

methods are an important algorithmic

device in a variety of fields.

</textquery>

<score datatype>="float" algorithm="DEFAULT"/>

</query>’,1)>=0 ORDER BY score(1) desc)

WHERE m BETWEEN 1 and 10;

words which are ignored by most search engines. We follow the theory of TF-IDF, which considers

the term frequency against document frequency of words, and ignores the sequence of word

appearance in the content for simplicity. Accordingly, we apply ACCUMulate operator (,) to join

words in unstructured text featured content into a query against CONTEXT full text search in Oracle.

The following shows an example of how the pre-query collects similarity data objects of one

of the user’s data object. Suppose the user has the access authority of the following document.

Markov chain Monte Carlo (MCMC) methods are an important

algorithmic device in a variety of fields.

Then one example of the Oracle SQL statements which use pre-query to collect the top-10

similar document in the database is shown in Table 5.3.

Also the threshold statement example which returns the documents in the database with above

52

Table 5.4: CBAC Threshold Example

SELECT A_ID

FROM JOHNDOE.AWARD_BASIC

WHERE ID NOT IN

(SELECT A_ID FROM JOHNDOE.AW_INTR

WHERE I_ID=SYS_CONTEXT(’USERENV’,’SESSION_USER’))

AND CONTAINS (ABS,

’<query><textquery grammar="CONTEXT" lang="english">

Markov chain Monte Carlo (MCMC)

methods are an important algorithmic

device in a variety of fields.

</textquery>

<score datatype>="float" algorithm="DEFAULT"/>

</query>’,1)>=20;

20 similarity score is shown in Table 5.4.

As the instance of the "pre-queries" used to search against the database affects the parsing step

directly and is also used to restrain the number of unique terms in TF-IDF. Knowing this, two

optimizing strategies are extended in Chapter 6 and 7. Before extending to Chapter 6 and 7, A

naive solution is raised to boost the efficiency by skipping the sort step with threshold method.

The restrain to return the same amount of data objects is used to remove the bias of IO time costs.

In the experiments, the thresholds of 10, 20 and 30 are tried separately, where the score is in the

range from 0 to 100. In the result, another factor which affects the efficiency is revealed: the

threshold. The threshold controls the overall returned similar instances. The lower the threshold

is, the more the returned similar instances. The whole process of searching relevant instances in

threshold method is implemented with a cursor in PL/SQL. Therefore, it costs more in opening a

"large" cursor.

The database runs on a 64-bit Windows 7 system, with Intel

R CoreTM 2 Duo CPU E8500 @

53

3.16GHz and 4.0GB RAM. To mimic the real-world use case, queries are issued from SQL-Plus,

and the query evaluation time includes all I/O (e.g. network I/O).

Execution. We mimic the scenario in Example 1. In initial authorization, the base set of each

user is defined as the award records PI-ed by the user. Each user is explicitly granted access to

such records. Next, we simulate the following access control scenarios: (R1) an attribute-based

access control (ABAC) rule: the user is allowed to access records in a division where he/she has

PI-ed an award; (R2) a content-based access control (CBAC) rule: the user is only allowed to

access awards that have similar abstracts with the awards in his/her base set; and (R3) a combined

(ABAC+CBAC) rule: R1 AND R2. All three scenarios are implemented with Oracle VPD to show

CBAC is capable to either work with existing access control model or work independently.

In the experiments, we login as 60 randomly selected users to issue the following queries.

QUERY1: SELECT TITLE, ABS FROM johndoe.AWARD_BASIC

WHERE S_DATE >= TO_DATE(’1996/01/01’, ’yyyy/mm/dd’)

AND ROWNUM<=10;

QUERY2: SELECT COUNT(*) FROM johndoe.AWARD_BASIC

WHERE S_DATE >= TO_DATE(’1996/01/01’, ’yyyy/mm/dd’);

The end-to-end query evaluation time for ABAC (R1) is shown in Figure 5.1 and 5.2. For

threshold-based CBAC, the end-to-end query evaluation time is shown in Figure 5.3 and 5.4. The

end to end query evaluation of threshold based CBAC + ABAC is presented in Figure 5.5 and 5.6.

Note that the rightmost bar in this group indicates the "w/o-CBAC” case, where no access control

is enforced. We have performed the experiment with different thresholds – a larger threshold

means a stricter constraint, which requires higher similarity between the queried records and the

seeds. As shown, query processing for Query1 with CBAC is very efficient. A larger threshold

leads to slower query processing, since Oracle needs to scan through more records to identify first

10 records that satisfy the stricter CBAC condition. Query evaluation slows down with R3, with

the overhead required by both ABAC and CBAC semantics. On the other hand, Query2 forces

54

Oracle to go through all records. As shown in Figure 5.5 and 5.6, the overhead is acceptable,

especially consider that CBAC models data content in a high-dimensional vector space, which

requires excessive computation.

Top-K CBAC. We have developed two implementations of top-K CBAC.

Naive implementation. In a naive implementation, we simply included the top-K semantics in the

dynamic predicate. Unfortunately, query performance was very slow, since the top-k ranking in

VPD predicate was repeatedly evaluated.

Optimized implementation. To improve query performance, we split the top-K semantics into two

steps: (1) in PL/SQL, we select the top K records that are most similar to the base set; (2) we

identify the similarity score (Ts) of the K-th record, and generate a threshold-based predicate with

threshold Ts

. The average end-to-end query processing time is significantly reduced, comparing

with the naive implementation. Note that query evaluation is still relatively slow comparing with

threshold-based CBAC, mainly due to the size of the database (see Figure 5.7). Especially, Oracle

does not provide native support for selecting first K records – Oracle sorts the entire table to return

the top K records (complexity: O(N logN)). However, as shown in Chapter 5, the computation of

selecting and ranking top K records could be as low as O(N + KlogK). In Chapter 6 and 7, we

will optimize top-K CBAC performance using blocking and tagging.

55

Figure 5.1: ABAC Efficiency with QUERY1

5.2 Offline CBAC

For databases which are not updated frequently, offline training is a much more efficient way to

perform content-based access control model. In a naive offline training, the accessible range is

calculated and hard coded in VPD policy for every user. Here we adopted unsupervised nearest

neighbor algorithm for ranking out the top-K documents. In Section 5.2.1, three different

strategies, namely, brute force, kd-tree and ball tree algorithms, of nearest neighbor algorithm are

introduced. As shown (Kumar et al. (2008)), ball tree in k nearest neighbor search yields excellent

result in both the sense of accuracy and efficiency.

5.2.1 Unsupervised Nearest Neighbor Offline Training

5.2.1.1 Brute Force Algorithm

Brute-force search, also known as exhaustive search, is the basic problem-solving technique for

nearest neighbor algorithm, which enumerates all possible candidates for the solution and validates

56

Figure 5.2: ABAC Efficiency with QUERY2

whether every candidate satisfies the problem criteria.

The algorithm for nearest neighbor for ranking iterates in the following way:

As shown above, brute-force algorithm is fairly simple to implement, and will converge to its

solution as always if there is one. However, the cost of it is proportional to O(|X| × |Z|), which

will grow fast for many realistic problems. Therefore, brute-force algorithm is recommended for

problems with limited size, or the accuracy is emphasized over speed. Other than that, brute-force

is usually used as a baseline for benchmarking other algorithms or heuristics.

5.2.1.2 K-D tree

K-D tree follows the pinciple of binary search (Bentley (1975)). It uses a binary tree to store the

training data X, where every node in the binary tree is a k-dimensional point. Every non-leaf point

generates a hyper-plane to separate the interested feature space into half. Points to the left of the

hyper-plane represent the left subtree of the node; while points to the right of the hyper-plane

represent the right subtree of the node. Then testing data Z follows Algorithm 5 to get the closest

57

Figure 5.3: Threshold CBAC Efficiency with QUERY1

point from X for every zi ∈ Z. Figure 5.8 and 5.9 illustrate a 2-D K-D tree example. K-D tree

guarantees O(nlogn) complexity.

5.2.1.3 Ball Tree Algorithm

Ball tree algorithm also know as metric tree algorithm is a variation from K-D tree (Omohundro

(1989); Uhlmann (1991)). Instead of using hyper planes, ball tree algorithm utilizes hyper spheres

to split the space (sub-space). Algorithm 6 explains the construction of ball tree. Algorithm 7

explains how to search and get k nearest neighbors, given a testing data z. Given a testing data set

Z, Algorithm 7 is easy to iterate all points in Z. With the improvement of dimension selection for

splits, ball tree is considered as a better choice compared to K-D tree.

58

Algorithm 3 Brute Force Algorithm

Require: Training Data X, and Testing Data Z, and k.

1: Initialize Y with an empty matrix.

2: for all zi ∈ Z do

3: for all xj ∈ X do

4: Compute the distance: d(zi

,xi)

5: end for

6: Rank the d(zi

,·) acsendingly.

7: Append the indices of X with top-k d(zi

,·) as a column to the right of Y.

8: end for

9: return Y

Algorithm 4 Calculate T = kdtree(X,d = 0)

Require: d ≥ 0

Ensure: X is an n×k matrix.

Ensure: T = kdtree(X,d).

1: axis ← d mod k

2: Sort points according to axis and choose median as pivot element.

3: Initialize an empty tree node N.

4: N.value ← median

5: X.le ft ← kdtree( points in X before median, d +1)

6: X.right ← kdtree( points in X after median, d +1)

7: N.le ftChild ← kdtree(X.le ft,d +1)

8: N.rightChild ← kdtree(X.right,d +1)

9: return N

59

Algorithm 5 Seach in K-D Tree

Require: Binary Tree T, and Testing Data Z, and k.

1: Initialize Y with an empty matrix.

2: for all zi ∈ Z do

3: Iniitialize y with an empty vector.

4: B ← T

5: for all i ∈ {1,··· , k} do

6: Search zi against B basing on the split dimension.

7: Once reach a leaf node bj

, store the node as current best bc = bj

.

8: Calculate d(zi

,bj) to be the radius of a hyper sphere centered at zi

.

9: if The hyper sphere has an intersection with the existing hyper planes then

10: Compute the nodes represented in the intersected hyper sub-spaces to get the closest

point bc.

11: end if

12: Add bc in y.

13: Delete tc in T

14: end for

15: Append y as a column to the right of Y.

16: end for

17: return Y.

60

Figure 5.4: Threshold CBAC Efficiency with QUERY2

5.2.2 Experiments

In the experiment, we utilized k-nn with ball tree algorithm. The naive offline CBAC efficiency has

been tested against the baseline where there is no constraint of accessible range of users. That is to

say the baseline is the situation where no access control policy is applied to the entire database and

users can query against the entire database without constraints. In the test, we force the baseline

returning the same amount of instances as CBAC does for every user. Figure 5.10 shows that

the offline CBAC model performs nearly the same with queries without any content-based access

control policy.

The other option for offline similarity assessment is to build up a table which is used to store

the pair-wise similarity of instances all over the database. The table holds information of the user’s

ID, the instance’s ID which belongs to him or her, the other instance’s ID, and the similarity score.

This option is suitable for small and medium size of databases. An advantage for this option is the

databases can be updated frequently. The only thing needs to be changed is the scores associated

with the updated data objects. The overall advantage for the offline similarity assessment is it

61

Figure 5.5: Threshold CBAC + ABAC Efficiency with QUERY1

makes indexing the scores possible for the pair-wise similarity. This will further boost the sorting

efficiency of CBAC model and other time efficiency even for thresholding.

62

Figure 5.6: Threshold CBAC + ABAC Efficiency with QUERY2

Figure 5.7: Top-10 CBAC Efficiency

63

Figure 5.8: 2-D K-D Tree Subspace Splits

Algorithm 6 Calculate T = balltree(X)

Ensure: X is an n×k matrix.

Ensure: T = kdtree(X,d).

1: c ← the dimension of greatest spread

2: Sort points according to axis and choose median as pivot element.

3: Initialize an empty tree node N.

4: N.value ← median

5: X.le ft ← kdtree( points in X before median, d +1)

6: X.right ← kdtree( points in X after median, d +1)

7: N.le ftChild ← balltree(X.le ft)

8: N.rightChild ← balltree(X.right)

9: return N

64

Figure 5.9: K-D Tree Example

Figure 5.10: Offline Efficiency

65

Algorithm 7 knn_search = (z, k,Q,T)

Ensure: z is a d dimensional vector.

Ensure: k is the number of nearest neighbor of z to search for.

Ensure: Q is max-first queue with length as k.

Ensure: T be the constructed ball tree

1: if distance(z,T.pivot) ≥ distance(z,Q. first) then

2: return Q unchanged

3: else if T is a leaf node then

4: for all node p in T do

5: if distance(z, p) < distance(z,Q. first) then

6: Add p to Q

7: if size(Q) > k then

8: Remove the furthest neighbor from Q

9: end if

10: end if

11: end for

12: else

13: Let T.child1 be the child node closest to z.

14: Let T.child2 be the child node furthest from z.

15: knn_search = (z, k,Q,T.child1)

16: knn_search = (z, k,Q,T.child2)

17: end if

66

Chapter 6

CBAC Optimizing Strategies

In this chapter, we discuss two optimization approaches in enforcing content-based access control.

First, we aim to improve the efficiency of the approach, especially when similarity assessment

is performed on-the-fly using blocking/clustering technique. Next, we also try to improve the

accuracy of content similarity assessment, especially for short text content, where term-distribution

based approaches would fail.

6.1 Content-Based Blocking

In the previous chapter, we have shown that content-based access control could be efficiently enforced

with offline similarity assessment. However, in some scenarios, the user credentials is

consistently updated, or provided with the query. Hence, it is not possible to pre-compute similarities

offline – the content similarity assessment needs to be performed on-the-fly for every record.

To expedite query processing, we introduce the content-based blocking scheme.

When the similarity function Simd(·,·) is provided with the access control policies, we can prepartition

the records into non-overlapping blocks, based on the content similarity of the records.

That is, we pre-cluster the records into c clusters, so that records with similar contents are labeled in

the same cluster: Simd(di

,di) is large when di and dj belong to the same cluster; while Simd(di

,di)

is small when di and dj belong to different clusters. The centroid of cluster Ck

is defined as the

67

average of the documents in the class:

| denotes the number of documents in the cluster. The centroid vectors of all clusters are

stored separately. Note that the centroid vectors are not human comprehensible. Alternatively, it is

also possible to store a document that is closest to the centroid, so that administrators could easily

estimate the content of the clusters. On average, each cluster will have N

c

documents. However,

most of the clustering approaches does not guarantee that cluster sizes are balanced.

During query processing, each incoming query is first compared with the cluster centroids, to

identify the most similar x clusters, where x << c. In practice, x is usually a very small number,

which is 1% of c or smaller. Next, the query is only evaluated against the records in the selected

x clusters. That is, a predicate is added to the query that requires the records to have one of the

x labels. In this way, in content-based access control enforcement, similarity assessment is only

performed between u and the records from x most similar clusters.

The assumption is that: when Simd(di

,dj) is high,

Simd(u,di) and Simd(u,dj) are expected to be similar. In fact, for vector space model and Euclidian

distance, we always have: |u?dj

| < |udi

|+|dj di

|. Therefore, the clusters that are most

similar to u are more likely to contain records that are most similar to u. For the same reason,

we can eliminate clusters that are different from u, since their records are supposed to be different

from u.

Any vector space clustering method could be employed in this approach. Since clustering is

computed offline, the computation is not a big concern. On the other hand, there are clustering

approaches that allows one item to belong to multiple clusters: overlapping clustering. When

overlapping clustering is employed, each record will be attached with multiple labels. Moreover,

during query processing, a smaller value could be picked for x, e.g. 3.

68

6.1.1 Naive k-means Clustering

We utilize the conception of clustering from machine learning to group similar documents into

clusters. When comparing the query and original access range of the session user against the entire

database, only related clusters will be considered as result candidates, and unrelated clusters will

be blocked.

k-means clustering is a method to partition the feature space in k voronoi diagrams (Aurenhammer

(1991)). Algorithm 8 illustrates the process of basic k-means clustering. It starts with random

selection of k points/centroids in training data X as seeds, and iterates assignments of points of

X to k centroids of clusters and calculation of new k centroids based on the recent assignments.

k-means although is very intuitive, it cannot guarantee the convergence of global optimum (Vattani

(2011)) and it is very expensive at the cost of O(n

dk+1

logn), where n is the number of entities to

be clustered and d is the feature dimensionality (Inaba et al. (1994)). The next two subsections

introduce the methodology of careful selection of k centroids, and scaled algorithm for k-means

clustering.

Algorithm 8 Basic k-means Clustering

Require: The training data set X, and the cluster number k

1: Randomly select k points in X to be the initial k centroids of clusters: C

2: Calculate each point in X and assign it to the nearest centroid inC to get the clustering partition

V.

3: Re-calculate the new k centroids of clusters from the means of each partition in V as C

0

.

4: while C

0 6= C do

5: C ← C

0

6: Calculate each point in X and assign it to the nearest centroid in C to get the clustering

partition V.

7: Re-calculate the new k centroids of clusters from the means of each partition in V as C

0

.

8: end while

9: return C and V

69

6.1.2 The Advantage of Careful Seeding: k-means++

In the previous subsection, we have introduced k-means clustering and its two major shortcoming.

In this subsection, we would like to address the first one: its incapability to guarantee the convergence

of global optimum. The key to the disadvantage of k-means of local convergence is due

to the arbitrary selection of the initial k centroids. k-means++ was introduced to initialize a careful

seeding of the initial k centroids by finding the most representative k ? 1 centroids given one

initial arbitrary selection of one centroid (Arthur & Vassilvitskii (2007)). Algorithm 9 illustrates

the process of the part of careful seeding. A standard k-means clustering is following the seeding

procedure. Note D(x) denote the shortest distance from a data point to the closest cluster.

Algorithm 9 The Seeding Algorithm of k-means++

Require: The training data set X, and the cluster number k

1: Randomly select a point c1 in X to be one initial centroid of clusters and add it in C.

2: for |C| < k do

3: Take a new centroid ci

, bu choosing x ∈ X with probability D(x)

2

∑x∈X D(x)

2

4: Add ci

in C.

5: end for

6: return C

6.1.3 Scaled k-means++ with mini-batch Strategy

k-means++ provides a heuristic method to seed k centroids as shown in the previous subsection;

however, k-means++ is still very expensive to implement for large scale data with high dimensionality.

In this subsection, we introduce mini-batch strategy for k-means clustering to improve the

scalability of k-means. Mini-batch strategy utilizes a random sampling with uniform distribution

to generate a small scale of "training data set batch". With the assumption of random sampling

with uniform distribution, the algorithm views the mini-batch proportionally reflects the structure

of true training data set and by using stochastic gradient descent (SGD), it assumes mini-batch

training will converge to the most possible k voronoi diagrams given the mini-batch b is generated

70

from X. As a golden standard, model can just be as good as input data, a mini-batch solution is an

approximation to the solution with the entire training data set.

Algorithm 10 Mini-batch k-means

Require: The training data set X, the cluster number k, mini-batch size b, and iterations t.

1: Select k centroids from X based either on randomization or on seeding.

2: v ← 0

3: for i ∈ {1,··· ,t} do

4: Randomly select b points in X, and M ← b.

5: for x ∈ M do

6: d[x ← f(C, x)]

7: end for

8: for x ∈ M do

9: c ← d[x]

10: v[c] ← v[c] +1

11: η ← 1

v[c]

12: c ← (1η)c+ηx

13: end for

14: end for

15: return C

6.1.4 Experiments

In the experiments, we perform offline clustering training on the entire repository. For document

clustering, k-means clustering is more suitable for vector-space training, as the features in the

vector has the same physical conception, so no weighting schemes is needed for -means clustering,

and sphere distance or cosine similarity, compared to other distance or similarity function is much

more proper. Here we use non-overlapping clustering, i.e., each record only belong to one cluster

according to the top similarity score between each centroid and the record itself. Non-overlapping

performs exclusive clustering compared to mixture modelings.

k-means clustering, although has been judged for its lacking of capability of inferring k in ad-

71

Figure 6.1: Threshold CBAC + Blocking Efficiency with QUERY1

vance, it is simple and easy to implement and generalize in a broad range of application, including

document clustering, and multimedia clustering. Per the setting of k, we refer to one simple rule

of thumb (Mardia et al. (1979)) that to partition n data objects into k clusters, k =pn2. Concretely,

in the experiments, k = 227. Figure 6.1, 6.2, 6.3, 6.4, and 6.5 show the results of blocking, which

boosts the efficiency in the sense of restraining the searching size of the entire database.

72

Figure 6.2: Threshold CBAC + Blocking Efficiency with QUERY2

6.2 Content-based labeling

6.2.1 Document labeling

As we have introduced in Chapter 5, the content-based access control model could take any

attribute-similarity measurement. So far, we have heavily employed the vector space model to

assess document similarity based on distribution of terms. While this model is the most popular

method in information retrieval applications, it suffers some drawbacks since it relies on the

bag-of-words model, not the meanings behind the terms. More precisely, the bag-of-words model

cannot reflect the semantic concepts beyond the word occurrences. This is a typical drawback targeted

in bag-of-words model. It cannot bridge the semantic gap in understanding the true meaning

in unstructured text. Nowadays, in particular, the vector space model is especially ineffective for

short documents. For instance, if we have a table of Twitter style short text snippets, enforcing

content-based access control with vector space model would be impractical. Since short text seg-

73

Figure 6.3: Threshold CBAC + ABAC + Blocking Efficiency with QUERY1

ments do not contain enough terms for term-distribution-based content modeling to be effective.

As an example, let us look at the following two short documents:

D1: privacy preserving similarity assessment for

semi-structured data

D2: private XML document matching

It is clear that D1 and D2 are both about the same topic. However, in vector space model, D1 and

D2 are likely to be orthogonal (if "private” and "privacy” are considered to be different terms, as

in most IR systems). Unfortunately, this type of short texts are heavily used in databases, and it is

expected to effectively enforce content-based access control on such short text attributes.

To tackle this problem, we need to employ information understanding methods that works better

than the term distribution based models. In particular, a group of annotation-based approaches

74

Figure 6.4: Threshold CBAC + ABAC + Blocking Efficiency with QUERY2

have been proposed in the information retrieval literature. The basic idea is to label short text

documents with a set of pre-selected unambiguous terms (a.k.a. topics), so that documents are

represented as a vector in the new unambiguous "topic space” instead of a "term space”, and

document-wise similarity could be measured in this topic space. The major concerns on these

approaches are: (1) quality of the constructed topic space, and (2) accuracy of document tagging.

We employ both non-negative factorization, and TAGME (Ferragina & Scaiella (2010)) annotation

to every abstract in the database to transform the content representation from bag-of-words

to bag-of-topics. In our experiment, we utilized python scikit-learn package () to implement nonnegative

matrix factorization to extract different numbers of topics from NSF corpus. The top 10

words of 10 topics, 20 topics, 50 topics and 100 topics are shown in Table A.1, A.2, A.3, and A.4

in Appendix. We use RESTFUL API calls to exact TAGME annotation. TAGME majorly relies

on the topics collected from Wikipedia, which reveals relatively high quality of the constructed

topic. For the accuracy of document tagging, TAGME uses a weighted strategy to generate a soft

75

Figure 6.5: Top-10 CBAC + Blocking Efficiency

boundary of choices to the users. It introduced a parameter ρ to represent the accuracy or the

goodness of the topic tagged onto the document. By examining these ρ’s, a ceretain cut-off can be

determined according to the usage of different applications.

6.2.2 Soundness of CBAC Enforcement

CBAC is said to be sound, when a CBAC enforcement mechanism makes access control decisions

that are consistent with users’ decisions. For instance, if we evaluate CBAC in Example 1, we

would like to ask: "when the CBAC enforcement mechanism allows Alice to access a record, will

the supervisor agree with this decision?” Practically, when CBAC enforcement allows users to

access records that are semantically similar to the seed records in the base set, the enforcement

mechanism is sound. Hence, the soundness of CBAC enforcement depends on the soundness of

the content model and the similarity assessment. In practice, understanding semantic meanings

76

from unstructured text is a very difficult task. In this paper, we follow the notions in information

retrieval community to evaluate the precision of the proposed approaches.

It is impractical to evaluate the relevance of a record against 100K records. Therefore, we

attempt to evaluate the top-100 records identified by CBAC to assess if a DBA would agree with

CBAC’s access decision. In the experiments, we first use three rules to coarsely identify "relevant

records”, and manually examine the content of these records to make adjustments. We noticed that

every record in NSF database is assigned with a set of field identification numbers. If two records

share one or more field identification number(s), they are initially considered to be relevant. Besides,

if the two records share two closely related field identification numbers, they are considered

to be relevant. Last, if the seeds’ content show a close relationship to the record’s field name, they

are considered to be relevant. Finally, we manually examine all the "relevant documents” identified

by these rules, and eliminate the ones that appear to be irrelevant to us. We have tested queries

from 60 different users in different disciplines including biology, chemistry, mechanical engineering,

mathematics etc, and measures the precision of top-K results (e.g. K = 10?100) for all the

queries. As shown in Figure 6.6, our CBAC enforcement is sound, as the user would agree with

approximately 80% of CBAC’s (positive) decisions, except for non-negative matrix factorization

(NMF) tagging. The reason why NMF tagging does not improve the precision of top-K results

is because NMF is actually an approximation of the base document repository, without other resources

facility (e.g. Wikipedia). Although it provides "semantic" representation of text, it also

compresses the information which has impact on the accuracy of similarity assessment. With the

facility of Wikipedia annotation, TAGME approach is able to improve both query efficiency and

CBAC accuracy. The efficiency results are shown in Figure 6.7, 6.8, 6.9, 6.10, and 6.11. The

efficiency result for combining blocking and labeling in top-K scenario is shown in Figure 6.12.

We do not evaluate recall, which is another popular metric in information retrieval. With the

amount of data, it is hard to manually identify all the relevant documents for each user. Meanwhile,

with the approximation assumption for CBAC (Section 2), it is tolerable if a small fraction of

relevant records are not accessible to the user, or a small number of irrelevant records are made

77

accessible.

Please note that the accuracy of the content similarity measurement is not a research problem in

the security community. Rather, we are utilizing the methods from information retrieval and NLP

communities. Any content modeling and similarity assessment method could be used in CBAC.

6.2.3 Experiments

In the experiments, when using TAGME, each topic is associated with an attribute called ρ in

the range of 0 to 1, which reflects the quality of the annotation in the text of input. The overall

topics annotated to the entire database is over 7 millions. That is 70 topic annotations per abstract.

Considering the entire annotations is not necessary, so we use a threshold of ρ as a filter to pick

out qualified topic annotation. In estimating the threshold, we fit all the collected ρ into a nonparametric

distribution. Figure 6.13 shows the fitting of the density function. Figure 6.14 shows

the fitting of cumulative probability. In Figure 6.14, it shows that by setting the threshold into

0.2, 80% topics are removed. Clearly, we have used the Pareto principle (80-20 rule) to determine

our cutoff on the thresholded rho. Therefore, in the experiment, we use 0.2 as the cutline to filter

topics. The similar filtering is also done with non-negative matrix factorization (NMF) annotation.

In NMF, we start the filtering with NMF of 100 topics, and obtain the curve as Figure 6.15 and

6.16.

The filtered topics are then constructed with ACCUMulate operator (,) and add into a new predicate

TAG in the table AWARD_BASIC of data type CLOB. A new CONTEXT indexing is added on

the predicate. Thus, in the model here instead of calculating similarity upon words, the model calculates

the similarity upon topics. The content labeling boosts the efficiency as shown in Table 6.6

in the sense of distilling the contents into the most relevant topics. In general cases, content-based

labeling makes the content much shorter compared to bag-of-words model and also it removes

ambiguities caused by word appearances.

78

Figure 6.6: Soundness of CBAC Enforcement

79

Figure 6.7: Threshold CBAC + Labeling Efficiency with QUERY1

Supervised labeling which is in scenario similar to TAGME usually provides better tagging on

documents; however, the labels are usually imbalanced, which means only a small fraction of the

document repository could be labeled given a pre-defined label. Due to the reason, in the next

chapter, a supervised labeling method for multi-label classification is raised. We try to expand the

application field, so that our focus includes document tagging, but not limited to it.

80

Figure 6.8: Threshold CBAC + Labeling Efficiency with QUERY2

Figure 6.9: Threshold CBAC + ABAC + Labeling Efficiency with QUERY1

81

Figure 6.10: Threshold CBAC + ABAC + Labeling Efficiency with QUERY2

Figure 6.11: Top-10 CBAC + Labeling Efficiency

82

Figure 6.12: Top-10 CBAC + Blocking + Labeling Efficiency

Figure 6.13: Density Fit

83

Figure 6.14: Cumulative Probability Fit

Figure 6.15: NMF 100 Density Fit

84

Figure 6.16: NMF 100 Cumulative Probability Fit

85

Chapter 7

Labeling Improvement with Multi-Label

Learning (MLL)

7.1 Motivation

In Chapter 6, both unsupervised labeling, and TAGME with the facility of Wikipedia annotation

are shown to provide relevant tags per documents. Specially, TAGME with the external contentrich

resources improves both efficiency and accuracy for CBAC. The reason that TAGME is able

to annotate documents with relevant tags is restrained with our feature of the experimental data

set. NSF data set is considered to be an academic document data set. Wikipedia facilitates the

academy fields for years and have expanded to most disciplines. Consider the following example

where TAGME will fall short in making relevant tags.

Example 7: A large manufacturer initialized a project to annotate products with potential category

tags, which are internally defined. The project need employees to read the product description and

then make annotation for every product. Given millions of different products, and a finite set

of categories, it is an extremely labor-extensive work for employees to read the description and

manually annotate all the products. Plus the quality of tagging, therefore, is highly restrained by

the employees’ experience and understandings per the manufacturer’s domain. The noise of the

86

tagging therefore is unknown.

TAGME will not be functional for Example 7. First, since the tags are defined internally, even

if Wikipedia has similar "terms" for the tags, they are most likely to mean totally different things.

Second, the so-called "tags" has no well-defined content associated with them; while Wikipedia,

under each web page (topic), there is rich content to explain what it is about, and which other web

pages (topics), it is related with. In the scenario of Example 7, supervised labeling will be a possible

effective alternative to replace TAGME. Supervised labeling facilitate labeling by lowering

down the manual annotation cost. Given a sample of labeled data objects, supervised learning is

able to make predictions of tags/labels to the rest data objects in the same knowledge domain. That

being said is, first experienced employees will manually label a fraction of products, and second a

supervised learning machine is trained to learn how to label products from the guidance provided

the experience employees in their labelings. Thus, in this chapter, we want to theoretically study

the cases with manual annotation: how well we could recover the labels for the rest part of samples

thourgh supervised labelings.

7.2 Problem Definition and Challenges

Labeling for documents is classified as multi-label learning (Tsoumakas et al. (2010)), where one

sample can be annotated with none to L labels, given L pre-defined labels. In other words, learning

with instances which can be tagged with any of the 2L possible subsets from the pre-defined L

labels is called multi-label learning. Multi-label learning is commonly applied in domains, such as

text, multimedia, web and biological data analysis, including automatic tagging, and function/topic

predictions.

Multi-label learning, in ideal situations, cannot ignore the correlation between several labels in

the L label set. However, the main challenge is actually the dilemma of optimizing label correlations

over exponentially large label power-set and the ignorance of label correlations using binary

87

relevance strategy (1-vs-all heuristic). That means multi-label learning either treats "multi-label"

as one single label in the label power-set, or treats the label as binary relevance where extends

multi-label learning as L parallel binary learnings. The shortcomings from two different scenarios

are obvious. The classification with label power-set usually encounters with highly skewed data

distribution, called imbalanced problem (Chawla et al. (2004)). While binary relevance strategy

reduces the problem from exponential to linear, it totally neglects the label correlations. However,

binary relevance will keep the imbalance rate as is, which is, most of the times, highly skewed.

In this chapter, we propose a novel strategy of introducing Balanced Pseudo-Labels (BPL)

which build more robust classifiers for imbalanced multi-label classification, which embeds imbalanced

data in the problems innately. By incorporating the new balanced labels we aim to increase

the average distances among the distinct label vectors with larger discriminant power. In this way,

we also compress the label correlation implicitly in BPL. Another obvious advantage of the proposed

method is that it is capable to combine with any classifier and it is proportional to linear

label transformation.

In the experiment, we choose five multi-label benchmark data sets including two text data sets,

one image data set, one biology data set, and one music data set, and compare our algorithm with

the most state-of-art algorithms. Experimental results have shown our algorithm outperforms them

in standard multi-label evaluation in most scenarios. The reason why we choose extra domain

of data sets is because although the motivation of the algorithm is to facilitate text tagging and

labeling, the algorithm is powerful enough to be applied on other kinds of data sets. Therefore,

in the rest of the chapter, we expand our focus on multi-label learning not limited to multi-label

learning in text domain.

7.3 Background

Multi-label learning (MLL) again refers to the problem to classify instances which can be tagged by

2

L possible subsets from the predefined L labels. Intuitively, MLL is widely applied in a variable

88

range of domains, such as text mining, multimedia classification, biological data analysis and

many more. With the rapid pace of multimedia application development and the accumulation of

massive biological data, the amount of data sources such as documents, images, music, and videos

in personal collections, public biological data sets, and web data stream is rapidly growing. To this

point, the popularity of MLL is based on three uncommon features of multi-labeled data, namely,

multi-components, multi-facets, and the structural label taxonomy. That means, one single sample

in learning problems has at least of one of the following characteristics, is a multi-label sample:

The sample has multiple components, and each component has at least one label to be

mapped to.

The sample has multiple facets, and each facet has at least one label to be mapped to.

The sample is from a hierarchical taxonomy structure, where the parent nodes of the taxonomy

are also treated as labels to the sample.

For example, automatic image tagging is a process to assign multiple keywords to a digital

image. Figure 7.1 shows a typical image labeled with sunset and sea, containing both a setting sun

and a sea sight. That being said, the image has at least two components: a sun and a sea sight,

each of which is mapped to one of the labels: sun and sea sight. Another typical example is the

news for David Beckham tagged by entertainment and also sports. Entertainment and Sports are

two different facets for David Beckham, which are mapped to two different types of news. The

third example is the gene function categorization shown in Figure 7.2. The protein P75957 is first

annotated with the grey blocks. According to the taxonomy of gene ontology (GO) molecular

function annotation, the structure represents a forest data structure, each protein gets its own node

annotation from GO plus all its parent nodes annotation. Based on the principle of molecular

function annotation of GO, The protein P75957 is annotated with the entire forest shown in Figure

7.2. As shown above, multi-label learning is omnipresent in our daily life. Based on the diversity

and broadness of applications which MLL applies, it becomes a hot topic with theoretical and

applicative interests.

89

One of the commonly used MLL strategies is called problem transformation, which transforms

MLL into multiple binary classification. The most popular problem transformation strategy is

binary relevance (BR) (Table 7.1 and 7.2). Boutell et al. addressed BR transformations using

SVM in scene analysis (Boutell et al. (2004)). Zhang and Zhou used an algorithm called ML-KNN

to combine maximum a posterior (MAP) principle with k-Nearest Neighbors (kNN) upon each

individual label (Zhang & Zhou (2007)). Lin and Chen showed the situation where using BR style

kNN, multi-label samples might be viewed as outliers (Lin & Chen (2010)).

They proposed a kNN-based MLL approach called voting Margin-Ratio kNN (Mr. kNN) to

prevent the false negative situation happened in ML-KNN. While it is simple to implement, the

common judgment for BR-based method is that it neglects label correlation. However, the label

correlation is crucial in many applications. For example, an image labeled with beach may likely

be labeled with sea, while an image labeled with mountain is unlikely labeled with indoor. Importantly

in bioinformatics, in digging out gene function correlations, a functional pathway will

probably be revealed. Here comes another means called label power-set (LP) (Table 7.1, 7.3, and

7.4). Given L predefined labels, any possible subset in the data is combined as a new label in LP.

The advantage of LP is it codes the label correlation into the classification process. However, it

will make the complexity from linear as BR to exponential. What’s more, the imbalanced rate will

be deteriorated to extreme case. Tsoumakas and Vlahavas developed RAKEL strategy to consider

the label correlation under a random k label combinations (Tsoumakas & Vlahavas (2007)). As

a strategy between BR and LP, RAKEL neglects the problems lying beneath MLL other than label

correlation, including imbalanced problems and the ambiguity in label combinations. Other

methods have been raised in dealing with the label correlation problem. Wang et al. extended

Green’s function into MLL (Wang et al. (2009)). They used kernel-based method to modify the

optimization function by introducing a penalty term constructed by the label correlation matrix. It

is related to the algorithm (Ji et al. (2010)), in the sense of extracting shared subspace for MLL

by adding a term of label dependency. Kang et al. raised correlated label propagation in MLL to

introduce label dependency (Kang et al. (2006)).

90

In this chapter, we propose a strategy to introduce balanced pseudo-labels to improve MLL

with skewed data distribution. Compared to others’ work, the key contributions of this paper are

highlighted as follows:

(1) We introduced the pseudo-labels to increase the distances between the new label vectors,

so as to reduce the ambiguities lying in the original label space.

(2) We put emphasis on the difference of the pseudo-classifiers (with the pseudo-labels), so as

to make the pseudo-classifiers to work efficiently together with little redundancy.

(3) We considered the balance rate of the pseudo-labels, in order to make more robust pseudoclassifiers.

The rest of the paper is organized as follows. In Section 7.4, we will discuss the relationship between

our framework and existing methods. In Section 7.5, we will describe the proposed method

with concrete mathematical definition. In Section 7.6, we will decribe the data sets and compare

our methods with other state-of-art multi-label algorithms.

Table 7.1: Multi-Label Example

No. Feature Label

1 X1 l1,l2

2 X2 l3

3 X3 l2

4 X4 l1,l2,l3

5 X5 l1,l2

Table 7.2: Binary Relevance Matrix Example

Feature l1 l2 l3

X1 1 1 0

X2 0 0 1

X3 0 1 0

X4 1 1 1

X5 1 1 0

91

Table 7.3: Label Power-Set Example

Label Label Power-Set

l1,l2 Z1

l3 Z2

l2 Z3

l1,l2,l3 Z4

Table 7.4: Label Power-Set Matrix Example

Feature Label Power-Set

X1 Z1

X2 Z2

X3 Z3

X4 Z4

X5 Z1

7.4 Related Work

The proposed method is inspired by error-correction coding (ECC). ECC was a technique developed

in 1950’s for detecting and correcting the errors due to channel noises in signal communication

(Hamming (1950)). In ECC, the original message is coded into a longer code-word by adding

more binary digits. The longer code-word is transferred via a signal channel from the transmitter

to the receiver. In the receiving terminal, the longer code-word is decoded back to the original

length of the initial message.

The idea had been successfully applied to multi-class classification by transferring every label

into a binary vector (Dietterich & Bakiri (1995)). Recently, researchers have applied ECC

method to MLL (Kouzani & Nasireding (2009); Ferng & Lin (2011)). It has been shown that the

ECC method boosts the performance of MLL in Hamming loss and subset accuracy (Ferng & Lin

92

Figure 7.1: Scene of Sunset at Sea

(2011)). However, limited explanation has been given to why ECC method can boost the performance.

The possible explanation would be the original label space might be tight to discriminate

the distinct label vectors. By increasing the dimensionality of label space, ECC will help to enlarge

the distances between label vectors in the new space. In the proposed method, we formularize the

objective function to make the distances increased during optimization iteration. Meanwhile, as

the imbalanced problem will make the binary classifier unstable, we try to make the newly added

pseudo-labels balanced.

Other methods have been applied to transform the label space. Compressed sensing (Hsu

et al. (2009)), principle label space transformation (Tai & Lin (2012)), canonical relation analysis

(Zhang & Schneider (2011); Sun et al. (2011)) have been used in MLL as to transform the label

vectors into a real-valued scale. Then the MLL problem will be transformed into a regression

problem. Thresholds of prediction should be learnt in these methods. Stable regression model

usually needs more samples than classification. Therefore, to transfer MLL to a regression method

compared to BR-based methods can usually make the model unstable. Classifier chains (Read

et al. (2011)) embeds every previous labels in the feature space in MLL. The method considers

93

Figure 7.2: Molecular Function Annotation of P75957

94

the label correlation in serial order. In our proposed method, we consider the label correlation in

a batch style rather than a serial order, since there is no clue that the label correlation comes in a

sequential order.

In a word, in aim of increasing the distances between label vectors, and deriving discriminant

pseudo-classifiers, the proposed method inspired by ECC makes the first effort to add balanced

pseudo-labels to boost the performance of imbalanced MLL.

7.5 Methodology

In this section, we will describe the proposed methods with concrete mathematical definition.

7.5.1 Preliminary

Given an input domain X R

d

and a binary output domain Y B

K, where B consists -1 and

+1. Let T represent the training data, where T = {X,Y} = {(x1,y1), (x2,y2), ··· , (xn,yn)}. Let

U ? B

m×K denote the unique label vectors in T, and w R

m is the occurrence weight vector of

U. w is normalized so that the summation of its elements is equal to one, where m is the number

of unique labels in T. BPL’s first goal is to find a Z B

m×p

, by adding Z to the right of U, to

maximize the average distances between the newly formed label vectors. In the next subsection,

we are going to discuss how we form the objective function to find Z. It should be noted that

m ≤ 2

p2, since the new added pseudo-labels should be at least sufficient to represent the original

unique label vectors with at least one +1’s and one -1’s.

7.5.2 Objective Function

In seeking Z, we consider the row-separation to increase the distances between label vectors, and

column-separation to make the pseudo-classifiers distinct from each other. Therefore, we use

Z = {zr1, zr2,··· , zrm} to denote the m row vectors in Z and U = {ur1, ur2,··· , urm} to denote

the m row vectors in U. Similarly, we use Z = {zc1, zc2,··· , zcp} to denote the p column vectors

95

in Z. Row-wisely, as we are trying to increase the distance between the new label vectors (uri, zri)

and (ur j, zr j), where i 6= j and i, j = 1,2,··· ,m. The distance comes from two parts, including the

distance between uri and ur j, and the distance between zri and zr j. The first part is fixed in the

training data. The second part is what we need to work on. The distance between any two rows in

Z is as follows.

d(zri, zr j) = (zri zr j)(zri zr j)

From the above derivation, if the inner product of a row in Z itself is a constant number, the

distance is inversely proportional to the inner product of the two rows of Z. Since Z, the inner product of a row in Z itself is equal to p. Then to maximize the distance of Z row-wisely is

to minimize the part as follows.

P1 = ∑∑ZZT

(7.2)

Column-wisely, we need to build up distinct pseudo-classifiers. In the mathematical sense, we

also need to maximize the distance of columns in Z. Therefore, similarly, we get P2 of minimization

as follows.

P2 = ∑∑Z

TZ (7.3)

The last part of BPL is to bring the balance into pseudo-labels. The balance rate is guaranteed

by minimizing the different counts between positive samples and negative samples. Therefore, we

have the balance rate vector β of the pseudo-labels calculated as follows.

96

β = 11×m ×((w×11×p)·Z)) (7.4)

where 11×m denote the matrix of all one’s with one row and m columns. Thus, the P3 of minimization

is as follows.

P3 = β β T

(7.5)

In order to consider the three minimization parts all together, a trade-off parameter needs to be

introduced as λ. Then the objective function is as follows. The format of the following optimization

objective function follows quadratic integer programming. As known, integer programming is

quite expensive (Papadimitriou (1981)). Therefore, it should be noticed that the objective function

cannot guarantee converge at polynomial time. Thus, we use greedy searching for improved label

space shown in the later algorithms

Q = ∑∑ZZT +∑∑Z

TZ+λ · β β T

(7.6)

In the following section, we will introduce the learning and the predicting processes.

7.5.3 Algorithm

In this section, we will describe the details of learning and predicting processes of BPL.

In BPL, the first step is to generate the pseudo-labels. The generation of pseudo-labels is

described in Algorithm 11.

In the experiment, the maxNum is set to 106

. Although the maxNum seems large, the permulation

step is quick. The range of λ is 

The second step is to learn individual label parameters in maximizing F1 measure. Here,

parameters are tuned independently to maximize the performance of the individual classifiers.

In the prediction part, a testing sample will be first predicted by the model to generalize the

97

Algorithm 11 Pseudo-Label Generation

Require: T, p, λ and maximum iteration M

1: Compute U and w from T.

2: Q ← ∞

3: bV with size (2

p ?2)× p columns is generated by excluding all -1’s and +1’s of the comprehensive

combination of p bits.

4: for all i ∈ {1,2,··· ,M} do

5: Permute the rows of bV to select m rows to form Z and calculate the objective function

value Qcurr.

6: if Qcurr is less than Q then

7: Record the current Z.

8: end if

9: end for

10: return U, w and Z

Algorithm 12 Individual Label Learning

Require: T, Z, k, p and Classifier C

1: Compute the new label vectors LT for samples in T.

2: for all i ∈ {1,2,··· ,K + p do

3: Train C’s model f with k-fold cross-validation.

4: end for

5: return f

98

predicted K + p binary digits l

0

. Then, the distances of l

0

and L in Algorithm 14 are calculated

to find the shortest distance and the corresponding label vector(s) in L. A weighted averaging of

the nearest label vectors (if more than one) is computed and thresholded by 0 to return a label

vector with 1’s and -1’s. Then, the first K digits of the returned label vector is given out as the final

prediction l. In the next section, we are going to discuss the experiment with description of data

sets, comparison algorithms, and metrics of evaluation.

7.6 Experiment

In the experiment, we applied our algorithm with random forest (Liaw & Wiener (2002)) and SVM

(Cristianini & Shawe-Taylor (2000)) of linear and RBF kernels in five multi-label benchmark data

sets, including emotions, enron, medical, scene and yeast.

These data sets come from different domains, such as music categorization, sight scene image

analysis, and gene function classification. They differ in training and testing sizes, category numbers,

features and other statistics related to MLL are presented in this section. In the following

subsection, we will present the statistics of these data sets.

Algorithm 13 Learning Process

Require: T, Z, pmax, k, Λ and Classifier C

1: pmin ← dlog2

(m+2)e

2: for all (p,λ) ∈ {pmin ≤ p ≤ pmax, λ ∈ Λ} do

3: Perform Pseudo-Label Generation.

4: Perform Individual Label Learning.

5: end for

6: Train individual models with k-fold cross-validation, and tune p and λ to get the final p and

λ.

7: for all i ∈ {1,2,··· ,K + p} do

8: Learn C’s model hi with fi from X to LTi

.

9: end for

10: return p, λ, U, w, Z and h

99

Table 7.5: Statistics of Data Sets

Data Set CATE CARD DENS |U| COVR

emotions 6 1.87 0.31 26 0.41

enron 53 3.39 0.06 545 0*

medical 45 1.2 0.03 61 0*

scene 6 1.0 0.18 14 0.22

yeast 14 4.2 0.30 164 0.01

Algorithm 14 Prediction

Require: the testing sample x, h, U, w and Z

1: Form L = (U,Z).

2: for all i ∈ {1,2,··· ,K + p} do

3: Predict l

0

i via hi

.

4: end for

5: Find the shortest distances from the predicted label vector l

0

to L.

6: Use weighted average of the label vectors with the shortest distances and threshold the values

with zero to make the final prediction t. The weight comes from the corresponding elements

in the weight vector w.

7: The predicted label vector is the first K digits from t denoting as l.

8: return l

7.6.1 Data Set Statistics

Table 7.7 shows the statistics of sizes of samples and features of the data sets. We also computed

the category size (CATE), label carginality (CARD), label density (DENS), unique label size |U|

in training data and label coverage (COVR) for each data set. 0* represents a number less than

10?10. The carginality, and density which commonly used in MLL are defined as follows.

100

Table 7.6: Imbalance Rate (%)

Average Positive Rate

emotions 30.22

enron 6.39

medical 2.79

scene 17.70

yeast 30.20

Table 7.7: Sample Sizes of Data Sets

Data Set Training Size Testing Size Feature Size

emotions 391 202 72

enron 1123 579 1001

medical 333 645 1449

scene 1211 1196 294

yeast 1500 917 103

We generate a new statistical metric in MLL called label coverage and define it as follows.

Label cardinality shows the average label size per sample. It determines the degree of multilabel

extent from sample perspective. Label density shows the average label rate over the samples

and the categories. It represents the label utility from sample perspective. Label coverage reflects

how crowd the original label space is.

Table 7.6 represents the imbalance rate of the five datasets. As shown in Table 7.6, every data

set embeds different rate of imbalance.

101

The following subsection will introduce the comparison methods in the experiment.

Table 7.8: Macro-Averaging F1 Measure (%) ↑

emo enr med sce yea

BR-LSVM 60.57 13.22 35.54 68.80 35.36

BPL-LSVM 65.92 13.57 33.63 70.85 39.68

BR-RSVM 58.85 12.08 37.21 60.06 25.46

BPL-RSVM 62.71 13.86 35.21 64.88 39.19

BR-RF 66.82 21.68 38.69 70.86 43.87

BPL-RF 70.46 23.39 36.32 78.14 50.44

IBLR 62.12 11.24 - 73.95 46.97

LS-CCA 63.14 12.78 35.86 58.36 36.99

LS-CCA-L1 62.87 20.34 42.46 63.69 36.90

LS-CCA-L2 62.80 21.53 39.75 64.89 37.76

L-M3L 64.77 17.80 - 67.04 36.14

R-M3L 57.81 22.69 - 77.17 49.83

7.6.2 Comparison Methods

We applied three different algorithms in the comparison part, namely IBLR (Cheng & Hüllermeier

(2009)), M3L (Hariharan et al. (2010)), and LS-CCA (Sun et al. (2011)).

IBLR is an instance-based logistic regression method to combine logistic regression with knearest

neighbours (kNN). In the experiment, we calculate out the parameter α in maximizing

the likelihood function supposed (Cheng & Hüllermeier (2009)). M3L is a max-margin multilabel

fomulation method. It codes the priors of label correlation via a correlation matrix. In

the experiment, the correlation matrix is calculated from training data via Pearson’s Correlation.

Linear and RBF kernels have been applied. LS-CCA is a canonical relation analysis via least

square formulation. In the experiment, LS-CCA both with and without regulation are applied.

The regulation of L1 and L2 have been implemented. The parameters and thresholds in these

experiments are tuned with 3-fold cross-validation. In LS-CCA with L1 and L2 regulation, the

regulation coefficient is in the range of . In M3L with linear kernel, we

102

Table 7.9: Micro-Averaging F1 Measurel (%) ↑

emo enr med sce yea

BR-LSVM 62.44 53.60 78.08 68.29 63.45

BPL-LSVM 66.26 48.30 74.84 70.22 63.62

BR-RSVM 60.09 49.70 78.17 60.19 63.50

BPL-RSVM 64.71 45.02 72.50 63.22 63.87

BR-RF 67.03 58.11 80.65 70.64 64.40

BPL-RF 71.15 53.65 78.39 77.52 67.76

IBLR 62.60 43.17 - 70.99 67.02

LS-CCA 53.04 28.29 46.87 44.57 58.08

LS-CCA-L1 52.50 49.14 62.51 47.25 57.94

LS-CCA-L2 52.02 47.74 59.20 47.85 64.27

L-M3L 65.27 55.58 - 66.44 63.57

R-M3L 59.46 56.92 - 76.56 66.43

tuned the bias with {0,0.1,··· ,1} and the cost with 

6}. For RBF kernel, we tuned

the bias and the cost as in linear kernel and γ with

6}. For all the thresholds, we

learn the value of each individual label from the minimum scale Smin to the maximum scale Smax

we got from the cross-validation with the interval of (Smax ?Smin)/100.

In the experiment, we applied random forest and SVM of both linear and RBF kernels. The

node number is tuned with {10,20,··· ,100} percent of feature size in random forest with the

package of random forest (Jaiantilal (2009)). We tuned the cost with 

6} with linear and RBF SVMs with libsvm package (Chang & Lin (2011)).

In BPL, we tune p = {dlog2

(m+2)e, dlog2

(m+2)e+1,··· ,16} and Λ = {2,210}

7.6.3 Evaluation Metric

In evaluating the experiment, we used standard metrics defined as follows, excluding some inaccurate

metrics such as accuracy, and hamming loss.

In the follows, T Pl

, FNl

, and FPl denote true positive, false negative, and false positive rates

for label l. MI represents micro-averaging, while MA represents macro-averaging. As recall and

103

precision have a reciprocal relationship, F1 measure calculates the balance of them.

recallMI · precisionMI

recallMI + precisionMI

recallMA · precisionMA

recallMA + precisionMA

In subset accuracy, δ function is a boolean expression. If yj = tj

is true, the result is one;

otherwise, the result is zero. Subset accuracy is considered to be the strictest evaluation metric in

multi-label classification, because it measures the exact matching of predictions.

7.6.4 Results

In this part, we are going to show the result of our methods with comparison methods. In the

tables, emo, enr, med, sce and yea represent emotions, enron, medical, scene, and yeast.

In Table 7.8, macro-averaging metric is presented. Macro-averaging metrics focus on the average

performance from label perspective. That means the metrics give equal weights for labels

regardless to different positive sample sizes in MLL. As stated before, we consider F1 measure

as the most important metric compared to recall and precision. Macro-averaging F1 measure in

104

Table 7.10: Subset Accuracy (%) ↑

emo enr med sce yea

BR-LSVM 25.25 12.44 63.10 51.92 16.03

BPL-LSVM 32.67 14.68 64.81 65.64 21.81

BR-RSVM 25.25 7.94 62.48 43.65 16.03

BPL-RSVM 29.21 11.40 61.71 58.19 20.72

BR-RF 29.21 14.68 66.05 55.18 17.78

BPL-RF 36.63 16.75 68.68 71.49 25.52

IBLR 13.86 8.46 - 56.10 16.36

LS-CCA 0.05 0 12.40 11.04 0.11

LS-CCA-L1 0.05 0 21.86 9.45 0.11

LS-CCA-L2 0.05 0 21.24 8.19 0

L-M3L 29.21 14.51 - 50.33 16.36

R-M3L 19.80 15.37 - 63.63 18.97

Table 7.8 shows BPL-RF out-performs the naive BR-based methods and the other methods except

for enron.

In Table 7.9, micro-averaging metric is presented. Micro-averaging metrics focus on the average

performance from sample perspective. That means the metrics give equal weights for samples

in MLC. It shows random forest-based methods including BR-RF and BPL-RF, out-perform the

others in micro-averaging F1 measure.

In Table 7.10, the subset accuracy is listed. The metric is viewed as the strictest evaluation

criterion as it rates for samples with exact matching of labels. In the result, BPL-RF out-performs

all the other methods in this metric. This goes with the method proposed. Although we applied

BR strategy after generating pseudo-labels, we aim to find the closest label vectors via BPL.

7.7 Conclusion

As presented in the experimental results, BPL is able to boost the accuracy of labeling with training

manual annotation by lowering down the imbalance rates and increase the discrimination power

between label vectors. Although the labeling training is expensive in time cost, it is performed

105

offline. Therefore, CBAC with the content-centric database from specific domain is capable to

train their tagging by initializing a set of manual tags covering the basic tags in the specific domain

with BPL and scalable classification algorithm like random forest.

106

Chapter 8

Discussions

8.1 Computational Complexity

The efficiency and scalability issues are major metrics in evaluating any database access control

approach. As we have shown in the experiments, the content-based access control model could

be efficiently enforced using native functions from Oracle DBMS, especially with blocking and

labeling. Meanwhile, we would also give a formal analysis of computational complexity.

First, without any indexing and blocking, the DBMS needs to perform a pairwise comparison

between every record and the user’s records, in order to make access control decisions. The

computational complexity is O(N · m), where N denotes the total number of records (usually very

large), and m denotes the number of records that represent the user. Assuming that we adopt the

vector space model, the complexity for each comparison would be O(D), where D denotes the

dimensionality of the vector space (a.k.a. the size of the dictionary). D does not grow linearly with

the growth of N. Indeed, as long as the a good number of records cover most of the words used

in the context, D increases very slowly when new records are added. Meanwhile, the computation

for each comparison could be easily reduced to O(d) (by only including terms in user’s records

in computing cosine similarity), where d being the number of distinct terms in the user’s record.

Hence, the overall computational complexity for a query would be O(N · m · d). As we see, query

107

processing time is linear to the number of records, while m and d are relatively small.

Next, with blocking, the DBMS first selects x clusters of records from the total c clusters,

and then perform pairwise comparison between user’s records and records within the c clusters.

Assuming that the size of clusters are relatively balanced, and each record only belongs to one

cluster, there will be N/c records in each cluster on average. Note that content-based clustering

is performed offline; hence, the computation for clustering is not concerned in our approach. The

computation for selecting top x clusters is O(c ·m· d), while the computation for enforcing CBAC

for records within the x clusters would be O((N/c)·m· d · x). Hence, the total computation would

Hence, the blocking mechanism reduces the overall computation to O(m· d ·sqrtN · x). It could be

further reduced to O(m· d ·log(N · x)), if we implement multi-level blocking.

8.2 Negative Rules and Conflict Resolution

In database access control, negative rules are employed to prevent the user (or role) from accessing

specified records. Usually, positive rules allows the user to access a (relatively large) set of records,

while negative rules excludes some particular records from the set. It is still possible to use negative

rules in CBAC. For instance, to specify that “Agent Alice cannot access records similar to Record

X" in Example 1. Meanwhile, in the case of conflict rules (e.g. a positive rule grants access to

a record, while a negative rule forbids it), the negative rule usually takes precedence. In some

access control model, the rule with a smaller scope takes precedence. In CBAC, we also have

the capability to specify that the rule with higher content-similarity takes precedence. However,

details on this topic is mostly the choice of the administrator, and is outside of the scope of this

dissertation.

108

8.3 CBAC for XML Data

Last but not least, CBAC could be effectively applied on XML data as well. We only need to

redesign the record-wise similarity function in Equation 4.1 to adapt to XML data. The new function

will take two XML nodes, traverse the entire subtrees, and return their similarity value. XML

similarity assessment is more complicate than relational data, due to the semi-structured nature

o the data. In particular, both structural similarity and textual similarity need to be considered

in comparing two XML documents or nodes. For more details about XML similarity comparison,

a survey is available (Tekli et al. (2009)). Meanwhile, all the content similarity assessment

techniques discussed in Chapter 4, 5 and 6 are still valid for XML data.

109

Chapter 9

Conclusion

In this dissertation, we introduce the content-based access control and enforcement mechanisms.

As a complimentary of the conventional access control approaches, the CBAC model is most suitable

for content-centric information sharing scenarios, when content plays a major role in access

decision making, and approximation is allowed by the application. CBAC makes access control

decisions based on the semantical similarity between the requester’s credentials (often represented

by his/her own records) and the content of the data. We formally present the content-based access

control model, and demonstrate an enforcement mechanism of this model on Oracle VPD. Meanwhile,

to improve the computational efficiency of the enforcement mechanism, we introduce an

offline similarity assessment approach (like an index), and a content-based blocking approach. We

further improve the accuracy of semantic content matching with a content-based tagging mechanism.

Experimental results show that the access control decision made by CBAC are reasonable,

and the overhead is also acceptable.

Finally yet importantly, the CBAC model provides no restrictions on user and content modeling.

We have presented a proof-of-concept implementation of the CBAC model with vector space

and tag-based models. In practice, more complicated user and content modeling methods could

be employed. For instance, it will be helpful to include advanced content models such as opinion

extraction (Ku et al. (2006)), and sentiment analysis (Pang & Lee (2008)). However, understanding

110

the semantic content of unstructured text content is a very difficult problem, which is outside of

the scope of this paper. It is one of the main tasks of our future work.

111

References

Adam, N. R., Atluri, V., Bertino, E., & Ferrari, E. (2002). A content-based authorization model

for digital libraries. Knowledge and Data Engineering, IEEE Transactions on, 14(2), 296–315.

Ahn, G.-J. (2009). Discretionary access control. In Encyclopedia of Database Systems (pp. 864–

866). Springer.

Amjad, H. (2007). A context aware content based federated access control system for healthcare

domain. ECE Masters Theses, (pp.13). ?

Appari, A. & Johnson, M. E. (2010). Information security and privacy in healthcare: current state

of research. International journal of Internet and enterprise management, 6(4), 279–314.

Arthur, D. & Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. In Proceedings

of the eighteenth annual ACM-SIAM symposium on Discrete algorithms (pp. 1027–1035).:

Society for Industrial and Applied Mathematics.

Aurenhammer, F. (1991). Voronoi diagrams?a survey of a fundamental geometric data structure.

ACM Computing Surveys (CSUR), 23(3), 345–405.

Bache, K. & Lichman, M. (2013). UCI machine learning repository.

Bentley, J. L. (1975). Multidimensional binary search trees used for associative searching. Communications

of the ACM, 18(9), 509–517.

Bertino, E., Bettini, C., Ferrari, E., & Samarati, P. (1996). A temporal access control mechanism

for database systems. Knowledge and Data Engineering, IEEE Transactions on, 8(1), 67–80.

112

Bertino, E., Bonatti, P. A., & Ferrari, E. (2001a). Trbac: A temporal role-based access control

model. ACM Transactions on Information and System Security (TISSEC), 4(3), 191–233.

Bertino, E., Castano, S., & Ferrari, E. (2001b). Securing xml documents with author-x. Internet

Computing, IEEE, 5(3), 21–31.

Bertino, E., Catania, B., Ferrari, E., & Perlasca, P. (2003a). A logical framework for reasoning

about access control models. ACM Transactions on Information and System Security (TISSEC),

6(1), 71–127.

Bertino, E., Fan, J., Ferrari, E., Hacid, M.-S., Elmagarmid, A. K., & Zhu, X. (2003b). A hierarchical

access control model for video database systems. ACM Transactions on Information Systems

(TOIS), 21(2), 155–191.

Bertino, E. & Ferrari, E. (2002). Secure and selective dissemination of xml documents. ACM

Transactions on Information and System Security (TISSEC), 5(3), 290–331.

Bertino, E., Ghinita, G., & Kamra, A. (2011). Access control for databases: concepts and systems,

volume 8. Now Publishers Inc.

Bertino, E. & Haas, L. M. (1988). Views and security in distributed database management systems.

In Advances in Database Technology—EDBT’88 (pp. 155–169). Springer.

Bertino, E., Haas, L. M., & Lindsay, B. G. (1983). View management in distributed data base

systems. In VLDB (pp. 376–378).

Bertino, E., Hammad, M. A., Aref, W. G., & Elmagarmid, A. K. (2000). An access control model

for video database systems. In Proceedings of the ninth international conference on Information

and knowledge management (pp. 336–343).: ACM.

Bertino, E., Samarati, P., & Jajodia, S. (1997). An extended authorization model for relational

databases. Knowledge and Data Engineering, IEEE Transactions on, 9(1), 85–101.

113

Bhatti, R., Sanz, D., Bertino, E., & Ghafoor, A. (2007). A policy-based authorization framework

for web services: Integrating xgtrbac and ws-policy. In Web Services, 2007. ICWS 2007. IEEE

International Conference on (pp. 447–454).: IEEE.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. the Journal of machine

Learning research, 3, 993–1022.

Boutell, M. R., Luo, J., Shen, X., & Brown, C. M. (2004). Learning multi-label scene classification.

Pattern recognition, 37(9), 1757–1771.

Boxwala, A. A., Kim, J., Grillo, J. M., & Ohno-Machado, L. (2011). Using statistical and machine

learning to help institutions detect suspicious access to electronic health records. Journal of the

American Medical Informatics Association, 18(4), 498–505.

Byun, J.-W. & Li, N. (2008). Purpose based access control for privacy protection in relational

database systems. The VLDB Journal, 17(4), 603–619.

Carpineto, C., Osinski, S., Romano, G., & Weiss, D. (2009). A survey of web clustering engines. ′

ACM Computing Surveys (CSUR), 41(3), 17.

Chang, C.-C. & Lin, C.-J. (2011). Libsvm: a library for support vector machines. ACM Transactions

on Intelligent Systems and Technology (TIST), 2(3), 27.

Chawla, N. V., Japkowicz, N., & Kotcz, A. (2004). Editorial: special issue on learning from

imbalanced data sets. ACM Sigkdd Explorations Newsletter, 6(1), 1–6.

Cheng, W. & Hüllermeier, E. (2009). Combining instance-based learning and logistic regression

for multilabel classification. Machine Learning, 76(2-3), 211–225.

Cristianini, N. & Shawe-Taylor, J. (2000). An introduction to support vector machines and other

kernel-based learning methods. Cambridge university press.

114

Damiani, E., De Capitani di Vimercati, S., Paraboschi, S., & Samarati, P. (2002). A fine-grained access

control system for xml documents. ACM Transactions on Information and System Security

(TISSEC), 5(2), 169–202.

Damiani, E., di Vimercati, S. D. C., Paraboschi, S., & Samarati, P. (2000). Securing xml documents.

In Advances in Database Technology—EDBT 2000 (pp. 121–135). Springer.

Dekker, M., Crampton, J., & Etalle, S. (2008). Rbac administration in distributed systems. In

Proceedings of the 13th ACM symposium on Access control models and technologies (pp. 93–

102).: ACM.

Dietterich, T. G. & Bakiri, G. (1995). Solving multiclass learning problems via error-correcting

output codes. arXiv preprint cs/9501101.

Downs, D. D., Rub, J. R., Kung, K. C., & Jordan, C. S. (1985). Issues in discretionary access

control. In 2012 IEEE Symposium on Security and Privacy (pp. 208–208).: IEEE Computer

Society.

Fabian, B., Kunz, S., Konnegen, M., Müller, S., & Günther, O. (2012). Access control for semantic

data federations in industrial product-lifecycle management. Computers in Industry, 63(9), 930–

940.

Ferng, C.-S. & Lin, H.-T. (2011). Multi-label classification with error-correcting codes. In ACML

(pp. 281–295).

Ferragina, P. & Scaiella, U. (2010). Tagme: on-the-fly annotation of short text fragments (by

wikipedia entities). In Proceedings of the 19th ACM international conference on Information

and knowledge management (pp. 1625–1628).: ACM.

Ferraiolo, D. F., Sandhu, R., Gavrila, S., Kuhn, D. R., & Chandramouli, R. (2001). Proposed nist

standard for role-based access control. ACM Transactions on Information and System Security

(TISSEC), 4(3), 224–274.

115

Ferreira, A., Cruz-Correia, R., Antunes, L., Farinha, P., Oliveira-Palhares, E., Chadwick, D. W.,

& Costa-Pereira, A. (2006). How to break access control in a controlled manner. In ComputerBased

Medical Systems, 2006. CBMS 2006. 19th IEEE International Symposium on (pp. 847–

854).: IEEE.

Giuri, L. & Iglio, P. (1997). Role templates for content-based access control. In Proceedings of

the second ACM workshop on Role-based access control (pp. 153–159).: ACM.

Griffiths, P. P. & Wade, B. W. (1976). An authorization mechanism for a relational database system.

ACM Transactions on Database Systems (TODS), 1(3), 242–255.

Hamming, R. W. (1950). Error detecting and error correcting codes. Bell System technical journal,

29(2), 147–160.

Hariharan, B., Zelnik-Manor, L., Varma, M., & Vishwanathan, S. (2010). Large scale max-margin

multi-label classification with priors. In Proceedings of the 27th International Conference on

Machine Learning (ICML-10) (pp. 423–430).

Hart, M., Johnson, R., & Stent, A. (2007). More content-less control: Access control in the web

2.0. IEEE Web, 2.

Hart, M. A. (2006). Content-based access control.

Hicks, B., Rueda, S., King, D., Moyer, T., Schiffman, J., Sreenivasan, Y., McDaniel, P., & Jaeger,

T. (2010). An architecture for enforcing end-to-end access control over web applications. In

Proceedings of the 15th ACM symposium on Access control models and technologies (pp. 163–

172).: ACM.

Hoare, C. A. R. (1961). Algorithm 65: find. Communications of the ACM, 4(7), 321–322.

Hsu, D., Kakade, S., Langford, J., & Zhang, T. (2009). Multi-label prediction via compressed

sensing. In NIPS, volume 22 (pp. 772–780).

116

Hu, V. C., Kuhn, D. R., & Ferraiolo, D. F. (2015). Attribute-based access control. Computer, (2),

85–88.

Inaba, M., Katoh, N., & Imai, H. (1994). Applications of weighted voronoi diagrams and randomization

to variance-based k-clustering. In Proceedings of the tenth annual symposium on

Computational geometry (pp. 332–339).: ACM.

Jaiantilal, A. (2009). Random forest package (matlab version).

Jajodia, S., Samarati, P., Subrahmanian, V., & Bertino, E. (1997). A unified framework for enforcing

multiple access control policies. In ACM Sigmod Record, volume 26 (pp. 474–485).:

ACM.

Jajodia, S. & Sandhu, R. (1991). Toward a multilevel secure relational data model. In ACM

SIGMOD Record, volume 20 (pp. 50–59).: ACM.

Ji, S., Tang, L., Yu, S., & Ye, J. (2010). A shared-subspace learning framework for multi-label

classification. ACM Transactions on Knowledge Discovery from Data (TKDD), 4(2), 8.

Joachims, T. (1996). A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text

Categorization. Technical report, DTIC Document.

Kagal, L., Finin, T., & Joshi, A. (2003). A policy based approach to security for the semantic web.

In The Semantic Web-ISWC 2003 (pp. 402–418). Springer.

Kang, F., Jin, R., & Sukthankar, R. (2006). Correlated label propagation with application to multilabel

learning. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society

Conference on, volume 2 (pp. 1719–1726).: IEEE.

Kao, A. & Poteet, S. R. (2007). Natural language processing and text mining. Springer.

Kouzani, A. Z. & Nasireding, G. (2009). Multilabel classification by bch code and random forests.

International journal of recent trends in engineering, 2(1), 113–116.

117

Krishnan, R., Sandhu, R., Niu, J., & Winsborough, W. H. (2009). Foundations for group-centric

secure information sharing models. In Proceedings of the 14th ACM symposium on Access

control models and technologies (pp. 115–124).: ACM.

Ku, L.-W., Liang, Y.-T., & Chen, H.-H. (2006). Opinion extraction, summarization and tracking

in news and blog corpora. In AAAI spring symposium: Computational approaches to analyzing

weblogs, volume 100107.

Kulkarni, S., Singh, A., Ramakrishnan, G., & Chakrabarti, S. (2009). Collective annotation of

wikipedia entities in web text. In Proceedings of the 15th ACM SIGKDD international conference

on Knowledge discovery and data mining (pp. 457–466).: ACM.

Kumar, N., Zhang, L., & Nayar, S. (2008). What is a good nearest neighbors algorithm for finding

similar patches in images? In Computer Vision–ECCV 2008 (pp. 364–378). Springer.

Lee, D. D. & Seung, H. S. (2001). Algorithms for non-negative matrix factorization. In Advances

in neural information processing systems (pp. 556–562).

Leighton, G. & Barbosa, D. (2010). Access control policy translation and verification within

heterogeneous data federations. In Proceedings of the 15th ACM symposium on Access control

models and technologies (pp. 173–182).: ACM.

Li, N. (2011). Discretionary access control. Encyclopedia of Cryptography and Security, (pp.

353–356).

Li, N., Wang, Q., Qardaji, W., Bertino, E., Rao, P., Lobo, J., & Lin, D. (2009). Access control

policy combining: theory meets practice. In Proceedings of the 14th ACM symposium on Access

control models and technologies (pp. 135–144).: ACM.

Liaw, A. & Wiener, M. (2002). Classification and regression by randomforest. R news, 2(3),

18–22.

118

Lin, X. & Chen, X.-w. (2010). Mr. knn: soft relevance for multi-label classification. In Proceedings

of the 19th ACM international conference on Information and knowledge management (pp. 349–

358).: ACM.

Lindqvist, H. (2006). Mandatory access control. Master’s Thesis in Computing Science, Umea

University, Department of Computing Science, SE-901, 87.

Malin, B., Airoldi, E., et al. (2007). Confidentiality preserving audits of electronic medical record

access. Studies in health technology and informatics, 129(1), 320.

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval, volume

1. Cambridge university press Cambridge.

Mardia, K. V., Kent, J. T., & Bibby, J. M. (1979). Multivariate analysis. Academic press.

McCune, J. M., Jaeger, T., Berger, S., Caceres, R., & Sailer, R. (2006). Shamon: A system for

distributed mandatory access control. In Computer Security Applications Conference, 2006.

ACSAC’06. 22nd Annual (pp. 23–32).: IEEE.

McGraw, R. (2009). Risk-adaptable access control (radac). In Privilege (Access) Management

Workshop. NIST–National Institute of Standards and Technology–Information Technology Laboratory.

Mihalcea, R. & Csomai, A. (2007). Wikify!: linking documents to encyclopedic knowledge. In

Proceedings of the sixteenth ACM conference on Conference on information and knowledge

management (pp. 233–242).: ACM.

Milne, D. & Witten, I. H. (2008). Learning to link with wikipedia. In Proceedings of the 17th

ACM conference on Information and knowledge management (pp. 509–518).: ACM.

Moffett, J., Sloman, M., & Twidle, K. (1990). Specifying discretionary access control policy for

distributed systems. Computer Communications, 13(9), 571–580.

119

Molloy, I., Dickens, L., Morisset, C., Cheng, P.-C., Lobo, J., & Russo, A. (2012). Risk-based

security decisions under uncertainty. In Proceedings of the second ACM conference on Data

and Application Security and Privacy (pp. 157–168).: ACM.

Molloy, I., Li, N., Li, T., Mao, Z., Wang, Q., & Lobo, J. (2009). Evaluating role mining algorithms.

In Proceedings of the 14th ACM symposium on Access control models and technologies (pp. 95–

104).: ACM.

Molloy, I., Li, N., Qi, Y. A., Lobo, J., & Dickens, L. (2010). Mining roles with noisy data. In

Proceedings of the 15th ACM symposium on Access control models and technologies (pp. 45–

54).: ACM.

Monte, S. (2010). Access control based on content. Technical report, TKK Technical Reports in

Computer Science and Engineering, B. TKK-CSE-B10. http://www. cse. tkk. fi/en/publications/B/10/papers/Monte

final. pdf.

Murugesan, M., Jiang, W., Clifton, C., Si, L., & Vaidya, J. (2010). Efficient privacy-preserving

similar document detection. The VLDB Journal–The International Journal on Very Large Data

Bases, 19(4), 457–475.

Ni, Q., Lobo, J., Calo, S., Rohatgi, P., & Bertino, E. (2009). Automating role-based provisioning

by learning from examples. In Proceedings of the 14th ACM symposium on Access control

models and technologies (pp. 75–84).: ACM.

NIST (2009). A survey of access control models.

Omohundro, S. M. (1989). Five balltree construction algorithms. International Computer Science

Institute Berkeley.

Oracle (2012). Oracle database security guide 10g release 2 (10.2).

Osborn, S., Sandhu, R., & Munawer, Q. (2000). Configuring role-based access control to enforce

120

mandatory and discretionary access control policies. ACM Transactions on Information and

System Security (TISSEC), 3(2), 85–106.

Pan, C.-C., Mitra, P., & Liu, P. (2006). Semantic access control for information interoperation. In

Proceedings of the eleventh ACM symposium on Access control models and technologies (pp.

237–246).: ACM.

Pang, B. & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and trends in

information retrieval, 2(1-2), 1–135.

Papadimitriou, C. H. (1981). On the complexity of integer programming. Journal of the ACM

(JACM), 28(4), 765–768.

Park, J. S., Sandhu, R., & Ahn, G.-J. (2001). Role-based access control on the web. ACM Transactions

on Information and System Security (TISSEC), 4(1), 37–71.

Pedersen, T., Pakhomov, S., McInnes, B., & Liu, Y. (2012). Measuring the similarity and relatedness

of concepts in the medical domain: Ihi 2012 tutorial overview. In Proceedings of the 2nd

ACM SIGHIT International Health Informatics Symposium (pp. 879–880).: ACM.

Porter, M. F. (2001). Snowball: A language for stemming algorithms.

Qin, L. & Atluri, V. (2003). Concept-level access control for the semantic web. In Proceedings of

the 2003 ACM workshop on XML security (pp. 94–103).: ACM.

Rao, P., Lin, D., Bertino, E., Li, N., & Lobo, J. (2009). An algebra for fine-grained integration

of xacml policies. In Proceedings of the 14th ACM symposium on Access control models and

technologies (pp. 63–72).: ACM.

Rao, V. & Jaeger, T. (2009). Dynamic mandatory access control for multiple stakeholders. In

Proceedings of the 14th ACM symposium on Access control models and technologies (pp. 53–

62).: ACM.

121

Read, J., Pfahringer, B., Holmes, G., & Frank, E. (2011). Classifier chains for multi-label classifi-

cation. Machine learning, 85(3), 333–359.

Reddivari, P., Finin, T., & Joshi, A. (2005). Policy-based access control for an rdf store. In

Proceedings of the Policy Management for the Web workshop, volume 120 (pp. 78–83).

Rostad, L. & Edsberg, O. (2006). A study of access control requirements for healthcare systems

based on audit trails from access logs. In Computer Security Applications Conference, 2006.

ACSAC’06. 22nd Annual (pp. 175–186).: IEEE.

Salton, G. & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information

processing & management, 24(5), 513–523.

Samarati, P. & de Vimercati, S. C. (2001). Access control: Policies, models, and mechanisms. In

Foundations of Security Analysis and Design (pp. 137–196). Springer.

Sandhu, R. & Chen, F. (1998). The multilevel relational (mlr) data model. ACM Transactions on

Information and System Security (TISSEC), 1(1), 93–132.

Sandhu, R. S. (1993). Lattice-based access control models. Computer, 26(11), 9–19.

Sandhu, R. S., Coyne, E. J., Feinstein, H. L., & Youman, C. E. (1996). Role-based access control

models. Computer, 29(2), 38–47.

Scannapieco, M., Figotin, I., Bertino, E., & Elmagarmid, A. K. (2007). Privacy preserving schema

and data matching. In Proceedings of the 2007 ACM SIGMOD international conference on

Management of data (pp. 653–664).: ACM.

Sharma, D. M. (2010). On the role of nlp in linguistics. In Proceedings of the 2010 Workshop on

NLP and Linguistics: Finding the Common Ground (pp. 18–21).: Association for Computational

Linguistics.

122

Sun, L., Ji, S., & Ye, J. (2011). Canonical correlation analysis for multilabel classification: A

least-squares formulation, extensions, and analysis. Pattern Analysis and Machine Intelligence,

IEEE Transactions on, 33(1), 194–200.

Tai, F. & Lin, H.-T. (2012). Multilabel classification with principal label space transformation.

Neural Computation, 24(9), 2508–2542.

Takabi, H. & Joshi, J. B. (2009). An efficient similarity-based approach for optimal mining of

role hierarchy. In Proceedings of the 16th ACM Conference on Computer and Communications

Security.

Tekli, J., Chbeir, R., & Yetongnon, K. (2009). An overview on xml similarity: background, current

trends and future directions. Computer science review, 3(3), 151–173.

Thomas, R. K., Sandhu, R. S., et al. (1993). Discretionary access control in object-oriented

databases: Issues and research directions. In Proc. 16th National Computer Security Conference

(pp. 63–74).

Thuraisingham, B. (2009). Mandatory access control. In Encyclopedia of Database Systems (pp.

1684–1685). Springer.

Toninelli, A., Montanari, R., Kagal, L., & Lassila, O. (2006). A semantic context-aware access

control framework for secure collaborations in pervasive computing environments. In The Semantic

Web-ISWC 2006 (pp. 473–486). Springer.

Tran, N. A. & Dang, T. K. (2007). A novel approach to fine-grained content-based access control

for video databases. In Database and Expert Systems Applications, 2007. DEXA’07. 18th

International Workshop on (pp. 334–338).: IEEE.

Tsoumakas, G., Katakis, I., & Vlahavas, I. (2010). Mining multi-label data. In Data mining and

knowledge discovery handbook (pp. 667–685). Springer.

123

Tsoumakas, G. & Vlahavas, I. (2007). Random k-labelsets: An ensemble method for multilabel

classification. In Machine learning: ECML 2007 (pp. 406–417). Springer.

Tzelepi, S. K., Koukopoulos, D. K., & Pangalos, G. (2001). A flexible content and context-based

access control model for multimedia medical image database systems. In Proceedings of the

2001 workshop on Multimedia and security: new challenges (pp. 52–55).: ACM.

Uhlmann, J. K. (1991). Satisfying general proximity/similarity queries with metric trees. Information

processing letters, 40(4), 175–179.

Upadhyaya, S. (2011). Mandatory access control. In Encyclopedia of Cryptography and Security

(pp. 756–758). Springer.

Vattani, A. (2011). K-means requires exponentially many iterations even in the plane. Discrete &

Computational Geometry, 45(4), 596–616.

Wang, H., Huang, H., & Ding, C. (2009). Image annotation using multi-label correlated green’s

function. In Computer Vision, 2009 IEEE 12th International Conference on (pp. 2029–2034).:

IEEE.

Winslett, M., Ching, N., Jones, V., & Slepchin, I. (1997). Using digital credentials on the world

wide web. Journal of Computer Security, 5(3), 255–267.

Winslett, M., Smith, K., & Qian, X. (1994). Formal query languages for secure relational

databases. ACM Transactions on Database Systems (TODS), 19(4), 626–662.

Witten, I. & Milne, D. (2008). An effective, low-cost measure of semantic relatedness obtained

from wikipedia links. In Proceeding of AAAI Workshop on Wikipedia and Artificial Intelligence:

an Evolving Synergy, AAAI Press, Chicago, USA (pp. 25–30).

Wu, C. & Chen, Y. (2011). A survey of researches on the application of natural language processing

in internet public opinion monitor. In 2011 International Conference on Computer Science and

Service System (CSSS) (pp. 1035–1038).

124

Yang, L., Ege, R. K., Ezenwoye, O., & Kharma, Q. (2004). A role-based access control model for

information mediation. In Information Reuse and Integration, 2004. IRI 2004. Proceedings of

the 2004 IEEE International Conference on (pp. 277–282).: IEEE.

Yu, T., Srivastava, D., Lakshmanan, L. V., & Jagadish, H. (2002). Compressed accessibility map:

efficient access control for xml. In Proceedings of the 28th international conference on Very

Large Data Bases (pp. 478–489).: VLDB Endowment.

Zhang, L., Ahn, G.-J., & Chu, B.-T. (2002). A role-based delegation framework for healthcare

information systems. In Proceedings of the seventh ACM symposium on Access control models

and technologies (pp. 125–134).: ACM.

Zhang, M.-L. & Zhou, Z.-H. (2007). Ml-knn: A lazy learning approach to multi-label learning.

Pattern recognition, 40(7), 2038–2048.

Zhang, Y. & Schneider, J. G. (2011). Multi-label output codes using canonical correlation analysis.

In International Conference on Artificial Intelligence and Statistics (pp. 873–882).

125

Appendix A

The Top 10 Words of Non-Negative Matrix

Factorization

126

Table A.1: The Top 10 Words of Non-Negative Matrix Factorization with 10 Topics

TOPIC WORDS

Topic #1 chemistry,organic,reactions,molecules,chemical,nmr,compounds,

department,synthesis,metal

Topic #2 students,science,teachers,mathematics,school,education,program,

faculty,courses,laboratory

Topic #3 theory,problems,equations,geometry,mathematical,algebraic,

mathematics,differential,study,physics

Topic #4 species,plant,genetic,populations,plants,evolutionary,population,

evolution,diversity,variation

Topic #5 data,social,project,information,economic,research,analysis,

political,study,policy

Topic #6 research,conference,support,university,workshop,award,international,

scientists,program,researchers

Topic #7 protein,cell,proteins,cells,gene,genes,expression,molecular,dna,function

Topic #8 materials,properties,magnetic,optical,high,films,phase,devices,surface,

electron

Topic #9 ocean,ice,climate,water,carbon,sea,global,processes,arctic,atmospheric

Topic #10 design,systems,control,system,algorithms,software,performance,computer,

engineering,research

127

Table A.2: The Top 10 Words of Non-Negative Matrix Factorization with 20 Topics

TOPIC WORDS

Topic #1 students,program,research,faculty,undergraduate,graduate,summer,reu,

undergraduates,student

Topic #2 protein,cell,proteins,cells,gene,genes,expression,molecular,dna,function

Topic #3 chemistry,organic,reactions,chemical,compounds,metal,molecules,synthesis,

reaction,complexes

Topic #4 theory,geometry,algebraic,mathematics,groups,spaces,geometric,algebra,

study,number

Topic #5 laboratory,students,courses,equipment,computer,curriculum,experiments,

undergraduate,biology,software

Topic #6 science,teachers,mathematics,school,education,teacher,learning,project,

schools,teaching

Topic #7 ice,ocean,climate,water,carbon,sea,arctic,global,atmospheric,circulation

Topic #8 physics,quantum,theoretical,energy,particle,electron,systems,experimental,

matter,experiments

Topic #9 research,months,support,university,dr,fellowship,award,sciences,postdoctoral,

scientific

Topic #10 problems,equations,methods,models,nonlinear,solutions,differential,numerical,

mathematical,algorithms

Topic #11 materials,properties,films,optical,devices,surface,thin,high,phase,polymer

Topic #12 mantle,seismic,earthquake,crust,rocks,deformation,crustal,continental,plate,data

Topic #13 network,college,internet,connection,access,resources,networks,services,nsfnet,

information

Topic #14 workshop,participants,held,researchers,scientists,international,workshops,discuss,

bring,report

Topic #15 engineering,design,education,technology,science,engineers,manufacturing,industry,

mechanical,university

Topic #16 nmr,magnetic,spectrometer,molecules,resonance,nuclear,instrument,instrumentation,

spectroscopy,mhz

Topic #17 species,plant,genetic,populations,plants,evolutionary,population,evolution,diversity,

variation

Topic #18 systems,design,control,system,performance,research,software,algorithms,development,

power

Topic #19 conference,held,international,meeting,travel,support,symposium,scientists,researchers,

attend

Topic #20 data,social,project,information,research,economic,political,analysis,policy,study

128

Table A.3: The Top 10 Words of Non-Negative Matrix Factorization with 50 Topics

TOPIC WORDS

Topic #1 laboratory,courses,students,experiments,undergraduate,curriculum,majors,

introductory,exercises,laboratories

Topic #2 students,program,faculty,summer,graduate,undergraduate,reu,student,minority,

undergraduates

Topic #3 genes,gene,expression,dna,genetic,regulation,transcription,regulatory,genome,

sequences

Topic #4 protein,proteins,binding,structure,membrane,function,enzyme,rna,enzymes,

structural

Topic #5 ocean,carbon,water,marine,sea,organic,circulation,production,arctic,pacific

Topic #6 theory,geometry,algebraic,groups,mathematics,spaces,algebra,geometric,

algebras,topology

Topic #7 software,computer,parallel,computing,programming,distributed,hardware,

computational,tools,simulation

Topic #8 conference,held,international,meeting,travel,support,symposium,scientists,

researchers,attend

Topic #9 cell,cells,membrane,growth,cellular,calcium,signaling,receptor,tissue,cycle

Topic #10 chemistry,organic,department,chemical,physical,analytical,division,biochemistry,

instrumentation,professor

Topic #11 equipment,support,scientific,vessel,instrumentation,transceivers,acquisition,

operated,retrieval,purchase

Topic #12 months,fellowship,support,postdoctoral,mathematical,sciences,fellowships,

awards,option,

doctoral

129

Topic #13 problems,algorithms,methods,problem,computational,optimization,efficient,

techniques,solution,

solving

Topic #14 ice,sea,antarctic,sheet,core,cores,arctic,glacial,west,record

Topic #15 data,analysis,information,database,sets,collection,collected,statistical,project,set

Topic #16 workshop,participants,held,researchers,workshops,discuss,bring,meeting,

international,scientists

Topic #17 investigator,principal,proposal,pi,abstract,investigators,study,award,planning,young

Topic #18 species,populations,genetic,evolutionary,population,diversity,variation,evolution,

relationships,phylogenetic

Topic #19 magnetic,field,properties,spin,fields,resonance,superconducting,superconductors,

temperature,magnetization

Topic #20 physics,particle,nuclear,matter,particles,elementary,theoretical,condensed,atomic,

experiments

Topic #21 surface,surfaces,adsorption,microscopy,scanning,analytical,atomic,interfaces,

interface,adsorbed

Topic #22 nmr,molecules,spectrometer,nuclear,resonance,spectroscopy,chemists,mhz,

structure,studies

Topic #23 social,political,economic,policy,project,study,cultural,public,dissertation,

archaeological

Topic #24 reactions,metal,compounds,reaction,complexes,organic,chemical,synthesis,

molecules,transition

Topic #25 solar,waves,measurements,energy,wind,atmospheric,plasma,wave,atmosphere,

observations

Topic #26 learning,teaching,student,students,education,project,knowledge,skills,technology,

concepts

Topic #27 models,model,modeling,statistical,develop,developed,simulation,theoretical,

130

mathematical,methods

Topic #28 quantum,theoretical,theory,dynamics,systems,electron,states,mechanics,electronic,

electrons

Topic #29 systems,control,system,power,nonlinear,dynamical,performance,adaptive,feedback,

dynamic

Topic #30 network,college,internet,connection,access,networks,resources,services,nsfnet,

community

Topic #31 teachers,mathematics,school,teacher,schools,project,teaching,middle,education,

districts

Topic #32 engineering,education,technology,engineers,mechanical,chemical,electrical,civil,

women,industry

Topic #33 phase,technology,business,small,ii,innovation,high,commercial,project,process

Topic #34 mantle,rocks,crust,isotopic,continental,crustal,evolution,samples,ridge,seismic

Topic #35 optical,laser,lasers,optics,devices,light,nonlinear,fiber,spectroscopy,imaging

Topic #36 films,thin,film,growth,deposition,devices,silicon,properties,semiconductor,metal

Topic #37 climate,change,global,lake,records,climatic,variability,record,arctic,regional

Topic #38 science,education,national,scientific,technology,activities,computer,programs,

program,sciences

Topic #39 equations,differential,nonlinear,solutions,partial,mathematical,problems,equation,

analysis,numerical

Topic #40 contract,abstract,required,contracts,services,support,nsf,agreement,firms,contracting

Topic #41 stars,galaxies,star,stellar,galaxy,formation,evolution,clusters,observations,universe

Topic #42 flow,fluid,flows,transport,numerical,heat,fluids,dynamics,turbulent,turbulence

Topic #43 brain,system,neurons,behavior,neural,visual,nervous,information,mechanisms,

animals

Topic #44 plant,plants,soil,growth,arabidopsis,environmental,resistance,nitrogen,responses,

host

131

Topic #45 earthquake,seismic,fault,earthquakes,hazard,structures,deformation,damage,

california,ground

Topic #46 research,program,training,award,facilities,activities,projects,undergraduate,center,

areas

Topic #47 biology,molecular,biological,dna,genetics,training,techniques,evolutionary,

fellowship,cellular

Topic #48 design,manufacturing,process,tools,product,performance,designs,development,

optimization,integrated

Topic #49 university,dr,award,collaboration,professor,state,cooperative,expertise,institute,

center

Topic #50 materials,properties,polymers,polymer,material,processing,characterization,

synthesis,electronic,composites

132

Table A.4: The Top 10 Words of Non-Negative Matrix Factorization with 100

Topics

TOPIC WORDS

Topic #1 research,training,facilities,award,graduate,area,areas,activities,infrastructure,

proposed

Topic #2 students,graduate,student,undergraduate,experience,minority,skills,school,project,

careers

Topic #3 chemistry,department,physical,division,chemical,analytical,biochemistry,supported,

instrumentation,areas

Topic #4 genes,gene,expression,dna,genetic,regulation,transcription,regulatory,genome,

sequences

Topic #5 university,center,state,award,cooperative,california,collaboration,professor,

department,carolina

Topic #6 theory,geometry,algebraic,spaces,geometric,topology,algebra,number,algebras,

space

Topic #7 solar,measurements,atmospheric,wind,observations,atmosphere,radar,plasma,

particles,cloud

Topic #8 laboratory,courses,experiments,undergraduate,curriculum,exercises,majors,

laboratories,introductory,lab

Topic #9 conference,held,gordon,conferences,researchers,participants,support,sessions,

speakers,attend

Topic #10 teachers,school,teacher,schools,project,middle,districts,year,teaching,elementary

Topic #11 equipment,scientific,transceivers,support,retrieval,satellite,dedicated,acquisition,

including,purchase

Topic #12 months,fellowship,support,postdoctoral,sciences,mathematical,fellowships,awards,

option,choose

133

Topic #13 species,evolutionary,phylogenetic,relationships,diversity,morphological,genus,

evolution,patterns,ecological

Topic #14 equations,differential,solutions,partial,nonlinear,mathematical,equation,numerical,

boundary,elliptic

Topic #15 engineering,engineers,mechanical,education,civil,electrical,chemical,industrial,

biomedical,disciplines

Topic #16 reu,summer,projects,site,undergraduates,undergraduate,faculty,experiences,

participants,ten

Topic #17 connection,internet,access,network,nsfnet,resources,bits,libraries,supercomputers,

midlevel

Topic #18 environmental,environment,sciences,conditions,management,pollution,natural,

change,ecological,monitoring

Topic #19 physics,particle,nuclear,matter,theoretical,particles,elementary,condensed,atomic,

experiments

Topic #20 methods,method,computational,statistical,techniques,developed,develop,numerical,

applied,development

Topic #21 workshop,participants,held,researchers,discuss,bring,workshops,report,future,issues

Topic #22 nmr,molecules,nuclear,resonance,spectroscopy,chemists,mhz,spectrometer,studies,

elucidation

Topic #23 learning,teaching,student,concepts,knowledge,interactive,modules,skills,courses,

environment

Topic #24 control,nonlinear,adaptive,feedback,controllers,optimal,robust,controller,dynamic,

realtime

Topic #25 design,designs,tools,performance,optimization,product,methodology,development,

designers,project

Topic #26 plant,plants,arabidopsis,resistance,host,responses,seed,insect,crop,pollen

Topic #27 language,languages,programming,speech,linguistic,japanese,children,american,

134

spoken,grammar

Topic #28 mathematics,mathematical,calculus,courses,reform,algebra,sciences,applied,

mathematicians,statistics

Topic #29 ice,sea,antarctic,sheet,core,cores,glacial,west,snow,ross

Topic #30 data,sets,collected,set,database,statistical,base,acquisition,collect,analyze

Topic #31 brain,neurons,neural,cells,nerve,nervous,visual,activity,mechanisms,sensory

Topic #32 ocean,circulation,pacific,sea,atlantic,experiment,deep,southern,hydrographic,

variability

Topic #33 growth,crystal,rates,rate,crystals,factors,nucleation,conditions,epitaxial,

development

Topic #34 archaeological,sites,site,region,mr,remains,conduct,period,ms,societies

Topic #35 metal,complexes,transition,metals,compounds,ligands,clusters,ions,atoms,catalytic

Topic #36 high,temperature,low,pressure,temperatures,thermal,superconductors,performance,

measurements,resolution

Topic #37 water,groundwater,deep,column,sea,quality,treatment,waters,vapor,oxygen

Topic #38 biology,biological,training,genetics,fellowship,dna,postdoctoral,physiology,

developmental,ecology

Topic #39 cell,cells,cellular,calcium,signaling,cycle,tissue,division,receptor,differentiation

Topic #40 protein,proteins,binding,enzyme,rna,enzymes,function,folding,amino,acid

Topic #41 electron,transfer,energy,microscopy,microscope,electrons,scanning,atomic,charge,

scattering

Topic #42 genetic,populations,population,variation,selection,natural,traits,reproductive,

individuals,evolutionary

Topic #43 quantum,mechanics,theoretical,classical,theory,dots,states,atoms,matter,spin

Topic #44 parallel,computing,performance,memory,distributed,computational,programming,

applications,processors,computation

Topic #45 phase,ii,small,business,innovation,commercial,project,applications,feasibility,gas

135

Topic #46 systems,dynamical,complex,distributed,biological,techniques,chaotic,hybrid,

intelligent,embedded

Topic #47 devices,device,semiconductor,electronic,fabrication,circuits,silicon,mems,

integrated,sensors

Topic #48 models,modeling,mathematical,statistical,stochastic,spatial,estimation,simulation,

random,empirical

Topic #49 surface,surfaces,adsorption,analytical,adsorbed,interface,interfaces,layer,atomic,

scanning

Topic #50 climate,change,global,variability,regional,atmospheric,climatic,records,vegetation,

tropical

Topic #51 groups,group,lie,representation,finite,representations,relationships,algebras,

symmetries,symmetry

Topic #52 reactions,chemical,reaction,intermediates,kinetics,catalytic,molecules,reactive,

oxidation,studies

Topic #53 women,careers,gender,girls,minorities,minority,female,participation,career,

academic

Topic #54 polymer,polymers,properties,polymerization,blends,chain,liquid,composites,chains,

mechanical

Topic #55 system,nervous,override,components,error,prototype,acquisition,abstract,realtime,

monitoring

Topic #56 behavior,experiments,behavioral,males,females,experimental,effects,animals,male,

reproductive

Topic #57 dr,award,postdoctoral,work,collaboration,complementary,expertise,professor,

scientists,visit

Topic #58 films,thin,film,deposition,properties,coatings,substrates,vapor,diamond,substrate

Topic #59 algorithms,efficient,optimization,algorithm,computational,complexity,graph,

combinatorial,develop,techniques

136

Topic #60 problems,problem,optimization,solving,solution,solve,mathematical,inverse,

solutions,optimal

Topic #61 analysis,techniques,tools,analytical,quantitative,harmonic,analyses,include,fourier,

operators

Topic #62 college,faculty,community,education,colleges,institutions,programs,courses,

curriculum,workshops

Topic #63 investigator,principal,proposal,pi,abstract,investigators,award,young,planning,title

Topic #64 collection,collections,specimens,museum,database,project,resource,history,natural,

storage

Topic #65 earthquake,seismic,earthquakes,hazard,damage,fault,ground,reduction,california,

national

Topic #66 technology,development,technologies,industry,education,center,institute,technical,

technological,project

Topic #67 molecular,molecules,dna,level,techniques,genetics,atomic,evolution,molecule,

experimental

Topic #68 waves,wave,nonlinear,propagation,gravity,gravitational,numerical,scattering,

phenomena,acoustic

Topic #69 materials,properties,material,characterization,processing,composite,advanced,

composites,electronic,synthesis

Topic #70 stars,galaxies,star,stellar,galaxy,formation,evolution,clusters,observations,universe

Topic #71 mantle,seismic,crust,melting,melt,upper,ridge,crustal,isotopic,earth

Topic #72 soil,forest,soils,ecosystem,ecosystems,forests,nitrogen,tropical,nutrient,vegetation

Topic #73 contract,abstract,required,contracts,services,agreement,nsf,support,firms,

contracting

Topic #74 marine,microbial,organisms,bacteria,coastal,sea,food,communities,nitrogen,

phytoplankton

Topic #75 symposium,meeting,international,scientists,travel,held,support,american,society,

137

researchers

Topic #76 oceanographic,vessel,instrumentation,support,rv,operated,vessels,fleet,owned,

aboard

Topic #77 carbon,dioxide,nitrogen,production,oxygen,organic,global,dissolved,cycle,flux

Topic #78 optical,laser,lasers,optics,light,fiber,spectroscopy,nonlinear,wavelength,pulses

Topic #79 transport,membrane,membranes,ion,channels,plasma,channel,separations,

separation,sediment

Topic #80 organic,synthesis,compounds,synthetic,professor,focus,chiral,molecules,products,

macromolecular

Topic #81 arctic,heat,basin,global,polar,warming,abstract,ocean,canadian,shelf

Topic #82 study,results,case,studied,studies,relationship,understanding,important,examine,

part

Topic #83 flow,fluid,flows,heat,numerical,fluids,turbulent,turbulence,particle,particles

Topic #84 program,programs,year,minority,nsf,academic,career,support,activities,summer

Topic #85 science,education,national,scientific,activities,foundation,computer,earth,scientists,

public

Topic #86 power,electric,energy,transmission,voltage,fuel,generation,electronics,low,electrical

Topic #87 deformation,plate,fault,zone,strain,crustal,continental,faults,gps,rocks

Topic #88 processes,understanding,physical,chemical,biological,fundamental,process,

formation,interactions,important

Topic #89 social,political,economic,policy,project,public,cultural,interviews,dissertation,

policies

Topic #90 model,modeling,developed,develop,based,results,test,simulations,approach,

parameters

Topic #91 magnetic,field,fields,spin,properties,resonance,magnetization,superconducting,

superconductors,ferromagnetic

Topic #92 instrument,mass,spectrometer,facility,instrumentation,spectrometry,acquisition,

138

samples,resolution,microscope

Topic #93 information,provide,database,management,knowledge,gis,decision,users,

geographic,web

Topic #94 sediment,lake,record,isotopic,sediments,history,basin,isotope,samples,cores

Topic #95 computer,software,hardware,tools,graphics,computers,simulation,visualization,

workstations,interactive

Topic #96 processing,image,imaging,images,signal,digital,video,visual,vision,motion

Topic #97 manufacturing,process,production,industry,industrial,product,planning,quality,

products,machining

Topic #98 dynamics,theoretical,dynamical,computational,simulations,interactions,

experimental,population,studies,understanding

Topic #99 structure,structures,properties,structural,function,crystal,diffraction,understanding,

studies,determination

Topic #100 network,networks,wireless,communication,traffic,neural,routing,performance,

networking,distributed

139


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp