Nabil R. Adam, Aryya Gangopadhyay, Richard Holowczak

The proliferation of inexpensive and powerful computer systems and storage has lead to the collection of unprecedented amounts of data by organizations from all facets of daily life. This has resulted in public concern regarding threats of individuals' sensitive information especially at a time when the exchange of information is fast becoming an industry of its own. A significant problem in database security centers on the desire to provide accurate and reliable aggregates of data while preventing the disclosure of individual information. This paper surveys research into this problem and describes a framework of solution approaches.

**Keywords**: Statistical Databases, Database Management Systems

Ram Gopal, Paulo Goes and Robert Garfinkel.

A practical method is presented for giving unlimited, correct, numerical responses to ad-hoc queries to a database while not compromising confidential numerical data. The technique is appropriate for any size database and no assumptions are needed about the statistical distribution of the confidential data. The responses are in the form of a number plus a guarantee, so that the user can determine an interval which is sure to contain the exact answer. Confidentiality is maintained by "hiding" the vector of sensitive data in an infinite set of vectors. Virtually any imaginable query type can be answered and collusion among the users presents no problem. The manager of the database can control which query types, based on the non confidential fields, are likely to yield the tightest guarantees.

*"An Audit Expert for Large Statistical Databases"*

F. M. Malvestuto and M. Moscarini

In an on-line database environment "auditing" statistical queries is an effective policy for protecting confidential attributes of individual records from statistical disclosure. Existing implementations of auditing avoid the exact disclosure of confidential attributes of every individual record but not of sensitive data, that is, data which allow a confidential attribute of same individual record to be accurately estimated. Moreover, they are extremely costly since they make use of a mathematical model having a number of variables equal to the size of the underlying database. We present an implementation of auditing avoiding exact disclosure of sensitive data, based on a mathematical model where the number of variables is never greater, and usually far less, than the size of the database.

*"Some superpopulation models for estimating the number of population
uniques".*

Akimichi Takemura

The number of the unique individuals in the population is of great importance in evaluating the disclosure risk of a microdata set. We approach this problem by considering some basic superpopulation models including the gamma-Poisson model of Bethlehem et al. (1990). We introduce Dirichlet-multinomial model which is closely related but more basic than the gamma-Poisson model. We also discuss the Ewens model and show that it can be obtained from the Dirichlet-multinomial model by a limiting argument similar to the law of small numbers. Although these models might not necessarily well fit actual populations, they can be considered as basic mathematical models for our problem, as binomial and Poisson distributions are considered as basic models for count data.

*"Measuring Identification Disclosure Risk for Categorical Microdata
by Posterior Population Uniqueness"*

Yasuhiro Omori

This article evaluates the risk of identification disclosure for categorical
microdata by a posterior probability of population uniqueness (*i.e.*,
unique observations in a population) when there is no prior information.
Bethlehem e*t al.* (1990) introduced the concept of population uniques
withdrawn from a superpopulation and proposed an estimated expected number
(or fraction) of population uniques as the criterion to determine whether
any additional measures for disclosure protection should be taken. But
their model has been found too simple to fit the real data and it is not
clear how to decide acceptable numbers or fractions so that an identification
disclosure does not take place. Instead, this article generalizes their
model and assesses the risk by the posterior probability as follows. Samples
are assumed to be randomly withdrawn from the population and hence follow
a multivariate hypergeometric distribution conditional on population cell
frequencies. The prior distribution of population cell frequencies is assumed
to be a multinomial distribution conditional on parameters. We first examine
posterior probability given hypothesized values of the parameters, and
then considers a Dirichlet distribution for the prior distribution of multinomial
parameters. Given sample and population sizes, the maximum numbers of sample
uniques are derived to attain certain small probability of identification
disclosure and shown to be a function of a sampling fraction.

**Keywords**: Microdata; Identification disclosure; Population uniqueness;
Multivariate hypergeometric; Dirichlet-multinomial

*"A Bayesian, Population-Genetics-Inspired Approach to the Uniques
Problem in Microdata Disclosure Risk Assessment"*

Stepen M. Samuels

One important measure of disclosure risk for microdata is the proportion of sample uniques which are also population uniques. The distribution of this random variable depends on the population only through its partition structure: the distribution of the numbers of cells of each size. Partition distributions have been extensively studied in population genetics. Portions of that research can be adapted to provide us with the promise of a mathematical framework based on plausible prior distributions with easy to interpret parameters, and a modified Polya urn sampling model from which risk assessment is easily obtained.

**Keywords**: Partition Structure, Polya Urn, Poisson-Dirichlet

*"A method for data-oriented multivariate microaggregation"*

Josep M. Mateo-Sanz , Josep Domingo-Ferrer

Microaggregation in a statistical disclosure control technique for microdata.
Raw microdata (*i.e.* individual records) are grouped into small aggregates
prior to publication. Each aggregate should contain at least *k *records
to prevent disclosure of individual information. So far, practical microaggregation
consisted of taking fixed-size microaggregates (size *k)*. we consider
in this paper a new approach to multivariate microaggregation in which
the size of aggregates is a variable taking values >= *k* depending
on data.

**Keywords**: Statistical disclosure control; Microaggregation; Hierarchical
clustering; Microdata protection

*"Estimation of variance loss following microaggregation by the individual
ranking method"*

Youri Baeyens, Daniel Defays

Thanks to computers, users of statistics are increasingly able to manipulate survey data themselves, in order to study individual behavior and the complex relationships between variables, to create ad-hoc models, and so on.

However, for obvious reasons of confidentiality, supplying data users directly with survey data is out of the question. The data need to be modified in some way in order to make it extremely difficult or even impossible to identify a respondent.

Many methods have been devised for this purpose and we have looked into one of them; individual ranking, a microaggregation method for continuos variables. More particularly, we have attempted to study the effects of individual ranking on the variances of the distributions.

The article comprises 3 main sections. The first describes the individual ranking method. The second outlines two ways of analyzing variance loss due to microaggregation using the individual ranking method. The third section summarizes the results obtained by simulations.

*"An Application of Microaggregation Methods to Italian Business
Surveys"*

Veronica Corsini , Luisa Franconi, Daniela Pagliuca, Giovanni Seri.

A class of statistical techniques which has proved to be useful in protecting business confidential data is microaggregation methods, developed at Eurostat. In this paper we present some of the results obtained in an extensive study on the application of microaggregation methods to Italian business data in order to evaluate the performance of the methods as far as the maintenance of the characteristics of the original data is concerned. In Section 2 we briefly present the techniques used. In Section 3 we describe the data analyzed whereas Section 4 contains some of the results obtained. In Section 5 we present and estimate some economic models using the original and the microaggregated data.

*"Fréchet and Bonferroni Bounds for Multi-way Tables of Counts
With Applications to Disclosure Limitation"*

Stephen E. Fienberg

Upper and lower bounds on cell counts in cross-classifications of positive counts play important roles in a number of the disclosure limitation procedures, e.g., cell suppression and data swapping. Some features of the Fréchet bounds are well-known, intuitive, and are regularly used by those working on disclosure limitation methods, especially those for two-dimensional tables. The multivariate versions of these bounds and other related bounds such as those calculated using the Bonferroni approach are more complex, however, but they have potentially great import for current disclosure limitation methodology. The purpose of this paper is to describe the key results on this topic.

*"An Algorithm to Calculate the Lower and Upper Bounds of the Elements
of an Array Given its Marginals"*

Lucia Buzzigoli, Antonio Giusti

The paper presents a new algorithm to calculate lower and upper bounds
of the elements of an *n*-way array, starting from the complete set
of its (*n*-1)-way marginals. The procedure is computational simpler
than linear programming, usually utilized to solve this problem. The paper
includes proofs for arrays of limited dimensions. The proposed algorithm,
very easy to implement with a matrix language, shows interesting properties
and possibilities of application.

*"Disclosure Detection in Multiple Linked Categorical Datafiles:
A Unified Network Approach".*

Stephen F. Roehrig, Rema Padman, George Duncan, Ramayya Krishnan

This paper presents new research on the use of network models to evaluate
the disclosure potential of categorical data tables linked over one or
more attributes. Networks have been used the past to model both disclosure
detection and protection (e.g. via suppression) of two-dimensional tables.
We present a new network model for higher-dimensional problems, including
the case where released tables are derived as projections of a single underlyuing
*n*-dimensional
data cube.

*"Some remarks on Research Directions in Statistical Data Protection"*

Lawrence H. Cox

Modern research on statistical data protection (SDP) draws upon a rich, diverse subset of the mathematical sciences--statistics, mathematical programming, combinatorics, graph theory and theoretical computer science. This paper offers observations on selected recent SDP research directions.

*"Dike: A Prototype for Secure Delegation of Statistical Data".*

Josep Domingo-Ferrer, Ricardo X. Sànchez del Castillo, Javier
Castilla

The need for delegating statistical data arises when the data owner
(*e.g.* statistical office) wants to have its data handled by an external
party. If the external party is untrusted and data are confidential, delegation
should be performed in a way that preserves security. A cryptographic solution
to the secure delegation problem is outlined which provides data secrecy
and computation verifiability. Also, the design principles of Dike --an
implementation allowing secure delegation of information over the Internet--
are discussed in some detail.

**Keywords**: Delegation of information; Encrypted data processing;
Distributed computing; Statistical data protection.

*"A Secure Network of European Statistical Offices over the Internet".*

Despina Polemi, George Kokolakis.

A security solution for the interconnection of the European Statistical Offices (ESOs), and ESOs with their users over the Internet is proposed based on the Trusted Third Party Services (TTPs) approach considering statisticians needs and addressing technical, operational and functional aspects.

**Index terms**: Trusted Third Party Services, Internet, confidentiality,
integrity, authenticity

*"Investigating Key Qualitities of an Automated Cell Suppression
System".*

Keith McLeod, John George Andrew Rae, Rodney Butler

*"Looking of Efficient Automated Secondary cell Suppression Systems:
A Software Comparison".*

Sarah Giessing

The problem of secondary cell suppression is well known and studied.
We examined software for automated secondary cell suppression and compared
the programs under methodological and conceptual aspects.

A major practical experiment was conducted: The programs were run on
tables from the German 1995 Census of Manual Trades. Results of these runs
will be compared and presented in this paper.

*"ARGUS for Statistical Disclosure Control"*

Leon Willenborg , Anco J. Hundepool

The paper describes the main functionality of two related software packages for producing safe data: -ARGUS for microdata and -ARGUS for tabular data.

**Keywords**: Statistical disclosure control, software, microdata,
tables.

Official survey data are customarily released to the public as tabular aggregates or as files of unit records. Such releases have been subject to well established rules designed to preserve the confidentiality of respondent information as required by agency charter. Output databases that allow external users to access data across collections and time via a single contact point offer a third and distinct form of release. This paper outlines a protection strategy for such generalized table retrieval facilities.

*"Special Uniques, Random Uniques and Sticky Population: Some Counterintuitive
Effects of Geographical Detail on Disclosure Risk"*

Mark. J. Elliot, Chris J. Skinner, Angela Dale

Work into statistical disclosure control invariably assumes that disclosure
risk increases as the level of detail on the released data increases. Using
the 1991 GB census data this paper describes some work using the UUSU ratio
(the proportion of sample uniques which are also population uniques) which
shows that the relationship between disclosure risk as measured by the
UUSU ratio has a non-monotonic relationship with the level of geographical
detail.

Further using the concept of population uniqueness it is possible to
demonstrate that uniqueness is a non-homoneneuos categorization. The paper
distinguishes *special uniques *which are unique by virtue of a unusual
combination of characteristics whose uniqueness from is insensitive to
changes in geographical level from *random uniques* which are unique
by virtue of the way in which the key variables have been constructed.

The paper concludes that the relationship between geographical level
and disclosure risk is more complex than was previously supposed and that
attending to the problem of special uniques may substantially reduce the
risk of disclosure.

*"Modeling population uniqueness using a mixture of two laws"*

Patrick St-Cyr

When a statistical agency wants to assess the risk of disclosure of a microdata file, one important measure that has to be estimated is the conditional probability that a record is unique in the population given that it is unique in the sample. The expression of this probability is a function of the sampling fraction and the structure of the population which is the information on the population in terms of the key variables. The basic problem is to estimate or model the structure of a population. By observing the relationship between this probability and the sampling fraction for a real population, we were able to find constraints over the structure of the population. These constraints give us some clues to what models should be considered. The strong result in the paper is to propose a mixture of two distributions for modeling population uniqueness.

**Keywords**: risk of disclosure, population uniqueness, mixture

*"Pre-record risk of disclosure in dependent data"*

Roberto Benedetti, Luisa Franconi, Federica Piersimoni

The disclosure protection problem, when the *identity* disclosure
definition is used, can be set as follows: a microdata file can be released
if there is little chance that others will correctly link records to individual
unit. The person who attempts such a link will pursue his aim by exact
matching the values of individual units contained in a public register
or external data base against the corresponding values in the released
microdata files. In this paper we propose a new methodology for the definition
of per-unit risk of disclosure that allows for a structure of dependence
amongst the individuals. The methodology that defines the probability of
identification of each record is presented in Section 2, In Section 3 the
risk of disclosure is defined, whereas in Section 4 the computational problems
are described.

*"Statistical Methods to Limit Disclosure for the Manufacturing Energy
Consumption Survey: Collaborative Work of the Energy Information Administration
and the U. S. Census Bureau"*

Ramesh A. Dandekar

The Energy Information Administration (EIA) of the United States Department of Energy (DOE) collects energy consumption and related information for the manufacturing industries in the United States via its Manufacturing Energy Consumption Survey (MECS). MECS is a triennial survey and is collected for EIA by the United States Census Bureau using the Bureau's legislative authority under Title 13. Title 13 is a statute that describes de statistical mission of the Census Bureau and contains strict confidentiality provisions to protect sensitive information.

*"Confidentially Auditing for Price Index Publications"*

Gordon Sande

The use of cell suppression to provide confidentiality protection for business statistics publications is standard within official statistical agencies. Automation for both designing and auditing of cell suppression patterns has been in use for considerable time. This automation has been developed for the important application of publishing tables of totals of economic activity. A common example is a census of manufacturing. Another group of examples are census of agriculture publications of either financial characteristics like farm revenue or physical characteristics like land use or herd size.

*"Improving the Disclosure Testing Algorithm for ONS Business Statistics"*

Mark Pont

This paper describes a change to the way that disclosure testing is undertaken for estimates produced from data collected in business surveys conducted by ONS. The purpose of the change was to reduce costs by making processing simpler. The new algorithm identifies those cells that can be correctly declared disclosive or non-disclosive easily i.e. without applying the full rule. The full test is then reserved only for those cells where it is not so clear whether the cell is disclosive or not. The paper comments on the effectiveness of the new method.

**Keywords**: disclosure, threshold rule, *p*-percent rule

*"Re-identification Methods for Evaluating the Confidentiality of
Analytically Valid Microdata"*

William E. Winkler

A public-use microdata file should be analytically valid. For a very small number of uses, the microdata should yield analytic results that are approximately the same as the original, confidential file that is not distributed. If the microdata file contains a moderate number of variables and is required to meet a single set of analytic needs of, say, university researchers, then many more records are likely to be re-identification methods typically used in the confidentially literature. This paper compares several masking methods in terms of their ability to produce analytically valid, confidential microdata.

*"Reflections on PRAM"*

Peter-Paul de Wolf, José M. Gouweleeuw, Peter Kooiman, Leon
Willenborg

PRAM is a probabilistic, perturbative method for disclosure protection of categorical variables in microdata files. If PRAM is to be applied, several issues should be carefully considered. The microdata file will usually contain a specific structure, e.g., a hierarchical structure when all members of a household are present in the data file. To what extent should PRAM conserve that structure? How should a user of the perturbed file deal with variables that can (logically) be deduced from other variables on which PRAM has been applied? How should the probability mechanism, used to implement PRAM has been applied? How should the probability mechanism, used to implement PRAM be chosen in the first place? How well does PRAM limit the risk of disclosing sensitive information? In this paper, these questions will be considered.

**Keywords**: Post Randomisation Method (PRAM), disclosure limitation,
perdurbed data, Markov matrix, expectation ratio.

*"Obtaining Information while Preserving Privacy: A Markov Perturbation
Method for Tabular Data"*

George T. Duncan, Stephen E. Fienberg

Preserving privacy appears to conflict with providing information. Ways exist, however, to resolve this value paradox in an important context. Statistical information can be provided while preserving a specified level of confidentiality protection. The general approach is to provide disclosure-limited data that maximizes its statistical utility subject to confidentiality constraints. Disclosure limitation based on Markov chain methods respect the underlying uncertainty in real data is examined. For use with categorical data tables, a method called Markov perturbation is proposed as an extension of the PRAM method of Kooiman, Willenborg and Gouweleeuw (1997). Markov perturbation allows cross-classified marginal totals to be maintained and promises to provide more information than the commonly used cell suppression technique.

**Keywords**: Confidentiality, Data Access, Data Security, Hierarchical
Models, Markov Chains, Perturbation Methods, Simulated Data.

*"Disclosure Limitation for the 2000 Census of Housing and Population"*

Phil Steel, Laura Zayatz

The Bureau of the Census is required by law (Title 13 of the U.S. Code) to protect the confidentiality of the respondents to our surveys and censuses. At the same time, we want to maximize the amount of useful statistical information that we provide to all types of data users. We have to find a balance between these two objectives. We are investigating techniques that will be used for disclosure limitation (confidentiality protection) for all data products stemming from the 2000 Census of Population and Housing.

This paper describes *preliminary* proposals for disclosure limitation
techniques. In Section 2, we briefly describe the procedures that were
used for the 1990 Census. In Section 3, we describe why some changes in
those techniques may be called for. In Section 4, we give our initial proposals
for procedures for the 2000 Census, including procedures for the 100% census
tabular data, the sample tabular data, and the microdata. In Section 5,
we briefly describe methods of testing the resulting data in terms of retaining
the statistical qualities of the data and giving adequate protection. Section
6 contains references.

**Keywords**: Confidentiality

*"Factors Affecting Confidentiality Risks Involved in Releasing Census
Data for Small Areas"*

Oliver Duke-Williams, Phil Rees

The wished of data users to have data for a number of sets of areas, here called 'geograpies' causes a potential problem for National Statistical Offices (NSOs). This paper reviews work done to investigate the extent to which the publication of data for multiple geographies poses a risk to the confidentiality of individuals.

*"Multiple Imputation and Disclosure Protection: The Case of the
1995 Survey of Consumer Finances"*

Arthur B. Kennickell

Recent developments in record linkage technology together with vast increases in the amount of personally identified information available in machine readable form raise serious concerns about the future of public use data sets. One possibility raised by Rubin [1993] is to release only simulated data created by multiple imputation techniques using the actual data. This paper uses the multiple imputation software developed for the Survey of Consumer Finances (Kennickell [1991]) to develop a series of experimental simulated versions of the 1995 survey data.

*"Modeling and Solving the Cell Suppression Problem for Linearly-Constrained
Tabular Data"*

Mateo Fischetti, Juan José Salazar

We study de problem of protecting sensitive data in a statistical table
whose entries are subject to a system of linear constraints. This very
general setting covers, among others,* k*-dimensional tables with
marginals as well as linked tables. In particular, we address the NP-hard
problem known in the literature as the (secondary) Cell Suppression Problem.
We introduce a new integer linear programming model and describe additional
inequalities used to strengthen the linear relaxation of the model. We
also outline a branch-and-cut algorithm for the exact solution of the problem,
which can also be used as a heuristic procedure to find near-optimal solutions.
Preliminary computational results are promising.

**Keywords**: Statistical Disclosure Control, Cell Suppression, Integer
Linear Programming.

*"Heuristic Methods for the Cell Suppression Problem in General Statistical
Tables"*

F. D. Carvalho, M. T. Almeida

One of the methods used to avoid disclosure of confidential data, in statistical tables, is to suppress confidential data from publication. Since row and column totals are also published, it is usually necessary to suppress the values of some non confidential data as well. Assigning a cost to the information lost with the suppression of each non confidential cell, the best solution is the one that minimizes the total cost. A table and its suppressions may be represented by an unidirected bipartie network. An introduction of this subject and some approaches to its resolution by other authors are presented. We propose and test some heuristic improvement methods where network model techniques are used, and conclude that the quality of the solution found by these new methods is considerably better when compared to the existing ones.

**Keywords**: Cell Suppression Problem; Complementary Suppressions;
Primary Suppressions; Heuristics.

*"Lower-bounding Procedures for the Cell Suppression Problem in Nonnegative
Statistical Tables"*

F.D. Carvalho, M.T. Almeida

One of the methods used to avoid disclosure of confidential data, in statistical tables, is to suppress confidential data from publication. Omitting the confidential valued, also known as primary cells, does not guarantee in every case that they cannot be disclosed or, at least, estimated within a narrow range, since some row and column values, if not all, are published. It is therefore necessary to make complementary suppressions, that is, to suppress values which are not confidential. A primary cell is considered protected if and only if the information provided by the final published table does not allow estimating its value within a narrower range than a prespecified safety range. Assigning a cost to every complementary suppression, the suppression problem is that of finding a set of complementary suppressions with minimum total cost. We present lower-bounding procedures for this problem and prove that our results dominate the results known from literature.

**Keywords**: Cell Suppression Problem; Complementary Suppressions;
Primary Suppressions; Lower-Bounding Procedures.

*"On Solving huge Set-cover Models of the Microdata Protection Problem"*

C.A. J. Hurkens, S.R. Tiourine

We discuss how to model the problem of dealing with microdata intended for public release. Protecting the information of individuals by recording and/or suppressions has to be balanced against the need for ending up with a data set that is still statistically valuable. Next we describe algorithms to compute good solutions. In addition to local search techniques to find these solutions, we develop a lower bounding mechanism, which will enable us to estimate the quality of our solutions.