The proliferation of inexpensive and powerful computer systems and storage has lead to the collection of unprecedented amounts of data by organizations from all facets of daily life. This has resulted in public concern regarding threats of individuals' sensitive information especially at a time when the exchange of information is fast becoming an industry of its own. A significant problem in database security centers on the desire to provide accurate and reliable aggregates of data while preventing the disclosure of individual information. This paper surveys research into this problem and describes a framework of solution approaches.
Keywords: Statistical Databases, Database Management Systems
A practical method is presented for giving unlimited, correct, numerical responses to ad-hoc queries to a database while not compromising confidential numerical data. The technique is appropriate for any size database and no assumptions are needed about the statistical distribution of the confidential data. The responses are in the form of a number plus a guarantee, so that the user can determine an interval which is sure to contain the exact answer. Confidentiality is maintained by "hiding" the vector of sensitive data in an infinite set of vectors. Virtually any imaginable query type can be answered and collusion among the users presents no problem. The manager of the database can control which query types, based on the non confidential fields, are likely to yield the tightest guarantees.
"An Audit Expert for Large Statistical Databases"
F. M. Malvestuto and M. Moscarini
In an on-line database environment "auditing" statistical queries is an effective policy for protecting confidential attributes of individual records from statistical disclosure. Existing implementations of auditing avoid the exact disclosure of confidential attributes of every individual record but not of sensitive data, that is, data which allow a confidential attribute of same individual record to be accurately estimated. Moreover, they are extremely costly since they make use of a mathematical model having a number of variables equal to the size of the underlying database. We present an implementation of auditing avoiding exact disclosure of sensitive data, based on a mathematical model where the number of variables is never greater, and usually far less, than the size of the database.
"Some superpopulation models for estimating the number of population uniques".
The number of the unique individuals in the population is of great importance in evaluating the disclosure risk of a microdata set. We approach this problem by considering some basic superpopulation models including the gamma-Poisson model of Bethlehem et al. (1990). We introduce Dirichlet-multinomial model which is closely related but more basic than the gamma-Poisson model. We also discuss the Ewens model and show that it can be obtained from the Dirichlet-multinomial model by a limiting argument similar to the law of small numbers. Although these models might not necessarily well fit actual populations, they can be considered as basic mathematical models for our problem, as binomial and Poisson distributions are considered as basic models for count data.
"Measuring Identification Disclosure Risk for Categorical Microdata by Posterior Population Uniqueness"
This article evaluates the risk of identification disclosure for categorical microdata by a posterior probability of population uniqueness (i.e., unique observations in a population) when there is no prior information. Bethlehem et al. (1990) introduced the concept of population uniques withdrawn from a superpopulation and proposed an estimated expected number (or fraction) of population uniques as the criterion to determine whether any additional measures for disclosure protection should be taken. But their model has been found too simple to fit the real data and it is not clear how to decide acceptable numbers or fractions so that an identification disclosure does not take place. Instead, this article generalizes their model and assesses the risk by the posterior probability as follows. Samples are assumed to be randomly withdrawn from the population and hence follow a multivariate hypergeometric distribution conditional on population cell frequencies. The prior distribution of population cell frequencies is assumed to be a multinomial distribution conditional on parameters. We first examine posterior probability given hypothesized values of the parameters, and then considers a Dirichlet distribution for the prior distribution of multinomial parameters. Given sample and population sizes, the maximum numbers of sample uniques are derived to attain certain small probability of identification disclosure and shown to be a function of a sampling fraction.
Keywords: Microdata; Identification disclosure; Population uniqueness; Multivariate hypergeometric; Dirichlet-multinomial
"A Bayesian, Population-Genetics-Inspired Approach to the Uniques Problem in Microdata Disclosure Risk Assessment"
Stepen M. Samuels
One important measure of disclosure risk for microdata is the proportion of sample uniques which are also population uniques. The distribution of this random variable depends on the population only through its partition structure: the distribution of the numbers of cells of each size. Partition distributions have been extensively studied in population genetics. Portions of that research can be adapted to provide us with the promise of a mathematical framework based on plausible prior distributions with easy to interpret parameters, and a modified Polya urn sampling model from which risk assessment is easily obtained.
Keywords: Partition Structure, Polya Urn, Poisson-Dirichlet
"A method for data-oriented multivariate microaggregation"
Josep M. Mateo-Sanz , Josep Domingo-Ferrer
Microaggregation in a statistical disclosure control technique for microdata. Raw microdata (i.e. individual records) are grouped into small aggregates prior to publication. Each aggregate should contain at least k records to prevent disclosure of individual information. So far, practical microaggregation consisted of taking fixed-size microaggregates (size k). we consider in this paper a new approach to multivariate microaggregation in which the size of aggregates is a variable taking values >= k depending on data.
Keywords: Statistical disclosure control; Microaggregation; Hierarchical clustering; Microdata protection
"Estimation of variance loss following microaggregation by the individual ranking method"
Youri Baeyens, Daniel Defays
Thanks to computers, users of statistics are increasingly able to manipulate survey data themselves, in order to study individual behavior and the complex relationships between variables, to create ad-hoc models, and so on.
However, for obvious reasons of confidentiality, supplying data users directly with survey data is out of the question. The data need to be modified in some way in order to make it extremely difficult or even impossible to identify a respondent.
Many methods have been devised for this purpose and we have looked into one of them; individual ranking, a microaggregation method for continuos variables. More particularly, we have attempted to study the effects of individual ranking on the variances of the distributions.
The article comprises 3 main sections. The first describes the individual ranking method. The second outlines two ways of analyzing variance loss due to microaggregation using the individual ranking method. The third section summarizes the results obtained by simulations.
"An Application of Microaggregation Methods to Italian Business Surveys"
Veronica Corsini , Luisa Franconi, Daniela Pagliuca, Giovanni Seri.
A class of statistical techniques which has proved to be useful in protecting business confidential data is microaggregation methods, developed at Eurostat. In this paper we present some of the results obtained in an extensive study on the application of microaggregation methods to Italian business data in order to evaluate the performance of the methods as far as the maintenance of the characteristics of the original data is concerned. In Section 2 we briefly present the techniques used. In Section 3 we describe the data analyzed whereas Section 4 contains some of the results obtained. In Section 5 we present and estimate some economic models using the original and the microaggregated data.
"Fréchet and Bonferroni Bounds for Multi-way Tables of Counts With Applications to Disclosure Limitation"
Stephen E. Fienberg
Upper and lower bounds on cell counts in cross-classifications of positive counts play important roles in a number of the disclosure limitation procedures, e.g., cell suppression and data swapping. Some features of the Fréchet bounds are well-known, intuitive, and are regularly used by those working on disclosure limitation methods, especially those for two-dimensional tables. The multivariate versions of these bounds and other related bounds such as those calculated using the Bonferroni approach are more complex, however, but they have potentially great import for current disclosure limitation methodology. The purpose of this paper is to describe the key results on this topic.
"An Algorithm to Calculate the Lower and Upper Bounds of the Elements of an Array Given its Marginals"
Lucia Buzzigoli, Antonio Giusti
The paper presents a new algorithm to calculate lower and upper bounds of the elements of an n-way array, starting from the complete set of its (n-1)-way marginals. The procedure is computational simpler than linear programming, usually utilized to solve this problem. The paper includes proofs for arrays of limited dimensions. The proposed algorithm, very easy to implement with a matrix language, shows interesting properties and possibilities of application.
"Disclosure Detection in Multiple Linked Categorical Datafiles: A Unified Network Approach".
Stephen F. Roehrig, Rema Padman, George Duncan, Ramayya Krishnan
This paper presents new research on the use of network models to evaluate the disclosure potential of categorical data tables linked over one or more attributes. Networks have been used the past to model both disclosure detection and protection (e.g. via suppression) of two-dimensional tables. We present a new network model for higher-dimensional problems, including the case where released tables are derived as projections of a single underlyuing n-dimensional data cube.
"Some remarks on Research Directions in Statistical Data Protection"
Lawrence H. Cox
Modern research on statistical data protection (SDP) draws upon a rich, diverse subset of the mathematical sciences--statistics, mathematical programming, combinatorics, graph theory and theoretical computer science. This paper offers observations on selected recent SDP research directions.
"Dike: A Prototype for Secure Delegation of Statistical Data".
Josep Domingo-Ferrer, Ricardo X. Sànchez del Castillo, Javier Castilla
The need for delegating statistical data arises when the data owner (e.g. statistical office) wants to have its data handled by an external party. If the external party is untrusted and data are confidential, delegation should be performed in a way that preserves security. A cryptographic solution to the secure delegation problem is outlined which provides data secrecy and computation verifiability. Also, the design principles of Dike --an implementation allowing secure delegation of information over the Internet-- are discussed in some detail.
Keywords: Delegation of information; Encrypted data processing; Distributed computing; Statistical data protection.
"A Secure Network of European Statistical Offices over the Internet".
Despina Polemi, George Kokolakis.
A security solution for the interconnection of the European Statistical Offices (ESOs), and ESOs with their users over the Internet is proposed based on the Trusted Third Party Services (TTPs) approach considering statisticians needs and addressing technical, operational and functional aspects.
Index terms: Trusted Third Party Services, Internet, confidentiality, integrity, authenticity
"Investigating Key Qualitities of an Automated Cell Suppression System".
Keith McLeod, John George Andrew Rae, Rodney Butler
"Looking of Efficient Automated Secondary cell Suppression Systems: A Software Comparison".
The problem of secondary cell suppression is well known and studied.
We examined software for automated secondary cell suppression and compared
the programs under methodological and conceptual aspects.
A major practical experiment was conducted: The programs were run on tables from the German 1995 Census of Manual Trades. Results of these runs will be compared and presented in this paper.
"ARGUS for Statistical Disclosure Control"
Leon Willenborg , Anco J. Hundepool
The paper describes the main functionality of two related software packages for producing safe data: -ARGUS for microdata and -ARGUS for tabular data.
Keywords: Statistical disclosure control, software, microdata, tables.
"Protecting Output Databases"
Stephen Horn, Ross Morton
Official survey data are customarily released to the public as tabular aggregates or as files of unit records. Such releases have been subject to well established rules designed to preserve the confidentiality of respondent information as required by agency charter. Output databases that allow external users to access data across collections and time via a single contact point offer a third and distinct form of release. This paper outlines a protection strategy for such generalized table retrieval facilities.
"Special Uniques, Random Uniques and Sticky Population: Some Counterintuitive Effects of Geographical Detail on Disclosure Risk"
Mark. J. Elliot, Chris J. Skinner, Angela Dale
Work into statistical disclosure control invariably assumes that disclosure
risk increases as the level of detail on the released data increases. Using
the 1991 GB census data this paper describes some work using the UUSU ratio
(the proportion of sample uniques which are also population uniques) which
shows that the relationship between disclosure risk as measured by the
UUSU ratio has a non-monotonic relationship with the level of geographical
Further using the concept of population uniqueness it is possible to demonstrate that uniqueness is a non-homoneneuos categorization. The paper distinguishes special uniques which are unique by virtue of a unusual combination of characteristics whose uniqueness from is insensitive to changes in geographical level from random uniques which are unique by virtue of the way in which the key variables have been constructed.
The paper concludes that the relationship between geographical level and disclosure risk is more complex than was previously supposed and that attending to the problem of special uniques may substantially reduce the risk of disclosure.
"Modeling population uniqueness using a mixture of two laws"
When a statistical agency wants to assess the risk of disclosure of a microdata file, one important measure that has to be estimated is the conditional probability that a record is unique in the population given that it is unique in the sample. The expression of this probability is a function of the sampling fraction and the structure of the population which is the information on the population in terms of the key variables. The basic problem is to estimate or model the structure of a population. By observing the relationship between this probability and the sampling fraction for a real population, we were able to find constraints over the structure of the population. These constraints give us some clues to what models should be considered. The strong result in the paper is to propose a mixture of two distributions for modeling population uniqueness.
Keywords: risk of disclosure, population uniqueness, mixture
"Pre-record risk of disclosure in dependent data"
Roberto Benedetti, Luisa Franconi, Federica Piersimoni
The disclosure protection problem, when the identity disclosure definition is used, can be set as follows: a microdata file can be released if there is little chance that others will correctly link records to individual unit. The person who attempts such a link will pursue his aim by exact matching the values of individual units contained in a public register or external data base against the corresponding values in the released microdata files. In this paper we propose a new methodology for the definition of per-unit risk of disclosure that allows for a structure of dependence amongst the individuals. The methodology that defines the probability of identification of each record is presented in Section 2, In Section 3 the risk of disclosure is defined, whereas in Section 4 the computational problems are described.
"Statistical Methods to Limit Disclosure for the Manufacturing Energy Consumption Survey: Collaborative Work of the Energy Information Administration and the U. S. Census Bureau"
Ramesh A. Dandekar
The Energy Information Administration (EIA) of the United States Department of Energy (DOE) collects energy consumption and related information for the manufacturing industries in the United States via its Manufacturing Energy Consumption Survey (MECS). MECS is a triennial survey and is collected for EIA by the United States Census Bureau using the Bureau's legislative authority under Title 13. Title 13 is a statute that describes de statistical mission of the Census Bureau and contains strict confidentiality provisions to protect sensitive information.
"Confidentially Auditing for Price Index Publications"
The use of cell suppression to provide confidentiality protection for business statistics publications is standard within official statistical agencies. Automation for both designing and auditing of cell suppression patterns has been in use for considerable time. This automation has been developed for the important application of publishing tables of totals of economic activity. A common example is a census of manufacturing. Another group of examples are census of agriculture publications of either financial characteristics like farm revenue or physical characteristics like land use or herd size.
"Improving the Disclosure Testing Algorithm for ONS Business Statistics"
This paper describes a change to the way that disclosure testing is undertaken for estimates produced from data collected in business surveys conducted by ONS. The purpose of the change was to reduce costs by making processing simpler. The new algorithm identifies those cells that can be correctly declared disclosive or non-disclosive easily i.e. without applying the full rule. The full test is then reserved only for those cells where it is not so clear whether the cell is disclosive or not. The paper comments on the effectiveness of the new method.
Keywords: disclosure, threshold rule, p-percent rule
"Re-identification Methods for Evaluating the Confidentiality of Analytically Valid Microdata"
William E. Winkler
A public-use microdata file should be analytically valid. For a very small number of uses, the microdata should yield analytic results that are approximately the same as the original, confidential file that is not distributed. If the microdata file contains a moderate number of variables and is required to meet a single set of analytic needs of, say, university researchers, then many more records are likely to be re-identification methods typically used in the confidentially literature. This paper compares several masking methods in terms of their ability to produce analytically valid, confidential microdata.
"Reflections on PRAM"
Peter-Paul de Wolf, José M. Gouweleeuw, Peter Kooiman, Leon Willenborg
PRAM is a probabilistic, perturbative method for disclosure protection of categorical variables in microdata files. If PRAM is to be applied, several issues should be carefully considered. The microdata file will usually contain a specific structure, e.g., a hierarchical structure when all members of a household are present in the data file. To what extent should PRAM conserve that structure? How should a user of the perturbed file deal with variables that can (logically) be deduced from other variables on which PRAM has been applied? How should the probability mechanism, used to implement PRAM has been applied? How should the probability mechanism, used to implement PRAM be chosen in the first place? How well does PRAM limit the risk of disclosing sensitive information? In this paper, these questions will be considered.
Keywords: Post Randomisation Method (PRAM), disclosure limitation, perdurbed data, Markov matrix, expectation ratio.
"Obtaining Information while Preserving Privacy: A Markov Perturbation Method for Tabular Data"
George T. Duncan, Stephen E. Fienberg
Preserving privacy appears to conflict with providing information. Ways exist, however, to resolve this value paradox in an important context. Statistical information can be provided while preserving a specified level of confidentiality protection. The general approach is to provide disclosure-limited data that maximizes its statistical utility subject to confidentiality constraints. Disclosure limitation based on Markov chain methods respect the underlying uncertainty in real data is examined. For use with categorical data tables, a method called Markov perturbation is proposed as an extension of the PRAM method of Kooiman, Willenborg and Gouweleeuw (1997). Markov perturbation allows cross-classified marginal totals to be maintained and promises to provide more information than the commonly used cell suppression technique.
Keywords: Confidentiality, Data Access, Data Security, Hierarchical Models, Markov Chains, Perturbation Methods, Simulated Data.
"Disclosure Limitation for the 2000 Census of Housing and Population"
Phil Steel, Laura Zayatz
The Bureau of the Census is required by law (Title 13 of the U.S. Code) to protect the confidentiality of the respondents to our surveys and censuses. At the same time, we want to maximize the amount of useful statistical information that we provide to all types of data users. We have to find a balance between these two objectives. We are investigating techniques that will be used for disclosure limitation (confidentiality protection) for all data products stemming from the 2000 Census of Population and Housing.
This paper describes preliminary proposals for disclosure limitation techniques. In Section 2, we briefly describe the procedures that were used for the 1990 Census. In Section 3, we describe why some changes in those techniques may be called for. In Section 4, we give our initial proposals for procedures for the 2000 Census, including procedures for the 100% census tabular data, the sample tabular data, and the microdata. In Section 5, we briefly describe methods of testing the resulting data in terms of retaining the statistical qualities of the data and giving adequate protection. Section 6 contains references.
"Factors Affecting Confidentiality Risks Involved in Releasing Census Data for Small Areas"
Oliver Duke-Williams, Phil Rees
The wished of data users to have data for a number of sets of areas, here called 'geograpies' causes a potential problem for National Statistical Offices (NSOs). This paper reviews work done to investigate the extent to which the publication of data for multiple geographies poses a risk to the confidentiality of individuals.
"Multiple Imputation and Disclosure Protection: The Case of the 1995 Survey of Consumer Finances"
Arthur B. Kennickell
Recent developments in record linkage technology together with vast increases in the amount of personally identified information available in machine readable form raise serious concerns about the future of public use data sets. One possibility raised by Rubin  is to release only simulated data created by multiple imputation techniques using the actual data. This paper uses the multiple imputation software developed for the Survey of Consumer Finances (Kennickell ) to develop a series of experimental simulated versions of the 1995 survey data.
"Modeling and Solving the Cell Suppression Problem for Linearly-Constrained Tabular Data"
Mateo Fischetti, Juan José Salazar
We study de problem of protecting sensitive data in a statistical table whose entries are subject to a system of linear constraints. This very general setting covers, among others, k-dimensional tables with marginals as well as linked tables. In particular, we address the NP-hard problem known in the literature as the (secondary) Cell Suppression Problem. We introduce a new integer linear programming model and describe additional inequalities used to strengthen the linear relaxation of the model. We also outline a branch-and-cut algorithm for the exact solution of the problem, which can also be used as a heuristic procedure to find near-optimal solutions. Preliminary computational results are promising.
Keywords: Statistical Disclosure Control, Cell Suppression, Integer Linear Programming.
"Heuristic Methods for the Cell Suppression Problem in General Statistical Tables"
F. D. Carvalho, M. T. Almeida
One of the methods used to avoid disclosure of confidential data, in statistical tables, is to suppress confidential data from publication. Since row and column totals are also published, it is usually necessary to suppress the values of some non confidential data as well. Assigning a cost to the information lost with the suppression of each non confidential cell, the best solution is the one that minimizes the total cost. A table and its suppressions may be represented by an unidirected bipartie network. An introduction of this subject and some approaches to its resolution by other authors are presented. We propose and test some heuristic improvement methods where network model techniques are used, and conclude that the quality of the solution found by these new methods is considerably better when compared to the existing ones.
Keywords: Cell Suppression Problem; Complementary Suppressions; Primary Suppressions; Heuristics.
"Lower-bounding Procedures for the Cell Suppression Problem in Nonnegative Statistical Tables"
F.D. Carvalho, M.T. Almeida
One of the methods used to avoid disclosure of confidential data, in statistical tables, is to suppress confidential data from publication. Omitting the confidential valued, also known as primary cells, does not guarantee in every case that they cannot be disclosed or, at least, estimated within a narrow range, since some row and column values, if not all, are published. It is therefore necessary to make complementary suppressions, that is, to suppress values which are not confidential. A primary cell is considered protected if and only if the information provided by the final published table does not allow estimating its value within a narrower range than a prespecified safety range. Assigning a cost to every complementary suppression, the suppression problem is that of finding a set of complementary suppressions with minimum total cost. We present lower-bounding procedures for this problem and prove that our results dominate the results known from literature.
Keywords: Cell Suppression Problem; Complementary Suppressions; Primary Suppressions; Lower-Bounding Procedures.
"On Solving huge Set-cover Models of the Microdata Protection Problem"
C.A. J. Hurkens, S.R. Tiourine
We discuss how to model the problem of dealing with microdata intended for public release. Protecting the information of individuals by recording and/or suppressions has to be balanced against the need for ending up with a data set that is still statistically valuable. Next we describe algorithms to compute good solutions. In addition to local search techniques to find these solutions, we develop a lower bounding mechanism, which will enable us to estimate the quality of our solutions.
Previous page | Index