• subscribe
June 22, 2004 12:00 AM

Under Wraps

Selling data with privacy guarantees
SQL Server Pro
InstantDoc ID #42731

No ID Required
An organization can use the first two techniques when selling access to statistical information that it hosts or when selling data in bulk. Sometimes, however, only individual data is useful. For example, medical research requires diagnosis and treatment information about specific people. The problem in this case is how to provide detailed information about individuals while making sure that researchers can't determine exactly who those people are.

One solution is de-identification, in which you scrub individual data records of identifying data fields. The Health Insurance Portability and Accountability Act (HIPAA) defines 18 classes of information that can identify individuals and sets policies about how to modify or entirely suppress each of them to preserve medical privacy. Obviously, you need to remove specific identifiers such as Social Security numbers. You need to generalize birth dates to birth years and, because you'll have fewer individuals over age 90, combine everyone over age 90 into one birth year. In addition, you can provide only the first three digits of a ZIP code. This restriction is part of a general geographic constraint that limits locality specification to areas with at least 20,000 individuals. All these data changes are designed to give end users a "fuzzy" view of each individual—specific enough to be valuable to researchers, but vague enough to prevent identification of particular people.

Who Was That Masked Man?
De-identification is a good start, but to have real confidence that you're preserving your customers' privacy, you need to take further steps. When disseminating data, the originating organization must realize that the end user might have other sets of data about the same individuals, so it's essential to ensure that the end user can't link your disseminated data with other data sets. Users might easily have access to public information such as Social Security death indexes, voter registration records, and motor-vehicle data. The Carnegie Mellon Data Privacy Lab Web site provides case studies and online demonstrations of how users can make the link between disclosed information and public records. These studies demonstrate that organizations that rely purely on de-identification can inadvertently compromise their customers' privacy.

Techniques for modifying released data usually begin with record de-identification. For example, Dr. Latanya Sweeney of the Data Privacy Lab has developed a technique called k-Anonymity. k-Anonymity is one of a class of algorithms designed to transform data sets in a way that lets you make quantifiable statements of privacy. The general approach is to first apply standard de-identification techniques, which essentially consist of generalizing individual characteristics such as age and location. Next, the k-Anonymity algorithm further generalizes individual characteristics to ensure that, for any one individual, at least k-1 others have the same set of key individual characteristics. This means that if users relate the released data to any conceivable set of external data, at least k individuals will always match a row of released data. Here, k serves as a quantitative measure of privacy.

Under Cover
Organizations can profit by disseminating information from their databases to researchers, but privacy concerns are real. These industry techniques enable organizations to ensure data privacy for their customers. But using these techniques requires detailed planning and analysis. You need to select the right technique based on whether you're delivering summary data or individual records and whether individual records have fields that end users could link to other, public information about individuals. The Carnegie Mellon Data Privacy Lab is an excellent place to begin evaluating the techniques in detail.

Privacy Resources
"Achieving k-Anonymity Privacy Protection"
Using Generalization and Suppression"

by Dr. Latanya Sweeney: http://privacy.cs.cmu.edu/people/sweeney/kanonymity2.pdf

"A Comparative Study of Microaggregation Techniques"
by Josep M. Mateo-Sanz and Josep Domingo-Ferrer Questuo:
http://citeseer.ist.psu.edu/293093.html

The Carnegie Mellon Data Privacy Lab:
http://privacy.cs.cmu.edu/index.html

Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule: http://www.hhs.gov/ocr/hipaa

HIPAA Final Privacy Rule:
http://aspe.hhs.gov/admnsimp/final/pvctxt01.htm

The National Agriculture Statistics Service (NASS):
http://www.nass.usda.gov:81/ipedb/

NASS Privacy Design:
http://www.niss.org/technicalreports/tr107.pdf


ARTICLE TOOLS

Comments
    There are no comments to display. Be the first one!
You must log on before posting a comment.

Are you a new visitor? Register Here