Most large businesses have an untapped asset: their warehouses of data about individual customer purchases. Market research firms, for example, will pay for this information because they forecast future buying trends based on demographic analysis of past activity. But while this market could be lucrative, privacy concerns make most enterprises reluctant to sell their customer data.
Privacy is important. Organizations have an ethical obligation to respect their customers' privacy. Organizations also have a legal obligation associated with recent legislation aimed at protecting individual privacy in the computer age. (For information about recent privacy legislation, see Sean Maloney's article "For Your Eyes Only," page 15, InstantDoc ID 42615.) Consumers are becoming increasingly aware that their purchases and online activity leave trails of personal information scattered throughout cyberspace. They fear that companies, criminals, or even the government could knit this disparate data together to form a profile of them that's too personal for comfort.
Companies don't necessarily have to lock up or destroy their customer data. But they do have to be diligent about ensuring that the data they provide doesn't permit identification of their customers. Several well-known techniques in the business intelligence (BI) industry let organizations disseminate data about customers while protecting individual customer privacy. These techniques fall into four main areas:
Data-access control
Microaggregation
De-identification
Release-data modification
With data-access control, you control end users' access to individual records. In the microaggregation approach, you aggregate groups of individual records into an accurate summary of a population while hiding the records of specific individuals. De-identification means that you can release personal records only after scrubbing field data of any identifying information. And modifying release data, an extension of de-identification, means scrubbing the entire data set in a prescribed way so that you can release individual data with quantifiable measures of privacy. The first two techniques apply when you release only statistical summary data; the last two apply when you disseminate individual records. Let's examine each of these techniques and look at some examples of how organizations have used them successfully.
Limited Access
Data-access control is an appropriate strategy when an organization wants to host its data on its own Web site and let external researchers access statistical aggregates of the data. In this approach, the organization can use a tool such as SQL Server Analysis Services to create an OLAP cube of the data, then sell online access to this data. Analysis Services lets you specify access-control limits that restrict access to particular cube cells.
However, restricting access to particular cells isn't enough. Suppose an organization allows access to aggregate data that summarizes individual cell information. You can roll this data up to several dimensions (for example, purchases by ZIP code or purchases by age group), but even if you restrict access to the particular cell for an individual, a user can still query several dimensions and infer the characteristics of a particular cell. For example, individual cells might represent outliers—individual cases that are extreme enough (such as a family with a dozen children) to skew the summary data and unusual enough that users can deduce the characteristics of a particular cell. You need to develop an intelligent monitor to detect these cases. A monitor records data previously sent to a client and limits new query results to ensure that the accumulated data won't let users reliably infer individual data. The monitor stands between the user and Analysis Services, intercepting requests and modifying rows that are returned.
The National Agriculture Statistics Service (NASS) database serves as an example of how to implement an intelligent monitor. (See "Privacy Resources," page 14, for the NASS URL and more information about privacy designs.) The database holds survey data about various farms' chemical fertilizer usage. The system provides query tools that let researchers discover accurate aggregate statistical data about geographical areas while preventing researchers from inferring information about specific farms. NASS maintains privacy by aggregating several smaller geographical units into larger ones to make it more difficult to infer the contribution of tiny clusters of farms to aggregate totals. Because the system allows ad hoc queries, it also monitors the successive data sent to clients to ensure that the aggregate data is sufficiently general to prevent identification of individual farms.
Privacy isn't an absolute—you can characterize it only in a probabilistic sense, by quantifying the likelihood that a set of information will let users correctly guess who an individual is. Researchers who view a set of demographic data about many individuals can still identify one individual if only one person has that particular combination of characteristics. If two people have that particular combination of characteristics, the system is a little more private—a researcher can correctly guess half the time which individual is which. In the case of NASS, the system defines the privacy constraint by ensuring that it doesn't disclose data about small counties that contain only a few farms. The monitor aggregates these counties into adjacent areas to make the risk of identification smaller. In the NASS implementation, the monitor performs this aggregation dynamically; as the user makes successive queries, it limits the information that the application can disclose based on the previous information the user received. (The reference links in the "Privacy Resources" box provide the exact algorithm.) It ensures that researchers can't use the complete set of information that you provide to reliably infer data about specific farms.
Think Small
NASS hosts its database system on its own Web site, so it can control the sequence of statistics that it provides for successive queries. An organization that sells a complete statistical data set has a harder problem. Once you release a complete data set, an end user can subject it to exhaustive analysis. This means that the originating organization has to aggregate individual information into higher-level aggregates, a technique called microaggregation. The paper "A Comparative Study of Microaggregation Techniques" by Josep M. Mateo-Sanz and Josep Domingo-Ferrer Questuo (see "Privacy Resources") summarizes various microaggregation techniques that privacy researchers have developed. In general, these techniques analyze a source data set of individual records and generate a set of aggregated records. Each aggregated record is a statistical summary of a cluster of individuals. The intent is to create a data set that is accurate for analysis but that doesn't contain any information that would let users infer individual data.
Prev. page  
[1]
2
next page