Optim Data Privacy Providers  11.3.0
 All Data Structures Files Functions Variables Macros Groups Pages
Modules
Data Swapping
Collaboration diagram for Data Swapping:

Modules

 Specific Parameters
 

Detailed Description

Data Swapping

Service Identifier: DSWAP

This Service Provider is used to protect the identification of individuals in the provided set of data. Confidentiality protection is achieved by modifying a fraction of the rows in the offered set of data by swapping a subset of fields across selected pairs of rows such that it becomes impossible or at least very hard for any intruder to be certain of having identified an individual or entity in the offered set of data.

The primary purpose of the Data Swapping Service Provider is to swap the content of selected fields between random pairs of rows. Additional capabilities are:

Additionally, the caller might provide a seed to make the sequence of permutations repeatable in later calls.

Example

A microdata dataset contains the age and income for 6 individuals. Note that the row number (#) is only used for illustration. In order to protect the anonymity of the individuals, sensitive values are randomly swapped among the rows. In the following, we describe the output of the Data Swapping Service Provider when used in default mode.

Source Buffer


To swap the sensitive fields, for each field a sequence of random pairs will be determined. We begin with field Income such that Incomes on the first and sixth rows, those on the second and third, and those on the fourth and fifth are pairwise swapped. These swaps lead to table "After Swap #1" shown below. Next, Age values are swapped assuming that the Age value on row 1 is swapped with the one on row 2, that on row 3 with the one on row 4, and the age values on row 5 with row 6. The final result is given in table "Destination buffer".

Data Swapping


With respect to field Income, rows 2 and 3 appear unchanged as row 2 was swapped with a row having the same income. For these individuals, the swap has provided no masking. In general, the probability that a swap has masked a particular row is inversely proportional to the frequency of its value appearing in the set of data. For large sets of data, this is acceptable. An income value which appears frequently in a microdata dataset does not as easily identify the individual as one which appears very rarely.

Locked Swap

In case of multiple selected fields, all these fields in a row may be swapped at once. Assuming the same swapping order as used above with field Age, the result of a locked swap of fields Age and Income is shown in the table below.

Locked Data Swapping


Note that if all fields of the rowset are selected then the result of a locked swap is equivalent to a reordering of the rows in the rowset. Furthermore, swapping the selected fields, each in a separate run, is equivalent to performing the swapping in Unlocked mode. In general, data swapping with independent order on the fields achieves a higher protection as the random swap of Age can differ from the random swap of Income. Contrary, a Locked Swap maintains the correlation between the selected fields.

Class-based Data Swapping

Service Identifier: CDSWAP

This Service Provider randomly swaps pairs of fields in selected rows but only if a condition on the pairs holds. This condition (or constraint) is defined with respect to a particular field, called the control field.

Example

In some situations there may be conditions on pairs of rows, defined by fields not in the collection of swapping fields, in order for one row in the pair to be a feasible swap candidate for the other. For instance, constraints may be necessary to prevent physically infeasible (and hence detectable) swapped records, such as males who have undergone hysterectomies. We call such a field whose values define the feasibility of swap candidates a control field.

In order to protect the anonymity of the individuals, sensitive values are randomly swapped within the groups of rows defined by the condition. In the following, we describe the output of the Class-based Data Swapping Service Provider that limits data swapping to the members of the corresponding category (also known as equivalence class).

A simple way to specify an equivalence class is to use the identity relation on a control attribute. It does not require user-input to define the groups. In the following table, we have a Source buffer with fields Age, Income, and City. Income is the swap attribute and City the control attribute.

Class-based Data Swapping


In this example, the control field has three values: 'Dublin', 'Zurich', and 'Athens'. For Dublin there is only one allowable swap [(#1, #2)]. For Zurich there are three rows so swap pairs can be generated among rows #3, #4, and #5. As Athens has only one row, it cannot be swapped. In such cases, to preserve privacy, the provider notifies the user and offers the option to suppress the values of the sensitive fields (i.e. Age, Income). In the example, the sensitive values are suppressed for row #6. The result is shown in the Destination buffer in the table above.

Distance-based Data Swapping

Service Identifier: DDSWAP

This Service Provider minimizes the distance between pairs of rows based on a given control field.

Example

Let field Income be the swap field and field Age the control field. Furthermore, let p = 3 be the user-specified block size.

Distance-based Data Swapping


In the Source buffer, the rows are already sorted in ascending order with respect to the control field. The block size of 3 splits the table in two blocks. Swapping is performed separately for each block. Let the swap order for Block #1 be [(1,3), (2,3)] and [(4,5), (5,6)]. The result is shown in the Destination buffer.



The following common parameters are applicable to the Data Swapping Service Provider:


The following parameters are applicable only for Data Swapping:


The following parameters are applicable only for Class-based Data Swapping:


The following parameters are applicable only for Distance-based Data Swapping: