## CS 4365/CS 5354 Data Processing Under Security and Privacy Summer 2016, Test 2

Name: __________________________________________________________

1-2. In class, we learned the following algorithm for eliminating outliers:

• we compute the sample mean μ = (x1 + ... + xn) / n,
• we compute sample variance V = ((x1 − μ)2 + ... + (xn − μ)2) / (n − 1), and the sample standard deviation σ as the square root of V;
• then, we dismiss all the records which are not in the 2-sigma interval [μ − 2σ, μ − 2σ] as outliers;
• after that, we repeat the same procedure for the new database, with some outliers deleted: we re-compute μ and σ and, if needed, eliminate values which are not in the new 2-sigma interval, etc.;
• this process continues until, at some stage, no more outliers are eliminated.

1. Use this algorithm to eliminate outliers from the following database:

• we have 100 records each equal to 1.9;
• we have 100 records each equal to 2.1;
• we have one record with value 10; and
• we have one record with value 1,000.
The algorithm is straightforward:

2. Explain how outlier elimination is used in computer security.

3. In class, we derived a general formula for the optimal sizes Δi of a privacy-enhancing box for which the inaccuracy in the resulting estimation of a statistical characteristic C is the smallest possible under the condition of k-anonymity: Δi = c / ai, where ai is the absolute value of the partial derivative of C with respect to the i-th variable xi, and c = (1/2) * (k * a1 * a2 * ...) / ρ(x) , where ρ(x) is the data density, i.e., number of records per unit volume.

Use the general formula to find the sizes of the cell that provides the smallest possible inaccuracy in computing the covariance between weight and age for adults. Assume that:

• we are looking for k-anonymity with k = 10,
• we deal with a population of El Paso, N = 700,000 students,
• age is uniformly distributed on the interval [20, 90];
• weight is uniformly distributed on the interval [50, 90]; and
• we are interested in the cell that covers a record with age x1 = 65 and weight x2 = 80 kg.
For covariance, the derivative with respect to x1 is equal to (x2 − μ2) / N and the derivative with respect to x2 is equal to (x1 − μ1) / N, where μi are the sample means of the corresponding values.

4. What if in Problem 3, in addition to k-anonymity, we also require l-diversity, with l = 2, and the thresholds ε1 = 0.2 and ε2 = 0.01?

Reminder:

• If for the k-anonymous solution, we have 2Δi ≥ l * εi, then l-anonymity is also satisfied.
• If for some of the variables xi, this inequality is not satisfied, then for this variable, we select:
• for this variable, we select Δi = (1/2) * l * εi, and
• for other variables, we select Δj from the condition that the cell contain k records, i.e., that ρ(x) * 2n * Δ1 * Δ2 * ... = k.

5. What percentage of privacy do we lose if someone detects the first digit x in the age xy? the second digit y? Assume that the age is between 20 and 90.

6. Based on the Problem 5, it may look like the last decimal digit -- the remainder of dividing a number by 10 -- is irrelevant, and should not be protected much. Explain that it is not so: what if someone knows the remainder by 10 and the remainder in the hex system (when divided by 16) -- can we then uniquely determine the adult age (from 20 to 90)? Explain your answer.

7. Use the efficient raising-to-the-power algorithm to compute 521 mod 11. Where is this algorithm used in RSA coding?

8-9. Use Euclid's algorithm to compute the greatest common divisor gcd(11, 21). Then find a number d for which d * 11 mod 21 = 1. Where is this used in RSA coding?

10. Briefly describe your project for this class, and what you have done so far for this project.