Summer 2016, Test 2

1-2. In class, we learned the following algorithm for eliminating outliers:

- we compute the sample mean μ =
(x
_{1}+ ... + x_{n}) / n, - we compute sample
variance V = ((x
_{1}− μ)^{2}+ ... + (x_{n}− μ)^{2}) / (n − 1), and the sample standard deviation σ as the square root of V; - then, we dismiss all the records which are not in the 2-sigma interval [μ − 2σ, μ − 2σ] as outliers;
- after that, we repeat the same procedure for the new database, with some outliers deleted: we re-compute μ and σ and, if needed, eliminate values which are not in the new 2-sigma interval, etc.;
- this process continues until, at some stage, no more outliers are eliminated.

1. Use this algorithm to eliminate outliers from the following database:

- we have 100 records each equal to 1.9;
- we have 100 records each equal to 2.1;
- we have one record with value 10; and
- we have one record with value 1,000.

2. Explain how outlier elimination is used in computer security.

3. In class, we derived a general formula for the optimal sizes
Δ_{i} of a privacy-enhancing box for which the
inaccuracy in the resulting estimation of a statistical
characteristic C is the smallest possible under the condition of
k-anonymity: Δ_{i} = c / a_{i}, where
a_{i} is the absolute value of the partial derivative of C
with respect to the i-th variable x_{i}, and c = (1/2)
*
√(k * a_{1}
* a_{2} * ...) / ρ(x) , where
ρ(x) is the data density, i.e., number of records per unit
volume._{1} is equal to
(x_{2} − μ_{2}) / N and the derivative
with respect to x_{2} is equal to (x_{1} −
μ_{1}) / N, where μi are the sample
means of the corresponding values.

Use the general formula to find the sizes of the cell that provides the smallest possible inaccuracy in computing the covariance between weight and age for adults. Assume that:

- we are looking for k-anonymity with k = 10,
- we deal with a population of El Paso, N = 700,000 students,
- age is uniformly distributed on the interval [20, 90];
- weight is uniformly distributed on the interval [50, 90]; and
- we are
interested in the cell that covers a record with age x
_{1}= 65 and weight x_{2}= 80 kg.

4. What if in Problem 3, in addition to k-anonymity, we also
require l-diversity, with l = 2, and the thresholds
ε_{1} = 0.2 and ε_{2} = 0.01?

Reminder:

- If for the k-anonymous solution, we have
2Δ
_{i}≥ l * ε_{i}, then l-anonymity is also satisfied. - If for some of the variables x
_{i}, this inequality is not satisfied, then for this variable, we select:- for
this variable, we select Δ
_{i}= (1/2) * l * ε_{i}, and - for other variables, we select Δ
_{j}from the condition that the cell contain k records, i.e., that ρ(x) * 2^{n}* Δ_{1}* Δ_{2}* ... = k.

- for
this variable, we select Δ

5. What percentage of privacy do we lose if someone detects the
first digit x in the age xy? the second digit y? Assume that the
age is between 20 and 90.

6. Based on the Problem 5, it may look like the last decimal digit
-- the remainder of dividing a number by 10 -- is irrelevant, and
should not be protected much. Explain that it is not so: what if
someone knows the remainder by 10 and the remainder in the hex
system (when divided by 16) -- can we then uniquely determine the
adult age (from 20 to 90)? Explain your answer.

7. Use the efficient raising-to-the-power algorithm to compute
5^{21} mod 11. Where is this algorithm used in RSA
coding?

8-9. Use Euclid's algorithm to compute the greatest common divisor
gcd(11, 21). Then find a number d for which d * 11 mod 21 = 1.
Where is this used in RSA coding?

10. Briefly describe your project for this class, and what you
have done so far for this project.