MAIN OBJECTIVE: to teach data processing under security and privacy.
NEED FOR PROCESSING LARGE AMOUNTS OF DATA. Modern computers allow us to store and process large amounts of the data. There is hope that, e.g., by analyzing these huge amounts of medical data, we will be able to better understand causes of different diseases, to better trace which cure is better for each patient -- and as a result, to arrive at more personalized medical practices. Current data analytics has indeed led to several interesting medical breakthroughs. Similar interesting results have been obtained in agriculture, in biology, in social sciences, etc.
POTENTIAL PROBLEM WITH PRIVACY. Since we do not know a priori what kind of dependence we are looking for, it make sense to allow researchers to ask all kinds of statistical questions. At first glance, it sounds reasonable, but it is known that that such possibility leads to a violation of privacy.
At first glance, it looks like if we anonymize the data by deleting the names, and if we allow only statistical questions, then privacy will be preserved. Not so. For example, we may want to analyze how blood pressure depends on the closeness to I-10. So, we take a street -- e.g., Robinson Street -- and measure the blood pressure of all the people who live in that street. We then take an average blood pressure of those who live within a certain distance from I-10. Since we do not know how big is the effect, we compute averages corresponding to different distances. The problem is that if we know the average blood pressure in all the houses up to 501, and then up to 503, we will thus be able to reconstruct the blood pressure of a person living in 503 Robinson.
HOW CAN WE PRESERVE PRIVACY AND STILL ALLOW RESEARCHERS TO ANALYZE DATA. If we allow researchers to ask all possible questions based on the exact data, then privacy is not preserved. We do not want to limit possible queries: we do not know what is the actual dependence, and by limiting queries, we may inadvertently prevent the researchers from finding the actual dependence. So the only remaining choice is to modify either directly the query result or the data values on which this result is based.
This modification should be unpredictable -- so that it will not be possible to reverse it and get the actual value. From the mathematical viewpoint, unpredictable means random, so we arrive at the idea of adding random noise either directly to the query result or, indirectly, to the data values. Thus, we arrive at probabilistic methods of maintaining privacy.
PRIVACY IS PRESERVED, BUT DATA PROCESSING BECOMES MORE COMPLEX. To a large extend, probabilistic methods do preserve privacy, but now we have an additional problem: the added noise affects the results of data processing. We therefore need to know how big is this effect, i.e., how accurate are the results. When we uses this data processing to check a certain hypothesis, we need to take this noise into account in the criteria for testing the hypothesis.
PROBABILISTIC METHODS MAY ALSO LEAD TO LOSS OF PRIVACY. Another problem with the probabilistic methods is that leave room open for privacy loss. For example, if we add random noise to the query, then, by asking the same query many times and averaging the result, an adversary will be able to get the original exact value -- and thus, to gain supposedly protected information.
If we add noise to the original data, then the adversary can take into account that the same person is usually listed in many different databases. Based on each database, we can reconstruct the stored value. These stored values are obtained by adding different instances of random noise to the actual value. Thus, we can again average these values and reconstruct the actual data.
STORING RANGES INSTEAD OF ACTUAL VALUES: INTERVAL APPROACH TO PRESERVING PRIVACY. These problems can be avoided if, instead of the actual value (with or without noise), we only store a range of values. For example, instead of the exact age, we only mark whether a person is between 20 and 30, between 30 and 40, etc. In other words, we only keep an interval (such as [20, 30]) that contains the actual (unknown) value. Since this is the only data that is stored in the database, this is the only information that can be reconstructed.
PRIVACY IS PRESERVED, BUT DATA PROCESSING AGAIN BECOMES MORE COMPLEX. In such situations, data processing also becomes complex: instead of the original algorithms for processing real values, we now need to come up with new algorithms for processing intervals.
WHAT WILL BE COVERED IN THIS CLASS:
General data processing techniques. We start with a brief overview of the standard statistical methods, such as methods for finding the parameters of a linear dependence or methods for checking whether the dependence is indeed linear. For students who have already taken probability and statistics class, this will be an easy part, but since not everyone has taken this class, we will go slowly.
Once we learn the algorithms, we will program them -- and this will be a pattern throughout the class: to solidify the knowledge of the algorithms and to practice using them, we will write programs implementing these algorithms.
How probabilistic approaches to privacy preservation affect data processing. Then, we will analyze how addition of random noise affects the data processing results. We will learn some techniques for estimating this effect, and then, after learning how to simulate random noise, we will experimentally test how well these methods work.
How interval approach to privacy preservation affects data processing. Once we learn how to process data under probabilistic modifications, we will move to a more complex topic -- how interval approach to privacy preservation affects data processing. We will consider two important cases:
How to gauge the corresponding level of privacy. In all these methods, some information is still revealed -- e.g., the range of possible values. So, some privacy is lost. How can we gauge the resulting level of privacy? we will the basic way of gauging privacy level: k-anonymity (for each piece of information that can be extracted, there are at least k different individuals whose data is consistent with this information), l-diversity, differential privacy, and utility-based measures or privacy and of privacy loss.
What is the optimal way of modifying the data? Within each method of preserving privacy, there are many different options. In the probabilistic approach, we can add different noise. In the interval approach, we can divide into different ranges: e.g., by dozens instead of by tens. Once we know what we want to estimate, a natural question is which option should we select so that -- within the desired level of privacy -- we can the most accurate estimations of the desired quantities. On simple examples, we will learn how to solve the corresponding optimization problems.
Auxiliary questions. The above are the major topics related to data processing under security and privacy. In addition to these major topics, there are many auxiliary topics, and we will cover some of them.
For example, the best ways to gauge the reliability of a program and the uncertainty of the results of data processing is to take into account the structure of the code. But what is the code is proprietary, only available as a black box? In this case, we need to come up with different techniques for estimating the corresponding uncertainty. Some of these techniques will be described in the class.
THERE IS NO TEXTBOOK: we will use handouts and links
PROJECTS: An important part of the class is a project. There are three possible types of projects:
TESTS AND GRADES: There will be two tests and one final exam. Each topic means home assignments -- both on a sheet of paper and related to programming. Maximum number of points:
A good project can help but it cannot completely cover possible deficiencies of knowledge as shown on the test and on the homeworks. In general, up to 80 points come from tests and home assignments. So:
STANDARDS OF CONDUCT: Students are expected to conduct themselves in a professional and courteous manner, as prescribed by the Standards of Conduct. Students may discuss programming exercises in a general way with other students, but the solutions must be done independently. Similarly, groups may discuss project assignments with other groups, but the solutions must be done by the group itself. Graded work should be unmistakably your own. You may not transcribe or copy a solution taken from another person, book, or other source, e.g., a web page). Professors are required to -- and will -- report academic dishonesty and any other violation of the Standards of Conduct to the Dean of Students.
If you feel you may have a disability that requires accommodation, contact the The Center for Accommodations and Support Services (CASS) at 747-5148, go to Room 106 E. Union, or e-mail to email@example.com. For additional information, please visit the CASS website.