Lectures 18-20: Data linkage and privacy
-understanding what the record (data) linkage problem is
- combining related/equivalent records across data sources
-understand why matching a database against itself can be regarded as a data linkage task
- business wishes to carry out an advertising campaign
- the customer database changes over time, people move address, change their names.
- duplicate records about individuals - business wishes to know if the same person appears more than once.
-be able to explain where record linkage is applied and why
Why:
- For businesses purpose, need to identify if two or more records fefer to the same individual.
- build connection between the similar record in separate table
-appreciate why record linkage can be difficult
Privacy
-be able to describe the methodology of blocking, as applied to record linkage, explain why it is useful and why it is challenging
Blocking
- process
- preprocessing
- clean the data
- Blocking
- represent complex records as simple values(blocks). Only score records with simple value in common.
- Scoring
- comparing two records and asses their similarity
- can use Jaccard similarity or edit distance here.
- Matching
- match sufficiently similar records
- Merging
- merge two groups, resolve conflicting attributes
- preprocessing
- Why?
- since blocking is much faster than linear scan and compare (n*m), but blocking only takes (m/b*n/b*b = m*n/b)
- challenges
- to determine the size of b
- when b is too small the speed is not significantly improved
- when b is too large the accuracy will drop dramatically
- for a good blocking strategy(uniform spread across blocks), then accuracy may be almost as good as strategy without blocking, but much much faster.
- to determine the size of b
-understand the record linkage pipeline of preparation, blocking, scoring, matching and merging
- Record linkage
- preparation
- Clean records
- 8J9I
- Blocking
- Represent complex records as simple values (blocks)
- Only score records with block in common
- Key words:
- Process:
- Each record from A allocated to one or more of the blocks
- Each record from B allocated to one or more of the blocks
- Within each block, compare the records from A against those from B & find those that match
- If two records are not assigned to the same block = they are not a match
- Number of blocks ⬆️
- Speed ⬆️
- Accuracy ⬇️ (miss other correct blocks)
- score
- Compare two records -> assess their similarity
- Jaccard similarity
Treat si as a set of words
| si | = count number of words in set
Denominator normalises the score based on how long each word is
- Edit distance
- Minimum number of character insertions, deletion & substitutions to go from s1 to s2
- Expensive to compute
- Score Combination
- Idea 1: sum up feature scores
- Idea 2: similarities, - dissimilarities
- Idea 3: weighted sum
- Idea 4: label examples of non-matching record pairs, train a classifier using machine learning
- matching
- Match sufficiently similar records
- Threshold
- i.e. Match when score > 0.5
- merging
- Merge matched records
- Resolve conflicting attributes
-be able to explain in what circumstances privacy is an important issue for data linkage
- When important?
- Matched data is being passed to another organisation or being made public
- Data matching is being conducted across databases from different organisation.
-understand the objective of privacy preserving data linkage
- the objective of privacy preserving
- without revealing any information about individuals who do not get linked across the databases
-understand the use of one way hashing for exact matching in a 2 party privacy preserving data linkage protocol
- One way hashing
- also means non invertible hash function
- for each organisation, applies a one way hash function to the attribute used to join the databases and shares its hashed values with other organisation. Each checks which ones match. These are linked records.
- Disadvantages
- little big difference results in a completely different output
- Dictionary attack: an organisation could mount a dictionary attack to "invert" the hash function. for example, organisation A scans the hashed values received from Organisation B. Checks if any match its hash dictionary. If yes, privacy is lost for those items.
- 2 party protocal
- Each organisation
- Applies a one-way hash function to the attribute used to join the databases
- Share its hashed values with the other organisation
- Each checks which one matches
- Disadvantage
- single character differences in the original value -> completely different hashed value
- dictionary attack: hash value could be inverted
- Each organisation
-understand the vulnerabilities of 2 party privacy preserving data linkage protocol to i) small differences in input, ii) dictionary attack
-understand the operation of the 3 party protocol for privacy preserving linkage, using hash encoding with salt for exact matching. Understand the disadvantages of this protocol
3 party protocal
- Organisation C = trusted 3rd party
- Organisations A & B send their hashed values to Organisation C
- Organisation C checks for matches
- If Organisation C is malicious -> using salt
- Organisations A & B concatenate a secret word (salt) to every name field in their data before hashing
- Organisation C does not know what this word is -> cannot perform a dictionary attack
Third party protocol
- using salt
- organisation A and B concatenate a secret word to every name field in their data before hashing(known as salt). Organisation C does not know what this word is and thus can't perform a dictionary attack to "reverse" the hashed values it received. When we send
- disadvantages
- may not robust to frequent attck,that is, 3rd party compares the distribution of hashed values to some known distribution. E.g. distribution of surename frequencies in a public database versus distribution of hash value. To prevent this, we also need to add some random record.
- adding salt does not help two parties protocol
- doesnt help approximate matching
-understand when and in what circumstances dictonary attacks and frequency attacks might be mounted against 2 party and 3 party data linkage protocols
-understand the steps of the 3 party protocol for privacy preserving data linkage with approximate matching (using Bloom filters)
Bloom filter
- how it works
- A bloom filter is an array of n bits, with all bits initially set to zero. We may store strings in the bloom filter by using hash functions H1...Hk to turn on certain bit. If a bit was set to 1 before, no change is made.
- similarity is (number of bits set to 1 in bot bloom filter) / (sum of 1 bitss in bot bloom filter)
- need to set the thread for similar comparison.
- how to protect string
- Very hard for third party to guess the exact word by only using the bloom filter but still perform similar to exact string compare.
- notice that chose a proper size of bloom filter and number of hash function is very important, when bloom filter is too small or use too many hash functions the comparison is useless since almost all string has full 1 bits in bloom filter.
-understand how to compute similarity of two strings based on 2-grams and why this method is useful
Similar match
- using 2-grams to calculate the approximate similarity
- 2*(number of 2 common 2-grams)/ (total number of 2-grams in both string)
- easy comparison and effective method
- approximate similarity
- two strings X and Y
- Convert each string into 2-grams ( lower case and list all substring of length 2)
- 0 <= sim(X,Y) <= 1
h = number of common 2-grams
x = number of 2-grams in X
y = number of 2-grams in Y
- E.g.
-understand how a Bloom filter works (how it is created, how strings are inserted, how strings are checked for membership)
- bloom filter
- An array of I bits & all bits initially set to 0
- Store strings in the bloom filter by using hash functions H1…Hk to turn on certain bits
- Each hash function Hi (1<=i<=k) maps an input string X to a value in the range [1,i-1]
- To store a string X in the bloom filter, set array index Hi(X) in the bloom filter to 1 for each Hi(X)
- Hash vale = index
- Example: k=2, I = 25
- If a bit was set to 1 before, no change is made
- Check if a string is present:
- If at least one of the bits is set to 0, then X is definitely not a member of the bloom filter
- If all of the bits are set to 1 -> X appears to be a member of the bloom filter, but it might be false positive (same bloom filter -> different words)
- Google Chrome Browser
- use bloom filter to store malicious URLs
- when visiting a website the browser checked if that URL is in the filter
- if the answer is yes -> browser would query Google’s servers for more detailed information.
- Otherwise, browsing continues as normal
- The bloom filter was a compressed data structure of all malicious URLs Google new about, stored by the browser
- Much less space than maintaining a full database of URLs
- Comparing two strings for approximate similarity
- Given two strings: X and Y
- 2-grams of X are stored in bloom filter B1
- 2-grams of Y are stored in bloom filter B2
- Both bloom filters are the same length & use same hash function
- If two strings have a lot of 2-grams in common -> their bloom filters will have a large number of identical bit positions set to 1
-understand how a Bloom filter provides a private representation for strings
- A and B agree on a bit array length I and k hash functions
- also agree on a salt and use it in their hash functions
- A and B many add dummy records to lessen the chance of C mounting a frequency attack
- May be problems with using bloom filters represent short strings (not very secure)
- Might add random 2-grams / random bits to the bloom filters (for extra security)
-understand how Bloom filters can be used to compare two strings for approximate similarity and the formula for doing this (Dice similarity coefficient)
- similarity of bloom filter
h = number of bits set to 1 in both bloom filters
b1 = number of bits set to 1 in bloom filter B1
b2 = number of bits set to 1 in bloom filter B2
- Need to choose a threshold value for deciding whether to report that X matches Y