Appendix B: Linking datasets

Data linking or data matching is the process of combining two or more datasets. It allows program administrations to provide more integrated and client-friendly government services.

Data linking also provides policy analysts and researchers a wider lens to draw insights and generate improvements to services. There are two ways to link datasets — deterministic and probabilistic.

Deterministic data linking

Deterministic data linking combines individual records only if the fields that are being compared match exactly. For example, two agencies could use social security numbers to combine their datasets. This type of data linking is most suitable when both datasets have a consistent, unique identifier.

Probabilistic data linking

Probabilistic data linking combines individual records using a special algorithm that compares multiple fields to determine if two records are the same entity. For example, P20 WIN’s data linking process uses identifiers such as name, birthday and other fields present in both datasets to combine datasets. Probabilistic data linking is best suited for datasets that don’t have a unique identifier. It’s also most applicable when two datasets have a unique identifier that’s inconsistently present or untrusted.

Appendix B: Linking datasets

Deterministic data linking

Probabilistic data linking

Recommended reading

Get expert answers.