Duplicate Parties
Duplicate parties may already exist in the data, and despite our best efforts, duplicates will certainly be created in the future.
What is the process for handling duplicates?
Report
Someone must notice the potential duplicate and report it to the data steward. The touchpoints for reporting are numerous: a phone call, an email, a Slack message, a hallway conversation, etc.
Identify
The data steward must identify the duplicate and the valid parties. The data steward must also determine if the duplicate party has any properties or relationships that need to be merged into the valid party.
In our design this stewarding process is not automated, and requires human judgement. Many users make mistakes in identifying duplicates; the steward will very often do some research on both the candidate parties to ensure not only which is the duplicate, but as a courtesy to downstream systems, try to get an idea of the number of transactions that have been attached to the potentially duplicate party. This simple impact analysis will guide the steps that follow.
Merge
Assuming that the steward has identified the duplicate and the valid parties, we must now try to create a good master record by merging properties, and relationships if necessary, from the duplicate party to the valid party. This will require some de-dupe tooling not yet specified.
Deactivate and Record Dupe
These two steps should be completed in a single transaction. Once the merge is complete, the duplicate party is deactivated as follows:
- the duplicate party's relationships should all be deactivated
- the duplicate party itself should be deactivated
Then, we must create an IS_DUPLICATE_OF relationship between the duplicate party and the valid party. This relationship will be used by downstream systems to identify and resolve duplicates.
DOWNSTREAM SYSTEMS MUST RESOLVE DUPLICATES
Remember, it is the job of consuming, or downstream systems to resolve duplicates in their systems. The IS_DUPLICATE_OF relationship is a signal to downstream systems that they should resolve the duplicate. It is an important design principle in our product approach that the authoritative data product not "reach into" other systems to fix their data.
Refer to the documentation for IS_DUPLICATE_OF for more information.