Privacy by design
Privacy by design (PbD), a systems engineering approach first developed by Cavoukian in 1995, calls for proactive privacy preserving design choices embedded throughout the process life cycle.49 Since the advent of electronic medical records (EMRs), experts have recognised the need for embedding technological safeguards to protect privacy and prevent data breaches.50 51 Advances in data science help address several of the aforementioned limitations, by either manipulating the data through strategies like minimisation, separation or abstraction or regulating the process by defining conditions for control and notification.51 52
In many settings in India, personal data can often be easily accessed by people who do not need such access; for example, clinic-based facilitators that liaise with state or private insurance companies, insurance agents themselves and in the public sector, administrative officials. There is little recognition that such access, however unintentional or inadvertent, is unethical, and will very soon be illegal.53 The NDHM strategy calls for PbD tools without providing greater detail.12 We have described below the dominant tools in current use that apply PbD principles to address gaps in health data protection.54 These examples are meant to be illustrative and are not exhaustive.
Data minimisation
When health data are collected, either through clinical operations or during research, there is temptation to collect more and not less, given the opportunity costs associated with collecting these data. This results in exhaustive data sets archived in the public and private health sector that pose significant privacy risks.53 Restricting data collection to the essentials has in fact been demonstrated to declutter and improve the user-interface, and consequently, user-experience and compliance, while reducing privacy risks.55 While the NDHM espouses data minimisation, existing legacy digital public health systems continue to collect vast amounts of redundant data on millions of beneficiaries, without demonstrable justification.14 53
Role-based access
Role-based access is a standard feature in most advanced EMRs.56 Open source tools like Dataverse provide scientists differential access to research databases as well.57 Multi-authority attribute-based encryption schemes allow role-based models to scale by allowing access to users based on a set of attributes, rather than on individual identities.58 59 For example, by virtue of being a verified clinician (regardless of who), physicians are generally able to look up most medical records at their institution easily; by virtue of being a public health administrator (regardless of who), officers should have no access to personal health information; and by virtue of being a research laboratory, the team would have access to authorised de-identified data, provided third-party regulators can affirm the veracity of each of their attributes (clinician, administrator, researcher).49 60 The Account Aggregator, a similar consent management framework already in play in India’s fintech ecosystem, lends itself to such selective, verifiable, pre-authenticated access as has been proposed at the backbone for the NDHM.61 Since user-consent can be sought asynchronously (prior to actual data processing), this model somewhat mitigates inadvertent coercion associated with point-of-care consent seeking. The NDHM seeks to verify attributes by developing and maintaining ‘registries’ of providers.62
User preference
The General Data Protection Regulation in the European Union facilitates data access by requiring companies to provide a consent management platform to give users more control over their data, by selecting from a menu of data-use options.14 In India, the Data Empowerment and Protection Architecture and the NDHM seek to empower users by allowing them to place revocable time and purpose limitations on the use of their data—the sorts of choices that would be extremely beneficial to patients.63 In theory, patients would control who accesses their data at all times, would receive notification of third-party access (whether authorised or not), or be able to revoke access at will, when permitted by law.
Others have elaborated on the idea by allowing data principals to opt into certain ‘data trusts’ or stewards with pre-negotiated access controls, where general attributes can be used to guide future data sharing: for example, a patient may elect to always allow healthcare providers to access her data but always deny access to pharmaceutical companies regardless of the identifiability of the data.64–66 This approach would entail data principals communicating their preferences to the consent manager to accordingly direct data toward select categories of data processers; for example, to clinical health information users, and say, public research agencies like the Indian Council of Medical Research, but not to pharmaceutical companies.12 The asynchronous and one-time (but revocable and changeable) nature of the process—made possible by the consent manager framework—may allow users to make a more informed and coercion-free choice, if citizens are encouraged to actively enrol in the system prior to clinical care.
Differential privacy
The current NDHM guidelines require that all health information processors make aggregated data available. Not only are aggregation and anonymisation inadequate for protecting privacy for the reasons described above, but many aspects of clinical and population health will require non-anonymised, high resolution data to actually be useable and useful.12 The NDHM’s Health Data Management Policy prohibits inadvertent unforeseen re-identification while processing data.14
Differential privacy (DP) seeks to balance such access to rich data while preserving privacy. It achieves this balance by differentially introducing ‘statistical noise’ in the data set, depending on what is being queried and by whom, thus combining the aforementioned approaches. The ‘noise’ masks the contribution of each individual data point without significantly impacting the accuracy of the analysis. Moreover, the amount of information revealed from each query is calculated and deducted from an overall privacy budget to halt additional queries when personal privacy may be compromised. If effective, this approach will help alleviate some of the concerns about combining large data sets; its utility in the clinical setting is yet to be determined. There is precedent for DP as a model for collaborative research.67 Open source platforms like OpenDP are likely to accelerate use of the application of DP across disciplines.68 DP may however lead to noisy aggregates with poor utility for analytical tasks in public health.69 70 Given the nascency of DP applications, it is premature to assess utility based on field-impact.