As Artificial Intelligence (AI) continues to transform research across various fields, effective data management has become more important than ever. Proper AI data management is not just a technical necessity; it is crucial for ensuring the credibility, consistency, and ethical integrity of research outcomes. By adopting the right practices, you can significantly improve the accuracy and reliability of your results.
This guide highlights key best practices in AI data management to help you:
These practices include:
Data privacy and security are crucial in AI research, especially when working with sensitive data such as personal health information or financial records. Researchers must implement rigorous data management practices to protect information from unauthorised access and misuse, or breaches, particularly before the data is input into AI tools. For example, in the case of healthcare applications, managing health data and addressing the ethical implications of AI requires careful attention. Researchers must ensure informed consent, implement strict data sharing protocols, and ensure AI systems do not misuse or expose personal information.
To ensure sensitive data does not leak into AI tools, follow these best practices measures:
Only input the minimum amount of sensitive data required for the AI tool’s operation. From an ethical point, data minimisation helps prevent unnecessary exposure of personal data and minimises risks of misuse or discrimination.
Encrypt sensitive data both at rest (when stored) and in transit (when being transferred). This ensures that data remains protected from unauthorised access at all stages, including before and after being input into AI tools.
Ensure that all data handling, storage, and processing practices fully comply with privacy laws, including the Privacy Act 1988 and any sector-specific regulations. This is crucial before any data is entered into AI tools.
To protect sensitive data, access to it must be restricted to authorised personnel only, especially when it is used in AI tools. Encryption helps to prevent unauthorised access or misuse. Regular security audits are essential to check compliance with access restrictions and encryption protocols.
Before inputting data into AI models, anonymise or de-identify it to remove personally identifiable information. This ensures that even if data is exposed, it cannot be linked back to individuals.
Ensure that the data input into AI tools is accurate, relevant, and up-to-date. Maintaining data integrity helps prevent biases and ensures that sensitive information is used correctly and responsibly in AI-driven research. Maintaining data integrity also involves documenting the source and history of the data (data provenance) to ensure its authenticity and reliability for AI-driven analysis.
Consistency and reproducibility are fundamental to credible AI research. By managing AI data effectively, researchers can ensure that results are consistent across multiple iterations, regardless of time or researcher. For example, when using AI to analyse text data, keeping the structure and wording of prompts consistent allows results to be reliably reproduced.
Keep a detailed log of each prompt version, documenting the changes made and their rationale. This process ensures that any modifications are consistent with the research goals, allows you to track how each change impacts AI output, and makes it easier to evaluate the potential effects of these changes on model performance.
Ensure that all relevant information, such as input data, expected outcomes, model parameters, and any pre-processing steps, is well-documented. This transparency supports clarity for future research and collaboration, and ensures that results can be replicated.
Keep a detailed record of how AI prompts change over time, as well as the reasons for these changes. This is especially important in AI research where the prompt given to the AI model can significantly affect the output and outcomes.
Transparency is essential in AI-driven research to ensure that methods and processes are well-documented. This facilitates reproducibility and validation.
Keep detailed records of AI-generated outputs, especially unexpected or unusual results, to allow for proper evaluation. Such records allow to assess the validity and reliability of the AI model, identify potential issues (e.g. biases or errors).
Document any modifications to prompts and the rationale behind them to track how these adjustments influence outcomes.
Provide clear, accessible logs that describe the purpose and context of each prompt, ensuring that the AI’s outputs align with the research objectives. This helps ensure that others can understand the research methodology and reproduce the findings.
Ensuring that the AI models themselves are explainable, so you can understand and explain how the AI reaches its conclusions, enhances transparency.
Ethical AI usage is essential to prevent harm and ensure fairness. Proper data management practices help researchers identify and address potential biases in AI models, ensuring that results are ethical and equitable. Regularly assessing and adjusting AI models is crucial to identify and correct any biases. It is important to use datasets that are representative of diverse populations to minimise bias and ensure that the AI models provide equitable results. For example, in health research, AI models are often used to predict disease outcomes or recommend treatments based on patient data. If the data used to train these models is not representative of various demographic groups (e.g. age, gender, race), the model could generate biased results that favor one group over others.
AI tools, when properly managed, streamline research workflows. When managed effectively, AI systems allow researchers to focus on in-depth analysis and insights by automating repetitive tasks and ensuring consistency across datasets.
Create a consistent set of prompts to avoid redundant work and ensure consistency in data handling. Standardising prompts ensures that AI tools process information in the same way each time, which is crucial for consistency in research outcomes. For example, in large-scale literature reviews, a standardised prompt might ask AI to extract specific themes, findings, or methodologies from each article, ensuring the AI's outputs remain consistent across hundreds of papers.
Streamline workflows to reduce repetitive tasks, saving time and focusing efforts on high-level analysis.
Both the Australian Research Council (ARC) and the National Health and Medical Research Council (NHMRC) have clear and specific requirements for data management and transparency, especially for research involving Artificial Intelligence (AI).
While the exact requirements may vary based on the specific funding scheme, the expectation is that researchers handle data responsibly, maintain transparency, and adhere to ethical standards.For AI-based research, both agencies require a Data Management Plan (DMP) to clearly outline how data will be managed throughout the research lifecycle. The DMP should cover the following key areas:
Policy on Use of Generative Artificial Intelligence in the ARC’s grants programs
Policy on Use of Generative Artificial Intelligence in Grant Applications and Peer Review
Submitting a comprehensive and well-prepared Data Management Plan (DMP) is critical for securing funding from both the ARC and NHMRC. A clear DMP not only ensures transparency in how data will be handled throughout the research but also demonstrates the researcher’s commitment to maintaining high ethical standards and transparency.
Edith Cowan University acknowledges and respects the Nyoongar people, who are
the traditional custodians of the land upon which its campuses stand and its programs
operate.
In particular ECU pays its respects to the Elders, past and present, of the Nyoongar
people, and embrace their culture, wisdom and knowledge.