Skip to Main Content

Manage Research Data: AI Data Management in Research: Best Practices and Importance

AI Data Management in Research: Key Practices

As Artificial Intelligence (AI) continues to transform research across various fields, effective data management has become more important than ever. Proper AI data management is not just a technical necessity; it is crucial for ensuring the credibility, consistency, and ethical integrity of research outcomes. By adopting the right practices, you can significantly improve the accuracy and reliability of your results.

This guide highlights key best practices in AI data management to help you:

  • Improve the quality of your work
  • Promote transparency, efficiency, and ethical standards

These practices include:

  • Implement version control for your data, including data generated from prompts, to monitor changes and ensure reproducibility.
  • Maintain detailed documentation, including a prompt log, to record each step of your research process. This makes it easier to track changes and enhance data comprehension.
  • Focus on bias reduction to ensure AI-generated data meets fairness and accuracy standards.
  • Prioritise data security to protect sensitive information, especially when entering data into AI tools.  Avoid inputting personal or unpublished data and ensure compliance with relevant regulations.

Data privacy and security are crucial in AI research, especially when working with sensitive data such as personal health information or financial records. Researchers must implement rigorous data management practices to protect information from unauthorised access and misuse, or breaches, particularly before the data is input into AI tools. For example, in the case of healthcare applications, managing health data and addressing the ethical implications of AI requires careful attention. Researchers must ensure informed consent, implement strict data sharing protocols, and ensure AI systems do not misuse or expose personal information.

 

To ensure sensitive data does not leak into AI tools, follow these best practices measures:

Data Minimisation

Only input the minimum amount of sensitive data required for the AI tool’s operation. From an ethical point, data minimisation helps prevent unnecessary exposure of personal data and minimises risks of misuse or discrimination.

Data Encryption

Encrypt sensitive data both at rest (when stored) and in transit (when being transferred). This ensures that data remains protected from unauthorised access at all stages, including before and after being input into AI tools.

Compliance with Regulations

Ensure that all data handling, storage, and processing practices fully comply with privacy laws, including the Privacy Act 1988 and any sector-specific regulations. This is crucial before any data is entered into AI tools.

Access Control

To protect sensitive data, access to it must be restricted to authorised personnel only, especially when it is used in AI tools. Encryption helps to prevent unauthorised access or misuse. Regular security audits are essential to check compliance with access restrictions and encryption protocols.

Anonymisation & De-Identification

Before inputting data into AI models, anonymise or de-identify it to remove personally identifiable information. This ensures that even if data is exposed, it cannot be linked back to individuals.

Data Integrity

Ensure that the data input into AI tools is accurate, relevant, and up-to-date. Maintaining data integrity helps prevent biases and ensures that sensitive information is used correctly and responsibly in AI-driven research. Maintaining data integrity also involves documenting the source and history of the data (data provenance) to ensure its authenticity and reliability for AI-driven analysis.

Consistency and reproducibility are fundamental to credible AI research. By managing AI data effectively, researchers can ensure that results are consistent across multiple iterations, regardless of time or researcher. For example, when using AI to analyse text data, keeping the structure and wording of prompts consistent allows results to be reliably reproduced.


Best practices:
Track Prompt Versions

Keep a detailed log of each prompt version, documenting the changes made and their rationale. This process ensures that any modifications are consistent with the research goals, allows you to track how each change impacts AI output, and makes it easier to evaluate the potential effects of these changes on model performance.

  • Ensuring consistency: By recording each version of the prompt, you maintain a clear record of how the prompt has evolved over time. This ensures that any changes made are consistent with the intended direction of the research, allowing you to reproduce the same results when needed.
  • Tracking modifications: Documenting every modification to the prompt allows you to identify exactly what was changed at each stage. You can review and track changes to understand their potential influence on the results.
  • Evaluating their impact on the output of the AI model: When a prompt is modified, it can change the way the AI interprets and responds to the input. By keeping track of each prompt version, you can assess how these changes affect the AI's results. This helps understand how small adjustments may influence the accuracy, fairness, or relevance of the AI’s output.
Document Key Details

Ensure that all relevant information, such as input data, expected outcomes, model parameters, and any pre-processing steps, is well-documented. This transparency supports clarity for future research and collaboration, and ensures that results can be replicated.

  • Input data: This refers to the raw data or information that is fed into the AI system. Researchers should document where this data comes from, how it was collected, and any characteristics it has (e.g., format, size, variables) to ensure others can understand and replicate the research process.
  • Expected outcomes: This refers to the goals or results the researcher wants achieving with the AI model. It could include specific performance metrics, predictions, or classifications the AI is expected to produce.
  • Model parameters: These are the settings or configurations of the AI model. Recording these details ensures others can reproduce the model with the same configurations, which is crucial for verifying results and comparisons.
  • Pre-processing steps: This refers to any data cleaning, transformation, or manipulation done before feeding the data into the AI model. It might include normalising data, or handling missing values.

Log Prompt Evolution

Keep a detailed record of how AI prompts change over time, as well as the reasons for these changes. This is especially important in AI research where the prompt given to the AI model can significantly affect the output and outcomes.

  • Tracking Changes: As you work with AI tools, the initial prompt (the question or instruction you provide to the AI) might evolve as you refine your research, address challenges, or improve the AI's performance. Keeping track of each version of the prompt - what it was, when it was updated, and how it changed—ensures that any modifications are documented.
  • Rationale Behind Changes: Along with each prompt change, you should record the rationale for making the modification. For instance, you may update the prompt to improve the AI's understanding, make it more specific. Documenting these helps others understand why certain decisions were made and ensures transparency.
  • Promotes Transparency and Accountability: By keeping a log of prompt evolution, you can provide clear evidence of how the AI's inputs have been adjusted over time. This helps other researchers, collaborators, or stakeholders trace the decision-making process and understand the logic behind changes.

Transparency is essential in AI-driven research to ensure that methods and processes are well-documented. This facilitates reproducibility and validation.


Best Practices:
Record AI Results

Keep detailed records of AI-generated outputs, especially unexpected or unusual results, to allow for proper evaluation. Such records allow to assess the validity and reliability of the AI model, identify potential issues (e.g. biases or errors).

Track Changes

Document any modifications to prompts and the rationale behind them to track how these adjustments influence outcomes.

Clear Documentation

Provide clear, accessible logs that describe the purpose and context of each prompt, ensuring that the AI’s outputs align with the research objectives. This helps ensure that others can understand the research methodology and reproduce the findings.

Explainability of AI Models

Ensuring that the AI models themselves are explainable, so you can understand and explain how the AI reaches its conclusions, enhances transparency.

Ethical AI usage is essential to prevent harm and ensure fairness. Proper data management practices help researchers identify and address potential biases in AI models, ensuring that results are ethical and equitable. Regularly assessing and adjusting AI models is crucial to identify and correct any biases. It is important to use datasets that are representative of diverse populations to minimise bias and ensure that the AI models provide equitable results. For example, in health research, AI models are often used to predict disease outcomes or recommend treatments based on patient data. If the data used to train these models is not representative of various demographic groups (e.g. age, gender, race), the model could generate biased results that favor one group over others.


Best Practices:
Bias Mitigation
  • Use diverse datasets: Ensure that datasets used to train AI models are inclusive, representing diverse populations across ethnicity, age, gender, and other demographic factors. For example, health AI models should be trained on data that reflects the full spectrum of age, gender, and racial diversity to avoid biased outcomes.
  • Regularly update datasets: As society and demographics evolve, continuously update the datasets to reflect these changes, ensuring that the AI model remains relevant and fair.
Fairness Checks
  • Cross-Demographic Testing: Test AI models across various demographic groups to ensure they perform equitably. This helps in identifying and correcting any disparities in outcomes.

Transparency
  • Explainability: Develop AI models that can explain their decision-making processes. This is crucial for understanding how conclusions are drawn and for identifying any potential biases.
Accountability
  • Engage with stakeholders, including those who might be affected by the research outcomes, to gather feedback and ensure the research is aligned with societal values.
Continuous Monitoring
  • Ethical Audits: Conduct regular ethical audits of AI research projects to ensure compliance with ethical guidelines. This includes reviewing data usage, model performance, and impact on different demographic groups.

AI tools, when properly managed, streamline research workflows. When managed effectively, AI systems allow researchers to focus on in-depth analysis and insights by automating repetitive tasks and ensuring consistency across datasets.


Best Practices:
Standardised Prompts

Create a consistent set of prompts to avoid redundant work and ensure consistency in data handling. Standardising prompts ensures that AI tools process information in the same way each time, which is crucial for consistency in research outcomes. For example, in large-scale literature reviews, a standardised prompt might ask AI to extract specific themes, findings, or methodologies from each article, ensuring the AI's outputs remain consistent across hundreds of papers.

Optimise Workflows

Streamline workflows to reduce repetitive tasks, saving time and focusing efforts on high-level analysis.

  • Automating Data Validation: Instead of manually checking every data entry for errors (e.g., missing values, incorrect formats), AI tools can automatically flag data that does not meet predefined criteria. For example, missing values can be automatically detected.
  • Standardising Data: Data cleaning often involves making sure that the format and structure of the data are consistent across the dataset (e.g. ensuring that dates, addresses, and names follow a consistent format). AI tools can automate this process. For example, AI can automatically convert all date formats to a uniform standard (e.g., DD/MM/YYYY) or standardise text entries (e.g., converting "male" and "M" to the same value).
  • Error Detection: Instead of manually scanning for errors, AI can flag unusual data points or discrepancies based on predefined rules or patterns, such detecting duplicate entries.
  • Data Transformation: Data may need to be transformed into a specific format for use in AI models (e.g., converting categorical data into numerical format). For example, in a machine learning project, categorical variables (1 for "Yes" and 0 for "No").

Both the Australian Research Council (ARC) and the National Health and Medical Research Council (NHMRC) have clear and specific requirements for data management and transparency, especially for research involving Artificial Intelligence (AI).

While the exact requirements may vary based on the specific funding scheme, the expectation is that researchers handle data responsibly, maintain transparency, and adhere to ethical standards.

For AI-based research, both agencies require a Data Management Plan (DMP) to clearly outline how data will be managed throughout the research lifecycle. The DMP should cover the following key areas:

Data Collection
  • Sources: Clearly state where the data will come from (e.g., surveys, experiments, or third-party datasets). If AI tools generate data, it should be documented.
  • Ethical Considerations: Detail how data collection complies with ethical guidelines and privacy laws, such as the Privacy Act 1988. Consideration of ethical AI use is also critical, particularly when the data involves human participants or sensitive health information.
Data Security and Privacy
  • Ensure compliance with Australian privacy laws and regulations, such as the Privacy Act 1988. Data must be securely handled before it is input into AI systems, to prevent unauthorised access.
AI Tools and Models Documentation
  • Specify the AI models and tools being used in the research, explaining how they will be applied to the research data.
Data Processing
  • Pre-Processing: Describe the steps to clean and transform data before using it in AI models, addressing issues such as missing data.
  • Quality Control: Explain how the accuracy, completeness, and consistency of the data will be maintained throughout the research process.
Bias and Fairness
  • Bias Detection: Outline techniques for identifying bias in training data or AI outputs.
  • Bias Mitigation: Describe steps to reduce or eliminate biases to ensure fairness and prevent discriminatory outcomes in AI models.
  • Equitable Research Design: Show how AI tools and the research design will lead to fair and inclusive outcomes.
Ethical AI Use
  • Transparency: Ensure that the AI models and their decision-making processes are interpretable and clear to both researchers and anyone else involved or interested in the research.
  • Accountability: Researchers must take responsibility for how AI tools are applied and ensure they align with ethical standards and funding guidelines.


For more specific and up-to-date requirements, please check guidelines from the ARC or NHMRC:
  • Australian Research Council (ARC):

Policy on Use of Generative Artificial Intelligence in the ARC’s grants programs 

  • National Health and Medical Research Council (NHMRC):

Policy on Use of Generative Artificial Intelligence in Grant Applications and Peer Review 


Submitting a comprehensive and well-prepared Data Management Plan (DMP) is critical for securing funding from both the ARC and NHMRC. A clear DMP not only ensures transparency in how data will be handled throughout the research but also demonstrates the researcher’s commitment to maintaining high ethical standards and transparency.