This lesson is still being designed and assembled (Pre-Alpha version)

This lesson is part of The Carpentries Incubator, a place to share and use each other's Carpentries-style lessons. This lesson has not been reviewed by and is not endorsed by The Carpentries.

Introduction to Data Science and AI for senior researchers : Glossary

Key Points

Welcome to this workshop	This workshop is developed for mid-career and senior researchers in biomedical and biosciences fields. This workshop aims to build a shared understanding of data science and AI in the context of biomedicine and related fields. Without going into underlying technical details, the contents provide a general overview and present selected case studies of biomedical relevance.
Data Science, AI, and Machine Learning	There are three main types of AI: Simulations, Symbolic AI, Machine Learning Computers are very unlikely to ever replace human cognizance and intellect. Simulation AI uses equations to run a model forwards from a given state. Symbolic AI was used to beat the grand chess master in the 90s and works by calculating many eventualities in order to find the best solution. The most recent type and one that has made extensive progress in recent years is Machine Learning. In the next episodes, we will consider what tasks AI can do and its role in scientific research. The ethical implications of AI are bias, privacy, and accountability and require continuous monitoring and correction addressed specifically to the applied context. See Episode 05 for Limitations and potential biases of Machine Learning models, as well as their ethical implications.
AI for Automation	Despite not being at the cutting edge of AI development we can still benefit from elevated efficiency and accuracy of research data processing. Examples of AI applications for automation include database search, image analysis and motion tracking While not every researcher works with generating AI algorithms and models, there are plenty of tools with numerous applications in research that use machine learning. Automating tasks results in larger data sets and less manual work, and it is worth joining the online communities that work on the tools showcased here.
AI for Data Insights	Very large, multiconnected data sets can be too complex for humans to manually find insight; hence AI can facilitate a scaled approach to data processing in various ways. While some methods are easier to understand and query than others, AI applications can enhance human analysis. Examples of A-facilitated methodologies for data insights include classification, clustering, and identifying connections in big data sets”
Problems with AI	Machine learning should be used with scepticism to prevent biased results We are vulnerable to unfounded claims from ML, whether for describing new results or even applications to medical care. ML should be used with scepticism to prevent biased results. A combination of data cleaning techniques is needed to effectively minimise limitations and ensure generalisation power of any AI model. Beware of any biases and privacy/security concerns by enacting full transparency and accountability in the documentation and reporting of AI applications.
Practical Considerations for Researchers	90% of Machine Learning is data preparation. It is easy to misrepresent machine learning results. Learning techniques such as building confusion matrices, ROC curves, and common metrics will help you interpret most ML results
Practical Considerations: Reporting Results	Just like any other statistical analysis, ML results comprise of presenting certain metrics to assess the utility of a model. While Supervised Learning focusses on optimizing metrics such as accuracy, sensitivity and specificity, Unsupervised Learning is used to find hidden groups. Supervised learning optimizes against a ground truth and so we can use metrics such as accuracy and sensitivity where the goal is to achieve 100% in such metric. Unsupervised learning is more nuanced because we don’t have a ground truth - while we optimize against the total sum of squared error (and the many other variations on this theme), we often have to rely on exploratory analyses in conjunction with our use case to decide what an optimum number of clusters would be. Common frameworks can be used to present your work and ensure they are consistent within ML-facilitated research. ML intends to deploy a model in a real-world setting and as such you should report its predictive power and limitations accurately.

Glossary

FIXME