Applied Data Science

Roger French
rxf131@case.edu

Applied Data Science (ADS)

Administered by the Department of Materials Science and Engineering

312 White Building (7204)
http://engineering.case.edu/emse/
Phone: 216.368.4230, Fax: 216.368.3209
Roger French, EMSE / CSE Faculty Director (ADS)

The Applied Data Science program, based in the Case School of Engineering, includes faculty from schools across the university and provides courses in applied data science for undergraduates and graduate students from across the schools of the university.  The Applied Data Science program is directed to undergraduate and graduate students studying in the domains of Engineering and Physical Sciences (including Engineering, Energy and Manufacturing, Astronomy, Linguistics, Geology, Physics, and Chemistry), Health (including Translational and Clinical), Business (including Finance, Marketing, and Economics), and Social Sciences.

Successful completion of the Undergraduate Minor in Applied Data Science requirements leads to a "Minor in Applied Data Science" for the graduating student. The minor represents that the students have developed knowledge of the essential elements of Data Science and Analytics in the area of their major (their domain of expertise). 

Additionally, the Applied Data Science courses are offered as DSCI 4xx graduate level classes, in which graduate students additional work on a semester project related to their domain area or thesis research topic. 

Minor in Applied Data Science (ADS)

Elements of the Minor:

The undergraduate minor is structured so that the students who qualify for the minor have a working understanding of the basic ADS tools and their application in their domain area. This includes: 

  • Formulate Data Science analyses of real-world datasets, to answer critical questions in various domain and application areas;

  • Data Management: datastores, sources, streams; 

  • High Performance and Distributed Computing: local computer, high performance computing clusters, distributed computing (such as Hadoop), or other cloud computing environments;

  • Informatics, Ontology, Query: including search, data assembly, annotation; 

  • Statistical Analytics: tools such as R statistics and high-level scripting languages (such as Python3); and

  • Machine Learning and Deep Learning: Machine learning approaches such as support vector machines, or neural networks, and deep learning frameworks such as Keras and TensorFlow2.

The data types found in these domains are diverse. They include time series and spectral data for Energy, Physics, Chemistry and Astronomy, and sensor and production data and image and volumetric data for Manufacturing. In Health, Translational ADS includes Genomic, Proteomic, and other Omics data, while Clinical ADS includes patient data, medical data, physiological time series, and mobile data. And in Social Sciences natural language datasets, both written and oral. Business data types include stock and other financial market data for Finance, time series and cross-section data for Economics, and operations and consumer behavior data for Marketing.

Students will develop comprehensive experience in the steps of data analysis. 

  • Define the Applied Data Science questions.

  • Identify, locate, and/or generate the necessary data, including defining the ideal data set and variables of interest, determining and obtaining accessible data and cleaning the data in preparation for analysis.

  • Exploratory data analysis to start identifying the significant characteristics of the data and information it contains.

  • Statistical modeling, inference and prediction, including interpretation of results, challenging results, and developing insights and actions.

  • Machine learning, deep learning and approaches to data visualization, images, natural language and artificial intelligence implementations.

  • Synthesizing the results in the context of the domain and the initial questions, and writing this up.

  • The creation of reproducible research, including code, datasets, documentation, and reports, which are easily transferable and verifiable.

  • Communicating data science results in context, with consideration of privacy, openness, security, ethics, and value considerations. 

The ADS minor curriculum

The undergraduate minor curriculum is based on five 3-credit courses, with one class chosen from each of Levels 1 through Level 5, which cover the spectrum of learning needed to achieve domain area expertise in data science and analytics. The courses are chosen to be both cross-cutting, i.e., intermixing students from across the university in the fundamental concepts such as scripting and statistics (Levels 1, 2, and 4), and domain-focused (Levels 3 and 5). For the Level 5 advanced topics course, the research topic will be either a semester research project approved by the minor advisor, and will also be a 3-credit project, or an advanced data science topic class. This will provide minor students both the domain focused learning they need, and a broadening perspective on applications, methods, and uses of ADS in other domains.  

Courses Counted Toward Minor Requirements

Established courses included in the Minor are found in Case School of Engineering  (Materials Science, Electrical Engineering and Computer Science, Manufacturing), College of Art & Science (Mathematics, Astronomy, Philosophy, Cognitive Science); School of Medicine, School of Nursing, and Weatherhead School of Management (Marketing, Finance, Operations, and Economics) and Mandel School for Social Sciences.

The courses that meet the requirements for the Minor can also be taken by students to meet requirements in Major programs, and therefore serve a dual purpose in our academic offerings. However, each program, department, and school may have its own criteria on whether a given course could be "double counted" towards major and minor requirements. 

Level 5:
DSCI 352/352M/452Applied Data Science Research3
DSCI 330Cognition and Computation3
DSCI 332/432Spatial Statistics for Near Surface, Surface, and Subsurface Modeling3
SYBB 387Undergraduate Research in Systems Biology1 - 3
Level 4:
DSCI 353/353M/453Data Science: Statistical Learning, Modeling and Prediction3
ASTR 306Astronomical Techniques3
BAFI 361Empirical Analysis in Finance3
MKMR 308Measuring Marketing Performance3
MKMR 310Marketing Analytics3
ECON 327Advanced Econometrics3
SYBB 459Bioinformatics for Systems Biology3
SYBB 421Fundamentals of Clinical Information Systems3
SYBB 311A/311B/311CSurvey of Bioinformatics: Technologies in Bioinformatics1
Level 3:
DSCI 351/351M/451Exploratory Data Science3
SYBB 412Survey of Bioinformatics: Programming for Bioinformatics3
Level 2:
STAT 312RBasic Statistics for Engineering and Science Using R Programming3
PQHS 431Statistical Methods I3
STAT 201RBasic Statistics for Social and Life Sciences Using R Programming3
OPRE 207Statistics for Business and Management Science I3
Level 1:
ENGR 131Elementary Computer Programming3
CSDS 132Introduction to Programming in Java3
DESN 210Introduction to Programming for Business Applications3

Courses

DSCI 134. Introduction to Applied Data Science. 3 Units.

This course is an introduction to data science and analytics. In the first half of the course, students will develop a basic understanding of how to manipulate, analyze and visualize large data in a distributed computing environment, with an appreciation of open source development, security and privacy issues. In the second half of the course, students will gain experience in data manipulation and analysis using scripted programming languages such as Python.

DSCI 330. Cognition and Computation. 3 Units.

An introduction to (1) theories of the relationship between cognition and computation; (2) computational models of human cognition (e.g. models of decision-making or concept creation); and (3) computational tools for the study of human cognition. All three dimensions involve data science: theories are tested against archives of brain imaging data; models are derived from and tested against datasets of e.g., financial decisions (markets), legal rulings and findings (juries, judges, courts), legislative actions, and healthcare decisions; computational tools aggregate data and operate upon it analytically, for search, recognition, tagging, machine learning, statistical description, and hypothesis testing. Offered as COGS 330, COGS 430, DSCI 330 and DSCI 430.

DSCI 332. Spatial Statistics for Near Surface, Surface, and Subsurface Modeling. 3 Units.

This course is on spatial modeling of near surface, surface, and subsurface data, also known as geostatistical modeling. Spatial modeling has its origins in predictive modeling of minerals in subsurface formations, from which many examples are used in this class. Students will learn the basics of spatial models in order to understand how they are built from various data types and how their uncertainties are assessed and risk reduced. Students will be expected to learn the rudimentary navigation of R Studio, execute pre-written publically available R code (provided), and make simple modifications. Graduate students will be expected to learn the above and develop a 10 week modeling project focused on the use of spatial modeling methods with R using data relevant to their specific discipline or interest. These projects will include preparing datasets to be executed in R code scripts. Resulting scripts will be placed in a git repository for use by other students as open source resources along with documentation demonstrating the reproducible spatial modeling science and analyses for these problems. Geostatistical (spatial) mapping is applicable across many disciplines. Examples of graduate projects from previous classes include subsurface modeling (geology), earthquake mapping (geophysics/civil engineering), soil stability modeling (civil engineering), aquifer characterization (hydrology), and pollution/contaminant mapping (environmental studies/medicine). Offered as DSCI 332 and DSCI 432.

DSCI 351. Exploratory Data Science. 3 Units.

In this course, we will learn data science and analysis approaches to identify statistically significance relationships and better model and predict the behavior of these systems. We will assemble and explore real-world datasets, perform clustering and pair plot analyses to investigate correlations, and logistic regression will be employed to develop associated predictive models. Results will be interpreted, visualized and discussed. We will introduce basic elements of statistical analysis using R Project open source software for exploratory data analysis and model development. R is an open-source software project with broad abilities to access machine-readable open-data resources, data cleaning and munging functions, and a rich selection of statistical packages, used for data analytics, model development and prediction. This will include an introduction to R data types, reading and writing data, looping, plotting and regular expressions, so that one can start performing variable transformations for linear fitting and developing structural equation models, while exploring for statistically significant relationships. The M section of DSCI 351 is for students focusing on Materials Data Science. Offered as DSCI 351, DSCI 351M and DSCI 451. Prereq: (ENGR 131 or EECS 132 or DSCI 134) and (STAT 312R or STAT 201R or SYBB 310 or PQHS/EPBI 431).

DSCI 351M. Exploratory Data Science. 3 Units.

In this course, we will learn data science and analysis approaches to identify statistically significance relationships and better model and predict the behavior of these systems. We will assemble and explore real-world datasets, perform clustering and pair plot analyses to investigate correlations, and logistic regression will be employed to develop associated predictive models. Results will be interpreted, visualized and discussed. We will introduce basic elements of statistical analysis using R Project open source software for exploratory data analysis and model development. R is an open-source software project with broad abilities to access machine-readable open-data resources, data cleaning and munging functions, and a rich selection of statistical packages, used for data analytics, model development and prediction. This will include an introduction to R data types, reading and writing data, looping, plotting and regular expressions, so that one can start performing variable transformations for linear fitting and developing structural equation models, while exploring for statistically significant relationships. The M section of DSCI 351 is for students focusing on Materials Data Science. Offered as DSCI 351, DSCI 351M and DSCI 451. Prereq: (ENGR 131 or EECS 132 or DSCI 134) and (STAT 312R or STAT 201R or SYBB 310 or PQHS/EPBI 431).

DSCI 352. Applied Data Science Research. 3 Units.

This is a project based data science research class, in which project teams identify a research project under the guidance of a domain expert professor. The research is structured as a data analysis project including the 6 steps of developing a reproducible data science project, including 1: Define the ADS question, 2: Identify, locate, and/or generate the data 3: Exploratory data analysis 4: Statistical modeling and prediction 5: Synthesizing the results in the domain context 6: Creation of reproducible research, Including code, datasets, documentation and reports. During the course special topic lectures will include Ethics, Privacy, Openness, Security, Ethics. Value. The M section of DSCI 352 is for students focusing on Materials Data Science. Offered as DSCI 352, DSCI 352M and DSCI 452. Prereq: (DSCI 133 or DSCI 134 or ENGR 131 or EECS 132) and (STAT 312R or STAT 201R or SYBB 310 or PQHS/EPBI 431 or OPRE 207) and (DSCI 351 or (SYBB 311A and SYBB 311B and SYBB 311C and SYBB 311D) or SYBB 321 or MKMR 201).

DSCI 352M. Applied Data Science Research. 3 Units.

This is a project based data science research class, in which project teams identify a research project under the guidance of a domain expert professor. The research is structured as a data analysis project including the 6 steps of developing a reproducible data science project, including 1: Define the ADS question, 2: Identify, locate, and/or generate the data 3: Exploratory data analysis 4: Statistical modeling and prediction 5: Synthesizing the results in the domain context 6: Creation of reproducible research, Including code, datasets, documentation and reports. During the course special topic lectures will include Ethics, Privacy, Openness, Security, Ethics. Value. The M section of DSCI 352 is for students focusing on Materials Data Science. Offered as DSCI 352, DSCI 352M and DSCI 452. Prereq: (DSCI 133 or DSCI 134 or ENGR 131 or EECS 132) and (STAT 312R or STAT 201R or SYBB 310 or PQHS/EPBI 431 or OPRE 207) and (DSCI 351 or (SYBB 311A and SYBB 311B and SYBB 311C and SYBB 311D) or SYBB 321 or MKMR 201).

DSCI 353. Data Science: Statistical Learning, Modeling and Prediction. 3 Units.

In this course, we will use an open data science tool chain to develop reproducible data analyses useful for inference, modeling and prediction of the behavior of complex systems. In addition to the standard data cleaning, assembly and exploratory data analysis steps essential to all data analyses, we will identify statistically significant relationships from datasets derived from population samples, and infer the reliability of these findings. We will use regression methods to model a number of both real-world and lab-based systems producing predictive models applicable in comparable populations. We will assemble and explore real-world datasets, use pair-wise plots to explore correlations, perform clustering, self-similarity, and logistic regression develop both fixed-effect and mixed-effect predictive models. We will introduce machine-learning approaches for classification and tree-based methods. Results will be interpreted, visualized and discussed. We will introduce the basic elements of data science and analytics using R Project open source software. R is an open-source software project with broad abilities to access machine-readable open-data resources, data cleaning and assembly functions, and a rich selection of statistical packages, used for data analytics, model development, prediction, inference and clustering. With this background, it becomes possible to start performing variable transformations for linear regression fitting and developing structural equation models, fixed-effects and mixed-effects models along with other statistical learning techniques, while exploring for statistically significant relationships. The class will be structured to have a balance of theory and practice. We'll split class into Foundation and Practicum a) Foundation: lectures, presentations, discussion b) Practicum: coding, demonstrations and hands-on data science work. The M section of DSCI 353 is for students focusing on Materials Data Science. Offered as DSCI 353, DSCI 353M and DSCI 453.

DSCI 353M. Data Science: Statistical Learning, Modeling and Prediction. 3 Units.

In this course, we will use an open data science tool chain to develop reproducible data analyses useful for inference, modeling and prediction of the behavior of complex systems. In addition to the standard data cleaning, assembly and exploratory data analysis steps essential to all data analyses, we will identify statistically significant relationships from datasets derived from population samples, and infer the reliability of these findings. We will use regression methods to model a number of both real-world and lab-based systems producing predictive models applicable in comparable populations. We will assemble and explore real-world datasets, use pair-wise plots to explore correlations, perform clustering, self-similarity, and logistic regression develop both fixed-effect and mixed-effect predictive models. We will introduce machine-learning approaches for classification and tree-based methods. Results will be interpreted, visualized and discussed. We will introduce the basic elements of data science and analytics using R Project open source software. R is an open-source software project with broad abilities to access machine-readable open-data resources, data cleaning and assembly functions, and a rich selection of statistical packages, used for data analytics, model development, prediction, inference and clustering. With this background, it becomes possible to start performing variable transformations for linear regression fitting and developing structural equation models, fixed-effects and mixed-effects models along with other statistical learning techniques, while exploring for statistically significant relationships. The class will be structured to have a balance of theory and practice. We'll split class into Foundation and Practicum a) Foundation: lectures, presentations, discussion b) Practicum: coding, demonstrations and hands-on data science work. The M section of DSCI 353 is for students focusing on Materials Data Science. Offered as DSCI 353, DSCI 353M and DSCI 453.

DSCI 354. Data Visualization and Analytics. 3 Units.

Data Visualization and Analytics students will learn data visualization and analytics techniques focused on different types of data such as time-series, spectral, or image data science problems. This class will focus on increasing analysis of complex data sets through visualization by enhancing exploratory data analysis and data cleaning. This class will focus on creating effective data visualizations to communicate data analytics results to different audiences. Different datasets will be provided to develop different types of visualizations and analytics. Types of data visualizations include in interactive plots (e.g., bar graphs change over time), applications that allow users to adjust the visualizations based on their decisions (e.g., shiny applications), interactive maps, 3-D plots of data, etc. Discussing how an audience understands information and brings in data as well as the ethics of making data visualizations will be discussed. The class will also include ways to increase modeling and analysis with effective visualizations for credible, data-driven decision making. This will include a git repository for other students to use these codes as open source resources and the preparation of reproducible data science analyses for different types of problems. Offered as DSCI 354, DSCI 354M, and DSCI 454. Prereq: (DSCI 351 or DSCI 351M) and (DSCI 353 or DSCI 353M).

DSCI 354M. Data Visualization and Analytics. 3 Units.

Data Visualization and Analytics students will learn data visualization and analytics techniques focused on different types of data such as time-series, spectral, or image data science problems. This class will focus on increasing analysis of complex data sets through visualization by enhancing exploratory data analysis and data cleaning. This class will focus on creating effective data visualizations to communicate data analytics results to different audiences. Different datasets will be provided to develop different types of visualizations and analytics. Types of data visualizations include in interactive plots (e.g., bar graphs change over time), applications that allow users to adjust the visualizations based on their decisions (e.g., shiny applications), interactive maps, 3-D plots of data, etc. Discussing how an audience understands information and brings in data as well as the ethics of making data visualizations will be discussed. The class will also include ways to increase modeling and analysis with effective visualizations for credible, data-driven decision making. This will include a git repository for other students to use these codes as open source resources and the preparation of reproducible data science analyses for different types of problems. Offered as DSCI 354, DSCI 354M, and DSCI 454. Prereq: (DSCI 351 or DSCI 351M) and (DSCI 353 or DSCI 353M).

DSCI 430. Cognition and Computation. 3 Units.

An introduction to (1) theories of the relationship between cognition and computation; (2) computational models of human cognition (e.g. models of decision-making or concept creation); and (3) computational tools for the study of human cognition. All three dimensions involve data science: theories are tested against archives of brain imaging data; models are derived from and tested against datasets of e.g., financial decisions (markets), legal rulings and findings (juries, judges, courts), legislative actions, and healthcare decisions; computational tools aggregate data and operate upon it analytically, for search, recognition, tagging, machine learning, statistical description, and hypothesis testing. Offered as COGS 330, COGS 430, DSCI 330 and DSCI 430.

DSCI 432. Spatial Statistics for Near Surface, Surface, and Subsurface Modeling. 3 Units.

This course is on spatial modeling of near surface, surface, and subsurface data, also known as geostatistical modeling. Spatial modeling has its origins in predictive modeling of minerals in subsurface formations, from which many examples are used in this class. Students will learn the basics of spatial models in order to understand how they are built from various data types and how their uncertainties are assessed and risk reduced. Students will be expected to learn the rudimentary navigation of R Studio, execute pre-written publically available R code (provided), and make simple modifications. Graduate students will be expected to learn the above and develop a 10 week modeling project focused on the use of spatial modeling methods with R using data relevant to their specific discipline or interest. These projects will include preparing datasets to be executed in R code scripts. Resulting scripts will be placed in a git repository for use by other students as open source resources along with documentation demonstrating the reproducible spatial modeling science and analyses for these problems. Geostatistical (spatial) mapping is applicable across many disciplines. Examples of graduate projects from previous classes include subsurface modeling (geology), earthquake mapping (geophysics/civil engineering), soil stability modeling (civil engineering), aquifer characterization (hydrology), and pollution/contaminant mapping (environmental studies/medicine). Offered as DSCI 332 and DSCI 432.

DSCI 451. Exploratory Data Science. 3 Units.

In this course, we will learn data science and analysis approaches to identify statistically significance relationships and better model and predict the behavior of these systems. We will assemble and explore real-world datasets, perform clustering and pair plot analyses to investigate correlations, and logistic regression will be employed to develop associated predictive models. Results will be interpreted, visualized and discussed. We will introduce basic elements of statistical analysis using R Project open source software for exploratory data analysis and model development. R is an open-source software project with broad abilities to access machine-readable open-data resources, data cleaning and munging functions, and a rich selection of statistical packages, used for data analytics, model development and prediction. This will include an introduction to R data types, reading and writing data, looping, plotting and regular expressions, so that one can start performing variable transformations for linear fitting and developing structural equation models, while exploring for statistically significant relationships. The M section of DSCI 351 is for students focusing on Materials Data Science. Offered as DSCI 351, DSCI 351M and DSCI 451.

DSCI 452. Applied Data Science Research. 3 Units.

This is a project based data science research class, in which project teams identify a research project under the guidance of a domain expert professor. The research is structured as a data analysis project including the 6 steps of developing a reproducible data science project, including 1: Define the ADS question, 2: Identify, locate, and/or generate the data 3: Exploratory data analysis 4: Statistical modeling and prediction 5: Synthesizing the results in the domain context 6: Creation of reproducible research, Including code, datasets, documentation and reports. During the course special topic lectures will include Ethics, Privacy, Openness, Security, Ethics. Value. The M section of DSCI 352 is for students focusing on Materials Data Science. Offered as DSCI 352, DSCI 352M and DSCI 452.

DSCI 453. Data Science: Statistical Learning, Modeling and Prediction. 3 Units.

In this course, we will use an open data science tool chain to develop reproducible data analyses useful for inference, modeling and prediction of the behavior of complex systems. In addition to the standard data cleaning, assembly and exploratory data analysis steps essential to all data analyses, we will identify statistically significant relationships from datasets derived from population samples, and infer the reliability of these findings. We will use regression methods to model a number of both real-world and lab-based systems producing predictive models applicable in comparable populations. We will assemble and explore real-world datasets, use pair-wise plots to explore correlations, perform clustering, self-similarity, and logistic regression develop both fixed-effect and mixed-effect predictive models. We will introduce machine-learning approaches for classification and tree-based methods. Results will be interpreted, visualized and discussed. We will introduce the basic elements of data science and analytics using R Project open source software. R is an open-source software project with broad abilities to access machine-readable open-data resources, data cleaning and assembly functions, and a rich selection of statistical packages, used for data analytics, model development, prediction, inference and clustering. With this background, it becomes possible to start performing variable transformations for linear regression fitting and developing structural equation models, fixed-effects and mixed-effects models along with other statistical learning techniques, while exploring for statistically significant relationships. The class will be structured to have a balance of theory and practice. We'll split class into Foundation and Practicum a) Foundation: lectures, presentations, discussion b) Practicum: coding, demonstrations and hands-on data science work. The M section of DSCI 353 is for students focusing on Materials Data Science. Offered as DSCI 353, DSCI 353M and DSCI 453.

DSCI 454. Data Visualization and Analytics. 3 Units.

Data Visualization and Analytics students will learn data visualization and analytics techniques focused on different types of data such as time-series, spectral, or image data science problems. This class will focus on increasing analysis of complex data sets through visualization by enhancing exploratory data analysis and data cleaning. This class will focus on creating effective data visualizations to communicate data analytics results to different audiences. Different datasets will be provided to develop different types of visualizations and analytics. Types of data visualizations include in interactive plots (e.g., bar graphs change over time), applications that allow users to adjust the visualizations based on their decisions (e.g., shiny applications), interactive maps, 3-D plots of data, etc. Discussing how an audience understands information and brings in data as well as the ethics of making data visualizations will be discussed. The class will also include ways to increase modeling and analysis with effective visualizations for credible, data-driven decision making. This will include a git repository for other students to use these codes as open source resources and the preparation of reproducible data science analyses for different types of problems. Offered as DSCI 354, DSCI 354M, and DSCI 454. Prereq: DSCI 451 and DSCI 453.