====== GE461: Introduction to Data Science - Spring 2025 ====== Introduction to data science fundamentals, techniques and applications; data collection, preparation, storage and querying; parametric models for data; models and methods for fitting, analysis, evaluation, and validation; dimensionality reduction, visualization; various learning methods, classifiers, clustering, data and text mining; applications in diverse domains such as business, medicine, social networks, computer vision; breadth knowledge on topics and hands-on experience through projects and computer assignments. [[https://stars.bilkent.edu.tr/syllabus/view/GE/461/|STARS Syllabus]] **Prerequisites**: (CS 101 or CS114 or CS 115) and (MATH 230 or MATH 255 or MATH 260) and (MATH 225 or MATH 241 or MATH 220)\\ **Credits**: 3 **Course Management Systems:** [[https://moodle.bilkent.edu.tr/2024-2025-spring/course/view.php?id=341|Moodle]]\\ **Course Website:** http://www.cs.bilkent.edu.tr/~ge461/2025Spring ** Instructor Team** * S. Aksoy, C. Alkan, S. Arashloo, O. Arıkan, F. Can, E. Çiçek, T. Çukur, H. Dibeklioğlu, İ. Körpeoğlu, C. Tekin, E. Tüzün\\ * Course Coordinator (contact point): S. Aksoy (saksoy AT cs.bilkent.edu.tr) **TAs** * Farzad Hallaji Azad (farzad.hallaji AT bilkent.edu.tr) * Batuhan Uykulu (batuhan.uykulu AT bilkent.edu.tr) **Classroom and Hours** * Clasroom: **EB-101** * Class hours: * Mon 13:30-15:20 * Thu 08:30-10:20 **Grading Policy** * Final: 40 % * Projects: 60 %. Multiple computer/programming/exercise assignments of various sizes. * There will be 5 projects. **Each project is 12 %**. ** Attendance** * Attendance is mandatory. A student who misses **more than 9 hours** will fail the course automatically. ** Exam** * TBD ** Projects** * Multiple computer/programming/exercise assignments of various sizes. * A project can be assigned earlier than the indicated date on the weekly plan. * Projects can be individual or group based. Instructors will decide. * Projects will be uploaded to Moodle. * Programming languages like Python, Java, R or Matlab can be used in the projects. * Gaining hands-on experience and experimenting will be important. Real world data sets can be used (economical/financial data sets, medical/biological data sets, image/video data sets, social network data sets, IT data sets, etc.). ** Other** * Grades will be posted in SAPS. * There is **no mandatory textbook** for the course. ---- ==== Week 1 (Jan 27, Jan 30) ==== **Introduction; what is data science; data science applications.** [Çiçek, Tüzün] \\ Topic Details: Introductory concepts in data science and applications. Overview of data science process.\\ Slides and Additional Material:\\{{ :ge461-lecture1-course_information-spring-2025.pdf |}}\\ Topic Details: Software engineering applications.\\ Slides and Additional Material:\\ Project/Exercise-Problem-Set/Homework: None this week.\\ References: \\ Events: \\ ==== Week 2 (Feb 3, Feb 6) ==== **Data science applications; data science pipeline.** [Alkan, Dibeklioğlu] \\ Topic Details: Genomics applications.\\ Slides and Additional Material:\\ Topic Details: Computer vision applications.\\ Slides and Additional Material: {{ :ge461_applications_vision_2025s.pdf |}}\\ Project/Exercise-Problem-Set/Homework: None this week.\\ References: [[https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195|"Big Data: Astronomical or Genomical?"]], Stephens et al., 2015\\ Events: \\ ==== Week 3 (Feb 10, Feb 13) ===== **Data representation; preprocessing; preparation; crowdsourcing. ** [Arashloo, Çiçek] \\ Topic Details: Normalization, Noise Removal (Filtering), Anomaly Detection, Data Compression, Noise Removal (ICA).\\ Slides and Additional Material:{{ :Data Pre-processing.pdf |}}\\ Topic Details: Crowdsourcing applications and usage in data science.\\ Slides and Additional Material:{{ :ge_461-lecture_6-_crowdsourcing.pdf |}}\\ Project/Exercise-Problem-Set/Homework: None this week\\ References:\\ Events: \\ ==== Week 4 (Feb 17, Feb 20) ==== ** Data collection; storage; querying; SQL, NoSQL; cloud; distributed storage and computing. ** [Körpeoğlu] \\ Topic Details: RDMBs, SQL; SQLite, Pandas; NoSQL; MapReduce and Hadoop; Spark.\\ Slides and Additional Material:{{:slides.pdf | data_storage_and_access.pdf}}\\ Project/Exercise-Problem-Set/Homework: None this week.\\ References: [[https://www.sqlite.org/index.html|SQLite]] [[https://pandas.pydata.org/docs/user_guide/index.html|Pandas]] [[https://en.wikipedia.org/wiki/MapReduce|MapReduce]] [[https://hadoop.apache.org/|ApacheHadoop]] [[https://spark.apache.org/|ApacheSpark]]\\ Events: \\ ==== Week 5 (Feb 24, Feb 27) ==== **Basic models; parametric models; fitting. ** [Arıkan] \\ Topic Details: Multiparameter Linear Regression\\ Slides and Additional Material: {{ :ch3_linear_regression.pdf |}} \\ Project: Solve following questions using Linear Regression: Exercises 3.7.8 and 3.7.9 in the ISLR Reference Book given below \\ References: An Introduction to Statistical Learning with Applications in Python, R, Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani and Jonathon Taylor. \\ Events: \\ ==== Week 6 (Mar 3, Mar 6) ==== ** Application ** [Arıkan] \\ Topic Details: Model Selection in Multiparameter Regression \\ Slides: {{ :Ch6_Model_Selection.pdf |}}\\ Project/Exercise-Problem-Set/Homework: None this week\\ References: An Introduction to Statistical Learning with Applications in Python, R, Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani and Jonathon Taylor. \\ Events: \\ ==== Week 7 (Mar 10, Mar 13) ==== ** Spring Break ** ==== Week 8 (Mar 17, Mar 20) ==== ** Dimensionality reduction; visualization.** [Aksoy] \\ Topic Details: Feature reduction, feature selection, high-dimensional data visualization.\\ Slides and Additional Material: {{ :ge461_dimensionality.pdf |Dimensionality slides}}, {{ :knaw_t-sne_talk.pptx |t-SNE slides}}\\ Project/Exercise-Problem-Set/Homework: [{{ :ge461_project_dimensionality.pdf |Project}} ({{ :fashion_mnist.zip |data}})] (due 23:59 on April 7, 2025)\\ References: [[https://www.mathworks.com/help/stats/dimensionality-reduction.html|Matlab: dimensionality reduction]], [[https://scikit-learn.org/stable/modules/decomposition.html|Scikit-learn: decomposition]], [[https://scikit-learn.org/stable/auto_examples/index.html#decomposition|Scikit-learn: decomposition examples]], [[https://scikit-learn.org/stable/modules/manifold.html|Scikit-learn: manifold learning]], [[https://www.mathworks.com/discovery/data-visualization.html|Matlab: data visualization]], [[https://matplotlib.org/|Matplotlib: data visualization]], [[https://lvdmaaten.github.io/tsne/|t-SNE]]\\ Events: \\ ==== Week 9 (Mar 24, Mar 27) ==== ** Unsupervised learning, clustering. ** [Aksoy] \\ Topic Details: K-means clustering, mixture models, hierarchical clustering.\\ Slides and Additional Material: {{ :ge461_clustering.pdf |Clustering slides}}\\ Project/Exercise-Problem-Set/Homework: \\ References: [[https://www.mathworks.com/help/stats/cluster-analysis.html|Matlab: cluster analysis]], [[https://scikit-learn.org/stable/modules/clustering.html|Scikit-learn: clustering]], [[https://scikit-learn.org/stable/auto_examples/index.html#clustering|Scikit-learn: clustering examples]]\\ Events: \\ ==== Week 10 (Mar 31, Apr 3) ==== ** Ramadan Holiday ** ==== Week 11 (Apr 7, Apr 10) ==== ** Machine learning; supervised learning; classifiers; deep learning. ** [Dibeklioğlu]\\ Topic Details: Bayesian decision theory, linear discriminants, introduction to neural networks, support vector machines, decision trees.\\ Slides and Additional Material:\\ Project/Exercise-Problem-Set/Homework: \\ References: \\ Events:\\ ==== Week 12 (Apr 14, Apr 17) ==== ** Machine learning; supervised learning; classifiers; deep learning.** [Dibeklioğlu] \\ Topic Details: Activation functions, convolutional neural networks, recurrent architectures.\\ Slides and Additional Material:\\ Project/Exercise-Problem-Set/Homework:\\ References: \\ Events: \\ ==== Week 13 (Apr 21, Apr 24) ==== ** Machine learning in healthcare. ** [Çukur] \\ Topic Details: Healthcare analytics: diagnostics, medical imaging, in-patient care, hospital management, risk analytics, wearables. Deep learning architectures for medical applications; \\ Slides and Additional Material: {{ ::ge461_ml_in_healthcare.pdf |}} \\ Project/Exercise-Problem-Set/Homework: {{ ::ge461_pw13_description.pdf |}}; {{ ::ge461_pw13_data.zip |}} (due date: 11 May 2025, 17:00)\\ References: Hastie, Tibshirani and Friedman, The Elements of Statistical Learning, Ch. 11 and 14; Mead, Analog VLSI and Neural Systems, Ch. 4; Bishop, Pattern Recognition and Machine Learning, Ch. 5\\ Events: National Sovereignty and Children's Day (Apr 23)\\ ==== Week 14 (Apr 28, May 1) ==== ** Data mining; online data stream classification; applications.** [Can] \\ Topic Details: Concept drift, ensemble-based classification, text mining. \\ Slides and Additional Material: {{ :ge461_datastreamminingspring25.pdf |}}\\ Project Tentative Days: Announcement **April 28** or earlier, Due date: **May 18, 23:59.** \\ Project/Exercise-Problem-Set/Homework:\\ References: \\ Events: Labor and Solidarity Day (May 1)\\ ==== Week 15 (May 5, May 8) ==== ** Reinforcement learning; applications. ** [Tekin] \\ Topic Details: Applications of Reinforcement Learning, Markov Decision Processes, Value Iteration, Q Learning\\ Slides and Additional Material:\\ Project/Exercise-Problem-Set/Homework: \\ References: \\ Events: \\ ==== Week 16 (May 12) ==== ** No class ** ==== Textbooks ==== * [[https://www.textbook.ds100.org/intro|Principles and Techniques of Data Science - Online]] * [[http://shop.oreilly.com/product/0636920023784.do|Python for Data Analysis, by Wes McKinney]] * [[http://shop.oreilly.com/product/0636920028529.do|Doing Data Science, by Cathy O’Neil and Rachel Schutt. O’Reilly. 2014.]] * [[https://www.oreilly.com/library/view/data-science-from/9781492041122/|Data Science from Scratch, second edition, O'Reilly, 2019.]] * [[http://shop.oreilly.com/product/0636920034919.do|Python Data Science Handbook, O'Reilly, 2016.]] * [[https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291/ref=pd_sim_14_2?_encoding=UTF8&pd_rd_i=1491962291&pd_rd_r=a661bb45-d0b9-11e8-9fea-e722222b4194&pd_rd_w=hDZAL&pd_rd_wg=TW8F8&pf_rd_i=desktop-dp-sims&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=18bb0b78-4200-49b9-ac91-f141d61a1780&pf_rd_r=5H464VA0VJ0JFK1QFQXJ&pf_rd_s=desktop-dp-sims&pf_rd_t=40701&psc=1&refRID=5H464VA0VJ0JFK1QFQXJ| Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, O'Reilly, 2017.]] * [[https://www.cs.ubc.ca/~murphyk/MLbook/|Machine Learning: a Probabilistic Perspective]] * [[https://www-bcf.usc.edu/~gareth/ISL/|An Introduction to Statistical Learning, R, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.]] * [[https://hastie.su.domains/ISLP/ISLP_website.pdf.download.html|An Introduction to Statistical Learning with Applications in Python, R, Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani and Jonathon Taylor.]] * [[https://www.springer.com/gp/book/9780387310732|Pattern Recognition and Machine Learning, Christopher Bishop]] * [[https://www.amazon.com/Neural-Networks-Learning-Machines-3rd/dp/0131471392|Neural Networks and Learning Machines]] * [[https://rafalab.github.io/dsbook/|Introduction to Data Science - Data Analysis and Prediction Algorithms with R, Rafael A. Irizarry. Online Book.]] * [[http://www.mmds.org/|Mining Massive Datasets, third edition, Ullman et al., 2020.]] ====Similar / Complementary Courses==== * [[https://bcourses.berkeley.edu/courses/1267848|CS194, Introduction to Data Science, Berkeley]] * [[http://data8.org/ | Data 8: Introduction to Data Science, Berkeley]] * [[http://www.ds100.org/ | Data 100: Principles and Techniques of Data Science, Berkeley]] * [[https://www.cs.purdue.edu/homes/neville/courses/CS24200.html|CS24200, Introduction to Data Science, Purdue]] * [[https://web.stanford.edu/class/stats101/|Data Science 101, Stanford]] * [[http://cs109.github.io/2015/index.html|CS109, Data Science, Harvard]] * [[https://www.cs.umd.edu/class/spring2017/cmsc320/|Introduction to Data Science I, Maryland]] * [[http://users.umiacs.umd.edu/~hcorrada/IntroDataSci/syllabus.html|Introduction to Data Science II, Maryland]] * [[https://www.eecs.wsu.edu/~assefaw/CptS483-06/|Introduction to Data Science, WSU]] * [[https://www.conted.ox.ac.uk/courses/applied-data-science|An Overview of Data Science, Oxford]] * [[https://www.cambridgenetwork.co.uk/events/applied-data-science-course-become-a-data-scientist-in-6-months/|Applied Data Science, Cambridge]] * [[https://studiegids.tudelft.nl/a101_displayCourse.do?restoreContext=true&SIS_SwitchLang=en&course_id=41759|Data Analysis, Delft]] * [[https://www.studocu.com/en/course/technische-universiteit-delft/programming-and-data-science-for-the-99/77128|Programming and Data Science for 99 Percent, Delft]] * [[https://datasciencedegree.wisconsin.edu/data-science-700-foundations-of-data-science/|Foundations of Data Science, Wisconsin]] * [[https://stars.bilkent.edu.tr/syllabus/view/CS/464/|Introduction to Machine Learning, Bilkent]] * [[https://stars.bilkent.edu.tr/syllabus/view/EEE/443/EE_BS/| Neural Networks, Bilkent]] * [[https://stars.bilkent.edu.tr/syllabus/view/EEE/485/EE_BS/| Statistical Learning and Data Analytics, Bilkent]] * [[https://stars.bilkent.edu.tr/syllabus/view/IE/451/IE_BS/| Applied Data Analysis, Bikent]] * [[http://www.cs.bilkent.edu.tr/~gunduz/teaching/cs550/|Machine Learning, Bilkent]] * [[http://www.cs.bilkent.edu.tr/~saksoy/courses/cs551/|Pattern Recognition, Bilkent]] * [[http://cs.brown.edu/courses/csci1951-a/|CS1951A Data Science, Brown]] * [[http://www.datasciencecourse.org/|CMU15-388/688 Practical Data Science, CMU]] * [[https://www.hse.ru/data/2016/10/06/1087301973/program-869867030-nOE2xyWAyH.pdf| Introduction to Data Science, Russia]] * [[https://ci.uky.edu/sis/sites/default/files/syllabi/Syllabus-LIS690-Introduction%20to%20Data%20Science-20160101_1.pdf|LIS690, Introduction to Data Science, Kentucky]] ==== Tools, Libraries, Systems, Languages ==== * [[https://pandas.pydata.org/|Pandas: Python Data Analysis Library]] * [[https://aws.amazon.com/machine-learning/?sc_channel=PS&sc_campaign=acquisition_TR&sc_publisher=google&sc_medium=ACQ-P%7CPS-GO%7CNon-Brand%7CDesktop%7CSU%7CMachine%20Learning%7CMachine%20Learning%7CTR%7CEN%7CText&sc_content=ml_general_bmm&sc_detail=%2Bmachine%20%2Blearning&sc_category=Machine%20Learning&sc_segment=293640020615&sc_matchtype=b&sc_country=TR&s_kwcid=AL!4422!3!293640020615!b!!g!!%2Bmachine%20%2Blearning&ef_id=W4g8FAAAAM1g4jhU:20181015203314:s|Machine Learning on AWS]] * [[https://spark.apache.org/|Apache Spark]]