Udacityの機械学習コースの「Intro to Machine Learning」を修了した。内容自体は割と簡単で、機械学習を勉強すると避けては通れない数式は一切出てこなくて、機械学習モデルの概念を実際にScikit-Learnのライブラリを使ってコード書きながら学んでいく形式だった。機械学習とはなんぞや、というプログラマにとっては良い講義だと思う。ただ、数式出てこない時点で各モデルを深く学ぼうというスタイルから外れるので、こいつで機械学習で何が出来るのか、であったり、機械学習のプロセスをイメージ出来るようになったあとで、細かいところを他のコースやコンテンツで学んでいくやり方になると思う。なので、個人的には若干物足りない講義ではあったが、断片的に学習していた機械学習の知識を整理することが出来たので良かった。
- Lessons 1-4: Supervised Classification
Naive Bayes: We jump in headfirst, learning perhaps the world’s greatest algorithm for classifying text. Support Vector Machines (SVMs): One of the top 10 algorithms in machine learning, and a must-try for many classification tasks. What makes it special? The ability to generate new features independently and on the fly. Decision Trees: Extremely straightforward, often just as accurate as an SVM but (usually) way faster. The launch point for more sophisticated methods, like random forests and boosting.
- Lesson 5: Datasets and Questions
Behind any great machine learning project is a great dataset that the algorithm can learn from. We were inspired by a treasure trove of email and financial data from the Enron corporation, which would normally be strictly confidential but became public when the company went bankrupt in a blizzard of fraud. Follow our lead as we wrestle this dataset into a machine-learning-ready format, in anticipation of trying to predict cases of fraud.
- Lesson 6 and 7: Regressions and Outliers
Regressions are some of the most widely used machine learning algorithms, and rightly share prominence with classification. What’s a fast way to make mistakes in regression, though? Have troublesome outliers in your data. We’ll tackle how to identify and clean away those pesky data points.
- Lesson 8: Unsupervised Learning
K-Means Clustering: The flagship algorithm when you don’t have labeled data to work with, and a quick method for pattern-searching when approaching a dataset for the first time.
- Lessons 9-12: Features, Features, Features
Feature Creation: Taking your human intuition about the world and turning it into data that a computer can use. Feature Selection: Einstein said it best: make everything as simple as possible, and no simpler. In this case, that means identifying the most important features of your data. Principal Component Analysis: A more sophisticated take on feature selection, and one of the crown jewels of unsupervised learning. Feature Scaling: Simple tricks for making sure your data and your algorithm play nicely together. Learning from Text: More information is in text than any other format, and there are some effective but simple tools for extracting that information.
- Lessons 13-14: Validation and Evaluation
Training/testing data split: How do you know that what you’re doing is working? You don’t, unless you validate. The train-test split is simple to do, and the gold standard for understanding your results. Cross-validation: Take the training/testing split and put it on steroids. Validate your machine learning results like a pro. Precision, recall, and F1 score: After all this data-driven work, quantify your results with metrics tailored to what is most important to you.
- Lesson 15: Wrapping it all Up
We take a step back and review what we’ve learned, and how it all fits together.
Mini-project at the end of each lesson. Final project: searching for signs of corporate fraud in Enron data.