Friday, December 02, 2016

How to Write an Academic Teaching Statement

A while back, I wrote a post on how to write an academic research statement. This is a follow-up post on how to write an academic teaching statement, and contains my thoughts on what makes for a good teaching statement when applying to computer science departments in US research universities.

Like I said about research statements, the teaching statement is not the most important part of your application package. In fact, for research universities, the teaching statement is probably the least important part. Nonetheless there are pitfalls that should be avoided, because a bad teaching statement can hurt your application. Being an ineffective teacher is grounds for not getting tenure at many schools, and schools don't like to hire faculty that they don't think can get tenure.

For reference, here is my teaching statement from when I went on the job market in Fall 2012. I want to emphasize a few points:

(1) Do not blather on and on about how much you LOVE teaching. No one wants to read this, just like no one wants to read in your grad school application about how you've loved computers since you were three years old. Stuff like this lacks substance and makes you seem immature.

(2) Keep it short. No one wants to read a long teaching statement, and long teaching statements are usually just fluffed up anyways. Just like for the research statement, optimize the wording to be as concise as possible. I'd recommend keeping the teaching statement to one page.

(3) List the courses and topics you can teach. Some faculty search committees like to explicitly see what courses you can teach, so this is an important thing to mention in the teaching statement. I typically like to list this at the very end of the teaching statement. I would recommend being somewhat open-minded about what courses you can teach. You can always negotiate these things later.

(4) Briefly list your credentials. I typically like to do this at the very beginning. You don't need anything flashy here, but you just can't count on your recommendation letters to talk about all of your teaching and mentorship experience.

(5) Have a teaching philosophy. No one is expecting you to know exactly how you'd like to teach your courses, but not having an explicit teaching philosophy of some kind can be an indicator of complete lack of preparation. For reference, this part took up the 2nd paragraph of my teaching statement. One strategy is to connect this part to prior teaching and mentoring experience, where you can talk about what teaching styles of have been successful for you.

(6) Say something meaningful about outreach and interdisciplinary education. Computer science and data science are quickly becoming vital enabling disciplines for virtually all scientific and engineering disciplines. Recognize this, and have some kind of plan for how you want to contribute to it, or why it matters to you.

Wednesday, September 28, 2016

SoCal ML Symposium

Julian McAuley and I are organizing the Southern California Machine Learning Symposium, on Friday November 18 at Caltech!
http://dolcit.cms.caltech.edu/scmls/

** CPF Deadline is October 4th!!

The SoCal ML Symposium brings together students and faculty to promote machine learning in the Southern California region. The workshop serves as a forum for researchers from a variety of fields working on machine learning to share and discuss their latest findings.

Topics to be covered at the symposium include, but are not limited to:
+ Machine learning with graphs, social networks, and structured data.
+ Active learning, reinforcement learning, crowdsourcing.
+ Learning with images and natural language.
+ Learning with high-dimensional data.
+ Neural networks, deep learning, and graphical models.
+ Learning dynamic and streaming data.
+ Applications to interesting new domains.
+ Addressing each of these issues at scale.

The majority of the workshop will be focused on student contributions, in the form of contributed talks and posters.

We invite submissions in the form of 1-2 page extended absracts, to be presented as posters and oral presentations at the symposium. Submissions may be made on our easychair page:
https://easychair.org/conferences/?conf=scmls16

A $500 first-prize and a $250 runner-up prize, sponsored by Google Research, will be awarded for the best student presentations.

Timeline:
Oct 4: Abstract submission
Oct 14: Notification
Nov 11: Registration deadline
Nov 18: Symposium

For more details, including submission and registration instructions, visit our symposium webpage:
http://dolcit.cms.caltech.edu/scmls/
and please help distribute our flyer:
http://dolcit.cms.caltech.edu/scmls/scmls.pdf

Friday, January 01, 2016

Data Science Positions for Sports Analytics

I want to give a plug for STATS LLC, which is building a data science team and has several openings for data scientist positions. For those who don't know, STATS is sports data company that provides the tracking data for the National Basketball Association, amongst other sports and leagues. STATS also recently acquired Prozone, which provides tracking data for many professional soccer leagues around the world. Sports analytics is definitely entering an exciting phase due to the rapid growth of new data sources that offer far greater granularity than was possible before. See, e.g., these papers that analyze tracking data provided by STATS and Prozone.

Patrick Lucey is the new Director of Data Science. I previously worked with Patrick at Disney Research, and I can vouch for him being a great collaborator with lots of fantastic ideas and unbounded enthusiasm for sports analytics research.

Thursday, December 31, 2015

Thoughts on NIPS 2015 and OpenAI

A few weeks ago, I attended NIPS 2015, which turned out to be (by far) the largest machine learning conference ever. With nearly 4000 attendees, the conference saw a roughly 50% increase from the previous year. Much of this growth seems fueled by industry interest, especially in topics such as deep learning and large scale learning. Deep learning, in particular, seems to be all the rage these days, at least in the public zeitgeist. I think this is great for the field, because this degree of interest will also percolate to the rest of machine learning more broadly.

There have been plenty of posts regarding NIPS already (see: Sebastien Bubeck, Neil Lawrence, John Langford, Paul Mineiro, and Hal Daume), with plenty of great pointers to interesting NIPS papers that I'll hopefully get around to reading soon. On my end, I didn't get a chance to see too many papers, in part because I was helping presenting a poster during one poster session, and a demo during another. But I did very much enjoy many of the talks, especially during the workshops.

OpenAI

Perhaps the biggest sensation at NIPS was the announcement of OpenAI, which is a non-profit artificial intelligence research company with $1B in endowment donated by people such as Sam Altman, Elon Musk, Peter Thiel, and others. The core ideal of OpenAI is to promote open research in Artificial Intelligence. For the most part, not much is known about how OpenAI will operate (and from what I've gathered, the people at OpenAI haven't fully decided on a strategy yet either). One thing that I do know on good authority is that OpenAI will NOT be patenting their research.

Nonetheless, there have already been many reactions to OpenAI, from the usual "robots will steal our jobs" trope, to nuanced concerns voiced by machine learning expert Neil Lawrence observing that open access to data is just as important as open access to research and systems. I do very much agree with Neil's point and I think that one of the best things that OpenAI can do for the research community is to generate interesting new datasets and testbeds. There have also been concerns voiced that the founding team is overwhelmingly deep learning people. I don't think this is much of an issue at the moment, because representation learning has been the biggest practical leap forward and giving broader access to learned representations is a great thing.

The announcement has even caught the attention of rationalists such as Scott Alexander, who voiced concerns about whether AI research should be open at all, for risk of losing control of the technology and potentially leading to the catastrophic results. Scott's concern is a meta-concern about the current mentality of AI research being an arms race and institutions such as OpenAI not focusing on "controlling" access to AI that could become dangerous. These meta-concerns are predicated on the assumptions that hard takeoff of AGI is a legitimate existential threat to humanity (which I agree with), and that existing institutions such as OpenAI could directly lead to that happening (which I strongly disagree with). I realize that OpenAI ponders about human-level intelligence in their opening blog post, but that's just a mission statement of sorts. For instance, Google, while awesome, has (thus far) fallen quite short of their mission to "organize the world's information and make it universally accessible and useful". Likewise, I don't expect OpenAI to succeed in their mission statement anytime soon.

Most machine learning experts probably do take an overly myopic view of machine learning progress, which is partly due to the aforementioned research arms race but also just due to how research works (i.e., it is REALLY hard to make tangible progress on something that you can't even begin to rigorously and precisely reason about). However, from what I've read, rationalist non-experts conversely tend to phrase things in such imprecise terms that it's hard to have a substantive discussion between the two communities. I imagine the "truth", such as it is, is somewhere in the middle. Perhaps one should gather both camps together for a common discussion.

What is definitely going to happen, in the near term, is that access to AI technologies will be an increasingly important competitive advantage moving forward. And it's great that institutions such as OpenAI will help promote open access to those technologies.

I am optimistic that the crew at OpenAI will explore alternative mechanisms in contrast to NSF-style funding of research, and how places like the Allen Institute engages in research. I think it'll be exciting to see what comes out of that process. Hopefully, OpenAI will also engage with places like the Future of Humanity Institute, and maybe even create forums that bring together people like Stuart Russell, Eric Horvitz, Scott Alexander and Eliezer Yudkowsky.

Cynthia Dwork on Universal Adaptive Data Analysis

Cynthia Dwork gave a great talk on using differential privacy to guard against overfitting when re-using a validation set multiple times. See this Science paper for more details. The basic idea is that, when you use your validation set to evaluate the performance of a model, do so in a differentially private way so that you don't overfit to the idiosyncrasies of the validation set. See, for instance, this paper describing an application to Kaggle-style competitions. This result demonstrates a great instance of (unexpected?) convergence between different areas of study: privacy-preserving computation and machine learning.

Jerry Zhu on Machine Teaching

Jerry Zhu has been doing very interesting work on Machine Teaching, which he talked about at the NIPS workshop on adaptive machine learning. Roughly speaking, machine teaching is the computational and statistical problem of how to select training examples to teach a learner as quickly as possible. One can think of machine teaching as the converse of active learning, where instead of the learner actively querying for training examples, a teacher actively provides them.

Machine teaching has a wide range of applications, but the one that I'm most interested in is when the learner is a human. As models become necessarily more complex in the quest for predictive accuracy, it is important that we devise methods to keep these models somehow interpretable to humans. One way is to use a machine teaching approach to quickly show the human what concepts the trained model has learned. For instance, this approach would have applications debugging complicated machine learning models.

Rich Caruana on Interpretable Machine Learning for Health Care

On the flip side, Rich Carauna talked about training models that are inherently interpretable by domain experts, such as medical professionals. Of course, these models are only applicable in restricted domains, such as when there is a "sufficient" set of hand-crafted features such that a generalized additive model can accurately capture the phenomenon of interest. The approach was applied to two settings: predicting the risk of pneumonia and 30-day re-admission.

One interesting consequence of this study was that these interpretable models could be used to tease out biases in the data collection process. For instance, the model predicted that patients with asthma are at lower risk of dying from pneumonia. Consulting with medical experts revealed that, historically, patients with asthma are more closely monitored for signs of pneumonia and so the disease is detected much earlier than for the general populace. Nonetheless, it's clear that one wouldn't want a predictive model to predict a lower risk of pneumonia for patients with asthma -- that was simply a consequence of how the historical data was collected. See this paper for details.

Zoubin Ghahramani on Probabilistic Models

Zoubin Ghahramani gave a keynote talk on probabilistic models. During this deep learning craze, it's important keep in mind that properly quantifying uncertainty is often a critical component as well. We are rarely given perfect information, and so we can rarely make perfect predictions. In order to make informed decisions, our models should make calibrated probabilities regarding so that we can properly weigh different tradeoffs. Recall that one of the critical aspects of the Jeopardy! winning IBM Watson machine was being able to properly calibrate its own confidence in the right answer (or question). Another point that Zoubin touched on was rational allocation of computational resources under uncertainty. See also this great essay on the interplay between machine learning and statistics by Max Welling.

Interesting Papers

As I mentioned earlier, I didn't get a chance to check out too many posters, but here are a few that I did see which I found quite interesting.

Generalization in Adaptive Data Analysis and Holdout Reuse
by Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, Aaron Roth
This paper generalizes previous work on adaptive data analysis by: 1) allow the query to the validation set be adaptive to the result of previous queries, and 2) provide a more general definition of adaptive data analysis.

Logarithmic Time Online Multiclass Prediction
by Anna Choromanska, John Langford
This paper studies how to quickly construct multiclass classifiers whose running time is logarithmic in the number of classes. This approach is especially useful for settings where the number of classes is enormous, which is also known as Extreme Multiclass Classification.

Spatial Transformer Networks
by Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu
This paper studies how to incorporate more invariants into convolutional neural networks beyond just shift invariance. The most obvious cases are being invariant to rotation and skew. See also this post.

Optimization as Estimation with Gaussian Processes in Bandit Settings
by Zi Wang, Bolei Zhou, Stefanie Jegelka
A preliminary version of this paper was presented at the Women in Machine Learning Workshop at NIPS, and will be formally published at AISTATS 2016. This is a really wonderful paper that unifies, to some extent, two of the most popular views in Bayesian optimization: UCB-style bandit algorithms and probability of improvement (PI) algorithms. One obvious future direction is to also unify with expected improvement (EI) algorithms as well.

Fast Convergence of Regularized Learning in Games
by Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, Robert E. Schapire
This paper won a best paper award at NIPS, and analyzed the setting of learning in a repeated game. Previous results showed a regret convergence of O(T-1/2), and this paper demonstrates an asymptotic improvement to O(T-3/4) for individual regret and O(T-1) for the sum of utilities.

Data Generation as Sequential Decision Making
by Philip Bachman, Doina Precup
This paper takes the view of sampling from sequential generative models as sequential decision making. For instance, can we view sequential sampling as an Markov decision process? In particular, this paper focuses on the problem of data imputation, or filling in missing values. This style of research has been piquing my interest recently, since it can offer the potential to dramatically speed up computation when sampling or prediction can be very computationally intensive.

Sampling from Probabilistic Submodular Models
by Alkis Gotovos, S. Hamed Hassani, Andreas Krause
Andreas's group has been working on a general class of probabilistic models called log-submodular and log-supermodular models. These models generalize models such as determinantal point processes. This paper studies how to do inference on these models via MCMC sampling, and establish conditions for fast mixing.

The Self-Normalized Estimator for Counterfactual Learning
by Adith Swaminathan, Thorsten Joachims
This paper addresses a signficant limitation of previous work on counterfactual risk minimization, which is overfitting to hypotheses that match or avoid the logged (bandit) training data, which the authors call propensity overfitting. The authors propose a new risk estimator which deals with this issue.

Monday, September 07, 2015

Thoughts on KDD 2015

Last month I attended KDD 2015 in beautiful Sydney, Australia. For those who don't know, KDD is the premier international conference for applied machine learning & data mining, and is often the venue for some of the most interesting data analysis research projects. Despite concerns that KDD 2015 would be a let down after KDD 2014 was such a great success in New York City, overall KDD 2015 was a fantastic conference, with an excellent lineup of invited speakers and plenty of interesting papers. Congratulations also to my PhD advisor Thorsten Joachims, who not only did a great job as PC Co-Chair, but also was the recipient of a Test of Time Award for his work on Optimizing Search Engines using Clickthrough Data.

Data Science for Science

One of the biggest themes at KDD 2015 was applying data science to support the sciences, which is something that's been on my mind a lot recently. Hugh Durrant-White gave a great keynote on applying machine learning to discovery processes in geology and ecology. One thing that jumped out of his talk was how challenging it is to develop models that are interpretable to domain experts. This issue is ameliorated in his settings because he largely focused on spatial models which are easier to visualize and interpret.

Susan Athey gave another keynote on the interplay between machine learning and causal inference in policy evaluation, which is an important issue for the sciences as well. I must admit, most of the talk went over my head, but there was some interesting debate after the talk about whether causality should be the goal or rather just more "robust" correlations (whatever that might mean).

I also really enjoyed the Data-Driven Science Panel, where the debate got quite heated at times. Two issues in particular stood out. First, what should be the role of machine learning and data mining experts in the ecosystem of data-driven science? One the one hand, computer scientists have historically had a large impact by developing systems and platforms that abstract away low-level complexity and empower the end user to be more productive. However, how to achieve such a solution in a data-rich world is a much messier (or at least different) type of endeavor. There are, of course, plenty of startups that address aspects of this problem, but a genuine scalable solution for science remains elusive.

A second issue that was raised was whether computational researchers have made much of a direct impact on the sciences. The particular area, raised by Tina Eliassi-Rad, is the social sciences. Machine learning and data mining have taken great interest in computational social science via studying large social networks. However, it is not clear to what extent computational researchers have directly made an impact to traditional social science fields. Of course, this issue is tied back to what the role of computational researchers should be. On the one hand, many social scientists do use tools made by computational people, so the indirect impact is quite clear. Does it really matter that there hasn't been much direct impact?

Update on MOOCs

Daphne Koller gave a great keynote on the state of MOOCs and Coursera in particular. It seems that MOOCs nowadays are much smarter about their consumer base, and have diversified the way they deliver content and measure success for a wide range of students. For example, people now understand much better the different needs of college aspirants (who use MOOCs to supplicant high school & college education) versus young professionals (who use MOOCs to get ahead in their careers) versus those seeking vocational skills (which is very popular in less developed countries).

One striking omission that was pointed out during the Q&A was that MOOCs have mostly abandoned the pre-college demographic, especially before high school. In retrospect, this is not too surprising, in large part due to the very different requirements for primary and secondary education across different states and school districts. But it does put a damper on the current MOOC enthusiasm, since many problems with education start much earlier than college.

Lessons Learned from Large-Scale A/B Testing

Ron Kohavi gave a keynote on lessons learned from online A/B testing. The most interesting aspect of his talk was just how well-tuned the existing systems are. One symptom of a highly tuned system is that it becomes very difficult to intuit about whether certain modifications will increase or decrease the performance of the system (or have no effect). For example, he gave the audience a number of questions to the audience, such as: "Does increasing the description of the sponsored advertisements lead to increased overall clicks on ads?" Basically, the audience could not guess better than random. So the main lesson is to basically to follow the data and don't be to (emotionally) tied to your own intuitions when it comes to optimizing large complex industrial systems.

Sports Analytics Workshop

I co-organized the 2nd workshop on Large-Scale Sports Analytics. I tried to get more eSports into the workshop this year, but alas fell a bit short. Thorsten did give an interesting talk that used eSports data, although the phenomenon he was studying was not specific to eSports. In many ways, eSports is an even better test bed for sports analytics than traditional sports because game replays track literally everything.

Within the more traditional sports regimes, it's clear that access to data remains a large bottleneck. Many professional leagues are hoarding their data like gold, but sadly do not have the expertise leverage the data effectively. The situation actually seems better in Europe, where access to tracked soccer (sorry, futbol) games are relatively common. In the US, it seems like the data is only available to a select few sports analytics companies such as Second Spectrum. I'm hopeful that this situation will change in the near future as the various stake holders become more comfortable with the idea that it's not the raw data that has value, but the processed artifacts built on top of that data.

Interesting Papers

There were plenty of interesting research papers at KDD, of which I'll just list a few that I particularly liked.

A Decision Tree Framework for Spatiotemporal Sequence Prediction
by Taehwan Kim, Yisong Yue, Sarah Taylor, and Iain Matthews
I'll start with a shameless piece of self-advertising. In collaboration with Disney Research, we trained a model to generate visual speech, i.e., animate the lower face in response to audio or phonetic inputs. See the demo video below:

More details here.

Inside Jokes: Identifying Humorous Cartoon Captions
by Dafna Shahaf, Eric Horvitz, and Robert Mankoff
Probably the most interesting application at KDD was on studying the anatomy of a joke. While the results may not seem too surprising in retrospect (e.g., the punchline should be at the end of the joke), what was really cool was that the model could quantify if one joke was funnier than another joke (i.e., rank jokes).

Cinema Data Mining: The Smell of Fear
by Jörg Wicker, Nicolas Krauter, Bettina Derstorff, Christof Stönner, Efstratios Bourtsoukidis, Thomas Klüpfel, Jonathan Williams, and Stefan Kramer
This was a cool paper that studied how the exhaled organic particles vary in response to different emotions. The authors instrumented a movie theater's air circulation system with chemical sensors, and found that the chemicals you exhale are indicative of various emotions such as fear or amusement. The author repeatedly lamented the fact that they didn't do this for any erotic films, and so they don't know what the cinematic chemical signature of arousal would look like.

Who supported Obama in 2012? Ecological inference through distribution regression
by Seth Flaxman, Yu-Xiang Wang, and Alex Smola
This paper presents a new solution to the ecological inference problem of inferring individual level preferences from aggregate data. The primary data testbed were county-wise election outcomes and demographic data that reported at a different granularity or overlay. The main issue is how to estimate, e.g., female preference for one presidential candidate, using just these kinds of aggregate data.

Certifying and removing disparate impact
by Michael Feldman, Sorelle Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian
Many people assume that, because algorithms are "objective" then they can't be biased or discriminatory. This assumption is invalid because the data or features themselves can be biased (cf. this interview with Cynthia Dwork). The authors of this paper propose a way to detect & remove bias in machine learning models that is tailored to the US legal definition of bias. The work is, of course, preliminary, but this paper was arguably the most thought provoking of the entire conference.

Edge-Weighted Personalized PageRank: Breaking A Decade-Old Performance Barrier
by Wenlei Xie, David Bindel, Alan Demers, and Johannes Gehrke
This paper proposes a reduction approach to personalized PageRank that yields a computational boost by several orders of magnitude, thus allowing, for the first time, personalized PageRank to be computed at interactive speeds. This paper was also the recipient of the best paper award.

Thursday, April 09, 2015

KDD 2015 Workshop on Large-Scale Sports Analytics

We are pleased to announce that the KDD Workshop on Large-Scale Sports Analytics will be taking place in Sydney this year on August the 10th at KDD 2015. Similar to last year, it will be a full day workshop consisting of invited speakers as well as poster sessions for submitted papers. A call for paper submissions is below.

=== Call for Submissions ===
When: August 10th, 2015
Where: Sydney, Australia
Website: http://large-scale-sports-analytics.org/

Description:

Virtually every aspect of sports analytics is now entering the “Big Data” phase, and the interest in effectively mining, modeling, and learning from such data has also been correspondingly growing. Relevant data sources include detailed play-by-play game logs, tracking data, physiological sensor data to monitor the health of players, social media and text-based content, and video recordings of games.

The objective of this workshop is to bring together researchers and analysts from academia and industry who work in sports analytics, data mining and machine learning. We hope to enable meaningful discussions about state-of-the-art in sports analytics research, and how it might be improved upon.

We seek poster submissions (which can be both preliminary research as well as recently published work) on topics including but not limited to:
* Spatiotemporal modeling
* Video, text and social media analysis
* Feature selection and dimensionality reduction
* Feature learning and latent factor models
* Computational rationality
* Real-time predictive modeling
* Interactive analysis & visualization tools
* Sensor technology and reliability
* Labeling and annotation of events/activities/tactics
* Real-time/deployed analytical systems
* Knowledge discovery of player/team/league behaviors
* Game Theory
* eSports


Submission Details:
Poster submissions should be extended abstracts no more than 4 pages in length (in KDD format, do not need to be anonymous). Extended abstracts should be submitted by June 5th 11:59 PM PDT. Details can be found at:

http://www.large-scale-sports-analytics.org/Large-Scale-Sports-Analytics/Submissions.html

Important Dates:
Submission - 5th June 2015 11:59 PM PDT
Notification - 30th June 2015
Workshop - 10th August 2015

Organizers:
Patrick Lucey (Disney Research) (patrick.lucey@disneyresearch.com)
Yisong Yue (Caltech) (yyue@caltech.edu)
Jenna Wiens (University of Michigan) (wiensj@umich.edu)
Stuart Morgan (Australian Institute of Sport) (stuart.morgan@ausport.gov.au)