Dr. Stan Matwin
Tier I Canada Research Chair
The Canada Research Chairs Program is funded by the federal government and forms part of a strategy to make Canada one of the world's top countries in research and development.
What is your line of research?
I work in Machine Learning and Data Mining. Machine Learning is a research area in which a computer is given examples of something (e.g. what is and what isn't an oil spill in a satellite image of the sea) and from these examples it learns how to classify or predict new examples of that 'something' (e.g. to recognize oil spills in new, unseen images). This is an old idea, dating back to 1950s, and it was part of the original Artificial Intelligence manifesto: everybody agrees that learning is an inherent part of intelligence. I like to see it more pragmatically: I am interested in the use of learning programs to learn practical things, e.g. to predict who in the emergency room will need hospitalization, to recognize oil spills, to categorize medical articles, or to catch emerging trends in a political campaign or in public opinion.
Of particular interest for me is learning from text data: papers, blogs, tweets, notes, etc. I believe that such data calls for methods that take into account its linguistic character - we will have stronger methods if they understand the lexical, syntactic and semantic character of such data. That is the main topic of my Canada Research Chair here at Dal.
Data Mining, for me, is Machine Learning in the large: first, one is dealing with large data sets, in millions of records and terabytes of volume. Second, in data mining it is recognized that one spends most of the effort not in the "model building" phase, but instead in the data cleaning and data preparation phase, e.g. doing "attribute engineering". In order to do this, the data miner must learn the basics of the domain from which the data is coming: he or she will have to create in their head fundamental "ontology" of that domain, i.e. what are the main entities, and what are the relationships between those entities.
I am also interested in data privacy. I work on methods that make it hard, or practically impossible, to identify a given person in a dataset.
How did you get interested in that?
Well, it all started many years ago, when was involved in one of the early projects in Expert Systems, a joint project with Cognos. At that time we were trying to build an ES that would process (or assist in processing) government travel claims. I got to learn more than I ever wished about that "fascinating" topic! A question which arose was: how does one acquire rules which form the knowledge base of an expert system? Somebody suggested that I look at Machine Learning - indeed, one of its early goals was to replace the classical "Knowledge Acquisition" approach with learning the rules from examples. I went to spend a sabbatical with one of the leading centres of Machine Learning at that time, George Mason U. in Virginia, and I caught the bug. I liked the fact that Machine Learning was drawing on a variety of discipline: AI, logic, databases and statistics to build its tools. I also liked the fact that it was directly applicable almost everywhere. I am always interested in applications - they are an opportunity to learn about something completely new, from neuro-ophthalmology to forestry to electronic components, just to name a few applications I was involved in. Applications also attract students and, last but not least, research funds. And done well, they often present a general research problem that can be shared with the community and initiate a new line of research. That has happened to our work on oil spill detection with R.C. Holte and M. Kubat that opened the active field of learning from imbalanced data.
My interest in data privacy is a little different. I am concerned about the fact that modern computers may become a tool that can be used to breach and violate people's privacy easier and on a much larger scale than it was possible, say, 30 years ago. I believe that since the computer research community invented the tools that make it possible -databases, the internet, image and voice recognition, barcodes etc. - it is then our moral obligation to at least think about tools that would make privacy easier and that would avoid many privacy-averse incidents.
What do you hope to achieve in the next five years?
I have several goals. First and foremost, I hope to create, together with colleagues from Dal, an active, dynamic centre of excellence in our joint field of research, which we call Big Data Analytics. We are working to open soon the Institute for Big Data Analytics Research to focus on this area. The Institute will attract talent, ideas and applications, and will make Dalhousie a globally visible centre for this type of research. We're getting a very powerful computer, IBM Netezza, a unique machine not only here but on campuses generally, which will provide an excellent infrastructure for Big Data applications.
At the research level, I hope to make inroads into a linguistically informed but still scalable text model ("representation"). I want to complete several real-life, deployed applications of data and text mining techniques. I also want to continue with a start-up, Devera Logic, that I have founded several years ago with colleagues in Ottawa in the area of computer security, and to bring it to a fruitful completion.
Who else is involved in this research?
Here at Dal there are several excellent researchers involved in this type of research. My closest collaborators in text analytics will be Vlado Keselj, Evangelos Milios and Mike Shepherd. I will also collaborate with other faculty members at Dal in the areas of visualization, HCI, data bases and data structures and privacy: Raza Abidi, Dirk Arnold, Robert Beiko, Jamie Blustein, Stephen Brooks, Qigang Gao, Kirstie Hawkey, Andrew Rau-Chaplin, Derek Reilly, Thomas Trappenberg, Caroline Watters, Norbert Zeh, and Nur Zincir-Heywood.
In Canada, I cooperate actively with several researchersa cross the country: Nick Cercone (former CS Dean at Dal, now at York), Fred Popowich at Simon Fraser, Diana Inkpen, Nathalie Japkowicz at the University of Ottawa and Chris Drummond at NRC, Guy Lapalme at Universite de Montreal.
I also plan to continue and further develop my rich international collaboration, mainly with Brazil where I already have a very active, ongoing cooperation; with France and Spain through Dal's partnership in the DMKM Erasmus Mundus program; and with my native Poland, where I hold a Professorship with the Academy of Sciences and have many contacts with several leading academic and research centres.
What attracts your interest outside your research area?
I am interested in current affairs and politics - I believe we have to be informed to influence decision makers on matters that concern us. I spend a lot of time reading (on-line) newspapers in at least three languages - English, French and Polish. I am also an avid reader of literature in these three languagees. Classical music is my major hobby - I have a large CD collection, I go to concerts wherever I can, also during my frequent travel.I like hiking and swimming, but I do not do enough of that.