I have talked some about the generalities of machine learning and data science. We have gone into some level of detail on the types of learning: supervised, unsupervised, reinforcement (not yet discussed, but coming). While these are interesting topics, I really want to go back a little in the process and focus on data munging, or data wrangling. This is usually the first step in the process of machine learning, before all of the model building, fitting, and applying. You see, you have to have data to work with if you expect a machine to learn anything. There have to be some past examples, or experiences, to learn from. I was reading an article this morning from ThoughtWorks.Com in which the author, David Johnston, disagrees with the idea that data munging is a mundane process. After reading the article, I came up with some thoughts of my own, based on past experience. David does not disagree with the adage that 80% of the process of data science is data wrangling and that data scientists only spend 20% of the time building models and reporting on findings. His disagreement is that the 80% of time spent in that data munging process is boring and useless.
I have worked for a big data company for two years now, and previously I was a customer with a military unit where my role was also data analysis. We had tools to make the process of storing and visualizing the data a much easier process, however, there was always the issue of getting the data into a usable format for discovery. Some tools absolutely make the process of data collection easier and provide for indexing and mapping the data sets. There is always some manual intervention that has to occur at this point. Some software is able to provide labels to the data, and other systems rely on the administrator to define the labels for the data. Of course, there is also the idea of standardization. If you are going to use labels for your supervised learning algorithms, then you want to standardize those labels across all of your data sets. You can find the common fields, features, etc and ensure that they are always recognized the same way bu the algorithms you will eventually use to do your analysis.
David Johnston, a self-proclaimed data scientist (which I am not), believes that there is extreme value in the data munging process. This is where the scientist, or analyst, begins to understand their data. I am of the idea that you cannot best program or define machine learning algorithms that best suit your data if you are not familiar with your data at some level. Some organizations are quite familiar with their data sets, but I can tell you from experience, that many organizations have no idea what is in their data or how it is structured, if it is. This practice of cleaning up the data allows time for the analysts to begin to see what features exist that might be useful for supervised learning or what generalities exist that might provide some clues in unsupervised learning. The lesson here really is to understand that many of the processes we deem mundane are actually critical to the work we do. We must not dismiss them to move on to the “cool” stuff, or we might miss the very value that we were trying to realize in the first place.
I encourage you to read David’s article for a much more articulate way of providing this information. I would also suggest that you do some reading on the process of data munging. I have included some reference material below as well as a link to David’s article.
As usual, I have included a video from Standford’s Computer Science Department. This is lecture 4 in a much larger series of lessons on machine learning. It is amazing to me that we live in a time where this type of instruction is available for free on the Internet. Please do take time to watch and let them know in the comments that you discovered them through Geekzweb.com. I always appreciate your support.