This post might require a cup of coffee. I want to take time to explain a few terms as we move forward in our machine learning series, discuss some common algorithms, and speak to data wrangling, or data munging. I’ll start with the topic of data wrangling since it is one of the first obstacles to data science. Simply put, data is required and it is required in a format which can easily be piped through a process. There are two aspect to this, data collection and data pre-processing. It might be hard to believe, but in many organizations the problem with true analysis, from a security or business perspective, is that getting data out of silos and into a central repository is extremely difficult. It is not the technical challenges that present the largest hurdles. In most organizations, the problem is political. Who owns the data and what do they stand to lose if they make it available.
I’ll leave the political discussion for another day, but I do want to speak to data ingest. Once your sources have been defined and configured for sending data flows, you still need some way of preparing that data for processing. This is the data wrangling stage. There are products on the market, my company makes one, that make the ingest and preprocessing a breeze. You really want to get this data into some type of information store and then break it apart. The more granular you can be in this dat standardization process, the easier your analysis will be in the end. This is where you should identify common fields in your data sets and begin to build some structure. Some tools on the market now will build this structure for you either during or after inget.
Now let’s discuss some terminology. Two terms I mentioned in the post for this series were supervised and unsupervised learning. Supervised learning means that you are building machine learning processes or functions around labeled data. There are fewer unknowns than when dealing with unsupervised and unlabeled data. For instance, if I have a set of data which has labels identifying fields and functions for manufacturing process, I might create a function to help me determine the root cause of plant failures. I could use the labels that, such as recorded failure times, parts, etc and build out the process of using all of the dat available, or a subset if desired, to determine that causality.
Unsupervised learning is more difficult, but can be much more interesting because you are dealing with more unknowns. For instance, I might take some fields within my day that I do know, like time of day for logins, file access, etc and build a function to determine who is most likely to login at a certain time of day. Better yet, I can create a function to forecast when users might login next or what files they might access within a given timeframe. I am using information. Do know that is labeled and looking for the unknown. This is the best way I currently know to explain the process, so I hope some of the more educated readers will help me out here. Also, please watch the Stanford videos because the instructor does a great job of explaining the basics where even I can understand them.
Now, I promised some algorithm discussion. Let’s start with a very common algorithm used in data science and machine learning, linear regression. Linear is the key word here. We can take some explanatory variables and use them to determine or predict the outcome or value of another variable by plotting the results of our math against the variable relationships. We plot our results along a linear path on a graph to determine which variables most accurately describe or attribute to our comparison subject. We can delve into this a little deeper as we move forward. This is a pretty deep subject and will take some time to flesh out.