The Data Mining Triology: I. Preparation
Overview
In this Data Mining triology, I'm going to present the following critical steps each scientist should perform when handling the data:
- Data Preparation
Load and integrate data sources - Data Cleaning
Prepocessing the data to make them meaningful and usable - Exploratory Data Analysis
Analyze the data and understand its pattern. Make corresponding adjustments to data along the way
In this blog, let's talk about the first one – Data Preparation.
Sources of data
There are indeed a great variety of data sources since the age of machine learning started. While we may come across a wide variety of data types (image, video, text, sheet, signal… you name it), we often can get these data from some popular website:
- Google Datasets Search
- Pros:
- Wide coverage, can find whatever dataset you want
- Cons:
- Some datasets are not actually accessible, and the website does not indicate that at all!
- Pros:
- Government Datasets
- Pros:
- Most of them are publically available, which is good
- Most of these data are well preprocessed, so you don't have to worry about it
- Cons:
- Sometimes, the size of the dataset is not large enough for meaningful projects
- Governments may not release the most up-to-date datasets. Hence the effectiveness of prediction models may be questionable
- Pros:
- Kaggle Datasets
- Pros:
- Really easy to get the data: via command line. This often preferred by professional engineers. See tutorial on kaggle api
- Aside from the datasets themselves, you can often find a bunch of enthusiasts on machine learning in Kaggle and excellent tutorials on the datasets you found.
- Cons:
- It takes a bit of practice to get along with Kaggle. Passion and drive are the key to success in Kaggle
- Pros:
- UCI Machine Learning Repository
- Pros:
- One of the most widely used repository for machine learning datasets
- Very Often, the datasets are related to academic/industrial research projects, so it is extremely helpful to researchers
- Cons:
- As a well-aged repo, the datasetes there certainly have been studied extensively. So it may not be so useful for new breakthroughs (but still, it should be very helpful for beginners)
- Pros:
Code for loading the data
The most common format for machine learning data is CSV files, and we are using python 3.x here for actual code.
This step should mark the start of your notebook (after np/pd/sklearn/plt
).
1 | import pandas as pd |
Note that this is too widely abused that people forget about other ways to load data:
Load CSV with Python Standard Library
1 | import csv |
Load CSV with Numpy
1 | import numpy |
Load CSV with URL
1 | from numpy import loadtxt |
Now be cautioned that this is just for CSV files. There are a lot of other data formats, and Google is always your best friend in finding methods to load datasets.
{: .notice–info .notice–x-large}
That was input, how about output?
Converting data from python objects into byte streams is known as Pickling or Serialization. This allows your own data to be passed around efficiently. Very often, they are stored as .pkl
or .json
files.
Python Pickle and JSON
The following table is inspired by this tutorial
Python Pickle | JSON | |
---|---|---|
Definition | Python Pickle is the process of converting python objects (list, dict, tuples, etc.) into byte streams which can be saved to disks or can be transferred over the network. The byte streams saved on the file contains the necessary information to reconstruct the original python object. The process of converting byte streams back to python objects is called de-serialization. | JSON stands for JavaScript Object Notation. Data Stored can be loaded without having the need to recreate the data again. |
Storage format | Binary serialization format | Simple text serialization format, human-readable |
Storage Versatility | Not only data entries, but classes and methods can be serialized and de-serialized | JSON is limited to certain python objects, and it cannot serialize every python object, such as classes and functions |
Language dependency | Very reliant on the language (Python specific) and versions (2.x pickle files may not be compatible in 3.x env) | JSON is supported by almost all programming languages. |
Speed | Slower serialization and de-serialization in pickle | Lightweights, much faster than pickle |
Security | There is always security risks with pickle files | JSON is generally secure |
With the above table in mind, one can choose their outputs accordingly.
Conclusion
Loading data is merely the first step and people can quickly learn to apply them. However, I/O choices does matter, and one should be cautious about them. Now, lets step into the second step in data mining: cleaning data.
The Data Mining Triology: I. Preparation
https://criss-wang.github.io/post/blogs/data/data-preparation/