1. Introduction
As a software developer I have had to face different challenges throughout my career. The use of Python as a programming language is becoming more and more widespread, and it serves as a basis for web development, AI, crypto, etc.
Big Data, according to this Oracle article, encompasses the phenomenon of “larger and more complex data sets.” In software development with Big Data, the volume of data and the speed of processing are fundamental. I consider that a large part of the development life cycle depends on these two, and due to their nature, it is to be expected that these processes are carried out on equipment whose hardware allows the rapid and efficient management of volumes ranging from thousands to billions of data.
However, in the middle of 2024, I think that circumstances have changed in favor of developers, so in this article, I will show you how you can do Big Data on an everyday laptop (- $300 USD) using open-source tools and optimization techniques.
2. Realistic minimum software requirements
Hewlett-Packard, also known as HP, is a multinational company considered one of the leaders in technology and the creation of computer equipment. They explain here what, according to them, are the minimum requirements that data science demands to do big data. The list of requirements they share is clear:
- Min. 16GB of RAM memory
- A GPU with a minimum of 4GB of memory (they emphasize the use of NVIDIA as an option for GPUs)
- Intel® Core™ i7, i9, and Xeon®2 processor, with a minimum of 4 cores and a base speed of 2.0GHz
- Windows 11 or Ubuntu operating system
However, not everyone can have such equipment at their fingertips. In my personal case, I code with a laptop ASUS Vivobook that I bought for $295 USD, which has:
- 8GB RAM
- An Intel i5-1135G7 4-core processor
- An integrated Intel Iris Xe Graphics GPU.
So, in this article, we will take these requirements as the minimum (and we will even see if we can reduce them even further) for the development of Big Data.
3. Work tools
These are the tools we are going to use:
- Pandas and NumPy
- Dask
- Google Co
3.1. Pandas and NumPy
Pandas and NumPy are two Python libraries popularly used in data science. They are used for data manipulation and scientific computing, respectively. We will use these because they can efficiently handle data structures and multidimensional arrays, which will help us deal with large amounts of data.
3.2. Dask
Dask is a library very similar in its use to Pandas and Numpy, with the difference that it is focused on large-scale distributed data processing. We will use it since we are interested in its efficiency, being able to process large amounts of data sets.
3.3. Google Collaboratory (Colab)
For a last use case, we will use the service Google Collaboratory to run Python code from the web browser. We will use it for its ability to access GPUs for free and TPUs for the use of the aforementioned libraries. It also has subscription plans for access to more powerful cloud computers. Alternatively, you can do the code locally in case you have the necessary hardware and want to do the test anyway.
Finally, this is how our work environment will look like:
4. Exercise
We are going to do a simple ETL using Google Colab and the mentioned libraries.
4.1. Create a Google Colab Notebook
We will create a Google Colab notebook using the following link.
4.2. Import modules
4.3. Create dataset
We are going to create an example dataset for the exercise. For this, we will create a new code block in the notebook and execute the following script:
This will create a new dummy dataset called restaurant_reviews.csv.
4.3.1. Check dataset size
In another block of code, we are going to execute the following script to validate the size of the created dataset.
4.4. Extract using Pandas & Dask
In another block of code, we are going to perform data loading to a data frame in Python, we are going to compare the loading speeds of both Pandas and Dask:
We conclude that, for this case, Dask strongly outperforms pandas in extracting information from the dataset.
4.5. Transform & clean data using Pandas & NumPy
Now, we will perform some data cleaning and transformation using both Pandas and Numpy functions. We will start using Pandas functions:
Then, we will use NumPy functions to continue with the data frame transformation process:
4.6. Load data into DB via API
Now, we are going to simulate an upload process using a fake normalization template. This will allow us to convert each data frame entry into a request to a rest dummy api that will simulate the upload of data to a server.
5. Conclusions
In summary, this article serves as a comprehensive guide for developers who want to address Big Data challenges efficiently on affordable laptops.
By using Python, Google Colab, and core libraries like Pandas, NumPy, and Dask, users can successfully manage huge data sets with ease. Performance comparison, data cleansing processes, and simulated data loads underscore the practicality and affordability of these tools, allowing developers to do complicated tasks seamlessly.