Data Engineering

Budget-Friendly Big Data Analysis: Python & Google Colab On an Everyday Laptop

This article provides a comprehensive guide to performing Big Data analysis on a budget, using Python and Google Colab. It outlines the minimum software requirements, introduces essential tools like Pandas, NumPy, and Dask, and demonstrates practical exercises for effective data management on everyday laptops.

Santiago Sánchez
April 2, 2024

1. Introduction

As a software developer I have had to face different challenges throughout my career. The use of Python as a programming language is becoming more and more widespread, and it serves as a basis for web development, AI, crypto, etc.

Big Data, according to this Oracle article, encompasses the phenomenon of “larger and more complex data sets.” In software development with Big Data, the volume of data and the speed of processing are fundamental. I consider that a large part of the development life cycle depends on these two, and due to their nature, it is to be expected that these processes are carried out on equipment whose hardware allows the rapid and efficient management of volumes ranging from thousands to billions of data.

However, in the middle of 2024, I think that circumstances have changed in favor of developers, so in this article, I will show you how you can do Big Data on an everyday laptop (- $300 USD) using open-source tools and optimization techniques.

2. Realistic minimum software requirements

Hewlett-Packard, also known as HP, is a multinational company considered one of the leaders in technology and the creation of computer equipment. They explain here what, according to them, are the minimum requirements that data science demands to do big data. The list of requirements they share is clear:

  • Min. 16GB of RAM memory
  • A GPU with a minimum of 4GB of memory (they emphasize the use of NVIDIA as an option for GPUs)
  • Intel® Core™ i7, i9, and Xeon®2 processor, with a minimum of 4 cores and a base speed of 2.0GHz
  • Windows 11 or Ubuntu operating system

However, not everyone can have such equipment at their fingertips. In my personal case, I code with a laptop ASUS Vivobook that I bought for $295 USD, which has:

  • 8GB RAM
  • An Intel i5-1135G7 4-core processor
  • An integrated Intel Iris Xe Graphics GPU.

So, in this article, we will take these requirements as the minimum (and we will even see if we can reduce them even further) for the development of Big Data.

3. Work tools

These are the tools we are going to use:

  • Pandas and NumPy
  • Dask
  • Google Co

3.1. Pandas and NumPy

Pandas and NumPy are two Python libraries popularly used in data science. They are used for data manipulation and scientific computing, respectively. We will use these because they can efficiently handle data structures and multidimensional arrays, which will help us deal with large amounts of data.

3.2. Dask

Dask is a library very similar in its use to Pandas and Numpy, with the difference that it is focused on large-scale distributed data processing. We will use it since we are interested in its efficiency, being able to process large amounts of data sets.

3.3. Google Collaboratory (Colab)

For a last use case, we will use the service Google Collaboratory to run Python code from the web browser. We will use it for its ability to access GPUs for free and TPUs for the use of the aforementioned libraries. It also has subscription plans for access to more powerful cloud computers. Alternatively, you can do the code locally in case you have the necessary hardware and want to do the test anyway. 

Finally, this is how our work environment will look like:

4. Exercise

We are going to do a simple ETL using Google Colab and the mentioned libraries.

4.1. Create a Google Colab Notebook

We will create a Google Colab notebook using the following link.

4.2. Import modules

4.3. Create dataset

We are going to create an example dataset for the exercise. For this, we will create a new code block in the notebook and execute the following script:

This will create a new dummy dataset called restaurant_reviews.csv.

4.3.1. Check dataset size

In another block of code, we are going to execute the following script to validate the size of the created dataset.

4.4. Extract using Pandas & Dask

In another block of code, we are going to perform data loading to a data frame in Python, we are going to compare the loading speeds of both Pandas and Dask:

We conclude that, for this case, Dask strongly outperforms pandas in extracting information from the dataset.

4.5. Transform & clean data using Pandas & NumPy

Now, we will perform some data cleaning and transformation using both Pandas and Numpy functions. We will start using Pandas functions:

Then, we will use NumPy functions to continue with the data frame transformation process:

4.6. Load data into DB via API

Now, we are going to simulate an upload process using a fake normalization template. This will allow us to convert each data frame entry into a request to a rest dummy api that will simulate the upload of data to a server.

5. Conclusions

In summary, this article serves as a comprehensive guide for developers who want to address Big Data challenges efficiently on affordable laptops.

By using Python, Google Colab, and core libraries like Pandas, NumPy, and Dask, users can successfully manage huge data sets with ease. Performance comparison, data cleansing processes, and simulated data loads underscore the practicality and affordability of these tools, allowing developers to do complicated tasks seamlessly.

We are Azumo
and we get it

We understand the struggle of finding the right software development team to build your service or solution.

Since our founding in 2016 we have heard countless horror stories of the vanishing developer, the never-ending late night conference calls with the offshore dev team, and the mounting frustration of dealing with buggy code, missed deadlines and poor communication. We built Azumo to solve those problems and offer you more. We deliver well trained, senior developers, excited to work, communicate and build software together that will advance your business.

Want to see how we can deliver for you?

schedule my call

Benefits You Can Expect

Release software features faster and maintain apps with Azumo. Our developers are not freelancers and we are not a marketplace. We take pride in our work and seat dedicated Azumo engineers with you who take ownership of the project and create valuable solutions for you.

Industry Experts

Businesses across industries trust Azumo. Our expertise spans industries from healthcare, finance, retail, e-commerce, media, education, manufacturing and more.

Illustration of globe for technology nearshore software development outsourcing

Real-Time Collaboration

Enjoy seamless collaboration with our time zone-aligned developers. Collaborate, brainstorm, and share feedback easily during your working hours.

vCTO Solution Illustration

Boost Velocity

Increase your development speed. Scale your team up or down as you need with confidence, so you can meet deadlines and market demand without compromise.

Illustration of bullseye for technology nearshore software development outsourcing

Agile Approach

We adhere to strict project management principles that guarantee outstanding software development results.

Quality Code

Benefits from our commitment to quality. Our developers receive continuous training, so they can deliver top-notch code.

Flexible Models

Our engagement models allow you to tailor our services to your budget, so you get the most value for your investment.

Client Testimonials

Zynga

Azumo has been great to work with. Their team has impressed us with their professionalism and capacity. We have a mature and sophisticated tech stack, and they were able to jump in and rapidly make valuable contributions.

Zynga
Drew Heidgerken
Director of Engineering
Zaplabs

We worked with Azumo to help us staff up our custom software platform redevelopment efforts and they delivered everything we needed.

Zaplabs
James Wilson
President
Discovery Channel

The work was highly complicated and required a lot of planning, engineering, and customization. Their development knowledge is impressive.

Discovery Channel
Costa Constantinou
Senior Product Manager
Twitter

Azumo helped my team with the rapid development of a standalone app at Twitter and were incredibly thorough and detail oriented, resulting in a very solid product.

Twitter
Seth Harris
Senior Program Manager
Wine Enthusiast

Azumo's staff augmentation service has greatly expanded our digital custom publishing capabilities. Projects as diverse as Skills for Amazon Alexa to database-driven mobile apps are handled quickly, professionally and error free.

Wine Enthusiast Magazine
Greg Remillard
Executive Director
Zemax

So much of a successful Cloud development project is the listening. The Azumo team listens. They clearly understood the request and quickly provided solid answers.

Zemax
Matt Sutton
Head of Product

How it Works

schedule my call

Step 1: Schedule your call

Find a time convenient for you to discuss your needs and goals

Step 2: We review the details

We estimate the effort, design the team, and propose a solution for you to collaborate.

Step 3: Design, Build, Launch, Maintain

Seamlessly partner with us to confidently build software nearshore

We Deliver Every Sprint

Time Zone Aligned Developers

Our nearshore developers collaborate with you throughout your working day.

Experienced Engineers

We hire mid-career software development professionals and invest in them.

Transparent Communication

Good software is built on top of honest, english-always communication.

We Build Like Owners

We boost velocity by taking a problem solvers approach to software development.

You Get Consistent Results

Our internal quality assurance process ensures we push good working code.

Agile Project Management

We follow strict project management principles so we remain aligned to your goals