Working with huge datasets, 800K+ files in Google Colab and Google Drive

colab + drive

1. Upload the entire folder to google drive containing the 800k+ images

2. Zip the dataset folder, Upload to GDrive and then unzip

3. Create the dataset directly on Google Colab and write the files on drive

This is damn long time for a single background image to process, each of these BG is creating 200*20 images

4. Maybe use threads ?

Hmm . . . so what worked ! ?

Create the Dataset on Google Drive, directly into a .zip/.tar file 🥳🎊

  • always work with your huge datasets in batches !
  • save your work in google drive periodically, use .zip files if you work with huge datasets, consider splitting them into parts if possible
  • you might need to use the garbage collector in python to clear up memory
Depth Estimation model run on my dataset

That’s all Folks!




Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Unavoidability of Model Interpretability

NLP Fundamentals For Absolute Beginners

#4 Learning and Prediction Day, In the End model Performance matters.

Sentiment Classification using CNN in PyTorch

Google AutoML Model Search Algorithm

Split Your Dataset With scikit-learn’s train_test_split()

Classification with TensorFlow and Dense Neural Networks

Beginner Guide For Machine Learning: It’s Not What You Think It Is

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Satyajit Ghana

Satyajit Ghana

More from Medium

Using Surprise Module in Python to Recommend Video Games

Speed Up Data Frame Operations w/ RAPIDS cuDF

Stop Using maya.cmds, use PYMEL instead!

Solving a capacitated vehicle routing problem with Google OR-Tools and Mapbox