BIG DATA : A Technology or Problem ??

4 min readMar 7, 2021

The world is one big data problem — Andrew McAfee

BIG DATA :

As the name Big Data itself tells the data which is enormous.Actually , Big data is a problem where the data is so large, fast or complex that is difficult to store , manage and manipulate using traditional methods. Many top MNC’s like facebook ,google, instagram etc. thousands of terabytes data per day.

>>Facebook generates 4 petabytes of data every day.

>>Google processes more than 20 petabytes of data every day.

There are mainly four problem in big data:-

Volume : To store Such a large amount of data ,we need a large volume storage device of zetta byte ,yotta byte or even larger volume storage that costs huge in market. Imagine if you just want to keep a redundant version of the data for disaster recovery. You’d need even more disk space. Hence the volume of data becomes a problem when it grows beyond the normal limits and becomes an inefficient and costly way to store on local storage devices.
Velocity : Velocity essentially measures how fast the data is coming in. There are many questions that arise like how to process every packet of data that comes , How to process such high-frequency structured and unstructured data on the fly.when you have a high velocity of data, that almost always means that there are going to be large swings in the amount of data processed every second.If you want to store such a large volume of data it takes too much time to store and again, to retrieve lot of time it takes which will create negative impact on any business.
Variety : It is about hetrogenous data type.Data was once collected from one place and delivered in one format. Once taking the shape of database files such as, excel, csv and access it is now being presented in non-traditional forms, like video, text, pdf, and graphics on social media, as well as via tech such as wearable devices. Although this data is extremely useful to us, it does create more work and require more analytical skills to decipher this incoming data, make it manageable and allow it to work.
Veracity : The data in the real world is so dynamic that it is hard to know what is right and what is wrong. Veracity refers to the level of trustiness or messiness of data, and if higher the trustiness of the data, then lower the messiness and vice versa. Veracity and Value both together define the data quality, which can provide great insights to data scientists.

To solve this challenge of Big data ,we use the concept of Distributed Storage :

In the distributed model, instead of storing data in one location, data is stored repeatedly among multiple physical servers called Data or Slave Node.And these nodes are managed by a node called Name or Master Node.Distributed data store systems differ from traditional data storage in that your data is copied (in whole or in part) across several servers in a storage network. This creates redundancy for data availability. If a single server is down or lost, the entirety of your data is backed up and distributed across several other nodes.

Benefits Of Distributed storage :

1.Scalability : the primary motivation for distributing storage is to scale horizontally, adding more storage space by adding more storage nodes to the cluster.

2.Redundancy : distributed storage systems can store more than one copy of the same data, for high availability, backup, and disaster recovery purposes.

3.Cost : distributed storage makes it possible to use cheaper, commodity hardware to store large volumes of data at low cost.

4.Performance : distributed storage can offer better performance than a single server in some scenarios. for example- it can store data closer to its consumers, or enable massively parallel access to large files.

To Implement the Concept of distributed storage system we will use the tool or Softwere called Hadoop.

Apache Hadoop is an open-source software framework for distributed storage and distributed processing of Big Data on clusters of commodity hardware. Hadoop is a framework that allows you to first store Big Data in a distributed environment, so that, you can process it parallely.

I Wish the Article Assists you to know “How big MNCs like facebook , google etc. stores , manages and manipulate Thousands of Terabytes of Data With High Speed and High Efficiency.”

Thanks….