So to simplify everything in couple of sentence Hadoop is nothing but technology created by Yahoo from the white papers that Google released named GFS “Google File System” and back in the days Yahoo hired couple of engineer and work on Hadoop from those White papers release from Google.
To fully understand these I’m going to describe data a bit that about kinds of data that exist, this is important so you can get what Hadoop is and where it really stand in:
Structured Data
This category of data is already existed from back in days as this is nothing but normal relation database like we all familiar with i.e.” Oracle, MySQL” and other relation database that exist and most of our data used to be and still in some important institution are stored in relation databases.
Semi Structured Data
This kind of data is somewhat you might be familiar with is those data that stored in between Structured and non-structured data and I will be coming in next topic what non-structured data are but just to understand this we can name some of semi-structured data like XML and JASON.
Unstructured Data
This data was setting in office for long time but nobody had the technology to load and analyses these kinds of data as they are not fully structured to get something out of these so that is where Hadoop comes in to help companies to load and analyses these kinds of data to name some of these they are like Company E-mail, Documents, Text files and other sort of data that comes under this category.
A research shows that 90% of our data is created in last two years and that is a lot of data and in term of Hadoop and other database we call that as Big Data any data that crosses terabyte of storage it comes under Big Data and this is where Hadoop comes in and help companies to distributed those process to many machines which one machine cannot handle this load here are some examples of big companies that use Hadoop clusters:
- Google: currently they have the biggest data on the planet and that indexes around 60 Billion pages of internet “websites,blogs,news” and even now when you search something it give you the result in couple of seconds which is awesome.
- Facebook: they have the biggest Hadoop cluster in the world that serves around 100 Petabyte of data and in top of that they generating around half petabyte of data every day that data includes Status update, status like, status comment, picture etc.
- Twitter: is another big example as they are serving around 400 Million of twits every day which could around 15 Terabyte of data everyday which they have a lot of Hadoop Mapper and Reducer to create all greatest and latest trends.
And one more thing is that Hadoop is have a technology called RAC awareness what it really mean is that it always know how many RAC you have and where the data is replicated as if you set your replication factor of data that you have you set it to some values like 3 it will replicate your data into 3 servers and if your Hadoop cluster is RAC aware it knows where the copy of current data is located so it will not move the data from one RAC to another one which Saves a lot of bandwidth and do the job faster.
Best way to explain Hadoop for Non-Geeks