@lumarseg

Data Analytics 101 | Volumen (Data Storage)

1. Challenge of handling a large volume of data.

In this note, we will analyze the challenges of dealing with a large volume of data. We will also explore the scope of the types of data sources you may need to ingest and store. Finally, we will discuss some options for storing data in the cloud, an introduction to data lakes, and data storage methods.

Exponential growth of enterprise data.

There are three major classifications of data sources:

  1. Structured data: Organized and stored in the form of values grouped into rows and columns within a table. These data are usually stored in relational databases, hence the name. They are highly structured according to rules and constraints established in the database. This type of data is central to transactional applications.
  2. Semi-structured data: Typically stored as a series of key-value pairs grouped into elements within a file. These data are usually stored in non-relational databases (often called NoSQL) or even in XML or JSON files. These data do not have a strict structure and are often of a more transient nature. Examples include player movements in an online video game, browser and internet cache, and social applications that automatically delete posts after a certain time.
  3. Unstructured data: Not consistently structured. Some data may have a similar structure to semi-structured data, but others may only contain metadata. These data often take the form of files or objects. They do not have a single structure and represent everything else that a company collects and generates. These data are often considered untouchable because they do not adhere to conventional norms. They need to be labeled and categorized for analysis, which prevents many companies from using them in their data analytics solutions. Examples include images, emails, text files, social media content, text messages, and videos.

Many internet articles talk about the vast amount of information contained in unstructured data. New applications are being launched that can now catalog and provide incredible insights into this untapped resource.

But what exactly is unstructured data? It’s in all the files we store, all the photos we take, and the emails we send.

2. Data Storage in the Cloud

When we refer to data analytics solutions, it is impossible to ignore situations involving processing a large volume of data. This implies being able to store the data securely. Perhaps storing your company’s data can be done on a local server, but as the information and data importance grow, IT resources become more expensive to manage. Therefore, storing data in the cloud becomes more relevant. Let’s consider three very important concepts: Durability, Availability and Scalability .

Durability

Durability refers to the capacity of a storage system to maintain data persistently and securely, even in adverse situations or unexpected failures. When data is considered durable, it means that it is protected from accidental losses or permanent damages. In practice, durability is achieved through techniques such as:

  • Data Redundancy: Storing multiple copies of the data on different devices or locations to ensure that, in the event of a device failure, the information can be accessed from another source.
  • Backups: Making periodic copies of the data and storing them outside the main system, so they can be restored in case of data loss.
  • Data Integrity: Using integrity verification techniques, such as checksums, to detect and correct errors in the stored data.

Availability

Availability refers to the capacity to access data when needed, quickly and reliably. A storage system with high availability ensures that data is accessible at all times, minimizing downtime and ensuring operational continuity. Some strategies to achieve high availability are:

  • Redundant architectures: Using hardware configurations or storage architectures that allow access to data even if one or several components fail.
  • Load balancing: Distributing the data access load among multiple resources to avoid single points of failure and improve performance.
  • Monitoring and fault detection: Implementing monitoring systems that can detect issues in real-time and activate automatic recovery mechanisms.

Scalability

Scalability refers to the capacity of a system or an application to adapt and handle changes in workload efficiently and without interruptions. In other words, it allows the infrastructure and associated resources to increase or decrease according to changing needs without negatively affecting performance. Scalability can be achieved in two ways:

  • Vertical Scalability (Scaling Up): It involves increasing the capacity of resources in a specific instance. For example, increasing the amount of RAM or processing power in a single virtual machine. This approach has a physical limit and may not be as flexible as horizontal scalability.
  • Horizontal Scalability (Scaling Out): It involves adding more instances or additional resources to distribute the load and improve performance. For example, increasing the number of servers in a cluster to handle more simultaneous requests. This approach is more flexible and allows for greater expansion without physical limitations.

Due to the possibility of needing to process a large amount of information and the variety of data (structured, semi-structured, and unstructured data), storing data in the cloud becomes the most convenient option. Among the cloud storage services available, there is Amazon Simple Storage Service (Amazon S3). This service is an object store, which means it can store almost any type of independent object. An object is how Amazon S3 refers to the stored data. It is a powerful and scalable service as it can grow to any size needed. It is durable, which means that your files will be available when you need them. It implements industry-leading scalability, security, and performance.

Using AWS S3 service provides three advantages in Data Analytics:

  • Data Decoupling: With Amazon S3, you can separate the way you store data from the way you process it. This is known as decoupling storage from processing. You can have different buckets for raw data, temporary processing results, and final results.
  • Parallelism: With Amazon S3, you can access all these storage locations from any process in parallel without negatively impacting other processes.
  • Centralized data architecture: Lastly, Amazon S3 becomes a central location to store analytical datasets, enabling access for multiple analytical processes simultaneously. This allows the solution to avoid the costly process of transferring data between the storage system and the processing system.

 

3. Data Lakes

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.

4. Data Storage Methods

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.

It´s under construction….