@lumarseg

Data science glossary

Term Definition
Algorithm
In the realms of mathematics and programming, an algorithm is a meticulously ordered set of instructions or well-defined rules. It systematically guides through a predetermined sequence of steps, facilitating computations, data processing, and problem resolution.
API
An API, or Application Programming Interface, is a collection of pre-written processes housed within programming libraries.
Artificial Intelligence
In computer science, it's the intelligence exhibited by machines. It's when a machine imitates cognitive functions of human minds. Essentially, it involves algorithms that enable machines to learn through repetition in order to simulate natural human intelligence. To achieve this, they recognize patterns within vast amounts of data.
Backend
This term refers to the web development layer where processes for accessing data of a web application are executed. This part is not visible to app users and operates on servers.
Cloud Computing
These are cloud services used for storing, managing, and processing data, servers, databases, networks, and software. The term "cloud" signifies that these functions operate via the internet. We rent services from Amazon Web Services, Google Cloud Platform, or Microsoft Azure instead of having a physical server within our company.
Cloud Servers
A virtual server that resides on a physical server. This server is hosted by a service provider like AWS, Microsoft Azure, or Google Cloud Platform. It offers functionalities akin to those of a physical server but can be more cost-effective since it's used through a rented service rather than being purchased and physically owned by a company.
Data
In data science, it refers to those records of figures, numbers, or words that are stored in digital documents and computer system databases. They constitute the raw material with which one works.
Data Cleaning
In data science, this term refers to the process of identifying erroneous or missing data within a table or database and then rectifying or removing them to facilitate analysis.
Data Pipeline
It's a process consisting of steps where data is moved and processed using various technologies. In data science, pipelines are employed to transfer data from OLTP databases to OLAP databases.
Data Science
It's a field that encompasses scientific methods, processes, and systems for deriving insights through data analysis and machine learning. Essentially, it's the process of uncovering valuable information from data.
Data storage
This refers to the act of storing and safeguarding computerized data within a database, ensuring accessibility to the information when needed.
Data Transformation
Transformation involves tasks such as segregating data, cleaning null values, adding new columns with fresh information, and even altering their format.
Data Visualization
It's a field aimed at representing data through graphs and visual formats. Presenting information visually is more efficient for communication, especially when dealing with extensive data. This allows for quicker and clearer insights, enabling more efficient decision-making.
Database
It's a collection of data belonging to the same context that is stored for use within a software system. To manage databases, we use software known as a database engine. Some common ones include MySQL and PostgreSQL.
ETL
ETL (Extract, Transform, Load). It's a type of data pipeline where we extract data from various sources, transform it to enable storage, and then load it into specialized databases for analytics.
Frontend
It pertains to the web development layer where the user-facing portion of a web application is created, typically accessible directly from a browser. All the interfaces you see on the web constitute the frontend layer of the web application.
Insights
Insights/Valuable Information: This refers to information acquired after analyzing data. Typically, it is used to make business decisions, devise strategies, or take other actions within an organization, hence its value. For instance, it could be information about which customers are likely to unsubscribe from a service due to specific circumstances.
Machine Learning
Machine learning, also known as automatic learning, is a branch of artificial intelligence with the aim of enabling computers to learn. In machine learning, computers observe vast amounts of data and construct a model capable of generating predictions to solve problems.
Machine Learning Algorithms
Within machine learning, there are various algorithms designed to generate trained models that enable accurate prediction of information.
Machine Learning Model
It's the output of information generated when a machine learning algorithm is trained with data. After training predictive models, we can input new data similar to the data they were trained on and obtain an output prediction.
OLAP Database
The acronym stands for Online Analytical Processing. These are specialized databases designed to facilitate efficient querying of large volumes of data.
OLTP Database
The acronym stands for Online Transaction Processing. These databases focus on recording and updating data (transactions). They are the databases commonly used in various applications where the necessary data resides.
Parallel Computing
It's a computing approach where multiple instructions are executed simultaneously by breaking down large problems into smaller ones.
Production
Placing software into "Production" signifies that the software is ready and available for use by its intended users. Software is not directly developed where users access it; it's created internally and undergoes testing beforehand. When it's deemed ready for user interaction, it's put into production and becomes usable. When discussing machine learning in production, it refers to the scenario where software employing trained models operates within user-facing applications."
Programming Language
It's a language used to provide instructions to computers in the form of algorithms. Within the language, preset commands are written to instruct the computer on what actions to perform.
Programming Library
A programming library, also known as a library, is a collection of functions authored by others using a programming language. Essentially, it comprises code that others have written, which you can easily incorporate into the programs you create. For example, in Python, there's the Pandas library that offers useful functions for data manipulation and analysis—functions that the language doesn't include by default.