New study is 98.4% accurate at detecting Covid-19 from X-rays.
Researchers trained a convolutional neural network on Kaggle dataset.
The hope is that the technology can be used to quickly and effectively identify Covid-19 patients.
As the Covid-19 pandemic continues to evolve, there is a pressing need for a faster diagnostic system. Testing kit shortages, virus mutations, and soaring numbers of cases have overwhelmed health care systems worldwide. Even when a good testing policy is in place, lab testing is arduous, expensive, and time consuming. Cheap antigen tests, which can give results in 30 seconds, are widely available but suffer from low sensitivity; The tests correctly identifying just 75% of Covid-19 cases a week after symptoms start .
Shashwat Sanket and colleagues set out to find an easy, fast, and accurate alternative using simple chest X-ray images. The team found that bilateral changes seen in chest X-rays of patients with Covid-19 can be analyzed and classified without a radiologist’s interpretation, using Convolutional Neural Networks (CNNs). The study, published in the September issue of Multimedia tools and Applications, successfully trained a CNN to accurately diagnose Covid-19 from Chest X-Rays, achieving an impressive 98.4% classification accuracy.. The journal article, titled Detection of novel coronavirus from chest X-rays using deep convolutional neural networks, shows some exciting promise in the ongoing efforts to find ways to detect Covid-19 quickly and effectively,
What are Convolutional Neural Networks?
A convolutional neural network (CNN) is a Deep Learning algorithm that resembles the response of neurons in the visual cortex. The algorithm takes an input image and weighs the relative importance of various aspects in the image. The neurons overlap to span the entire field of vision, comprising a completely connected network where neurons in one layer link to neurons in other layers. The multilayered CNN includes an input layer, an output layer, and several hidden layers. A simple process called pooling keeps the most important features while reducing the dimensionality of the feature map.
One major advantage of CNNs is that, compared to other classification algorithms, the required pre-processing is much lower. In addition, CNNs use regularized weights over fewer parameters. This avoids the exploding gradient and vanishing gradient problems of traditional neural networks during backpropagation.
The study began with a Kaggle dataset containing radiography images. As well as chest X-ray images for 219 COVID-19 positive cases, the dataset also contained 1341 normal chest X-rays and 1345 viral pneumonia images. Random selection was used to reduce the normal and viral pneumonia images to a balanced 219 each. The model, which the authors dubbed CovCNNl, was trained with augmented chest X-ray images; The raw images were standardized with each other using transformations like shearing, shifting and rotation. They were also converted to the same size: 224 × 224 × 3 pixels. Following the augmentation, the dataset was split into 525 images for training and 132 images for testing. The following image, from the study authors, demonstrates how the augmented images appear. Image a in the top row shows how Covid-19 appears on an x-ray, in comparison to four normal chest X-rays: Seven existing pre-trained transfer learning models were used in the study, including ResNet-101 (a 101 layers deep CNN), Xception (71 layers deep), and VGG-16, which is widely used in image classification problems but painfully slow to train . Transfer learning takes lessons learned from previous classification problems and transfers that knowledge to a new task—in this case, correctly identifying COVID-19 patients.
Four variant CovCNN models were tested for effectiveness with several metrics, including: accuracy, F1-score, sensitivity, and specificity. The F1 score is a combination of recall and precision; Sensitivity is the true positive rate—the proportion of correctly predicted positive cases; Specificity is the proportion of correctly identified negative cases. The CovCNN_4 model outperformed all the other models, achieving 98.48% accuracy, 100% sensitivity, and 97.73% specificity. This fine-tuned deep network contained 15 layers, stacked sequentially with increasing filter sizes. This next image shows the layout of the model:
The authors conclude that their covCNN_4 model can be employed to assist medical practitioners and radiologists with faster, more accurate Covid-19 diagnosis, as well as follow up cases. In addition, they recommend that their model’s accuracy can be further improved by “fusion of CNN and pre-trained model features”.
As the market gets more competitive with time, businesses are altering their strategies to sustain and cater to changing customer needs better. The present era customers have smartened up considerably! They know what they want, and luring them with glitzy ads and lofty marketing pitches does not cut much ice anymore. They want better value for money and an enhanced experience. So, businesses need to offer better service, enhance product quality, and become more productive and efficient.
Data analytics is a big weapon for enhancing the operational efficacy of businesses.
Nowadays, businesses of varying types and sizes are resorting to data analytics applications to enhance efficiency and productivity levels. They obtain data from a number of sources- both offline and online. This huge amount of data is then compiled and analysed by using specialized BI solutions. The resultant reports and insights help the businesses to get a better grasp of various nuances of operations. They resort to using cutting-edge data analytics applications, including power bi solutions.
How using data analytics software and applications can be useful for businesses.
It helps businesses identify market needs- The BI and data analytics tools can be useful for identifying market needs. Data obtained from online and offline customer surveys, polls and other types of feedback are compiled and analysed by such applications. The results can help businesses understand the precise needs of the market. This can vary from one location to another. When businesses can understand regional market needs better, they can tweak their production plan accordingly. It proves to be beneficial in the long run.
It aids the brands to detect and eliminate Supply Chain hurdles- For a brand manufacturing physical products, supply chain optimization can prove to be tedious. Logistics related issues can crop unexpectedly, hampering the sales and supply chain system. Issues that can affect the supply chain include shipping delays, damage to fragile items, whether caused by hassles, employee issues, etc. This is where data analytics tools like Power BI can come in handy.
The data collected through sensors, cloud services and wearable devices are analysed by such applications. The generated reports helppower bi consultantsfigure out the existing loopholes leading to disruptions in the supply chain. They can thereafter come up with strategies to tackle and eliminate such issues.
It helps identify and resolve Team-coordination issues- Sometimes, a company may find it hard to achieve its operational target owing to improper and inadequate sync between various departments. The departments like HR, sales and advertising may not have good sync with one another. This can lead to inefficient resource sharing. For the management, it may be hard to figure out these internal glitches. However, hiring a data analytics expert can be helpful in resolving such conditions.
A veteranpower bi developercan use the tool to analyze collected data and find out the issues leading to a lack of sync between various departments. Thereafter, suitable remedial measures can be taken to boost resource sharing, and that can help augment efficiency.
It helps detect employee and team productivity issues- Not everyone in a team in a company has equal efficacy and productivity. A senior team member and employee may work smarter and faster than newly inducted ones. Sometimes, disgruntled employees may deliberately work in an unproductive way. The overall output gets affected when there are such issues affecting the productivity and efficacy of the employees in a company.
For the management, checking the efficacy of every single employee may not be easy. In a large-sized organization, it is near impossible. However, identifying employee efficacy and productivity becomes easier when a suitable data analytics solution is used. Hiring a power bi development professional can be handy in such situations. By identifying factors leading to employee productivity deficit, corrective measures can be deployed.
It helps detect third-party/vendor related issues- In many companies, working with third-party vendors and suppliers becomes necessary. Businesses may rely on such vendors for the supply of raw materials, and they also hire such vendors to outsource specific operations. Sometimes, the operational output of the company may get affected owing to reliance on a vendor not suited for its needs. The suitability of such vendors can be understood well by deploying data analytics services.
It aids in understanding speed related issues- Sluggishness in production may affect the output in a business setup, for sure. Production or manufacturing involves a number of stages, and delay in one or more stages can affect productivity and efficacy. It may be hard for the company management to fathom what is causing the delay in the production workflow. The reasons can be worn out by machinery or unskilled workforce. Deploying the latest data analytics solutions can be useful for detecting and resolving the issues affecting production speed.
It helps in detecting IT infrastructure issues- Sometimes, your business may find it hard to achieve operational targets owing to the usage of outdated or ageing IT infrastructure. It is both hardware and software related issues that affect output and efficiency. The legacy systems used in some organizations bottleneck the prowess of a skilled and efficient workforce- as it has been seen. Deploying the latest data analytics solutions helps the companies understand which part of the IT infrastructure is causing the deficit in output.
It aids in understanding cost overrun factors- In every company, incurring a cost is a prerequisite for keeping the workflow alive. However, it is also necessary that the running expenditure of the workplace is kept within a limit. It can be hard to figure out if the money spent after departments like electricity, internet, sanitation etc., are being kept within a limit or overspending is taking place. Sometimes, hidden costs may be involved, which may skip scrutiny of the accounts departments.
When data analytics tools are used, it is easier to find out instances of cost overrun in such setups. The management then can take up corrective measures to ensure running cost is kept within feasible limits.
Summing it up
Usage of data analytics tools like Power BI helps a company in figuring out issues that are bottlenecking productivity and output. The advanced data analysis and report generation capabilities of such tools help businesses fathom issues that can be hard to interpret and analyze otherwise. By using such tools, businesses can also make near accurate predictions about market dynamics and customer preferences. However, to leverage the full potential of such tools, hiring suitable data analytics professionals will be necessary.
Big Data is trending right now, but how does it change the eCommerce industry? Let’s understand in detail.
eCommerce is booming, and consumer’s data has become a lifeline for online stores. A huge volume of data is generated by the eCommerce industry when it comes to customer patterns and purchasing habits.
It is projected that by 2025, the digital universe of data will reach175 zettabytes, a 61 percent increase. It includes e-commerce – tracking shoppers’ activities, their locations, web browser histories, and abandoned shopping carts.
Modern tech such as Artificial intelligence (AI), Machine Learning, and Big Data is not just for books and sci-fi movies anymore. These are now one of the most common tools used in an E-commerce site’s performance optimization.
Gartner reported that by 2020, 85% of customer communications might not require human intervention due to advancements in AI. Online businesses should have access to a large volume of data, enabling them to make better decisions about their customers, the products they recommend, and how they will plan and implement their marketing campaigns.
A great deal of success in e-commerce relies on Big Data to plan future business moves. Now before discussing how Big Data impacts eCommerce, let’s understand the meaning of Big Data Analytics.
Big Data Analytics means examining a huge volume of data to identify hidden patterns, correlations, and other valuable insights. This enables online stores to make informed decisions based on data.
E-commerce companies use Big Data analytics to understand their customers better, forecast consumer behaviour patterns, and increase revenue. According to the study conducted byBARC, some benefits brands can avail using Big Data:
Making data-driven decisions
Improved control on business operations
Deliver top-notch customer experience
Reduce operational cost
Allow customers to make secure online payments
Supply management and logistics
The eCommerce market is skyrocketing, a source taken fromElluminatiinc.com, today, 2.15 billion people shop online, and this figure will continue to grow because customers today value comfort over anything else.
Now think from the eCommerce business owner’s point of view, how they identify preferences of these billions of customers and provide them with a personalized experience. Here Big Data comes to the rescue. eCommerce Big Data includes structured and unstructured information about customers, such as their addresses, zip codes, shopping cart contents, and more.
Now think from the eCommerce business owner’s point of view, how they identify preferences of these billions of customers and provide them with a personalized experience. (start) When it comes to serving a huge customer range especially for multi-channel selling firms, it’s hard to manage and update constantly among sales channels. This can cause great damage to keep your loyal customers. That’s why, integrating your business into a reliable tool is the best way to deal with the problem. Here, BigData andLitCommercecomes to the rescue.
Email, video, tweets, and comments in social media are unstructured eCommerce parts that can also serve as valuable sources of information. The ability to examine shopping carts or show individual content based on an IP address via a content management system is already available to online retailers, but Big Data discovery will extend their capabilities in the short term.
Big Data in eCommerce
Are you able to benefit from this? Well, Big Data lets you organize information in a pretty structured way so that you can provide a top-notch experience to customers. As a result, e-commerce business owners gain valuable insight into the choices and behaviours of their customers, resulting in increased sales and traffic.
The following are the most notable ways Big Data will affect eCommerce in the future.
Enhance Customer Service
Big Data plays an important role in delivering excellent customer service because it keeps track of your existing and new customers’ records and helps you study their preferences to boost the engagement ratio.
This process includes what your customers like, which payment method they follow, what kind of products they buy frequently, and much more. Consequently, eCommerce business owners understand the user’s mind and offer apersonalized experienceto drive sales and traffic.
Here you can take an example of an online streaming service, Netflix. Along with personalization, it also has implementedautoscaling supportto meet evolving needs of customers.
Improve Payment Methods and Security
Basically, unsecured payments and a lack of variety in payment methods contribute to abandoned carts. For instance, customers would not purchase from your store if they don’t find their desired payment methods. eCommerce stores can improve conversion rates by offering a variety of digital payment methods, but it should be swift and secure.
Big Data can also improve payment security in the future. A variety of payment methods and safe and secure transactions are important for customers. Here Big Data can detect fraud activity and ensure an amazing experience for customers.
Business owners can set up alerts for transactions on the same credit card that are out of the ordinary or for orders coming from the same IP address using various payment methods.
Who does not like to get huge discounts on products they love? Of course, we all, right? Utilize customer data to determine specific offers relevant to their previous purchases and send them discount codes and other offers based on their buying habits.
Additionally, Big Data can be used tofind potential customerswho are browsing a website, abandon a purchase, or but don’t buy. You can send a customer an email inviting them to purchase a product they looked at or reminding them of it. Here you can see how Amazon and eBay perfected the art of online selling.
Helps to Conduct A/B Testing
To ensure a seamless and efficient online experience, A/B testing is essential. Detecting bugs and removing errors will help your business grow. In addition to testing your pricing model on a time-based basis with data collected from your store, the data collected will help you optimize the overall store performance.
Especially during days when the demand is high, incorrect pricing can make your retailers lose money. Additionally, marketers can identify where the lift occurs and how they can use it to increase volume in time-based A/B testing, further assisting them in determining discounting strategies.
Forecast Trends and Demand
Predicting trends and demand is equally important to meeting buyer’s needs. Having the right inventory on hand for the future is crucial for e-commerce. eCommerce brands can make plans for upcoming events, seasonal changes, or emerging trends via Big Data.
Businesses that sell products online amass massive datasets. Analysis of previous data allows them to plan inventory, predict peak times,forecast demand, and streamline operations overall.
E-commerce companies can also offer various discounts to optimize pricing. Machine Learning and Big Data here make it easier to predict when and how discounts should be offered as well as how long they should last.
Big Data is Here to Stay for a Long
The above points clearly define that Big Data is making eCommerce better. eCommerce is already being affected by Big Data, and that impact will only continue to grow. With its use, online stores do business faster and easier than ever before. Their business activities are improved through the use of Big Data in all areas, and customer satisfaction is always maintained.
Datascience is exploding in popularity due to how it’s tethered to the future of technology, supply-demand for high paying jobs and being on the bleeding edge of corporate culture, startups and innovation!
Students from South and East Asia especially can fast track lucrative technology careers with data science even as tech startups are exploding in those areas with increased foreign funding. Think carefully. Would you consider becoming a Data Scientist? According toCoursera:
A data scientistmight do the following tasks on a day-to-day basis:
Find patterns and trends in datasets to uncover insights
Create algorithms and data models to forecast outcomes
Use machine learning techniques to improve quality of data or product offerings
Communicate recommendations to other teams and senior staff
Deploy data tools such as Python, R, SAS, or SQL in data analysis
Stay on top of innovations in the data science field
In a data-based world of algorithms, data science encompasses many roles since data scientists help organizations to make the best out of their business data.
In many countries there’s still a shortage of expert data scientists that are familiar with the latest tools and technologies. As fields such as machine learning, AI, data analytics, cloud computing and related industries get moving, the labor shortage of skilled professionals will continue.
Some Data Science Tasks Are Being Automated with RPA
As sometasks of data scientistsbecome automated, it’s important for programming students and data science enthusiasts to focus on learning hard skills that should continue to be in demand well into the 2020s and 2030s. As such I wanted to make an easy list of the top skills for knowledge workers in this exciting area of the labor market for tech jobs.
Shortage of Data Scientists Continues in Labor Pool
So the idea here is to acquire skills that are more difficult for RPA and other automation technologies to automate at organizations. It’s also important to specialize in skills where business penetration is high but increasing faster as the majority of businesses are adopting the trend, like Cloud computing and Artificial Intelligence.
AI Jobs Will Grow Significantly in the 2020s
In India,according to LinkedIn, AI is one of the fastest growing jobs. LinkedIn notes,Artificial Intelligence roles play an important role in India’s emerging jobs landscape, as machine learning unlocks innovation and opportunities. Roles in the sector range from helping machines automate processes to teaching them to perceive the environment and make autonomous decisions. This technology is being developed across a range of sectors, from healthcare to cybersecurity.
Thetop skillsthey cite are Deep Learning, Machine Learning, Artificial Intelligence (AI), Natural Language Processing (NLP), TensorFlow.
With such a young cohort of Millennials and GenZ, countries like India and Nigeria are unique in the latter half of the 2020s and 2030s as being the most productive workforces in the world and, yes, demographics really matter here. So for a young Indian, Nigerian, Indonesian, Brazilian or Malaysian in 2021 this really is the right time to start a career in data science since that could lead to bigger and brighter things.
So let’s start the list of generic skills that I think matter the most for the future data scientists and students now studying programming and related fields of skills that are transferable to the innovation boom that is coming.
1. Machine Learning
Machine learning is basically a branch of artificial intelligence (AI), that has become one of the most important developments in data science. This skill focuses on building algorithms designed to find patterns in big data sets, improving their accuracy over time.
The more data a machine learning algorithm processes, the “smarter” it becomes, allowing for more accurate predictions.
Data analysts(average U.S. salary of $67,500) aren’t generally expected to have a mastery of machine learning. But developing your machine learning skills could give you a competitive advantage and set you on a course for a futurecareer as a data scientist.
Python is often seen as the all-star for an entry into the data science domain. Python is the most popular programming language for data science. If you’re looking for a new job as a data scientist, you’ll find that Python is also required in most job postings for data science roles.
Why is that?
Python libraries including Tensorflow, Scikit-learn, Pandas, Keras, Pytorch, and Numpy also appear in many data science job postings.
According toSlashData, there are 8.2 million active Python users with “a whopping 69% of machine learning developers and data scientists now using Python”.
Python syntax is easy to follow and write, which makes it a simple programming language to get started with and learn quickly. A lot of data scientists actually come from backgrounds in statistics, mathematics, or other technical fields and may not have as much coding experience when they enter the field of data science. Since BigData and AI are exploding, the Python community is of course as you know large, thriving, and welcoming.
A library in Python is a collection of modules with pre-built code to help with common tasks. The number of related libraries to Python is staggering to me.
You may want to familiarize yourself with what they actually do:
Data Cleaning, Analysis and Visualization
NumPy: NumPy is a Python library that provides support for many mathematical tasks on large, multidimensional arrays and matrices.
Matplotlib: This library provides simple ways to create static or interactive boxplots, scatterplots, line graphs, and bar charts. It’s useful for simplifying your data visualization tasks.
Pandas: The Pandas library is one of the most popular and easy-to-use libraries available. It allows for easy manipulation of tabular data for data cleaning and data analysis.
Scipy: Scipy is a library used for scientific computing that helps with linear algebra, optimization, and statistical tasks.
Seaborn: Seaborn is another data visualization library built on top of Matplotlib that allows for visually appealing statistical graphs. It allows you to easily visualize beautiful confidence intervals, distributions and other graphs.
Statsmodels: This statistical modeling library builds all of your statistical models and statistical tests including linear regression, generalized linear models, and time series analysis models.
Requests: This is a useful library for scraping data from websites. It provides a user-friendly and responsive way to configure HTTP requests.
Then there are the Python libraries more related tomachine learningitself.
Tensorflow: Tensorflow is a high-level library for building neural networks. Since it was mostly written in C++, this library provides us with the simplicity of Python without sacrificing power and performance.
Scikit-learn: This popular machine learning library is a one-stop-shop for all of your machine learning needs with support for both supervised and unsupervised tasks.
Keras: Keras is a popular high-level API that acts as an interface for the Tensorflow library. It’s a tool for building neural networks using a Tensorflow backend that’s extremely user friendly and easy to get started with.
Pytorch: Pytorch is another framework for deep learning created by Facebook’s AI research group. It provides more flexibility and speed than Keras.
So as you can see Python is a great foot-in-the-door skill that’s related to entering the field of data science.
3. R, A Great Programming Language for Data Science in Industry
R is not often mentioned necessarily with data science. Here’s why I think it’s important.
R is another programming language that’s widely used in the data science industry. One can learn data science with R via a reliable online course. R is suitable for extracting key statistics from a large chunk of data. Various industries use R for data science like healthcare, e-commerce, banking and others.
R’s open interfaces allow it to integrate with other applications and systems. As a programming language, R provides objects, operators and functions that allow users to explore, model and visualize data.
As you may know, machine learning is entering the finance, banking, healthcare and E-commerce sectors more and more.
R is more specialized than Python and as such might have higher demand in some sectors. R is typically used in statistical computing. So if you are technically minded R could be a good bet because R for data science focuses on the language’s statistical and graphical uses. When you learn R for data science, you’ll learn how to use the language toperform statistical analysesand developdata visualizations. R’s statistical functions also make it easy to clean, import and analyze data. So if that’s your cup of tea, R is great forfinance at the intersection of data science.
4. Tableau for Data Analytics
With more data comes the need for better data analytics. Theevolution of data science workersreally is a marvel to behold. In a sense data science is nothing new and is just the practical application of statistical techniques that have existed for a long time. But honestly I think data analytics, and more Big Data changes how we can visualize and use data to drive business outcomes.
Tableau is an in-demand data analytics and visualization tool used in the industry. Tableau offers visual dashboards to understand the insights quickly. It supports numerous data sources, thus offering flexibility to data scientists. Tableau offers an expansive visual BI and analytics platform and is widely regarded as the major player in the marketplace.
It’s worth taking a look at if data visualization interests you. Other data visualization tools might include PowerBI, Excel and others.
5. SQL and NoSQL
Even in 2021, SQL has a surprisingly common utility for data science jobs. SQL (Structured Query Language) is used for performing various operations on the data stored in the databases like updating records, deleting records, creating and modifying tables, views, etc. SQL is also the standard for the current big data platforms that use SQL as their key API for their relational databases.
So if you are into databases, the general operations of data, data analytics and working in a data-driven environment SQL is certainly good to know.
Are you good at trend spotting? Do you enjoy thinking critically with data? As data collection has increased exponentially, so has the need for people skilled at using and interacting with data to be able to think critically, and provide insights to make better decisions and optimize their businesses.
Becoming adata analystcould be more enjoyable than you think, even if it lacks some of the glamor and hype of other sectors of data science.
According toCoursera, data analysis is the process of gleaning insights from data to help inform better business decisions. The process of analyzing data typically moves through five iterative phases:
Identifythe data you want to analyze
Cleanthe data in preparation for analysis
Interpretthe results of the analysis
6. Microsoft PowerBI
With Azure doing so well in the Cloud, Microsoft’s PowerBI is good to specialize in if you are less interested in algorithms and more interested in data analytics and data visualization. So what is it?
Microsoft Power BI is essentially a collection of apps, software services, tools, and connectors that work together to work on our data sources to turn them into insights, visually attractive, and immersive reports.
Power-Bi is anall-in-one high level tool for the data analytics partof data science. It can be thought of as less of a programming-language type application, but more of a high level application akin to something like Microsoft Excel.
If you are highly specialized in PowerBI it’s likely you’d always be able to find productive work. It’s what I would consider a safe bet in data science. While it’s considered user friendly, it’s not open source, which might put off some people.
7. Math and Statistics Foundations or Specialization
It seems only common sense to add this but if you are interested in a future with algorithms or deep learning, a background in Math or Statistics will be very helpful. Not all data scientists will want to go in this direction but the data scientist will be expected of course to understand the different approaches to statistics — including maximum likelihood estimators, distributors, and statistical tests — in order to help make recommendations and decisions. Calculus and linear algebra are both key as they’re both tied to machine learning algorithms.
The easiest way to think of it is that Math and Stats are the building blocks of Machine Learning algorithms. For instance, statistics is used to process complex problems in the real world so that data scientists and analysts can look for meaningful trends and changes in data. In simple words, statistics can be used to derive meaningful insights from data by performing mathematical computations on it. Therefore the aspiring knowledge worker student of data science will want to be strong in Stats and Math. Since many algorithms will be dealing with predictive analytics, it will also be useful to be well-grounded in probability.
8. Data Wrangling
The manipulation of data or wrangling is also an important part of data science, e.g. data cleaning. Data manipulation and wrangling make take up a lot of time but ultimately help you in taking better data-driven decisions. Some of the data manipulation and wrangling generally applied is – missing value imputation, outlier treatment, correcting data types, scaling, and transformation. This in general makes Data Analysis possible.
Data wrangling is essentially the process of cleaning and unifying messy and complex data sets for easy access and analysis. With the amount of data and data sources rapidly growing and expanding, it is getting increasingly essential for large amounts of available data to be organized for analysis. There are specialized software platforms that specialize in the data analytics lifecycle.
The steps of this cycle might include:
Collecting data:The first step is to decide which data you need, where to extract it from, and then, of course, to collect it (or scrape it).
Exploratory data analysis:Carrying out an initial analysis helps summarize a dataset’s core features and defines its structure (or lack of one).
Structuring the data:Most raw data is unstructured and text-heavy. You’ll need to parse your data (break it down into its syntactic components) and transform it into a more user-friendly format.
Data cleaning:Once your data has some structure, it needs cleaning. This involves removing errors, duplicate values, unwanted outliers, and so on.
Enriching:Next you’ll need to enhance your data, either by filling in missing values or by merging it with additional sources to accumulate additional data points.
Validation:Then you’ll need to check that your data meets all your requirements and that you’ve properly carried out all the previous steps. This commonly involves using tools like Python.
Storing the data:Finally, store and publish your data in a dedicated architecture, database, or warehouse so it is accessible to end users, whoever they might be.
Toolsthat might be used in Data wrangling are: Scrapy, Tableau, Parsehub, Microsoft Power Query, Talend, Alteryx APA Platform, Altair Monarch or so many others.
9. Machine Learning Methodology
Will data science become more automated? This is an interesting question. At its core, data science is a field of study that aims to use a scientific approach to extract meaning and insights from data. Machine learning, on the other hand, refers toa group of techniques used by data scientiststhat allow computers to learn from data.
Machine learning are techniques that produce results that perform well without programming explicit rules. If data science is the scientific approach to extracting meaning and insights from data, it is really a combination of information technology, modeling, and business management. However machine learning or even deep learning actually often does the heavy lifting.
Since there is just a massive explosion of big data, data scientists will be in high demand for likely the next couple of decades at least. Machine learning creates a useful model or program by autonomously testing many solutions against the available data and finding the best fit for the problem. Machine learning leads to deep learning and is the basis for artificial intelligence as we know it today. Deep learning is a type of machine learning, which is a subset of artificial intelligence.
So if a data scientist student is interested inworking on AI, they will need a firm grounding in machine learning methodology. While machine learning requires less computing power, deep learning typically needs less ongoing human intervention. They are both being used to solve significant problems in smart cities and the future of humanity.
10. Soft Skills for Data Science
To work in technology soft skills can be huge differentiators when everyone on the team has the same level of knowledge. Communication, curiosity, critical thinking, storytelling, business acumen, product understanding and being a team player among many other soft skills are all important for the aspiring data scientist and these should not be neglected.
Ultimately data scientists work with data and insights to improve the human world. Soft skills are a huge asset for a programming student that wants to be a manager one day or even to transition to a more executive role later in life or become an entrepreneur after their engineering life is less dynamic. You will want to especially work on:
Empathetic leadership skills
Power of observation that leads to insight into others
Having more polished soft skills can also obviously enable you to perform better on important job interviews, in critical phases of projects and to have a solid reputation within a company. All of this greatly enhances your ability to move your career in data science forward or even work at some of the top companies in the world.
A career in data science is incredibly exciting when AI and Big Data permutate our lives more than ever before. There are many incredible resources online to learn about data science and particular career paths for programming, machine learning, data analysis and AI.
Finally whether you choose data science or machine learning will depend on your aptitude, interests and willingness to get post graduate degrees. They can be summarized by the following:
Skills Needed for Data Scientists
Data mining and cleaning
Unstructured data management techniques
Programming languages such as R and Python
Understand SQL databases
Use big data tools like Hadoop, Hive and Pig
Skills Needed for Machine Learning Engineers
Computer science fundamentals
Data evaluation and modeling
Understanding and application of algorithms
Natural language processing
Data architecture design
Text representation techniques
I hope this has been a helpful introductory overview meant to stimulate students or aspiring students of programming, data science and machine learning while giving a sense of some key skills, concepts and software to become familiar with. The range of jobs in the field of data science is really quite astounding, all with slightly different salary expectations. The average salary for a data scientists in Canada (where I live) is $86,000, which is $5 million Indian Rupees (50lakhs) for example.
Share this article with someone you know that might benefit from it. Thanks for reading.
This question is raised on occasion. Salaries are not increasing as fast as they used to, though this is natural for any discipline reaching some maturity. Some job seekers claim it is not that easy anymore to find a job as a data scientist. Some employers have complained about the costs associated with a data science team, and ROI expectations not being met. And some employees, especially those with a PhD, complained that the job can be boring.
I believe there is some truth to all of this, but my opinion is more nuanced. Data scientist is a too generic keyword, and many times not even related to science. I myself, about 20 years ago, experienced some disillusion about my job title as a statistician. There were so many promising paths, but the statistical community, in part because of the major statistical associations and academic training back then, missed some big opportunities, focusing more and more on narrow areas such as epidemiology or census data, but failing to catch on serious programming (besides SAS and R) and algorithms. I was back then working on digital image processing, and I saw the field of statistics missing the machine learning opportunity and operations research in particular. I eventually called myself a computational statistician: that’s what I was doing, and it was getting more and more different from what my peers were doing. I am sure by now, statistics curricula have caught up, and include more machine learning and programming.
More recently, I called myself data scientist, but today, I think it does not represent well what I do. Computational or algorithmic data scientist would be a much better description. And I think this applies to many data scientists. Some, focusing more on the data aspects, could call themselves data science engineers or data science architects. Some may find the word business data scientist more appropriate. Junior ones are probably better defined as analysts.
Some progress has been made in the last 5 years for sure. Applicants are better trained, hiring managers are more knowledgeable about the field and have more clear requirements, and applicants have a better idea as to whether an advertised position is as interesting as it sounds in the description. Indeed, many jobs are filled without even posting a job ad, by directly contacting potential candidates that the hiring manager is familiar with, even if by word-of-mouth only. While there is still no well-known, highly recognized professional association (with a large number of members) or well-known, comprehensive certification for data scientists as there is for actuaries (and I don’t think it is needed), there are more clear paths to reaching excellence in the profession, both as a company or as an employee. A physicist familiar with data could easily succeed with little on-the-job practice. There are companies open to hiring people from various backgrounds, which broadens the possibilities. And given the numerous poorly solved problems (they pop up faster than they can properly be solved), the future looks bright. Examples include counting the actual number of people once infected by Covid (requiring imputation methods) which might be twice as high as official numbers, assessing the efficiency of various Covid vaccines versus natural immunization, better detection of fake reviews / recommendations or fake news, or optimizing driving directions from Google map by including more criteria in the algorithm and taking into account HOV lanes, air quality, rarity of gas stations, and peak commute times (more on this in my next article about my 3,000 miles road trip using Google navigation).
Renaissance Technologies is a good example: they have been working on quantitative trading since 1982, developing black-box strategies for high frequency trading, and mastering trading cost optimization. Many times, they had no idea and did not care why their automated self-learning trading system made some obscure trades (leveraging volatile patterns undetectable by humans or unused by competitors), yet it is by far the most successful hedge fund of all times, returning more than 66 percent annualized return (that is, per year, each year on average) for about 30 years. Yet they never hired traditional quants or data scientists, though some of their top executives came from IBM, with a background in computational linguistics. Many core employees had backgrounds in astronomy, physics, dynamical systems, and even pure number theory, but not in finance.
Incidentally, I have used many machine learning techniques and computational data science, processing huge volumes of multivariate data (numbers like integers or real numbers) with efficient algorithms, to try to pierce some of the deepest secrets in number theory. So I can easily imagine that a math background, especially one with strong experimental / probabilistic / computational number theory, where you routinely uncover and leverage hard-to-find patterns in an ocean of seemingly very noisy data behaving worse than many messy business data sets (indeed dealing with chaotic processes), would be helpful in quantitative finance, and certainly elsewhere like fraud detection or risk management. I came to call these chaotic environments as gentle or controlled chaos, because in the end, they are less chaotic than they appear to be at first glance. I am sure many people in the business world can relate to that.
The job title data scientist might not be a great title, as it means so many things to different people. Better job titles include data science engineer, algorithmic data scientist, mathematical data scientists, computational data scientist, business data scientist, or analyst, reflecting the various fields that data science covers. There are still many unsolved problems, the list growing faster than that of solved problems, so the future looks bright. Some such as spam detection, maybe even automated translation, have seen considerable progress. Employers and employees have become better at matching with each other, and pay scale may not increase much more. Some tasks may disappear in the future, such as data cleaning, replaced by robots. Even coding might be absent in some jobs, or partially automated. For instance, the Data Science Central article that you read now was created on a platform in 2008 (by me, actually) without a single line of code. This will open more possibilities, as it frees a lot of time for the data scientist, to focus on higher level tasks.
To receive a weekly digest of our new articles, subscribe to our newsletter, here.
About the author: Vincent Granville is a data science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent is also self-publisher at DataShaping.com, and founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target). You can access Vincent’s articles and books, here. A selection of the most recent ones can be found on vgranville.com.
In this post, we examine applications of deep learning to three key biomedical problems: patient classification, fundamental biological processes, and treatment of patients. The objective is to predict whether deep learning will transform these tasks.
The paper places a high bar i.e. on the lines of Andy Grove’s inflection point to refer to a change in technologies or environment that requires a business to be fundamentally reshaped.
The three classes of applications are described as follows:
Disease and patient categorization: the accurate classification of diseases and disease subtypes. In oncology, current “gold standard” approaches include histology, which requires interpretation by experts, or assessment of molecular markers such as cell surface receptors or gene expression.
Fundamental biological study: application of deep learning to fundamental biological questions using methods based on leveraging large amounts.
Treatment of patients: new methods to recommend patient treatments, predict treatment outcomes, and guide the development of new therapies.
Within these, areas where deep learning plays a part for biology and medicine are
Deep learning and patient categorization
Imaging applications in healthcare
Electronic health records
Challenges and opportunities in patient categorization
Deep learning to study the fundamental biological processes underlying human disease
Transcription factors and RNA-binding proteins
Promoters, enhancers, and related epigenomic tasks
Protein secondary and tertiary structure
Sequencing and variant calling
The impact of deep learning in treating disease and developing new treatments
Clinical decision making
There are a number of areas that impact deep learning in biology and medicine
Evaluation metrics for imbalanced classification
Formulation of classification labels
Formulation of a performance upper bound
Interpretation and explainable results
Hardware limitations and scaling
Data, code, and model sharing
Multimodal, multi-task, and transfer learning
I found two particularly interesting aspects: interpretability and data limitations. As per the paper:
deep learning lags behind most Bayesian models in terms of interpretability but the interpretability of deep learning is comparable to other widely-used machine learning methods such as random forests or SVMs.
A lack of large-scale, high-quality, correctly labeled training data has impacted deep learning in nearly all applications discussed, from healthcare to genomics to drug discovery.
The challenges of training complex, high- parameter neural networks from few examples are obvious, but uncertainty in the labels of those examples can be just as problematic.
For some types of data, especially images, it is straightforward to augment training datasets by splitting a single labeled example into multiple
Simulated or semi-synthetic training data has been employed in multiple biomedical domains, though many of these ideas are not specific to deep
Data can be simulated to create negative examples when only positive training instances are available.
Multimodal, multi-task, and transfer learning, can also combat data limitations to some
The authors conclude that deep learning has yet to revolutionize or definitively resolve any of these problems, but that even when improvement over a previous baseline has been modest, there are signs that deep learning methods may speed or aid human investigation.
Ugh. “Data Monetization” … a term that seems to confuse so many folks (probably thanks to me). When most folks hear the phrase “data monetization”, they immediately think of “selling” their data. And while there are some situations in which some organizations can successfully sell their data, there are actually more powerful, more common, and less risky ways for ANY organization to monetize – or derive business / economics value – from their data.
I’ve thought of 4 ways that organizations could monetize their data, and there are probably more. Let’s review them.
There are organizations whose business model is based on selling third-party data. Nielsen, Acxiom, Experian, Equifax and CoreLogic are companies whose business is the acquisition, aggregation, packaging, marketing, and selling of third-party data. For example, Figure 1 shows the personal data that one can buy from Acxiom.
Selling data requires dedicated technical and business organizations to acquire, cleanse, align, package, market, sell, support, and manage the third-party data for external consumption. And there is a myriad of growing legal, privacy and ethical concerns to navigate, so a sizable legal team is also advised.
Some organizations can monetize their data by creating data services that facilitate the exchange of their data for something of value from other organizations. Walmart’s Retail Link® is an example of this sort of “data monetization through exchange.”
Walmart’s Retail Link® exchanges (for a price) Walmart’s point-of-sales (POS) data with its Consumer Packaged Goods (CPG) manufacturing partners such as Procter & Gamble, PepsiCo, and Unilever. Retail Link provides the CPG manufacturers access to that manufacturer’s specific product sell-through data by SKU, by hour, by store as well as inventory on-hand, gross margin achieved, inventory turns, in-stock percentages, and Gross Margin Return on Inventory Investment (Figure 2).
Unfortunately, not all organizations have the clout and financial and technology resources of a Walmart to dictate this sort of relationship. Plus, Walmart invests a significant amount of time, money, and people resources to develop, support, and upgrade Retail Link. In that aspect, Walmart looks and behaves like an enterprise software vendor.
But for organizations that lack the clout, finances, and technology expertise of a Walmart, there are other more profitable, less risky “monetization” options.
Probably the most common way for organizations to monetize or derive value from their data is in the application of their data to optimize the organization’s most important business and operational use cases. And the funny thing here is that it isn’t really the data that one uses to monetize an organization’s internal use cases, it’s actually the customer, product, and operational insights that is used to optimize these use cases.
Insights Monetization is about leveraging the customer, product, and operational insights (predicted behavioral and performance propensities) buried in your data sources to optimize and/or reengineer key business and operational processes, mitigate (compliance, regulatory, and business) risks, create new revenue opportunities (such new products, services, audiences, channels, markets, partnerships, consumption models, etc.), and construct a more compelling, differentiated customer experience (Figure 3).
Figure3: Data Monetization through Internal Use Case Optimization
To apply “Insights” to drive internal Use Case Optimization requires some key concepts:
(1) Nanoeconomics. Nanoeconomics is the economics of individualized human and/or device predicted behavioral or performance propensities. Nanoeconomics helps organizations transition from overly generalized decisions based upon averages to precision decisions based upon the predicted propensities, patterns, and trends of individual humans or devices.
(2) Analytic Profiles provide an asset model for capturing and codifying the organization’s customer, product, and operational analytic insights in a way that facilities the sharing and refinement of those analytic insights across multiple use cases. An Analytic Profile captures metrics, predictive indicators, segments, analytic scores, and business rules that codify the behaviors, preferences, propensities, inclinations, tendencies, interests, associations, and affiliations for the organization’s key business entities such as customers, patients, students, athletes, jet engines, cars, locomotives, CAT scanners, and wind turbines (Figure 4).
(3) Use Cases are comprised of Decisions clustered around a common Key Performance Indicator (KPI) where Decisions are a conclusion or resolution reached after analysis that leads to an informed action. Sample use cases include reduce customer attrition, improve operational uptime, and optimize asset utilization. Analytic Profiles are used to optimize the organization’s top priority use cases.
Finally, some organizations are fortunate to have a broad overview of their market. They know what products or services are hot, which ones are in decline, and who is buying and not buying those products or services, and what sorts of marketing and actions works best for driving engagement. For those organizations, there is a fourth way to monetize their data – by packaging and selling “decisions” in the form of Data Products to their customers, partners, and suppliers (Figure 4).
Figure5: Data Monetization thru Selling “Decisions” via Data Products
Instead of just selling or exchanging data with your partners and suppliers, these organizations leverage their broader market perspective to build data products that help their customers, partners, and suppliers optimize their key business and operational decisions in areas such as:
New Product Introductions
To sell Data Products requires an intimate understanding of your partners and suppliers’ business models and the key decisions that they trying to make.
For example, a large digital media company has enough customer, product, and operational insights across its ad network to help their customers and business partners (ad agencies) make better decisions in the areas of ad placement, dayparting, audience targeting and retargeting, and keyword bidding. The digital media company could build a data product that delivers operational recommendations that optimize their customers’ and business partners’ digital marketing spend (Figure 6).
Figure6: Packaging and Selling “Decisions”
Any organization that has a broad view of a market (think OpenTable and GrubHub for restaurants, Fandango for movies, Travelocity or Orbitz for travel and entertainment) could build such a data product for their customers, industry partners, and suppliers.
Don’t miss the boat on Data Monetization. Focusing just on trying to sell your data is not practical for the vast majority of companies whose business model is not in the acquisition, aggregation, selling, and supporting of third-party data sources. And creating “data exchanges” really only works if your organization has enough industry market share and clout to dictate the terms and conditions of these sorts of relationships.
However, any organization can monetize their customer, product, and operational insights. The easiest is in the application of these insights to optimize the organization’s internal use cases. And organizations can go one step further and build data products that package these insights into recommendations that support their partners’ and suppliers’ most important business decisions (Figure 7).
Figure7: 4 Types of Data Monetization
Best of luck on your data monetization journey!
Third-party data is any data that is collected by an entity that does not have a direct relationship with the user whose data is being collected
 Trend Results is in no way associated with or endorsed by Wal-Mart Stores, Inc. All references to Wal-Mart Stores, Inc. trademarks and brands are used in strict accordance with the Fair Use Doctrine and are not intended to imply any affiliation of Trend Results with Wal-Mart Stores, Inc. Retail Link is a registered trademark of Wal-Mart Stores, Inc.
“The time will come when no human investment manager will be able to beat the computer,”David Siegel(co-founderTwo Sigma)
Generally, hedge funds engage in investing for the long term rather than day trading. In that context, the investment requires developing a macro or microeconomic thesis, understanding the market, and utilizing this perception to employ a vision, building a position, and then comes the part of holding and managing that position for a while, for a few days to often months.
A quantitative analyst explores hundreds of different pieces of information to predict an eligible output to recognize and measure attractive long-term or short-term positions in the market. In most cases, an algorithm can process more information than human analysts and keep track of that information in its database.
Likewise, there are some significant advantages the trading robots offer for quantitative hedge funds. They are:
Unbiased Trading Ability Based on Symmetrical Analysis:
Hedge funds fundamentally analyze and utilize the economic and financial data to evaluate the possible attempts. In the process, the entire investment method is a research-driven process that is completely based on the symmetric orientation of strategies.
Trading robots are insensitive, impartial about information, and perform all the trades based on substantial and symmetrical data analysis. Therefore, nothing and nothing can halt trading robots from sticking to the algorithms.
The most challenging task for a regular trader is to stay on the course of discipline and maintain the strategy without getting distracted. However, a trading robot successfully remains on the trail of discipline and strategy.
Here the hedge fund manager plays an indispensable role because the robots may burn the fund if the market responds differently than the program. In such cases, the manager makes the call and trades manually or upgrades the instructions to the robots considering the situation. Therefore, an experienced hedge fund manager with the assistance of effective trading robots can achieve a significant return.
One of the most significant advantages of automated trading for hedge funds is Backtesting. It’s a method of making the robot perform on historical data. Thus, you can have the performance graph of the robot in multiple previous and various market scenarios from different times.
In general, while Backtesting, the hedge fund managers operate the trading robot through the previous market Uptrends, Downtrends, or other sideway trends. Consequently, the outcome confers the strength of managing trades in comparable environments.
Availability of Efficient Automated Trading Solutions
Nowadays, many trading platforms offer complete automated solutions for hedge funds. But among them, the MetaTrader platform provides the most sophisticated and efficient automated trading capabilities to operate large amounts of funds tirelessly. And the accretion of similar automated solutions making automated hedge fund management more accessible than ever.
TheMQL Marketprovides a collection of more than 13000 pre-made trading robots and solutions for legitimate automated trading! Therefore if you are looking for a trading-related solution, trading robot, or anything related to automated trading, the MQL community is there to assist you.
If you are using the MetaTrader 5 for hedge funds, you are about to get the advantages of choosing from thousands of different trading robots for a very efficient trading solution.
Characteristics of The Most Profitable Trading Robot
Numerous trading robots have helped users earn a lot of money through automated trading. Also, some robots promise profit and a proficient gain in the market, although they end up burning out all the cash. Likewise, there are thousands of similar disappointments where the trader lost the investment depending on either fraudulent or unprofitable robots.
Running a trading robot can be tricky as its algorithm determines the most reliable results from simultaneous events, and concurrently all trading robots claim to be better than the others. Therefore, the traders usually use specific indicators to find the most profitable trading robot for them.
There are no such standards set yet to determine the most profitable robots. However, several functionalities determine the performance of a trading robot. Therefore, at this point, we are about to discuss a few characteristics an ideal trading robot should contain:
Efficient Money Management:As a trading robot remains active 24/7 and the trader not being available always creates some space for the robot to manage money. One of the most vital responsibilities of a trading robot is to take profits and stop losses. Therefore, maintaining the take-in while gaining profit and saving the account balance performing hard stop loss while losing money becomes a full-time duty for the automated system.
AI & Machine Learning:An expert robot should be capable of identifying versatile trading market situations using Artificial intelligence. As a result, the robot would apply different patterns in trading to adapt to the sudden changes in the trading market. Whether it’s the reality or the trading market, revolution is the key to survival.
Sustainable Profit:A proficient trading robot will provide you a regular steady flow of profit rather than forming an asymmetrical growth. Usually, a prominent trading robot should generate an 8-12% monthly gain. Somewhat robots may gain 20% or 25%, but the more steady, the better it is.
Fewer Drawdowns:When it comes to Drawdowns, anything less than 20% is excellent for an ideal trading robot. A robust applied strategy and risk management can keep the Drawdown level between 2% to 20%.
Bug-free Lifespan:Some very efficient trading robots often start acting differently after six months to a year. That happens because of vulnerability to bugs. Therefore, a proficient trading robot should have strong protection against bugs. Otherwise, they could take you down or even become scammers!
User friendly:A functional trading robot should be user-friendly considering the significant numbers of beginner automated traders.
User Reviews:The better a trading robot, the better the users speak about it. In the MQL community, there are dozens of thousands pre-made trading robots available with plenty of user comments regarding their performance and efficiency. So choose the one that comes with more positive comments.
This article can help companies to step into the Hadoop world, move an existing Hadoop strategy into profitability or production status.
Though they may lack functionality to which we have become accustomed, scale-out file systems that can handle modern levels of complex data are here to stay. Hadoop is the epitome of the scale-out file system. Although it has been pivoted a few times, it’s simple file system (HDFS) persists, and an extensive ecosystem has built up around it.
While there used to be little overlap between Hadoop and a relational database (RDBMS) as the choice of platform for a given workload, that has changed. Hadoop has withstood the test of time and has grown to the extent that quite a few applications originally platformed on RDBMS will be migrated to Hadoop.
Cost savings combined with the ability to execute the complete application at scale are strong motivators for adopting Hadoop. This report cuts out all the non-value-added noise about Hadoop and presents a minimum viable product (MVP) for building a Hadoop cluster for the enterprise that is both
Cost savings combined with the ability to execute the complete application at scale are strong motivators for adopting Hadoop. Inside of some organizations, the conversion to Hadoop will be like a levee breaking, with Hadoop quickly gaining internal market share. Hadoop is not just for big data anymore.
With unprecedented global contribution and interest, Spark is moving quickly to become the method of choice for data access in HDFS (as well as other storage formats). Users have demanded improved performance and Spark delivers. While the node specification is in the hands of the users, in many cases Spark provides an ideal balance between cost and performance. This clearly makes Hadoop much more than cold storage and opens it up to a multitude of processing possibilities.
Hadoop has evolved since the early days when the technology was invented to make batch-processing big data affordable and scalable. Today, with a lively community of open-source contributors and vendors innovating a plethora of tools that natively support Hadoop components, usage and data are expanding. Loading Hadoop clusters will continue to be a top job at companies far and wide.
Data leadership is a solid business strategy today and the Hadoop ecosystem is at the center of the technical response. This report will address considerations in adopting Hadoop, classify the Hadoop ecosystem vendors across the top vectors, and provide selection criteria for the enormous number of companies that have made strides in towards adopting Hadoop, yet have trepidation in making the final leap.
This article cuts out all the non-value-added noise about Hadoop and presents a minimum viable product (MVP) for building a Hadoop cluster for the enterprise that is both cost-effective and scalable. This approach gets the Hadoop cluster up and running fast and will ensure that it is scalable to the enterprise’s needs. This approach encapsulates broad enterprise knowledge and foresight borne of numerous Hadoop lifecycles through production and iterations.
Data Management Today
Due to increasing data volume and data’s high utility, there has been an explosion of capabilities brought into use in the enterprise in the past few years. While stalwarts of our information, like the relational row-based enterprise data warehouse (EDW), remain highly supported, it is widely acknowledged that no single solution will satisfy all enterprise data management needs.
Though the cost of storage remains at its historic low, costs for keeping “all data for all time” in an EDW are still financially material to the enterprise due to the high volume of data. This is driving some systems heterogeneity as well.
This section will explore the major categories of information stores available in the market, help you make the best choices based on the workloads.
The key to making the correct data storage selection is an understanding of workloads – current, projected and envisioned. This section will explore the major categories of information stores available in the market, help you make the best choices based on the workloads, and set up the context for the Hadoop discussion.
Relational database theory is based on the table: a collection of rows for a consistent set of columns. The rest of the relational database is in support of this basic structure. Row orientation describes the physical layout of the table as a series of rows with comprising a series of values that form the columns, which are stored in the same order for each row.
By far, most data warehouses are stored in a relational row-oriented (storage of consecutive rows, with a value for every column) database. The data warehouse has been the center of the post-operational systems universe for some time as it is the collection point for all data interesting to the post-operational world. Reports, dashboards, analytics, ad-hoc access and more are either directly supported by or served from the data warehouse. Furthermore, the data warehouse is not simply a copy of operational data; frequently, the data goes through transformation and data cleansing before landing in the data warehouse.
Over time, the data warehouse will increasingly support buffering of data through solid-state components for high-use data and other means, reuse of previously queried results, and other optimizer plans.
Multidimensional databases (MDBs), or cubes, are specialized structures that support access by the data’s dimensions. The information store associated with multidimensional access is often overshadowed by robust data access capabilities. However, it is the multidimensional database itself (not the access) that is the source of overhead for the organization.
If a query is paired well with the MDB (i.e., the query asks for most columns of the MDB), the MDB will outperform the relational database. Sometimes this level of response is the business requirement. However, that pairing is usually short-lived as query patterns evolve. There are more elegant approaches to meeting performance requirements today.
In columnar databases, each physical structure contains all the values of one or a subset of columns of one table. This isolates columns, making the column the unit of I/O and bringing only the useful columns into a query cycle. This is a way around the all-too-common I/O bottleneck that analytical systems face today. Columnar databases also excel at avoiding the I/O bottleneck through compression.
The columnar information store has a clear ideal workload: when the queries require a small subset (of the field length, not necessarily the number of columns) of the entire row. Columnar databases show their distinction best with large row lengths and large data sets. Single-row retrievals in the columnar database will underperform those of the row-wise database, and since loading is to multiple structures, loading will take longer in a columnar database.
It must be the value of performance of that workload that differentiates the columnar database for it to make sense. Interestingly, upon further analysis, many enterprises, including most data warehouses, have substantial workloads that would perform better in a columnar database.
Storing a whole operational or analytic database in RAM as the primary persistence layer is possible. With an increasing number of cores (multi-core CPUs) becoming standard, CPUs are able to process increased data volumes in parallel. Main memory is no longer a limited resource. These systems recognize this and fully exploit main memory. Caches and layers are eliminated because the entire physical database is sitting on the motherboard and is therefore in memory all the time. I/Os are eliminated. And this has been shown to be nearly linearly scalable.
To achieve best performance, the DBMS must be engineered for in-memory data. Simply putting a traditional database in RAM has been shown to dramatically underperform an in-memory database system, especially in the area of writes. Memory is becoming the “new disk.” For cost of business (cost per megabyte retrieved per time measure), there is no comparison to other forms of data storage. The ability to achieve orders of magnitude improvement in transactional speed or value-added quality is a requirement for systems scaling to meet future demand. Hard disk drive (HDD) may eventually find its rightful spot as archive and backup storage. For now, small to midsize data workloads belong in memory when very high performance is required.
Data streams already exist in operational systems. From an architecture perspective, the fast data “data stream” has a very high rate of data flow and contains business value if queried in-stream. That is the value that must be captured today to pursue a data leadership strategy.
Identifying the workload for data stream processing is different than for any other information store described in this paper. Data stream processing is limited by the capabilities of the technology. The question is whether accessing the stream – or waiting until the stream hits a different information store, like a data warehouse – is more valuable. Quite often, the data flow volume is too high to store the data in a database and ever get any value out of it.
Fast data that will serve as an information store is most suitable when analysis on the data must occur immediately, without human intervention. The return on investment is quite high for those cases where companies treat fast data as an information store.
If the stream data can be analysed while it’s still a stream, in-line, with light requirements for integration with other data, stream data analysis can be effectively added.
Cross-referencing the “last ten transactions” or the transactions “in the last five minutes” for fraud or immediate offer can pay huge dividends. If the stream data can be analysed while it’s still a stream, in-line, with light requirements for integration with other data, stream data analysis can be effectively added.
This all leads us to Hadoop. The next section will describe how Hadoop impacts and works with (and without) these main categories of information stores.
Hadoop Use Patterns
Hadoop can be a specialized, analytical store for a single application, receiving data from operational systems that originate the data. The data can be unstructured data, like sensor data, clickstream data, system log data, smart grid data, electronic medical records, binary files, geolocation data or social data. Hadoop is a clear winner for unstructured batch data, which almost always tends to be high volume data — as compared to other enterprise data stores with access needs fully met by the Hadoop ecosystem today.
Hadoop can also store structured data as ‘data mart’ replacement technology. This use is more subjective and requires more careful consideration of the capabilities of the Hadoop infrastructure as it relates to performance, provisioning, functionality and cost. This pattern usually requires a proof of concept.
Hadoop is a clear winner for unstructured batch data, which almost always tends to be high volume data — as compared to other enterprise data stores — with access needs fully met by the Hadoop ecosystem today.
Scaling is not a question for Hadoop.
Hadoop can also serve as a data lake. A data lake is a Hadoop cluster collecting point for data scientists and others who require far less refinement to data presentation than an analyst or
knowledge worker. A lake can collect data from many sources. Data can flow on to a data warehouse from the lake, at which point some refinement and cleansing of the data may be necessary.
Hadoop can also simply perform many of the data integration functions for the data warehouse with or without having any access allowed at the Hadoop cluster.
A successful Hadoop MVP means selecting a good-fit use pattern for Hadoop.
Finally, Hadoop can be an archive, collecting data off the data warehouse that is less useful due to age or other factors. Data in Hadoop remains very accessible. However, this option will create the potential for query access to multiple technical platforms, should the archive data be needed. Data virtualization and active-to-transactional data movement are useful in this, and other scenarios, and is part of modern data architecture with Hadoop.
A successful Hadoop MVP means selecting a good-fit use pattern for Hadoop.
Hadoop Ecosystem Evolution
Hadoop technology was developed in 2006 to meet the data needs of elite Silicon Valley companies which had far surpassed the budget and capacity for any RDBMS then available. The scale required was webscale, or indeterminate, large scale.
Eventually, the code for Hadoop (written in Java) was placed into open source, where it remains today.
Hadoop historically referred to a couple of open source products –- Hadoop Distributed File System (HDFS) (a derivative of the Google File System) and MapReduce –- although the Hadoop family of products continues to grow. HDFS and MapReduce were co-designed, developed and deployed to work together.
Upon adding the node, HDFS may rebalance the nodes by redistributing data to that node.
Sharding can be utilized to spread the data set to nodes across data centers, potentially all across the world, if required.
A rack is a collection of nodes, usually dozens, that are physically stored close together and are connected to a network switch. A Hadoop cluster is a collection of racks. This could be up to thousands of machines.
Hadoop data is not considered sequenced and is in 64 MB (usual), 128 MB or 256 MB block sizes (although records can span blocks) and is replicated a number of times (three is default) to ensure redundancy (instead of RAID or mirroring.) Each block is stored as a separate file in the local file system (e.g. NTFS). Hadoop programmers have no control over how HDFS works and
where it chooses to place the files. The nodes that contain data, which is well over 99% of them, are called datanodes.
Where the replicas are placed is entirely up to the NameNode. The objectives are load balancing, fast access and fault tolerance. Assuming three is the number of replicas, the first copy is written to the node creating the file. The second is written to a separate node within the same rack. This minimizes cross-network traffic. The third copy is written to a node in a different rack to support the possibility of switch failure. Nodes are fully functional computers so they handle these writes to their local disk.
Here are some other components worth having:
Hive – SQL-like access layer to Hadoop
Presto – Interactive querying of Hadoop and other platforms
Pig – Translator to MapReduce
HBase – Turns Hadoop into a NoSQL database for interactive query
ODBC – Access to popular access tools like Tableau, Birst, Qlik, Pentaho, Alteryx
MapReduce was developed as a tool for high-level analysts, programmers and data scientists. It is not only difficult to use, it’s disk-centric nature is irritatingly slow given that the cost of memory has recently had a steep decline. Enter Spark.
Spark allows the subsequent steps of a query to be executed in memory. While it is still necessary to specify the nodes, Spark will utilize memory for processing, yielding exponential performance gains over a MapReduce approach. Spark has proven to be the best tradeoff for most HDFS processing.
Hadoop in the Cloud
Running your Hadoop cluster in the Cloud is part of the MVP approach. It is justifiable for some of the same reasons as running any other component of your enterprise information ecosystem in the Cloud. At the least, the cloud should be considered an extension of the data center, if not the eventual center of gravity for an enterprise data center.
Running your Hadoop cluster in the Cloud is part of the MVP approach.
Reasons for choosing the Cloud for Hadoop include, but are not limited to, the following:
Firing up large scale resources quickly. With Cloud providers like Amazon Web Services (AWS) you can launch a Hadoop cluster in the Cloud in half an hour or less. Hadoop cluster nodes can be allocated as Cloud instances very quickly. For example, in a recent benchmark, our firm was able to launch instances and install a three-node Hadoop cluster with basic components like HDFS, Hive, Pig, Zookeeper, and several others in less than 20 minutes, starting with launching an AWS EC2 instance through loading our first file into HDFS.
Dealing with highly variable resource requirements. If you are new to Hadoop, your use case is likely small at first, with the intent to scale it as data volumes and use case complexities increase. The Cloud will enable you to stand up a proof-of-concept that easily scales to an enterprise-wide solution without procuring in-house hardware.
Simplifying operations, administration, and cost management. Hadoop in the Cloud also greatly simplifies daily operations and administration (such as configuration and user job management) and cost management (such as billing, budgeting, and measuring ROI). Cloud providers like AWS bill monthly and only for the resources, storage, and other services your organization uses. This makes the cost of your Hadoop solution highly predictable and scalable as the business value of the solution increases.
Making the decision to take Hadoop to the Cloud is a process involving business and technology stakeholders. The process should answer questions like the following:
Will the Cloud provide ease of data access to developers and analysts?
Does the Cloud and the Hadoop distribution we choose comply with our organization’s information security policies?
How will Hadoop in the Cloud interweave with our enterprise’s current architecture?
Does our company have an actionable big data use case that could be enabled by a quick Cloud deployment that can make a big impact?
Getting Hadoop in the Cloud will require your organization to overcome some obstacles—particularly if this your first entrée into the Cloud. Whatever your big data needs and uses of information are, it is imperative to consider the value propositions of Hadoop and the Cloud.
Hadoop Data Integration
Modern data integration tools were built in a world abounding with structured data, relational databases, and data warehouses. The big data and Hadoop paradigm shift have changed and disrupted some of the ways we derive business value from data. Unfortunately, the data integration tool landscape has lagged behind in this shift. Early adopters of big data for their enterprise architecture have only recently found some variety and choices in data integration tools and capabilities to accompany their increased data storage capabilities.
Even while reaching out to grasp all these exciting capabilities, companies still have their feet firmly planted in the old paradigm of relational, structured, OLTP systems that run their day-in-day-out business. That world is and will be around for a long time. The key then is to marry capabilities and bring these two worlds together. Data integration is that key —- to bring the transactional and master data from traditional SQL-based, relational databases and the big data from a vast array and variety of sources together.
Many data integration vendors have recognized this key and have stepped up to the plate by introducing big data and Hadoop capabilities to their toolsets. The idea is to give data integration specialists the ability to harness these tools just like they would the traditional sources and transformations they are used to.
With many vendors throwing their hat in the big data arena, it will be increasingly challenging to identify and select the right/best tool. The key differentiators to watch will be the depth by which a tool leverages Hadoop and the performance of the integration jobs.
With many vendors throwing their hat in the big data arena, it will be increasingly challenging to identify and select the right/best tool. The key differentiators to watch will be the depth by which a tool leverages Hadoop and the performance of the integration jobs. As volumes of data to be integrated expand, so too will the processing times of integration jobs. This could spell the difference between a “just-in-time” answer to a business question and a “too-little-too-late” result.
There are incomparable advantages to leveraging Spark directly through the chosen data integration tool, as opposed to through another medium (i.e., Hive), which is futile due to lack of support by even enterprise distributions of Hadoop.
Traditionally, data preparation has consumed an estimated 80% of analytic development efforts. One of the most common uses of Hadoop is to drive this analytic overhead down. Data
preparation can be accomplished through a traditional ETL process: extracting data from sources, transforming it (cleansing, normalizing, integrating) to meet requirements of the data warehouse or downstream repositories and apps, and loading it into those destinations. However, as in the relational database world, many organizations prefer ELT processes, where higher performance is achieved by performing transformations after loading. Instead of burdening the data warehouse with this processing, however, Hadoop handles the transformations. This yields high-performance, fault-tolerant, elastic processing without detracting from query speeds.
In Hadoop environments, you also need massive processing power because transformations often involve integrating very different types of data from a multitude of sources. The analyses might encompass data from ERP and CRM systems, in-memory analytic environments, and internal and external apps via APIs. You might want to blend and distill data from customer master files with clickstream data stored in clouds and social media data from your own NoSQL databases or accessed from third-party aggregation services.
Due to increasing, not decreasing, levels of safe harbor privacy restrictions, many multi-national companies will find Hadoop deployments becoming more distributed. As a result, we can expect a need to keep a level of data synchronized across the cluster.
Query patterns will eventually necessitate the use of data virtualization in addition to data integration. The SQL-on-Hadoop set of products have integrated data virtualization capability.
Hadoop Ecosystem Categories
While you could download the Hadoop source tarballs from Apache yourself, the main benefit of commercial distributions for Hadoop is that they assemble the various open source projects from Apache and test and certify the countless new releases together. These are presented as a package. This saves businesses the cost of the science project of testing and assembling projects, since it will take more than HDFS and MapReduce to really get Hadoop-enabled in an enterprise.
Given version dependencies, the process of assembling the components will be very time- consuming.
Distributions provide additional connectors with availability, scalability, and reliability as other enterprise systems.
Vendors also provide additional software or enhancements to the open source software, support, consulting, and training. One area lacking for enterprises in the open source-only software is software that helps administrators configure, monitor, and manage Hadoop. Another area needed for the enterprise is in enterprise integration. Distributions provide additional connectors with availability, scalability, and reliability as other enterprise systems.
These are well covered by the major commercial distributions. Some of the vendors push their wares back into the open source en masse, while others do not. Neither approach presents a “Top 10 Mistake” if you follow the approach but be aware.
Some expenditure for a commercial distribution is worth it and part of an MVP approach.
When selecting how to deploy Hadoop for the enterprise, keep in mind the process for getting it into production. You could spend equal time developing and productionizing if you do not use a commercial distribution. You are already saving tremendous dollars in data storage by going with Hadoop (over a relational database). Some expenditure for a commercial distribution is worth it and part of an MVP approach.
An alternative to running and managing Hadoop in-house—whether on-premises or in the Cloud—is to take advantage of big data as a service. As with any as-a-service model, Hadoop as a service makes medium-to-large-scale data processing more stand-up accessible to businesses without in-house expertise or infrastructure, easier to execute, faster to realize business value, and less expensive to run. Hadoop as a service is aimed at overcoming the operational challenges of running Hadoop.
Decoupled Storage from Compute
Data is decoupled from the data platform by taking advantage of Cloud providers’ persistent low-cost storage mechanism (i.e., Amazon S3, Google Cloud Storage, and Microsoft Azure Blob Store) and data connectors to fluidly move data from passive storage to active processing and back to storage again. This way, you only pay for processing resources when they are actually processing data. When data is at rest, you are only paying for its storage, which is significantly cheaper in terms of cost per hour than a running instance—even if its CPUs are idling.
This way, you only pay for processing resources when they are actually processing data
For example, imagine you have a big data transformation job that runs once a week to turn raw data into an analysis-ready data set for a data science team. The raw data could be collected and stored on Amazon S3 until it’s time to be processed. Over the weekend, a Hadoop cluster of EC2 instances is launched from a pre-configured image. That cluster takes the data from S3, runs its transformation jobs, and puts the resultant dataset back on S3 where it awaits the data science team until Monday morning. The Hadoop cluster goes down and terminates once the last byte is transferred to S3. You, as the big data program director, only pay for the Hadoop cluster while it is running its assigned workload and no more!
Automated Spot Instant Management
Hadoop workloads can take advantage of a unique feature that can significantly reduce costs—bidding for Cloud services. Just like any commodity market, Cloud providers, like AWS, offer their computing power based on the supply and demand of their resources. During periods of time when supply (available resources) is high and demand is low, Cloud resources can be procured on the spot at much cheaper prices than quoted instance pricing. AWS calls these spot instances. Spot instances let you bid on unused Amazon EC2 instances—essentially allowing you to name your own price for computing resources! To obtain a spot instance, you bid your price, and when the market price drops below your specified price, your instance launches. You get to keep running at that price until you terminate the spot instance or the market price rises above your price.
While bidding for Cloud resources offers a significant cost savings opportunity, therein lies a problem. Bidding for resources is a completely manual process—requiring you to constantly monitor the spot market price and adjust your price accordingly to get the resources you need when you need them. Most Hadoop program managers can’t sit and wait for the “right price.” It’s actually quite difficult to bid on spot instances and constantly monitor spot market prices to try to get the best price.
Hadoop Data Movement
Data architect and integration professionals are well versed in the methods of moving and replicating data around and within a conventional information ecosystem. They also know the inherent value of having powerful and robust data integration tools for change data capture, ETL, and ELT to populate analytical databases and data warehouses. Those conventional tools work well within the traditional on-premises environments with which we are all familiar.
However, what does data movement look like in the big data and hybrid on-premises and cloud architectures of today? With blended architecture, the Cloud, and the ability to scale with Hadoop, it is imperative that you have the capability to manage the necessary movement and replication of data quickly and easily. Also, most enterprises’ platform landscapes are changing and evolving rapidly. Analytical systems and Hadoop are being migrated to the Cloud, and organizations must figure out how to migrate the most important aspect—the data.
There are multiple methods to migrate data to (and from) the Cloud—depending on the use case. One use case is a one-time, massive data migration. One example of this is the use of DistCp to backup and recover a Hadoop cluster or migrate data from one Hadoop cluster to another.
DistCp is built on MapReduce, which of course is a batch-oriented tool. The problem with this is method is the poor performance and the costs. For example, if you needed to migrate 1TB of data to the cloud over a 100Mbps internet connection with 80% network utilization, it would take over 30 hours just to move the data. As an attempt to mitigate this huge time-performance lag for slower internet connections (1TB over a 1.5Mbps T1 would require 82 days!), Amazon offers a service called Snowball where the customer actually loads their data onto physical devices, called “Snowballs,” and then ship those devices to Amazon to be loaded directly onto their servers. In 2016, this seems archaic. Neither option is attractive.
Another use is the ongoing data migration from on-premises to the Cloud. One method is the use of a dedicated, direct connection to the Cloud that bypasses the ISP. Cloud providers, such as Amazon, have their own dedicated gateways that can accomplish a direct connection for minimal network latency through an iSCSI connection through the local storage gateway IP address. This is typical of the solutions out there. There are some performance benefits with this method, but in all likelihood, you will need a third party tool to manage the complexity of the data movement.
Another method is the use of a third-party migration tool to monitor for change data capture and regularly push this data up to the Cloud. Most tools in this space use periodic log-scanning and picks the data up in batches. The downside is it creates a lot of overhead. The scheduled batch data replication process requires the source system to go offline and/or be read-only during the replication process. Also, the data synchronization is one-way and the target destination must be read-only to all other users in order to avoid divergence from the original source. This makes the target Cloud source consistent…eventually. Other problems with these tools include the lack of disaster recovery (requires manual intervention) and the complexities when more than one data centers are involved.
The number one problem is, to replicate or migrate data up to (or down from) the Cloud, using any of these methods requires both the source and target to “remain still” while the data is transferred—just like you have to pause when having your photograph taken. The challenge with data migration from on-premises to the Cloud—particularly with Hadoop—is overcoming “data friction.” Data friction is caused by the batch-orientation of most tools in the arena. Furthermore, batch-orientation tends to dominate the conventional thinking in data integration spheres. For example, most data warehouse architects have fixed windows of extraction when a bulk of data is loaded from production systems to staging. This is batch thinking. In the modern, global big data era, data is always moving and changing. It is never stagnant.
If your organization needed to quickly move data to a Hadoop cluster in the Cloud and offload a workload onto it, the time-cost of replicating the needed data would be high. When “data friction” is high, a robust hybrid Cloud cannot exist.
With active-transactional, data is pumped directly to the Cloud as it is changed on-premises or vice versa, making it ideal for hybrid cloud elastic data center deployments as well as migration.
SQL on Hadoop
It’s not just how you do something that’s important; rather, it’s whether you’re doing something that matters. Your Hadoop project should not store data “just in case.” Enterprises should integrate data into Hadoop because the processing is critical to business success.
Wherever you store data, you should have a business purpose for keeping the data accessible. Data just accumulating in Hadoop, without being used, costs storage space (i.e., money) and clutters the cluster. Business purposes, however, tend to be readily apparent in modern enterprises that are clamoring for a 360-degree view of the customer made intelligently available in real time to online applications.
The best way, in MVP fashion, to provide the access to Hadoop data is from the class of tools known as SQL-on-Hadoop. With SQL-on-Hadoop, you access data in Hadoop clusters by using the standard and ubiquitous SQL language. Knowledge of APIs is not necessary.
You should grow the data science of your organization to the point that it can utilize a large amount of high-quality data for your online applications. This is the demand for the data that will be provided by Hadoop.
The best way, in MVP fashion, to provide the access to Hadoop data is from the class of tools known as SQL-on-Hadoop. With SQL-on-Hadoop, you access data in Hadoop clusters by using the standard and ubiquitous SQL language. Knowledge of APIs is not necessary.
SQL-on-Hadoop helps ensure the ability to reach an expansive user community. With the investment in a Hadoop cluster, you do not want to limit the possibilities. Putting a SQL interface layer on top of Hadoop will expand the possibilities for user access, analytics, and application development.
There are numerous options for SQL-on-Hadoop. The original, Apache Hive, is the de facto standard. The Hive flavor of SQL is sometimes called HQL. Each of the major three Hadoop enterprise distributions discussed earlier (Hortonworks, Cloudera, and MapR) includes their own SQL-on-Hadoop engine. Hortonworks offers Hive bolstered by Tez and their own Stinger project. Cloudera includes Apache Impala with their distribution. MapR uses Apache Drill.
The list only begins there. The large vendors—IBM, Oracle, Teradata, and Hewlett-Packard each have their own SQL-on-Hadoop tools—BigSQL, Big Data SQL, Presto, and Vertica SQL On Hadoop, respectively. Other not-so-small players have offerings, like Actian Vortex and Pivotal’s Apache HAWQ. And of course, Spark proponents tout Spark SQL as the go-to choice.
Besides the vendor-backed offerings, two additional open source projects—Phoenix, a SQL engine for HBase, and Tajo, an ANSI SQL compliant data warehousing framework that manages data on top of HDFS with support for Hive via HCatalog.
Look for a complement of features to your current architecture and appetite for proofs of concept.
Evaluation Criteria for Hadoop in the Cloud
The critical path for evaluating Hadoop in the Cloud solutions for your organizations is to set yourself on a path to take action. The need for big data is only going to get bigger and the use cases and business problems to solve will only get more varied and complex. Therefore, we leave you with the following criteria to consider as you build a business case for Hadoop in the Cloud, a key component of a Hadoop MVP implementation.
Data leadership must be part of company strategy today and Hadoop is a necessary part of that leadership. The use patterns Hadoop supports are many and are necessary in enterprises today. Data lakes, archiving data, unstructured batch data, data marts, data integration and other workloads can take advantage of Hadoop’s unique architecture.
The ability to fire up large scale resources quickly, deal with highly variable resource requirements and simplify operations, administration and cost management make the cloud a natural fit for Hadoop. It is part of a minimum viable product (MVP) approach to Hadoop.
Selecting a cloud service, or big data as a service, should put you in the best position for long- term, low total cost of ownership.
The challenge with data migration from on-premises to the Cloud—particularly with Hadoop—is overcoming “data friction”. There are multiple methods to migrate data to (and from) the Cloud—depending on the use case. WANdisco Fusion sets up your MVP for the inevitable data movement (migration and replication) required in a heterogeneous modern enterprise data architecture.
Finally, round out your use case, distribution, cloud service and data movement selections with SQL-on-Hadoop to provide access to the data assets and enable a MVP of Hadoop to accede to its role in data leadership.
Due to the pandemic, most businesses are increasing their investments in AI. Organizations have accelerated their AI efforts to ensure their business is not majorly affected by the current pandemic.
Though the implementation is a positive development in terms of AI adoption, organizations need to be aware of the challenges in adopting AI. Building an AI system is not a simple task. It comes with challenges at every stage.
Even though you build an AI project, there are high chances of it failing upon deployment, which can be attributed to numerous reasons. This blog post will cover the top five reasons on why AI projects fail and mention the solutions for a successful AI project implementation.
1. Improper Strategic Approach
There are two facets to a strategic approach. The first is being over-ambitious, and the second is the lack of a business approach.
When it comes to adopting an AI project, most organizations tend to start with a large-scale problem. One of the main reasons is the false belief people have about AI.
Currently, AI is overhyped but under-delivered. Most people believe AI to be that advanced piece of technology that is nothing short of magic. Though AI is potent enough to be such a technology, it is still at a very nascent stage.
Furthermore, adopting AI in an organization is a considerable investment of time, money, resources, and people. Since companies make that huge investment, they also expect higher returns.
But as mentioned before, AI is still too narrow to drive such returns in one go. Does that mean you cannot get a positive ROI? Not at all.
AI adoption is a step-by-step process. Every AI project you build is a step forward to making AI the core of your business. So start with smaller projects like gauging demand for your products, predicting credit score, personalizing marketing, etc. As you build more projects, your AI will better understand your needs (with all the data), and you will start seeing much better ROI.
Moving on to the second facet of the problem – When companies decide to build an AI project, they usually see the problem statement from a technical perspective. This approach prevents them from measuring their true business success.
Companies have to start seeing a problem from a business perspective first. Ask yourself the following questions:
What business problem are you trying to solve?
What are the metrics that define success?
Once you have answered these questions, move on to decide what technology you would use to solve the problem. Remember, AI is an ocean that covers multiple technologies like machine learning, neural networks, deep learning, computer vision, and so much more.
Understand which technology would be most suitable for the problem at hand and then start building an AI solution.
2. Lack of Good Talent
Most people forget that AI is a tool created by humans. Of course, data is the crucial ingredient, but humans are the ones who use it to develop AI. And currently, there is a shortage of talented professionals who can build effective AI systems.
In its Emerging Jobs 2020 report, LinkedIn ranked AI specialists in the first position. However, the supply does not seem to match the demand yet.
The shortage dips further when you consider quality and experience as well. Mastering AI or becoming an expert in AI takes years. Before becoming an AI expert, one needs to master the various underlying skills like statistics, mathematics, and programming. Also, AI practitioners have to constantly keep updating themselves as AI is a continuously evolving field.
According to Gartner, 56% of the organizations surveyed reported a lack of skills as the main reason for failing to develop successful AI projects.
Organizations can solve this problem in two ways.
First, they need to identify talent within their workforce and start upskilling them. They can gradually extend this process to the rest of the organization.
Second, organizations need to partner with universities to bridge the gap between academia and the industry. With a clear picture of the skills needed and the right resources, universities can train students with the skills required in the industry.
While pedagogical changes will boost AI upskilling, it is still a long term approach. What about now? There is a third approach which is slowly gaining traction, which is, dedicated AI companies that have the right talent, building AI models and offering AI-as-a-Service.
3. Data Quality and Quantity
Having addressed the issue of the people who make AI, let’s now talk about the ingredient that makes this technology possible – data.
AI, as a concept, was first introduced in the decade of 1950. But at that time, the researchers did not have enough data to bring the technology to reality. However, in the last decade, the situation has drastically changed.
With technology and gadgets being a close companion of humans, most gadgets and software have garnered the ability to collect zillions of data. And with this rising data collection, AI started gaining traction.
But then a new problem arose – the quality of the data.
Data is one of the two most crucial requirements to create an effective AI system. Though companies had started to collect tons of data, issues like unwanted data, unstructured data persisted.
Data usually gets collected in multiple forms – structured, unstructured or semi structured. It is usually unorganized and contains various parameters that may or may not be essential for your AI project specifically.
For example, if you are building a recommendation system, you would want to avoid collecting unnecessary data like mail id, customer picture, phone number, etc. This data would not help you solve the problem of understanding your customer preferences. Worse, you might face the issue of overfitting where there is a ton of unnecessary data.
In the above example, if you had data like web browsing history, previous purchases, interests, location, then your AI system would give you much better results.
To solve the data issue, consider involving all the stakeholders before starting an AI project – the business heads, data analysts, data scientists, ML engineers, IT analysts, and DevOps engineers. You can then have a clear picture of what data is required to build the AI model, what quantity, and what form. Once you have an understanding of this, you can clean and transform your data as required.
Also, while you prepare the data, make sure you keep aside a part of it as testing data to ensure the AI you build works as you intend it to work.
4. Lack of AI Awareness in Employees
Most people believe artificial intelligence would replace them in their jobs. However, this is certainly not the case.
As companies adopt AI, they will also have to concurrently educate their workforce on how AI is an “augmentor”. This education which is rightly termed as “data literacy” is crucial if you want your organization to have an enterprise-wide AI adoption.
Data literacy needs to be prioritized for two reasons
To ensure that your workforce(especially non-tech) are aware of what AI does and the capacity in which it helps them
To ensure that upon successful education, they do not blindly rely on AI for the decisions it makes
There have been scenarios where even though companies have deployed AI in their day-to-day operations, the workforce has rejected it. This indicates that employees have trust issues with the technology.
Alternatively, you do not want your workforce to blindly accept all the decisions made by your AI. You need to ensure the decisions are justified and make sense.
Due to these reasons, as and how your organization starts adopting AI, you will also have to start educating your workforce on the technology. Promote AI as a technology that takes up tasks and not jobs. Let your workforce understand that the sole purpose of AI is to free up human time so that they can focus on complex problems. It is pertinent that people understand AI as not just artificial intelligence but augmented intelligence.
5. Post-Deployment Governance and Monitoring
Consider you bought a car. It has all the necessary features that you wanted. It drives smoothly, helps you get to work in a matter of minutes, and even customizes the ambiance as per your preferences. But does that mean it does not need any attention or maintenance from your end? Absolutely not.
Similarly, for simplicity, building and deploying an AI is like having a car. You will also have to maintain the AI after deployment. However, maintaining AI is a far bigger undertaking than maintaining a vehicle.
AI systems make myriad decisions based on the data that are fed to them. If people cannot understand how an AI arrives at a particular decision, then that AI system can be labeled as a “black box AI.”
Ensuring your AI does not turn into a black box is crucial, especially when it makes decisions like processing loans, suggesting medical treatments, accepting applications for universities, and so on. Many governments are realizing the black box issue and are considering regulating the technologies. Even if they are not bound legally, it becomes an ethical responsibility of the developers to ensure AI is fair and just.
An additional challenge here is the dynamic nature of data and the business scenario. It is unlikely for data to remain static throughout the lifetime of an AI project. As the data changes, the AI also needs to be recalibrated to ensure it does not drift from its performance.
This process of recalibrating AI systems is mostly similar to building an all-new model. And like any AI project, it takes time and resources. For this purpose, most companies try to stretch their models for a long time without “maintaining” them and accommodating the business changes in the model. But you cannot time when the model will start drifting and lead to unnecessary implications.
To address these problems, organizations will have to constantly monitor their AI systems. The AI needs to be regularly updated with the changing data and business scenarios. To make this process a little less difficult, you can use an AI observability tool that helps you monitor your models and report unnecessary drifts.
AI Adoption Is A Journey
AI is a powerful technology that is changing the way we do business today. However, like every good thing, it needs time and effort to uncover and function to the best of its abilities.
As AI enablers, we recommend organizations adopt AI in a step-by-step fashion. The returns on AI investments are not linear – it compounds as you start utilizing it in your organization. Once you have specific AI use cases, you can extend the AI system to an enterprise-wide level adoption.
To provide the best experiences, we use technologies like cookies to store and/or access device information. Consenting to these technologies will allow us to process data such as browsing behavior or unique IDs on this site. Not consenting or withdrawing consent, may adversely affect certain features and functions.
The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
The technical storage or access that is used exclusively for statistical purposes.The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.