Saturday, December 04, 2010

Predictive Analytics

Predictive analytics is  hot.  Advances in hardware,statistics and business intelligence software  have made it usable and performant. Predictive analytics , as the name suggests helps gain business intelligence from data using various data mining, pattern recognition and probablistic algorithms. Consider the following example - Given the history of orders for a product , how likely is a customer belonging to a certain age group to buy a certain product ? The answer can be computed in several ways.

1.Find out clusters of customers who bought the product. Find out the distance of the cluster from this customer to indicate the likelihood of buying.
2.Use regression analysis.   Y = a1*X1 + a2*X2 .....
where Y  represents  revenue for a product and X1,X2,X3  could represent causal factors such as age,geography etc.a1,a2,a3 are the coefficients. If age is statistically signigicant, the coefficient will have a significant value. Once age is confirmed to be statistically significant, we could have multiple causal variables
for each age bracket and then find out which among those is the most significant.

Oracle offers predictive analytics at several layers. PL/SQL/JAVA  comes with an API for predictive analytics called  DBMS_PREDICTIVE_ANALYTICS . Oracle also has a product called RTD - Real time decision making that is bundled with OBIEE - Oracle business intelligence enterprise edition.
Other tools like crystall ball, excel add on for predictive analytics  and  Oracle data mining are some other tools in Oracle's arsenal . Oracle Demantra  provide predictive analytics related to forecasting and demand management. With IBM having acquired SPSS  , the industry's  landscape has become interesting.

Academic publications like "Competing on Analytics" by Harvard Press  and the recent survey published in
MIT's Sloan Management Review on BI trends have contributed to the hieghtened   interest and investment in this upcoming discipline. Companies like Netflix have grown to a billion dollar by predicting  consumer buying patterns based on  "clicks".

Friday, October 08, 2010

Correlation vs Causality

I was in Chennai a few years ago at a conference sitting atop the terrace of a restaurant with a few colleagues. The sales of cold drinks were at an all time high with everybody ordering coke ,beer etc.
At the same time, I noticed that there was a huge influx of patients at the hospital  near the restaurant .
So there must have been a positive correlation between the data for the sale of cold drinks and the data for the inflow of patients to the hospital , meaning as one went up or down the other went up or down too.
However does it mean that the  cold drinks caused people to get hospitalized ? Or vice versa - did people
drink because someone got hospitalized ?
None of the above was entirely correct in this situation. The reality was , the influx of people to the hospital and the sales of cold drinks were  caused by the sweltering heat of Chennai.. So if we were to forecast the sale of cold drinks ,  the causal factor would be  "temperature"  and not the "number of people admitted to the nearby hospital" . 
And this precisely  is one of the key  things to watch out for in analyzing the results of  regression analysis. While regression will give you correlation between 2 variables  , it may require an expert to confirm  if there is causality between the two.

Wednesday, October 06, 2010

The dangers of serial thinking

I bought a horse for 10$ and sold it to a friend for 20$. I bought the same horse from the same person for 30$ and sold it to the same person for 40$ . What was my profit ?

If you came up with 10$ as the answer  (20-10) + (20-30) + (40-30)  , you fell into the trap of serial thinking.

If you came up with 20$ as the answer , you are right because these two are separate transactions.

Transaction 1 - You bought and sold the horse. Profit = 20 - 10 = 10$
Transaction 2 - You bought and sold the horse. Profit = 40-30 = 10$

Total profit = 10 + 10 = 20$

If  you are financially savvy  , you would do

profit = total cash inflow - total cash outflow
         = (40 + 20) - ( 30 + 10)
        = 60 - 40
        = 20$

Beware  the dangers of serial thinking .
Start  thinking laterally.

Tuesday, September 28, 2010

The three best practices in software development

Michael Cusumano has documented the three most effective practices in software development  in his
book  "The Business of Software". These three practices have resulted in significant defect reductions in large companies.While  Six Sigma, Lean, CMMI and other methodologies can be applied to software and have been effective with varying degrees  , the following three practices have to be   implemented. All other methodologies and practices would be add-ons. So,  what are the three key practices ?.

1.Early prototyping -
Do a proof of concept by prototyping early and show it to the users.
Seeing is believing. When the users see something, they can spot obvious flaws and limitations.  

2.Reviews,reviews,reviews
We need reviews at each stage - High level design reviews ,detailed design review, code reviews, unit test cases reviews, system test reviews , project plan reviews etc. Review provides a negative feedback mechanism and hence stabilizes the output, to give an analogy of a closed look negative feedback control system.  In a control system with a negative feedback, the output Y = X * (G/1 + GH) .
X is the input, G is the gain/amplification/distortion  and H is the amount of feedback
In a  large project , G can be assumed to  be large and H close to 1 . Hence  GH  can be assumed to be much greater 
than 1. Thus Y = X * (G/GH) = X/H

Now if the review is 100% perfect , H =1 . Therefore,  Y =X .
Thus the output of the software stage equals the inputs which is the   specification to that stage.
In essence the output of the software development   will follow the specifications accurately if there are
rigorous reviews.

Linus Torvalds , the inventor of Linux said " Every bug appears shallow if a large number of eyes look at the code". How true ! No wonder why Linux is such a robust operating system despite being open source.
Millions of eyes look at each line of code each day  and suggest improvements.  So the next time you want to review code, call all developers in a room  and project their code . Let them all look at it and have fun dissecting it in a cordial  and friendly environment.

3.Daily regression tests and builds.

The code must be compiled and linked daily.
Automated regression tests must be run daily - the more the better and emails send to all concerned including
senior management when RT's fail. This ensures that as new features are added, existing features continue to work.

Sunday, September 26, 2010

Bayes theorem

Bayes theorem is the foundation of Bayesian statistics and the new and emerging discipline of  predictive analytics. Reverend Bayes , a 17th century British mathematician wrote the Bayes theorem as a special case of probability theory. He did not disclose the theorem  fearing that it might not pass rigorous scientific scruitiny. The theorem was recovered after his death .
Here's the theorem and  a  few practical applications.

P(Ri/E) = p(E/Ri) p(Ri) / sum [p(E/Ri) * p(Ri)] i = 1 to n

p(Ri) is called anterior probability  of event Ri. It represents what we already know from past history about event Ri..E is the fresh new evidence that has arrived that would influence the event Ri.

p(Ri/E) is the probability of event Ri given the evidence E.
p(E/Ri) is the likelihood of the evidence itself.

As can be seen above as new evidence surfaces, the theorem lets us update our knowledge about the probability of occurrence of the event Ri. p(Ri) was our knowledge of the event based on its history,
while p(Ri/E) is our updated knowledge after taking the evidence E into consideration.

These probabilities can be deterministic or may represent a certain distribution.
Here are a few practical examples  where the theroem could be applied.
(An example similar to the first example below was  cited in a recent issue of Sloan Management Review).

1.I know the history of rainfall in my region for the past 10 years. I now have the evidence that this year
the temperatures were higher than normal. We know that higher temperatures correspond to higher rainfall
and follows a certain distribution. Using this evidence, we can predict the probability of rains this year  .
If I calculated probability based on history alone (which is what frequentists would do)  , I would be ignorning the key evidence that surfaced this year.

2. We know the history of deliveries of a certain supplier. A new evidence has arrived that the supplier's capacity is full due to a new contract they have signed with some customer. We know the relationship between the supplier's capacity and on time delivery . Given the evidence, we can find out the probability of on time delivery this month. If we had not used Bayes theorem, (and instead relied  on history alone as a frequentist would do)  the probability would have been based solely on the supplier's  historical performance and hence would have ignored the critical new evidence.

Bayesian theorem does have certain limitations/gotchas  in practical life as would be seen in future blogs.

Monday, September 06, 2010

System thinking

Most engineers , when given something to analyze  try to break it up into several parts and then  analyze each one of those parts  one by one. While there is merit to  analyzing a problem , it is important to first synthesize it. Synthesis is the opposite of analysis. It involves 3 simple steps.

1.What  system  is the  object under investigation a part of ?
2.What are the other  objects in that system ?
3.How do these other objects interact ?

Once these questions are answered , the engineer gets  the so called  holistic view of the problem under investigation. Thereafter the process of analysis of the object in question  may begin which will be described in  my subsequent posts.

The system is always greater than the sum of its parts because of the complex interactions between those parts. A car transports people - however none of its parts, for example, the engine,gearbox or frame can perform that function. Hence theorists posit  that systems have  emergent properties which means functions evolve as various parts of the system interact.

This underscores the need for multiple rounds of system testing in the software development cycle because each round is essentially an attempt to discover emergent properties of the system , not anticipated in the unit testing of its parts. People, process and technology are the three basic parts of a software system or any system for that matter. They can interact in at least 4 ways - people and process, process and technology,
people and technology and people,process and technology  - therefore system test case design should consider these interactions. Compounding the equation is that these interactions are most of the time non linear  (i.e output is not proportional to input)  and higher order meaning one needs to differentiate them multiple times wrt the independent variable to remove the effect of the independent variable. These interactions could be static - time indepenent or dynamic - dependent on time. These interactions could be location dependent or independent - e.g. geography of where people,process or technology are located.
The interactions may have a causal loop meaning you do not know what is the  cause and what is the effect - e.g  Is the technology poor because processes are poor or proceeses are poor because technology is so. Modelling such an effect is very difficult because it is not clear what the dependent and independent variables are. That's where a system  consultant's expertise comes into play - to apply judgement in situations where  modelling is infeasible.  

Saturday, September 04, 2010

3 idiots and Five Point Someone

I finally managed to find some time to watch Aamir Khan's Bollywood movie '3 idiots' and read Chetan Bhagat's "five point someone" , the novel on which Aamir's movie is based. I am very impressed with Chetan's portrayal of life at the IIT's and the story's unusual plot . The protagonist - "Ryan" in the novel and "Rancho" in the movie carry a theme that resonated with me the most - "Follow your passion and success will follow you sooner or later" . Victor Menezes, former Chairman of Citigroup had once quoted the following in his speech to the students at Sloan. "Do a job that you like and you would not have to work throughout your life".
Of course following ones passion is easier said than done. There are constraints in practical life, there are sacrifices ,risks and costs associated with following your passions and more often than not pragmatic considerations triumph the pursuit of passion. Chetan represents the voice of the Indian youth aptly but would do well to avoid profanity in order to take his literature to the next level.Aamir on the other hand has replaced some of the profanity in the novel with some profound thoughts, events and music making the movie truely entertaining yet full of substance.

Sunday, August 22, 2010

The 5 stage hierarchy

Russell Ackoff describes the 5 stages of the human mind using the data-information-knowledge-understanding-wisdom hierarchy in
• Ackoff, R. L., "From Data to Wisdom", Journal of Applied Systems Analysis, Volume 16,
1989 p 3-9.
The framework is powerful in a number of ways.
1.It helps evaluate the current state of a company's IT systems.
It turns out that a lot of IT systems are still in the information stage or halfway through the
knowledge stage.

Here's an example -
1.Data - Cost of goods sold is X million $ and average inventory for the period is say Y million $.
2.Information - Inventory turns is X/Y , say 5 .
3.Knowledge - Inventory turns are low compared to the industry averages for this sector.
4.Understanding - We exactly understand what parts are contributing to the low inventory
turns and why. We know the characteristics of those parts , their historical consumption rates and future demand.
5.Wisdom - We still need to hold on to these parts because manufacturers have already obsoleted them and our valued customers are likely to order them in the future. This might mean carrying inventory of slow moving parts but the decision is in the interest of our most profitable customers. It is therefore a wise policy to keep stock of obsolete parts although it might impact the inventory turns metric.

Data,information and knowledge are representations of events that have already happened.
While data and information are tangible , knowledge is not, although knowledge can be tacit or explicit. Data and information are "real" objects that can be stored, accessed , retrieved,copied,transmitted,received,destroyed. Tacit knowledge is the stickiest one and lies in the expert's head while data and information are either already public or can be accessed at least privately. Wisdom provides the behavior necessary to react to future events and includes the ethical and moral elements . Nation's policies represent wisdom (or the lack thereof) while data , information and knowledge govern processes. Understanding lies between knowledge and wisdom and is about "internalizing" knowledge. I know does not mean I understand. Knowledge is often theoretical while understanding is experential. Hence the Chinese proverb , "I read, I know. I do, I understand". No wonder why MIT , my alma mater is ranked one of the best engineering institutes in the world. It is the institute's philosphy of "mens et manus" which literally means hands and mind to reinforce both knowledge and understanding. Finally though, it turns out that highly knowledgeable people with great understanding of the system may be unwise.

Sunday, May 30, 2010

Patent awarded

The United State Patent and Trademark Office recently awarded me and my co-inventors a patent on quantifying the output of credit research systems. The work was done when I worked at the Bank of America and all the co-inventors are my former colleagues/bosses at the Bank. Since all intellectual property filed on behalf of the Bank belongs to it (and rightfully so) , BoFA will hold all legal and financial rights to the patent. The invention would seem too late and too little given the magnitude of the global financial crisis that we witnessed , however I am glad that I could put probability and statistics to good use here.

Thursday, January 21, 2010

Quantitative techniques for risk management

How can we quantify risks ? A deceptively simple question indeed.
The state of the art in risk management would lead one to the following approaches to
quantify risks.

1.The probabilistic approach
2.The statistical approach
3.The operations research approach
4.The aritifical intelligence approach

The probabilistic approach seeks to find the probability of each occurence and the payoff
associated with it. This leads us to a decision tree with the payoffs associated with each
outcome. The expected monetary value (EMV) is then found out for each outcome.
My experience in modelling real life problems is that it is very hard to find out the probability of each outcome. Nevertheless a probabilistic distribution can be inferred if there is
enough data. A more powerful technique is Bayesian modelling. A common scenario that
I have modelled is the following.
Given that a supplier has delivered the material 10 days late last month, what is the
probability that he will deliver on time this month ?
Bayesian statistics is a rapidly evolving branch and is the basis of forecasting engines such
as Demantra. At the heart of it though is the Bayes theorem. My current work
involves improving the accuracy and stability of forecasts by tuning the forecasting engine.
Quantification of credit risks for fixed income securities is another area where I have used
probabilistic techniques.

Statistical approaches rely on inferring based on data. Regression analysis is a common
technique. I have quantified marketing risks using JMP in the past.
MINITAB is another friendly tool for statistical analysis. It is not as heavy duty as SAS
but is powerful enough for many practical applications. Some of the scenarios where I have
used are finding the factors that lead to mutual fund redemptions . I know my cousin
used it to quantify the factors that lead to the risk of malnutrition - specifically fluorosis.

The OR based approaches typically formulate a linear/non linear model and try to solve it.
SOLVER is a powerful tool here. I have used it for the quantification of strategic supply chain
risks and what if scenarios. I have also used it to find out the amount of insurance required
for factories. The other powerful technique for risk quantification is Monte Carlo simulation.
Crystall Ball is a powerful tool here. I have used it to quantify project schedule risks,
budget risks, the probability of background processes getting delayed , compensation risks
and in post acquisition integrations.

Some of the AI approaches to risk quantification include neural networks, data mining and
fuzzy logic. Fuzzy logic , developed by Prof.Zadeh is used in consumer devices such as
washing machines. Fuzzy sets differ from regular sets in that they each member of a
fuzzy set has a degree of association with that set from 0 to 1 while regular sets either have
members (degree =1 ) or do not have them (degree = 0). I have read about fuzzy logic having been used to reduce the risk of overdoze of anesthesia administered to patients.
I tried fuzzy logic almost a decade ago but I think I should revisit that approach in the light of the business knowledge that I have gained over the years.
Neural networks are powerful but have been less prevalent in the industry.
The amount of data needed to train a neural network has been an issue .
Oracle's DARWIN is a powerful suite for data mining algorithms. There are tons of them. I have used them in the past to improve forecast accuracy where data was sparse.

Finally we should not disregard the quasi quantiative techniques such as FMEA-
failure mode and effects analysis. FMEA does not require one to be a mathematician.
On the other hand it takes into account the inputs from all stakeholders to quantify
risks. The outcome , therefore, is generally acceptable to all , although FMEA might
involve conducting several sessions (sometimes heated) with the stakeholders to ensure
convergence of opinions.
I have used FMEA in scenarios such as assessing project risks, evaluating design options,
enforcing KYC regulations in the banking sector etc. FMEA is a key component of the six sigma
methodology and is a technique developed during the second world war.